UDP- Cleaning the Data Lesson

Cleaning the Data

Cleaning data is an essential step in increasing the quality of data. Software programs like spreadsheet programs are used to help clean the data. Cleaning and transforming the data often involves removing invalid records and translating all the columns to use a same set of values. You may also combine two different datasets into a single table, remove duplicate entries or apply any number of other normalizations. As you acquire data, you will notice that such data often has many inconsistencies: names are used inconsistently; amounts will be stated in badly formatted numbers, while some data may not be usable at all due to file corruptions or missing data. Cleaning the data is very likely to be the most time intensive part of your project when using big data.  

Whenever you download a dataset, the very first thing you should do is to make a copy of it so you will have the original raw data.  Any changes you should make should be done in this copy.  

You will be using a spreadsheet program to clean and analyze the data collected in the last lesson. 

[CC BY 4.0] UNLESS OTHERWISE NOTED | IMAGES: LICENSED AND USED ACCORDING TO TERMS OF SUBSCRIPTION