You need to clean data before you start analysing the data file. The reason for this is not hard to understand. Data is gathered through extensive manual effort. Researchers spend hours talking to people or sifting through statistics to find the right data.
Even more tedious is the data entry process. The chances of multiple entries, spelling mistakes or entry in wrong columns happen despite best efforts. The chances of errors mount when data is being collected and entered separately by several surveyors. Often surveyors employed to gather data lack sufficient training.
That’s why it is important to clean data. If you don’t do so then you run the risk of arriving at the wrong conclusions.
It is not hard to clean data. However, one has to be patient, because some files may suffer from missing data or may have variables that are absolutely foreign to you.
The best way to clean data is follow these eight steps:
# 1. Eliminate spelling mistakes
You must do a spellcheck to remove spelling errors. There are two kinds of errors that need to be corrected. First, is the spelling of cities, towns or villages. Often, a village or a small town may be spelt in two or three different ways. You need to identify the multiple spellings, and ensure that one spelling is used.
# 2. Eliminate multiple representations
Sometimes a word may be differently abbreviated. For instance, Maharashtra may be abbreviated as Maha or M’rashtra. Clean data requires that Maharashtra be abbreviated in one style only. Otherwise, you will end up having two different results for the same variable.
Sometimes, the same abbreviation may be both in lower case and upper case. Here too, your data will be read as two variables. It is therefore important that the abbreviation of United Nations Organisation should be either UNO or uno.
# 3. Use one standard / scale / format
It is essential that one metric system be followed. You need to clean data when you find that weight has been spelt in kilograms as well as in pounds; currency is given both in millions and lakhs. The way date is formatted is another area of concern. Sometimes, in the same data file you will find that the format is date-month-year and month-date-year. In such cases too, you must clean data and use only one format.
# 4. Remove duplication
This happens when scores of surveyors are used to collect and enter data. Very often, they don’t realise that the data they are entering has already been entered. Such duplication corrupts results, and needs to be weeded out.
# 5. Identify missing values
No data analysis can be complete if key values are missing. This may happen on account of oversight or by mistake. However, you need to identify the missing values, and go back to the source to get the missing information.
# 6. Remove redundancy
When doing data analysis you may come across some data that is not relevant to your analysis. This data can be deleted before you start your analysis. It is essential that you focus on key variables only, and not on the entire data file.
# 7. Correct range values
Clean data means that the researcher has checked and eliminated typos while entering range values. For instance, while analysing school data you may find that the 80-90 marks column shows a student having scored 900 marks thought his actual marks were 90. The extra zero was a mistake. Your entire analysis will go haywire if such typos are not corrected.