Data Cleaning: A primary step towards a data science project
After gathering data for a project, the next logical thing to do is clean and audit the data. Data cleaning is one of the very important parts of a data science project. It comes as a part of the data gathering process (refer to DS-lifecycle here). Clean data can directly impact the quality of the project’s result, be it the visualizations or the statistical models that you intend to build on the gathered data.
Data can be dirty due to many reasons. Sometimes the process of data generation can introduce errors, collection, etc. Let’s see a few.
Table of Contents
- 1 Sources of Dirty Data
- 2 Measurement of Data Quality
- 3 Data Cleaning Process
Sources of Dirty Data
- User entry errors
- Different schema
- Legacy systems
- Evolving Applications
- No unique identifiers
- Data migration
- Programmer error
- Corruption in transmission
User Entry Errors
A very common example of a user entry error is date entries. For, e.g., while filling out a survey form, the surveyor enters the date‘ MM/DD/YY’ whereas the data entry format is ‘DD/MM/YY.’
E.g. Date entry7/1/2016 → 7th January 2016 in DD/MM/YYYY format7/1/2016 → 1st July 2016 in MM/DD/YYY format
While gathering data from multiple diverse sources such as the web, various schemas can be found. While doing data modeling, you tend to create a model where the data model schema matches most of the available data on the web. You tend to ignore a handful of data sources that do not match your data model schema still source data from in such a situation. These can be a source of error in your data.
While gathering information from/about legacy systems, we might not always get the required format information. Let’s say, for, e.g., a legacy system encodes the logs in a way that is not understandable by current standards. In such cases, you can only convert the amount of information that the current systems understand. In the process, certain information is lost and conversion can also introduce errors in the translated/generated data.
As applications evolve, the data they generate, such as log files, keep changing the content that is a part of the log keeps changing. Let’s say for, e.g., the Apache access log is
184.108.40.206 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
For the next version of apache, they changed the log format and your process didn’t account for the change, then it will assign the wrong value to the wrong field. So this could be a source of error.
No Unique Identifiers
A very nice example of data not having a unique identifier is when you crawl data from websites and store it in-store, which doesn’t store it with a unique identifier. Let’s say you create a view from data that does not include the primary key and you use that to query the original data. Doing this can introduce errors.
Data keep on migrating from one data store to another for various reasons. Engineers do it all the time. Using data migration features that do not preserve all the data features can lead to data migration errors. E.g., using mango export in MongoDB can lose some metadata and may lead to faulty data,
mongoexport --db test --collection traffic --out traffic.json
an alternative would be to use mongodump
mongodump --db test --collection collection
Sometimes a programming glitch can cause the generated data to be faulty.
Corruption in Transmission
Data can get corrupted while transmitting over a network or otherwise. Corruption in data while transmission can occur due to network outages or CRC check errors.
Measurement of Data Quality
After data collection, the immediate step is to measure the quality of data accumulated. Measuring gives you a clear picture of the extent to which the data can be useful. Various metrics can be used to measure data quality.
- Validity: Conforms to the schema
- Accuracy: Conforms to a gold standard
- Completeness: Presence of all records
- Consistency: Match other data
- Uniformity: Diversity in data (uniformity in diversity)
A valid entry of data conforms to the schema that you decided in the data modeling step. Generally, we would want the data to conform to the data model/schema. The higher the validity of your data, the better.
Accuracy is measured in terms of comparison with the gold standard of the same data. Let’s say you are gathering data about cities to compare the data available from an organization like dataforcities.org, etc. The comparison would then yield a measure, which we will term as accuracy.
Its a measure of the amount of data with reference to the gold standard available. Retaking the example of cities, if the data contains cities created a decade ago, then we are surely missing data about city-data created in the last decade. Since the data does not have the entire data, we would term it as incomplete. Also, one can come up with a metric measuring the completeness of data.
In Wikipedia’s terms, “Data Consistency refers to the usability of data; we want data to be constant in time, also letting us be capable of using and showing them in different ways without changing their structure.” Agreeing to it, data should be consistent with being used elsewhere. The data format and structure come into play, such as JSON, CSV, text, etc.
The data should have an inherent uniformity meaning the data can be about diverse entities, but there should be uniformity in accessing it and processing it. I would like to call it “Uniformity in Diversity” 😉
Data Cleaning Process
- Audit your data
- A data cleaning plan
- Execute the plan
- Manually correct data
Auditing data is the start of the data cleaning process, assuming you can find defects in data while auditing and taking steps to correct them. It is a good practice to audit data at regular intervals. Auditing data manually can bring out errors in data that had escaped the prior processes.
Data Cleaning Plan
Since the auditing process identifies errors, there must be a plan to fix the identified errors. Generally, the plan has three stages –
- Identify Causes
- Define Operations
The stages are simple and actions taken in each stage can be understood by the stage itself.
Execute the plan
What good is a plan that is not executed? Since now you have everything in place to take action, you just have to execute the plan.
After execution still, you find some data inconsistencies, then it would be a good idea to manually correct the errors. Sometimes the overhead of repeating the process can be more than just manually correcting the data. In such cases, this is a good option.
Remember clean data yields better results.For any questions and inquiries visit us on Artificial intelligence and Machine learning