Data Cleaning: A primary step towards a data science project


After gathering data for a project, the next logical thing to do is clean and audit the data. Data cleaning is one of the very important parts of a data science project. It comes as a part of the data gathering process (refer to DS-lifecycle here). Clean data can directly impact the quality of the project’s result, be it the visualizations or the statistical models that you intend to build on the gathered data.

Data can be dirty due to many reasons. Sometimes the process of data generation can introduce errors, collection, etc. Let’s see a few.

cr1-1024x397 Data Cleaning: A primary step towards a data science project
Source: unsplash.com

Sources of Dirty Data 

  • User entry errors
  • Different schema
  • Legacy systems
  • Evolving Applications
  • No unique identifiers
  • Data migration
  • Programmer error
  • Corruption in transmission

User Entry Errors

A very common example of a user entry error is date entries. For, e.g., while filling out a survey form, the surveyor enters the date‘ MM/DD/YY’ whereas the data entry format is ‘DD/MM/YY.’

E.g. Date entry7/1/2016 → 7th January 2016 in DD/MM/YYYY format7/1/2016 → 1st July 2016 in MM/DD/YYY format

Different Schema

While gathering data from multiple diverse sources such as the web, various schemas can be found. While doing data modeling, you tend to create a model where the data model schema matches most of the available data on the web. You tend to ignore a handful of data sources that do not match your data model schema still source data from in such a situation. These can be a source of error in your data.

Legacy Systems

While gathering information from/about legacy systems, we might not always get the required format information. Let’s say, for, e.g., a legacy system encodes the logs in a way that is not understandable by current standards. In such cases, you can only convert the amount of information that the current systems understand. In the process, certain information is lost and conversion can also introduce errors in the translated/generated data.

Evolving Applications

As applications evolve, the data they generate, such as log files, keep changing the content that is a part of the log keeps changing. Let’s say for, e.g., the Apache access log is

64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352

For the next version of apache, they changed the log format and your process didn’t account for the change, then it will assign the wrong value to the wrong field. So this could be a source of error.

No Unique Identifiers

A very nice example of data not having a unique identifier is when you crawl data from websites and store it in-store, which doesn’t store it with a unique identifier. Let’s say you create a view from data that does not include the primary key and you use that to query the original data. Doing this can introduce errors.

Data Migration

Data keep on migrating from one data store to another for various reasons. Engineers do it all the time. Using data migration features that do not preserve all the data features can lead to data migration errors. E.g., using mango export in MongoDB can lose some metadata and may lead to faulty data,

mongoexport --db test --collection traffic --out traffic.json

an alternative would be to use mongodump

mongodump  --db test --collection collection

Programmer Error

Sometimes a programming glitch can cause the generated data to be faulty.

Corruption in Transmission

Data can get corrupted while transmitting over a network or otherwise. Corruption in data while transmission can occur due to network outages or CRC check errors.

cr2-1024x303 Data Cleaning: A primary step towards a data science project
Source: unsplash.com

Measurement of Data Quality 

After data collection, the immediate step is to measure the quality of data accumulated. Measuring gives you a clear picture of the extent to which the data can be useful. Various metrics can be used to measure data quality.

  • Validity: Conforms to the schema
  • Accuracy: Conforms to a gold standard
  • Completeness: Presence of all records
  • Consistency: Match other data
  • Uniformity: Diversity in data (uniformity in diversity)

Validity

A valid entry of data conforms to the schema that you decided in the data modeling step. Generally, we would want the data to conform to the data model/schema. The higher the validity of your data, the better.

Accuracy

Accuracy is measured in terms of comparison with the gold standard of the same data. Let’s say you are gathering data about cities to compare the data available from an organization like dataforcities.org, etc. The comparison would then yield a measure, which we will term as accuracy.

Completeness

Its a measure of the amount of data with reference to the gold standard available. Retaking the example of cities, if the data contains cities created a decade ago, then we are surely missing data about city-data created in the last decade. Since the data does not have the entire data, we would term it as incomplete. Also, one can come up with a metric measuring the completeness of data.

Consistency

In Wikipedia’s terms, “Data Consistency refers to the usability of data; we want data to be constant in time, also letting us be capable of using and showing them in different ways without changing their structure.” Agreeing to it, data should be consistent with being used elsewhere. The data format and structure come into play, such as JSON, CSV, text, etc.

Uniformity

The data should have an inherent uniformity meaning the data can be about diverse entities, but there should be uniformity in accessing it and processing it. I would like to call it “Uniformity in Diversity” 😉

cr3-1024x297 Data Cleaning: A primary step towards a data science project
Source: unsplash.com

Data Cleaning Process 

  • Audit your data
  • A data cleaning plan
  • Execute the plan
  • Manually correct data

Auditing

Auditing data is the start of the data cleaning process, assuming you can find defects in data while auditing and taking steps to correct them. It is a good practice to audit data at regular intervals. Auditing data manually can bring out errors in data that had escaped the prior processes.

Data Cleaning Plan

Since the auditing process identifies errors, there must be a plan to fix the identified errors. Generally, the plan has three stages –

  • Identify Causes
  • Define Operations
  • Test/Audit

The stages are simple and actions taken in each stage can be understood by the stage itself.

Execute the plan

What good is a plan that is not executed? Since now you have everything in place to take action, you just have to execute the plan.

Manual Correction

After execution still, you find some data inconsistencies, then it would be a good idea to manually correct the errors. Sometimes the overhead of repeating the process can be more than just manually correcting the data. In such cases, this is a good option.

Remember clean data yields better results.

For any questions and inquiries visit us on  Artificial intelligence and Machine learning

Kaustubh

I look after Technology at Thinkitive. Interested in Machine Learning, Deep Learning, IoT, TinyML and many more areas of application of machine learning.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button