The Importance of Data Quality - Part One

Quality analytics depends on many things, not the least of which are an incisive understanding of the business problem to be analyzed and an experienced, knowledgeable team of data pros who have the right tools and techniques to conduct the analysis.

But the single most important ingredient to effective analytics is quality data. In this three-part series we will look at what constitutes quality data and how to go about ensuring that your analytics are based on the best data possible. Essentially, data quality comes down to three factors: input data, methodology and quality control. In this initial post we will examine input data.

Understanding Input Data

As the saying goes: garbage in, garbage out. High-quality input data are fundamental to producing reliable models and datasets. Regardless of how good your models are, if the input data used to build and implement them are bad—incomplete, outdated, biased or otherwise inaccurate—the resulting predictions or datasets have little chance of being reliable.

Now, no data are perfect, which is something you don’t often hear stated so bluntly by a data provider. But the reality is data are subject to how, where, when, and from whom they were captured. Any of these aspects can be a source of bias or error. So it is imperative to understand the pedigree of input data and determine how “clean” the data really are before embarking on any analytics effort.

Data, no matter how “up-to-date,” always present a snapshot in time, and that time is, by necessity, in the past. Knowing when (recency and frequency) and how (process) the data were collected is critical to determining the degree of data “cleanliness” and also assists researchers in making informed choices about methodology and what types of analysis may or may not be appropriate.

The recency of input data greatly determines how well they reflect the current state of affairs. Data that are 5 years old are bound to be less representative of the present than data that are 5 minutes old, all other things being equal. Further, the frequency of data collection is also important because it influences the types of models that a researcher can use and how often those models can be calibrated and their predictions tested. As researchers, we have to use history to predict the future. There is no changing this fact. And as researchers, it is our job to determine how well historical data reflect the present or predict the future—and make adjustments where necessary. This is where the skill, experience and domain knowledge of the researcher is critical. It is quite straightforward to build most models. The real challenge is intelligently using the results.

Read The Full Article

0 Comments Write your comment

    1. Loading...