The Importance of Data - Part Two

In my previous post on data quality (view it here), we looked at input data and how recency, frequency and data collection process can affect the quality and types of analysis that can be done. In this post, we turn to the role of methodology in creating quality data and the factors that go into selecting the right methodology. The discussion will get a bit technical, but it’s worth getting a handle on these important concepts to appreciate what goes into creating quality data.

Applying the Right Methodology
For EA and for the purposes of this examination, “methodology” refers to the techniques we use to build our data products, such as DemoStats and DaytimePop, as well as those used to execute custom projects. These techniques range from simple rule-based algorithms to machine learning methods. At EA we use a range of techniques when building our datasets. In large part, the type, amount, reliability and recency of data available dictate the method we use. 

We build unique methods for nearly every theme and geography captured by our datasets. What does that mean? For DemoStats alone we have created and maintain over 150 unique algorithms to produce 754 variables across 42 themes for 5 time periods and 11 levels of geography. 

It is helpful to think about methodology as a spectrum, with model accuracy on one end and model generalization at the other. There is not a direct trade-off between accuracy and generalization. The best models have high levels of accuracy and generalization. Yet, modelling techniques tend to start at one end of the spectrum and, through model training, calibration and testing, work toward the other end of the spectrum. The following graphic illustrates where along the accuracy-generalization continuum various modelling techniques starts.

Figure 1. The methodology spectrum and where common modelling techniques fall (click to enlarge)

When deciding which techniques to use to build a standard dataset like WealthScapes or execute a custom project, we compare the advantages and disadvantages of techniques focused on accuracy versus those focused on generalization, as shown in Table 2.

Table 1. Advantages and disadvantages of accuracy and generalization (click to enlarge)

This table uses a couple of technical terms that are important for anyone working with data, methodology and models to understand. Let’s start with correlation versus causation. Correlation is just a statistical metric—a mathematical formula that compares two variables. Correlation says nothing about the existence of a real-world relationship between two variables or the nature of that relationship. Causation, on the other hand, explicitly looks at how attributes or phenomenon interact. 

For example, if we are trying to predict how many jelly beans a worker in an office consumes in a day, we might find that specific variables correlate highly with jelly bean consumption, such as amount of soda consumed in a day, distance from the worker’s desk to the bowl of jelly beans and number of hours the worker spends in the office. In this scenario, it would be easy to conclude that high soda consumption causes jelly bean consumption. 

But that would be an inappropriate conclusion. Jelly bean consumption and soda are linked, but indirectly. In this case the causal factor driving jelly bean consumption is more likely the worker’s attitude about nutrition; soda consumption is acting as a proxy. If soda were removed from the office environment, it is very possible that jelly bean consumption would increase rather than decrease. In fact with further tests, we would likely be able to determine that distance to the jelly bean bowl (access) and hours spent in the office (exposure) have a significant and direct causal relationship to jelly bean consumption. The greater a worker’s access and exposure to jelly beans, the more jelly beans the worker will eat. Always remember: Correlation is not causation.

READ The Entire Article


Photo Gallery

0 Comments Write your comment

    1. Loading...