Common mistakes in data analysis

You are currently viewing Common mistakes in data analysis
  • Post category:Data analytics
  • Reading time:9 mins read

Common mistakes in data analysis 

For those who are not familiar with the data analytics field, data analysis seems like a simple task. Anyone can run a query on the database, calculate the sum and average of sales or other interesting KPI and conclude whether we have met the targets and how we should move on from here.

In practice it does not always work that way. Such a naive approach to data analysis can cause wrong conclusions to the organization and lead to wrong decisions. Here are some famous mistakes in data analysis that illustrate how biases and misinterpretation of the data can lead to incorrect conclusions.

Unbalanced population sampling – how not to conclude the right answer from data

In 1936 Liberty Digest magazine conducted a huge survey to try to predict the results of the upcoming US elections. The magazine sent election polls to 10 million people and received back responses from 2.5 million people. Analysis of the answers predicted that Alfred Landon would defeat Franklin Roosevelt by 57% versus 43%. To the surprise of the poll editors and readers, Roosevelt won and received 62% of the vote.

The data analysis mistakes

In analyzing the problems the researchers found two significant mistakes in their data analysis known as sampling errors:

1 – The people to whom the survey was sent did not constitute a representative sample of the population in the United States, since the survey was sent only to telephone owners. The reason for this was technical – for those who had a telephone line, their address was recorded in the telephone book, but in 1936 only the upper and middle class had a telephone line and the survey did not ask the answers of the lower class.

 2 – The survey was sent to 10 million people and 2.5 million of them answered it. The characteristics of the people who agreed to respond to the survey may differ from the general population. This phenomenon also exists in reviews of buying products online: People who have a very negative or very positive opinion about a product they ordered online will be more inclined to write a review than customers whose opinion were average about the product.

What can be learned from this?

When we want to draw conclusions from the data, we must make sanity checks to ensure that they represent the population we are working with. For example, it is not possible to conclude that the conversion rate of traffic to a site that comes from search engines (organic traffic) will be the same as traffic that comes from a sponsored campaign. To predict what the campaign performance will be, make sure that the sample of the population from which it is predicted is similar to the predicted population.

Confounding variables – alternative explanations for the conclusions

“Does semen have antidepressant properties?”

This was the title of a real scientific study conducted on 256 women and published in 2002. The researchers showed that there was a significant statistical relationship between condom use among women and depressive symptoms.

Does the research prove that sperm reduces depression in women? The answer is: probably not.

The researchers compared a group of women who were using condoms to a group that did not use and measured their level of depression, but could there be alternative explanations for the relation between sperm and depression?

We know that women (and men) who have a casual partner or relationship that is still in its beginning tend to use a condom and as the relationship develops women tend to use pills or other contraceptives. Therefore, it is more likely that ‘long relationship’ is the variable that affects the reduction of depressive symptoms rather than the sperm of men. This phenomenon is called – ‘Confounding variables’. A Confounding variable is a variable that affects the results of the study but we do not measure or address it when drawing conclusions.

What can be learned from this analyis mistakes?

Confounding variables may also appear in studies in the business world and can cause data analysis mistakes.. For example, see the following hypothetical example:

In a study by a risk department at the Bank, the researchers found that the most influential factor in the level of risk is a residential area.

When the researchers tried to understand how a residential area might affect the level of risk, they reviewed previous studies that found that people tend to live in an environment similar to those in socioeconomic status, and that people from low socioeconomic status are less likely to meet their loan repayment obligations. The researchers therefore concluded that although a relationship was found between the area of ​​residence and the level of risk of the customer, the variable that really affected the results was the socio-economic status of the customer and not his area of ​​residence.

It is important to understand that you cannot avoid confounding variables completely. In any study there may be variables that we didn’t know existed and are the ones that influenced the results, but experience working with data and familiarity with the content on which the study is conducted may reduce this phenomenon.

Hawthorne effect – what to watch out for when doing A / B testing

In the 1930s, a number of studies were conducted at the Hawthorne plant in the United States that examined the relationship between the lighting intensity in the plant and the output of workers. Studies have shown that when you increase the lighting, the workers’ performance increases, but surprisingly, even when the light intensity is lowered, the workers’ performance increases. The explanation for this phenomenon known as the Hawthorne effect is named after the plant where the experiments were performed.

The reason for the effect is related to the fact that people notice a sudden change and therefore this change affects their behavior. Factory workers noticed a change in lighting, they assumed they were being tested and whenever there was a change they increased performance. Over time both at low light intensity and at high light intensity the performance of the workers returned to average regardless of the light intensity.

What can be learned from this?

According to the Hawthorne effect we must suspect the immediate effects caused by changes. For example, one of the most popular and effective techniques for measuring the effectiveness of change is called A / B Testing. In this technique change of the product is presented to some of the users who have been selected to the experiment. Those users are measured against the group of regular users and the data analyst checks whether there are changes in their behavior . For example, an E-Commerce website can change the color of the button on the purchase page to examine if the change caused more clicks.

According to the Hawthorne effect there is a chance that the new variation will have more clicks on the button just because the regular surfers will see a change in the button. That is, the change itself is what affected the presses and not the button itself. To avoid Hawthorne effect in your experiments, wait with the experiment for a long period of time and check whether there is increase in the measurement over time.

In summary

I have listed here several types of biases and errors that may occur in research with data. There are many other types of errors that I have not addressed, such as errors in displaying data, cleansing of data data or relying on averages for variables with high variance.

The important thing to take from the article is that when working with data you should always conduct sanity checks of the results.

This does not mean that conclusions should not be drawn from data, on the contrary – making decisions based on data is the key to the success of many organizations. To avoid biases and errors the data analyst should work together with the product managers who are well acquainted with the business field and together test the reliability of the conclusions.


This article was written by Yuval Marnin.
If you have a need to hire a freelancer data analyst you may contact me at: [email protected]

 

Yuval Marnin

For data analytics mentoring services: [email protected]