Fooled by the numbers. The Simpson’s paradox.

Published in

Analytics Vidhya

7 min readApr 29, 2020

People love absolutes. A thing is either black or white. Not both. Definitely not a rainbow. That is very difficult to measure properly and thus impossible to manage.

In my job as a data analyst I am often asked for a simple total number. An aggregate. A single metric to rule them all. In the domain of web analytics, this single number is usually the conversion rate. How many people who visited my web have done as I wanted them to?

On our fake website we have launched two themes, one featuring the magic of Harry Potter and the other the force of Star Wars. We asked our visitors which version they like more. As we had a very junior developer for our web, we couldn’t run an A/B test. As a result, we just launched Harry Potter for a week and Star Wars the next week.

Hint: Do not test in such manner on your web.

The results are statistically significant.

Harry Potter’s magic beats the force users. Even the power of the dark side can’t save the designer of the Star Wars theme. Let’s just implement the Harry Potter theme.

Not so soon.

There was some public holidays during the time we tested Star Wars theme and more people were visiting the web from mobile devices instead of PCs. When we look on mobile only results we observe the following.

The result is statistically significant.

Seems like Star Wars is winning on mobile.

So if the overall result is in favour of Harry Potter yet on mobile it is in favour of Star Wars, that means Harry Potter must have won hard on PC right?

Star Wars wins. “What dark magic is this?” asks Harry.

How can Star Wars be better in both cases yet worse overall?

This is called the Simpson’s paradox.

Definition

Simpson’s paradox is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

A lurking variable is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.

We are evaluating the performance of our web based on the conversion rate. However, the conversion rate is impacted by the type of device the user has. The data show that in week when Star Wars was tested more users chose mobile to visit our web which influenced both the visits and the conversion rate. In this example the device is the lurking variable.

How to deal with the Simpson’s paradox?

As there are many ways of making sure you don’t fall victim to the trap of the Simpson’s paradox, I have decided to simplify it into three direct actions.

Question the aggregated data.

Try to think whether multiple groups exist where the effects of the tested “thing” vary.

The conversion rate varies greatly between PC and mobile. The device of the user is in causal relation with the conversion rate. It is lower on mobile than on the PC.

Action: Break down the total data into smaller groups. Validate whether the result is observed in the sub-groups as well.

2. Get as much context of the data as possible. Data without context should be read as an attempt to manipulate.

You marketed the design of the new theme for your web as a competition. The designer whose proposal is better gets paid. The other doesn't. The Harry Potter one knows that on public holidays more people visit on the PC where the conversion rate is higher. That is why he rushed his design to be completed and tested before the holidays start.

Action: Question the data generation. Have we collected the data under similar conditions? Did we apply the same methodology in asking for feedback?

3. Try to hunt down any potential lurking variables for your experiment. What factors influence the results that the data don’t show?

We should have mapped all potential causes influencing the conversion rate. Device is one of them, time, demography of the visitors might play a role as well. Either test only on specific audience or evaluate the results for all sub-groups.

Action: Put together a detailed causal diagram. Think about underlying groups in the data and their potential impacts on the result. Draw a map of what effects what.

Splitting the data into smaller groups seems like a solution for all the problems, so what might be the problems with it?

In our little example we dealt with only two themes and only two devices. Imagine having 5 themes, 4 devices, 10 operating systems, 20 browsers and 30 languages etc. There is no limit to the number of categories you can split your data into. The more you split them, the smaller the sample size you get per each group. With small sample size your results might not be statistically significant and thus you cannot be sure whether there is a difference between the themes at all.

Another issue would lie in the decision phase. If you find out that each web browser has a different theme performing the best, would you invest into a personalised development of your web? Will you be designing and developing each new feature in a different theme? If you cannot afford personalised approach then you probably would aim for the theme which works best for the majority of your users anyways.

This gets us to the most important part — the decision.

How should I decide if the overall trend does not match the trend in the sub-groups?

You need to be aware of the causal relations within your data. You need to know what you want. There is no correct decision for all situations and all purposes. It all depends.

If the goal is to apply the theme which will bring you more conversions, decide for the Star Wars theme. It performs better on both devices.If you took a bet with a friend which design performed better and he will ask 20 random people who have seen the Star Wars theme and 20 random people who have seen the Harry Potter theme whether they liked it, choose the Harry Potter theme.

To better clarify what I mean by knowing what you want, here is another example.

If a political party tells you how they lowered all taxes compared to the previous government, beware. Maybe they are just misusing the Simpson's paradox for their own gains. They lowered the tax for people with income below 1k EUR monthly to 5% to help the lowest income families. At the same time they lowered the tax for all other incomes to 20%.

You are comparing two different time periods and in order to validate whether the government manipulates or not you need to decompose the data.

Thanks to the overall economic growth some families could have moved from a lower income group to a higher income group. This means previously they have been paying 10% lowered to 5% and now they are in a group which used to pay 21% lowered to 20%.
The income group is the lurking variable in this case.

This inter-group transition greatly affected the samples for each group and thus the effect.

With the economic growth in place almost no people are below 1k EUR. This means that for many people the tax changed from previous 10% to 20%. In absolute values the government collects more in taxes than the previous government even though they lowered all the taxes.

If your question was, whether the government lies with lowering all the taxes then the answer is it doesn’t. They lowered the taxes which is true, but the overall tax paid by the people has increased, which they didn’t say.

Is the lower tax better for you? If you stayed in the same group then yes it is. If you moved to a higher income group then to you it isn’t. You are paying more than previously.

Takeaway: Clarify the question you want to answer with the data. Identify what might be impacting the results and what needs to be taken into consideration (lurking variables). Split the data into relevant groups. Make sure you understand the methodology on how the data were collected and whether it is aligned with the interpretation.