A Basic guide on A/B Testing

You are currently viewing A Basic guide on A/B Testing
Image by Gerd Altmann from Pixabay
  • Post category:Data analytics
  • Reading time:9 mins read
The article was written by Yuval Marnin.
For data analyst freelance services contact me [email protected]

Introduction – What is A/B Testing?

In the methodology of A/B testing, the product manager tests a new version of the product on a small group of users, and the test results teach whether there is a significant statistical improvement in the new version compared to the previous version.

The methodology is called A/B because A and B refer to the different versions of the product. For simplicity, this article will refer to the easy case of testing two versions against each other, but in practice, more than one alternative can be offered to the original version, and complex experimental systems can be built that allow testing interaction between versions and changes in the product.

It is important to note that this article is a brief overview of the complex subject of A/B testing and is not a practical guide. If you want to perform a test correctly and systematically, it is recommended to do it with a data analyst who has studied academic courses in statistics and research methods, so that they can build the test correctly and without biases that may lead to incorrect conclusions from the experiment.

A/B Testing and Hypothesis Testing

The technique of A/B testing relies on classical statistical research, and therefore to conduct a proper test, we must be familiar with the subject of hypothesis testing, significance tests, and the harmful effects that can occur due to statistical biases – especially biases that are derived from incorrect sampling and confounding variables.

Every A/B testing is actually a hypothesis test that can be formulated as follows:

H0 – There is no difference between the versions of the product.
H1 – The new version of the product is better (or there is some difference between the versions).

*There are additional options for formulating the hypothesis test, and the test results will be derived from the formulation of the hypothesis.

Choosing Metrics for A/B Testing

In order to determine which version yields better results, we must decide on a metric according to which we will measure the success of each version.

For example – whether the average sales of the new version of the product (H1) are higher than the average sales of the existing version (H0).

It is possible and desirable to set more than one metric, but it should be noted that for each metric we must formulate its own hypothesis and perform a distinct test.

Also, it is important to note that sometimes there may be an improvement in one metric but a decline in another. For example, there may be a situation where in the test it is discovered that the new version of the product has improved the quantity of sales, but has greatly harmed the conversion rate.

After analyzing the results of the statistical tests for all the metrics we examined, we can make a decision whether we want to replace the existing version of the product with the new version.

The importance of random user selection in A/B testing

In A/B testing, each user of a product is directed to one of the test versions – the regular version (H0) or the new version (H1).

To avoid bias in selecting the wrong sample, the test subjects should be randomly assigned to one of the versions.

If we do not assign users randomly and instead give a specific group of users the new version, the results we get may be biassed by the characteristics of the group we chose, and not by the true results of the test.

For instance, if we direct users who came to the product through Google ad campaigns to the new version and find an improvement in sales, we won’t be able to know whether the improvement is due to the fact that users who come from campaigns tend to purchase more or whether the new version is actually better.

Desired Sample Size in A/B Testing

Ideally, we would like to divide users into two groups – one group will receive the original version of the product, and the other group will receive the new version of the product.

The problem with this approach is that the new version may be worse than the current version, and referring 50% of users to it may significantly harm the product.

So, what is the ideal sample size?

Unfortunately, there is no single answer to this. In order to have a high level of confidence in the test result, we need to aim for the largest possible number of users in the new version. In addition, when the test is performed on a highly diverse population, we will need to increase the number of users in the test to ensure that the test succeeded.

For example, if the metric we are measuring is the total purchase amount in an e-commerce company, where many customers make high-value purchases but many others make low-value purchases, the variability of the purchase amount will be high, and in order to determine if the new version really had an impact on sales, we will need to run the test on many customers.

Duration of A/B Testing

In order to prevent confounding variables, such as certain days of the week with different user behavior patterns, it is advisable to run the test over an extended period of time. The length of time depends on the nature of the product. Sometimes two weeks are enough, and sometimes it may be preferable to run the test for a month.

Statistical significance in A/B testing

In order to reject the null hypothesis (H0) and accept the alternative hypothesis (H1), we need to ensure that there are statistically significant differences in the metrics we are testing.

In the field of statistical inference, there are many types of tests that can determine whether the difference between metrics is statistically significant. The appropriate statistical test is chosen based on the experiment design and the metric being evaluated.

For example, if we want to test whether there is an increase in the average purchase amount of a selected customer, we use a t-test to compare means. If we want to test the improvement in conversion rate of users, we use a proportion test.

Generally, when the p-value is less than 0.05, the result is considered statistically significant. When the p-value is less than 0.01, the result is considered highly significant.

What is A/A testing methodology?

A/A testing is a methodology in which the same version is presented to both the sample population and the regular population, and it is tested whether there is a statistical difference between the populations.

Usually, there should be no difference in the metrics of the two populations of users since they are receiving the same version. However, if we detect a difference between the populations in the test, it means that we have a technical issue in the experiment and we need to identify it before we conduct the actual experiment. For example, it could be that the selection of users for the test is not being done randomly, or that the server displaying the version to the users in the test is slower.

Criticism of A/B Testing

There are several critiques of A/B testing:

The Hawthorne Effect or Novelty Effect

In some cases, the mere fact of changing a product and presenting a new version to users can cause changes in the metrics being tested, regardless of the characteristics of the new product. For example, users may click on a new button just because it is new, not because they actually want to use it.

This phenomenon is called the Hawthorne Effect or Novelty Effect, and the way to avoid it is to conduct long tests that will show that there is still a difference between the two versions over time, even when the change is no longer new to users.

Relying too much on statistical significance in A/B testing

From a statistical perspective, the larger the sample size, the smaller the variation of the sample mean, and when the variation of the sample is small, there is a greater chance that the test result will be statistically significant. Thus, in large samples, almost every change may appear statistically significant.

Therefore, to determine whether an A/B test has succeeded, it is advisable to not to rely only on significant tests to determine if we should accept the new version of the product.


This article was written by Yuval Marnin.
If you have a need to hire a freelancer data analyst you may contact me at: [email protected]

Yuval Marnin

For data analytics mentoring services: [email protected]