Data Science / Uncategorized

A/B-Testing

Everyone interested in data science might have come across “A/B-Testing”. While many universities don’t seem cover this important subject, knowing that terminus is seen as basic knowledge during job interviews.

I even found out that many co-workers seem to have an issue understanding how versatile this procedure can result to be. There is “no free lunch” for success and we have to take many different parameters into account in order to design a good functioning A/B-Test leading to significant results.

A/B-Testing in a Nutshell

In a simplified way: We have an algorithm (variant A) as baseline and want to introduce a promising new, improved version (variant B) into production. In order to evaluate if our assumption was right and the new algorithm is supposed to lead to an improvement w.r.t. our KPIs (for instance rpm), we would normally start an A/B-Test for which the users are split into two groups. One would get our baseline algorithm (variant A) and the other group our new algorithm (variant B). After running our test for a certain time span, we stop our test and start to analyze the results w.r.t. the chosen KPIs. Based on our outcome we’d choose the algorithm with the highest improvement which would hopefully be the newest version.

Important “parameters” for an A/B-Test

You might have wondered what I specifically meant by “certain time span” and “splitting into two groups” etc. I will now introduce some important questions / parameters for designing our A/B-Test.

1. What use case do we want to test?

Our use case should be made very clear. In most cases there will be some requirements coming from another department (or even our team) and we have to implement a new algorithm based on those. Do we want to test a frontend functionality or an algorithm working only in the backend? Are we targeting certain users or is it very generic?

On one hand we developed our algorithm we want to test based on our use case, on the other hand we also need more information on all implications.

2. How do we decide if our new algorithm is better than the current one (baseline)?

We need to evaluate if our A/B-test was successful. Therefore we need to define some KPIs that make sense for our use case. Normally the company might want to maximize the revenue. So we’d need to measure characteristics leading to an uplift. Such as rpm (revenue per mille), clicks, purchases, redemptions etc.

Another important aspect is the significance. We also need to know that our outcome is significant, otherwise our test would be obsolete. There are many suitable statistical tools such as calculating the p-value. Just as reminder: The p-value is the probability obtaining test results at least as extreme as the observed results of a statistical hypothesis test assuming that the null hypothesis is correct.

In addition we might take other aspects into account like some other tests running at the same time as our test.

3. Carry-over effect?

We should also take into account if other A/B-test from the past might have an effect on our new one. Were there any tests altering the users behavior or targeting a learning effect on the users we consider for our new A/B-test? Or maybe other departments are launching A/B-tests at the same time span and those could have an impact on out results as well. Those are some factors that should be checked beforehand for avoidance.

4. How do we split?

There are many ways to split into two (or even more) variants. In most examples and also in frontend testing, we split users randomly into two groups and those users stay in those groups, meaning that we need to track which user was split into which group taking care that the user’s not affected by a change in treatment. There may also be cases for which it makes sense to divide users into groups beforehand.

But there might also be the possibility that the user does not even notice a change in treatment and we ensured that there is no impact by doing so. Then we might randomly split the traffic / requests and only log our assigned variant for each request.

We could also split by traffic source, traffic medium, country, living area of the user, device type used by the user etc. There are many possibilities depending on what we want to test.

5. Splitting ratio?

Most A/B-Test are conducted with a 50/50 ratio, meaning that the traffic/users etc are split randomly with a 50% chance to end up in one variant or the other.

But maybe we want to start with a smaller group for our new algorithm (e.g. 90/10) and increment it for the Variant B when we got our first test results. In that way we can be ensure we didn’t mess up and the company won’t loose much with a damaging variant. But that’s only for cases in which we are not sure if the new algorithm would bring the improvement we hoped for. The evaluation on our rpm has then to be carried proportionately.

6. Which time span do we choose?

We could rely on experiences we made in the past: We could evaluate how long we had to wait until we got some clear and significant results in other similar tests.

Another idea would be checking our evaluation on a daily basis, meaning monitoring traffic, our preliminary KPI results and our significance test. One day we’d reach significant results and could stop our test.

But most companies would like an evaluation of the time needed for the A/B-test beforehand in order to know who much this test could cost. So we’d might to combine both approaches.

Another useful aspect, depending on the use case, might also be to consider if there are some seasonal effects having an impact on our evaluation. Also during COVID lockdowns people behaved in a different way – shopping more online, for instance – also affecting possible test results. This should be taken into account as well.

Those are, in my opinion, the most important points we have to take in mind while designing an A/B-test. Additionally this post is getting too long and if you made it until here: Thank you 🙂

Leave a comment