A/B testing is the backbone of tech companies. They try, fail, and implement fast what works and what does not work. They believe in it as if this was the C-level making decisions and they have to obey the A/B testing results. I am the guy working in data science telling you that maybe you should listen to less to the “sample size calculator” the “paired t-test result”.
The more you do something, the more you get confident doing it. The more you automate your A/B testing the more you feel like “this works like a charm and I get 1% uplift, this pays for my salary, right?”
So what goes wrong in practice? What makes that you implemented 100 feature changes with 1% uplift and your final metric moved by 3% after 1 year of tough work. Let’s have a look at why your 2-week successful A/B testing went wrong.
New features that are customer-facing have a natural novelty effect, other features may have a competitor reaction effect. Novelty generates clicks and then it fades away. Typically, you will get the following graph with novelty effect.
Competitor’s reactions will take a bit more time but will eventually make your feature “the norm” and the impact will diminish. In some system only the final outcome matters … In this case the declining effect is omitted. It could also be that the effect declines far after the real impact.
Too many A/B testing running
Have you ever heard “we have a great traffic splitting system that can run hundreds of experiments concurrently”? Well, it works in many cases until you have a lot of experiments and one is not fully randomised. In practice, most of the traffic splitting is done on a part of the population with some mutual exclusion. But, imagine the case where you have an ML algorithm being used to serve recommendations and you are experimenting on the service selection menu right above this recommendation.
Your 2 experiments are deemed to be mostly mutually exclusive (not interference) and therefore the population are overlapping. Unfortunately, the ML model experiment performed a lot better on the most loyal users but the same on the rest of the users and somehow, your experiment performed worse for these users. Most will argue that we need to check the control for this, but automated system do not have time for this.
You did not capture the right metric
In many cases, A/B testing tests on a multitude of metrics ranging from business, operational and technical purpose. Let’s take the example of a new feature for which the main metric is “Revenue”. For the sake of this example, we assume that revenue is just the amount of what we sell and the price of what we sell is $1. Assume that we have 2 groups of users: those who spend $2 per month (group low) and those who spend $5 (group high). Retention MoM of the first users is 50% and for the second group 80%. Each group has 100 users.
With this experiment, we successfully increase the number of orders of the group high to $5.5… Therefore our A/B testing result reported an increase in revenue of $5.
At the same time, the retention for group low declined to 48% but you did not monitor this… Retention is a compounding metric. The loss is actually significant after several months…
Issue → A/B testing on the long term retention is difficult to observe.
One size fits all (outcome of A/B testing)
Just a reminder that A/B testing is a winner takes all methodology. It does not personalise. It does not give any indication on the resulting behaviour.
You don’t learn much from A/B testing
I quite often hear: “We do A/B testing to learn about the user” “Really?” What did you learn? You learned that Version A was better than B for conversion rate. This is not a key learning. You did not learn much about the users individually, you did not learn about the version you experimented with.
A/B testing is not an individual causation algorithm. It just tells you that a variant was on average better on the population given the current circumstances.
In other word, if you consider individuals X_i and X_j, you do not know whether one have has a higher likelihood to have a higher metric than the other one.
What to do?
- Bandit experiments? Contextual bandit experiments?
- Causal inference? Counterfactuals?
- Sampling algorithms?
- Holdout groups?
- Bayesian Statistics?
[Next for the analysis of these methods]
Full Article: Adrien @ Medium
Mobile App Development Best Practices – 03.10
iOS MetaCodable – Supercharge Swift’s Codable implementations with macros meta-programming How to build a Tuist plugin and publish it using...
How to make and use BOM (Bill of Materials) dependencies in Android projects
By using a BOM dependency, you can avoid specifying the versions of each individual library in your app, and let...
Telegram turns 10 years old and revenues stagnate
Telegram seems to want to grow not only through messaging but also through communities, which pretty much means it wants...
MetaCodable – Supercharge Swift’s Codable implementations with macros meta-programming
Supercharge Swift‘s Codable implementations with macros. Overview MetaCodable framework exposes custom macros which can be used to generate dynamic Codable implementations. The core of the framework...
How to get started with Swift Concurrency 🧵 (Beginner Tutorial)
Swift has built-in support for writing asynchronous and parallel code in a structured way. Asynchronous code can be suspended and resumed later,...
Mobile App Development Best Practices – 02.10
Data.ai has summarized the interim results of the year – and once again we have a record. Annual consumer spending...