Multi-armed bandit testing involves a statistical problem set-up. The most-used example takes a set of slot machines and a gambler who suspects one machine pays out more or more often than the others. For every token, they need to decide which slot machine to use in order to maximise winnings from their budget.
This set-up was initially applied in the medical-pharmaceutical field. This was in order to allocate fixed budgets to a variety of research projects that show a high degree of uncertainty at the start, but a clearer outcome (or lack of it) as they progress.
Recently this problem set-up has been applied to A/B and MVT testing and it is becoming increasingly of interest to product managers and marketers that want to asses the impact of a given feature or campaign.
Exploitation vs. Exploration
Our gambler doesn’t really know where to start in the early stages. He needs to explore all the different options and at the same time he has to maximise his profits, exploiting the best-performing slot machine. This is known as the exploration vs. exploitation trade-off:
During an exploration phase the gambler tries random levers to investigate which lever delivers the biggest reward, but at some point he also needs to use that knowledge to maximise his take-home money. There are different strategies to solve this problem, and while none of them is perfect, they provide decent results (approximately an 80-85% chance of selecting the optimal slot machine).
Here is an example of the simplest method to solve the problem:
- Epsilon-greedy method: The arm that proves to be the best arm so far is selected for a proportion of the trials, and another lever is randomly selected (with uniform probability) for a – normally smaller – proportion. If we define the exploration proportion to be 10%, then the system will exploit the best arm for 90% of the time and try a random lever for 10% of the time.
Simply put, that means that once I select one arm, and for the next turn I have a 90% chance to exploit that arm again, and a 10% chance to explore new arms. Every time an arm is pulled, its reward will be recalculated.
Here are some other methods that are a bit more advanced:
- Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. We arbitrarily define how many times we want to pull the lever and how much exploration we want to do at the beginning. Say the gambler has 500 tokens; he decides to randomly try all levers for the first 50 tokens and then use the best lever for the remaining 450 tokens.
- Epsilon-decreasing strategy: This is similar to the epsilon-greedy strategy, but with epsilon decreasing over time. It enables more exploration at the beginning and focuses more on exploitation as the experiment progresses.
- Optimistic initial values: One of the issues with the greedy strategies is that they are biased by their initial estimates. One way to solve this problem, by encouraging initial exploration, is to set all estimates higher than what we actually expect them to be, so that the system, being ‘disappointed’, will explore more.
- Thompson sampling: Google seems to use a variant of this algorithm for their own content experiments. It’s based on Bayesian stats, with the assumption of prior knowledge of how well each variant will perform, and as the experiments are conducted, those assumptions are challenged and updated.
What Does it Mean For Product Managers?
Every time we test a new feature, landing page or ad, we are taking a risk because we see the potential opportunity of a reward. The multi-armed bandit approach allows us to make sure we choose the best possible option while sending the minimum amount of traffic to the least-performing options. That sounds great! But hold on, there are a few other things to consider:
- Moving baseline: One of the assumptions of all the strategies mentioned above is that the baseline doesn’t change. In fact, the assumption is that the arms give out a fixed pay-out and one of them might be higher than the others. However, in the real world, the metric we are tracking might change over time due to seasonality, user behaviour, randomness etc. In this case the arm that seems to be winning during a ‘good’ phase, will get a bias compared to other arms.
- Confidence for a given variation: Because all these strategies require a more or less arbitrary allocation of traffic, different volumes are allocated to different variations. In our Epsilon-greedy example above, the arm believed to be the winner will tend to get 90% of the traffic and the remaining arms will have to share the remaining 10%, decreasing the level of confidence for those arms.
- The whole set-up is geared toward optimisation, not prediction: The multi-armed bandit set-up is aimed at maximising a limited resource, rather than to predict what will happen in the future if you decide to go with one arm instead of another. There are strategies that aim at providing such results (like Thompson sampling), but they are complex and still subject to the previous points.
I’d like to make clear that I am not a data scientist and as far as I know this topic is still the subject of research, but these are my main learnings from hundreds of tests using the multi-armed bandit approach and more traditional set-ups:
- Multi-armed bandit is best used when running a campaign with limited resources, such as time or money. That’s why it makes a lot of sense in AdWords and other bid management tools.
- There are a lot of different implementations of the multi-armed bandit set-up. If you are thinking about adopting a platform that uses this set-up, ask for details of the implementation.
- Understanding the right balance between exploration and exploitation for your purposes is very important.
- Classic test set-ups with fixed allocations provide more reliable results as they are not biased by moving baselines (as well as providing easier ways to predict test duration). However, there could be an opportunity cost of not serving the best variation for the duration of the test.
I hope to have cleared up some of the doubts around this test set-up. If you have any questions or would like to share your experiences please leave a comment!
- Chapters 1 and 2 of Reinforcement Learning:
- TechTalk Tutorial: Introduction to Bandits: Algorithms and Theory (and Part 2)
- On the Thomspon sampling method: A Bayesian Framework for Reinforcement Learning