post

B2B product management

Artificial Intelligence (AI) & Machine Learning (ML)

Building Product Teams

Business Strategy

Data Driven Product Management

Digital Transformation

Diversity and Inclusion

Minimum Viable Product

OKRs

Prioritisation

Product Design

Product Research

Product Development Process

Product Ethics

Product Discovery

Product Leadership

Product Management Career

Product Strategy

Stakeholder Management

🌎 <a href="https://www.mindtheproduct.com/world-product-day/" style="color:#FFFFFF;text-decoration:underline;">Happy World Product Day!</a> Be sure to celebrate product, yourself, and your peers today! 🎉

Successful product management in companies where it’s least understood

Measuring critical user journeys – Javier Andrés Bargas-Avila (UX Director, Google)

A new way to look at applied experiments

Product survey findings: Only 15% of users are embracing AI features 

Building high-performing product teams: The PDLC approach

Footer About

Footer Contribute

Footer Get Involved

Footer Links

Mind the Product

Guest Post

Overcoming the 20% feature usage challenge

Product Management

Three product leaders to learn from at #mtpcon inside Pendomonium

Product manager for Qubit's CXM platform, Davide works with the biggest retailers in UK, helping them to deliver the best experience for different visitor segments.
Also co-founder of <a href="http://www.zonino.co.uk">Zonino</a>, a site that aggregates jobs posting directly from London startups websites.
You can get in touch with him tweeting @davidescalzo

Davide Scalzo

<p><img loading="lazy" decoding="async" src="//wp-content/uploads/2013/11/multi-armed-bandit-testing-300x209.png" alt="Multi-armed Bandit Testing" width="300" height="209" class="alignright size-medium wp-image-4031" srcset="/wp-content/uploads/2013/11/multi-armed-bandit-testing-300x209.png 300w, /wp-content/uploads/2013/11/multi-armed-bandit-testing.png 800w" sizes="(max-width: 300px) 100vw, 300px" /><a href="http://en.wikipedia.org/wiki/Multi-armed_bandit" target="_blank">Multi-armed bandit testing</a> involves a statistical problem set-up. The most-used example takes a set of slot machines and a gambler who suspects one machine pays out more or more often than the others.  For every token, they need to decide which slot machine to use in order to maximise winnings from their budget.</p>
<p>This set-up was initially applied in the medical-pharmaceutical field. This was in order to allocate fixed budgets to a variety of research projects that show a high degree of uncertainty at the start, but a clearer outcome (or lack of it) as they progress.</p>
<p>Recently this problem set-up has been applied to <a href="https://www.mindtheproduct.com/tag/ab-testing/" title="A/B testing">A/B</a> and <a href="https://www.mindtheproduct.com/tag/multivariate-testing/" title="Multivariate testing">MVT testing</a> and it is becoming increasingly of interest to product managers and marketers that want to asses the impact of a given feature or campaign.</p>
<h3>Exploitation vs. Exploration</h3>
<p>Our gambler doesn&#8217;t really know where to start in the early stages. He needs to explore all the different options and at the same time he has to maximise his profits, exploiting the best-performing slot machine. This is known as the exploration vs. exploitation trade-off:</p>
<p>During an exploration phase the gambler tries random levers to investigate which lever delivers the biggest reward, but at some point he also needs to use that knowledge to maximise his take-home money. There are different strategies to solve this problem, and while none of them is perfect, they provide decent results (approximately an 80-85% chance of selecting the optimal slot machine). </p>
<p>Here is an example of the simplest method to solve the problem:</p>
<ul>
<li><strong>Epsilon-greedy method</strong>: The arm that proves to be the best arm so far is selected for a proportion of the trials, and another lever is randomly selected (with uniform probability) for a &#8211; normally smaller &#8211; proportion. If we define the exploration proportion to be 10%, then the system will exploit the best arm for 90% of the time and try a random lever for 10% of the time.</li>
</ul>
<p>Simply put, that means that once I select one arm, and for the next turn I have a 90% chance to exploit that arm again, and a 10% chance to explore new arms. Every time an arm is pulled, its reward will be recalculated.</p>
<p>Here are some other methods that are a bit more advanced:</p>
<ul>
<li><strong>Epsilon-first strategy</strong>: A pure exploration phase is followed by a pure exploitation phase. We arbitrarily define how many times we want to pull the lever and how much exploration we want to do at the beginning. Say the gambler has 500 tokens; he decides to randomly try all levers for the first 50 tokens and then use the best lever for the remaining 450 tokens.</li>
<li><strong>Epsilon-decreasing strategy</strong>: This is similar to the epsilon-greedy strategy, but with epsilon decreasing over time. It enables more exploration at the beginning and focuses more on exploitation as the experiment progresses.</li>
<li><strong>Optimistic initial values</strong>: One of the issues with the greedy strategies is that they are biased by their initial estimates. One way to solve this problem, by encouraging initial exploration, is to set all estimates higher than what we actually expect them to be, so that the system, being &#8216;disappointed&#8217;, will explore more.</li>
<li><strong>Thompson sampling</strong>: Google seems to use a variant of this algorithm for <a href="https://support.google.com/analytics/answer/2846882" target="_blank">their own content experiments</a>. It&#8217;s based on <a href="http://en.wikipedia.org/wiki/Bayesian_statistics" target="_blank">Bayesian stats</a>, with the assumption of prior knowledge of how well each variant will perform, and as the experiments are conducted, those assumptions are challenged and updated.</li>
</ul>
<h3>What Does it Mean For Product Managers?</h3>
<p>Every time we test a new feature, landing page or ad, we are taking a risk because we see the potential opportunity of a reward. The multi-armed bandit approach allows us to make sure we choose the best possible option while sending the minimum amount of traffic to the least-performing options. That sounds great! But hold on, there are a few other things to consider:</p>
<ul>
<li><strong>Moving baseline</strong>: One of the assumptions of all the strategies mentioned above is that the baseline doesn&#8217;t change. In fact, the assumption is that the arms give out a fixed pay-out and one of them might be higher than the others. However, in the real world, the metric we are tracking might change over time due to seasonality, user behaviour, randomness etc.  In this case the arm that seems to be winning during a &#8216;good&#8217; phase, will get a bias compared to other arms.</li>
<li><strong>Confidence for a given variation</strong>: Because all these strategies require a more or less arbitrary allocation of traffic, different volumes are allocated to different variations. In our Epsilon-greedy example above, the arm believed to be the winner will tend to get 90% of the traffic and the remaining arms will have to share the remaining 10%, decreasing the level of confidence for those arms.</li>
<li><strong>The whole set-up is geared toward optimisation, not prediction</strong>: The multi-armed bandit set-up is aimed at maximising a limited resource, rather than to predict what will happen in the future if you decide to go with one arm instead of another. There are strategies that aim at providing such results (like Thompson sampling), but they are complex and still subject to the previous points.</li>
</ul>
<h4>Conclusion</h4>
<p>I&#8217;d like to make clear that I am not a data scientist and as far as I know this topic is still the subject of research, but these are my main learnings from hundreds of tests using the multi-armed bandit approach and more traditional set-ups:</p>
<ul>
<li>Multi-armed bandit is best used when running a campaign with limited resources, such as time or money. That&#8217;s why it makes a lot of sense in AdWords and other bid management tools.</li>
<li>There are a lot of different implementations of the multi-armed bandit set-up. If you are thinking about adopting a platform that uses this set-up, ask for details of the implementation.</li>
<li>Understanding the right balance between exploration and exploitation for your purposes is very important.</li>
<li>Classic test set-ups with fixed allocations provide more reliable results as they are not biased by moving baselines (as well as providing easier ways to predict test duration). However, there could be an opportunity cost of not serving the best variation for the duration of the test.</li>
</ul>
<p>I hope to have cleared up some of the doubts around this test set-up. If you have any questions or would like to share your experiences please leave a comment!</p>
<h4>Further reading:</h4>
<ul>
<li>Chapters 1 and 2 of <a href="http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html" target="_blank">Reinforcement Learning:<br />
An Introduction</a></li>
<li>TechTalk Tutorial: <a href="http://techtalks.tv/talks/54451/" target="_blank">Introduction to Bandits: Algorithms and Theory</a> (and <a href="http://techtalks.tv/talks/54455/" target="_blank">Part 2</a>)</li>
<li>On the Thomspon sampling method: <a href="http://www.ece.uvic.ca/~bctill/papers/learning/Strens_2000.pdf" target="_blank">A Bayesian Framework for Reinforcement Learning</a></li>
</ul>
<div class="g g-21"><div class="g-single a-170"><a class="gofollow" data-track="MTcwLDIxLDEsNjA=" href="https://www.pendo.io/pendomonium/"><img decoding="async" src="https://www.mindtheproduct.com/wp-content/uploads/2024/06/MTP-Footer_ad.png" /></a></div></div>

core/freeform

How to lead data-centric product teams: Jon Mora (Chief AI Officer, Zefr)

How to be a better data-driven product manager

User insights for product decisions by Cris Valerio

Data scaling for startups by Crystal Widjaja

Make Better Product Decisions by Using Quantitative and Qualitative Data, by Brian Tran

Multi-Armed Bandit Testing

Davide Scalzo

Comments

Read next