This is the first of two posts that deep dive on A/B testing, expanding on a talk I gave at Google Playtime 2016 in London. In this post I share some of the learnings we’ve had after running 60+ A/B tests at Peak, looking at each step of the A/B testing cycle in turn.
When we started out A/B testing 12 months ago at Peak we ran tests by the seat of our pants. We’d dive into a test and work out the details as we went along, thinking we were saving time by doing so. Sometimes we managed to pull it off but this approach had a few big drawbacks:
- We were brainstorming ideas from a limited group of people, often the product team
- We were prioritising ideas without properly evaluating them, perhaps because they came from the loudest voice in the room
- We were loosely defining tests, which meant they took longer to run than they should have done
- We were implementing tests haphazardly, sometimes realising mid-way through that the right things weren’t being tracked
- We were analysing tests incorrectly, leaving lingering doubts over the validity of the results
- We were siloing learnings from tests in one area of the company, or worse, forgetting about them entirely
Fortunately as we ran more tests we began to identify and correct for these. We went from treating every A/B test as a one-off to being able to run lots of tests in parallel that had a better chance of succeeding. And this process became one of the most important factors in our growth; a repeatable way for us to experiment with different ideas and improve the product.
So what did we learn?
It all starts with ideas, but where do good ideas come from? Rather than waiting for them to pop up we’ve found the best ones surface when the whole team is engaged with brainstorming solutions to specific problems, so we start with this.
- Use data to identify the biggest problems in your product. If brainstorming is a gun that fires out lots of ideas, then before you pull the trigger you’ll want to make sure that it’s pointing in the right direction. By analysing data you can identify where the biggest problems lie in your product and focus on solving these.
- Talk to your users to understand these problems. Identifying a problem is great, but sometimes in order to really understand it you need to speak to your users. Asking open-ended questions can reveal mental models, motivations, and usability issues that surprise you and force you to rethink your assumptions about a problem.
- Engage the whole team when brainstorming ideas. Once you’ve identified and understood the problem you’ll want a diverse group of minds coming up with potential solutions. Although open idea backlogs are useful, we’ve found cross-functional team meetings every two weeks dedicated to brainstorming produce better results.
- Build a network of peers to swap ideas. Meetups and conferences are a great way to meet peers who are running A/B tests at companies similar to yours, and by catching up every couple of months you can get interesting ideas. Just because a test worked for someone else doesn’t mean it will work for you, but it can give you more confidence in the potential of an idea.
You now have a bunch of ideas, but how do you decide which one to test first? It’s not an exact science but by estimating their impact and cost in a back of the envelope way you can sensibly prioritise them.
- Estimate the probability of the test being successful. You never know for sure which test will succeed in advance so it’s important to stay open minded. However you can bump up the expected probability of success by taking into account factors such as learnings you’ve had from previous tests, whether this test has worked for a similar company, and feedback from users.
- Estimate the impact of the test in a back of the envelope way. Estimating the potential increase in revenue from monetisation tests is a helpful way of ranking them. Things get more tricky when weighing up the impact of tests that have different sets of KPIs, though. In these cases it’s often your business priorities that help you to decide.
- Estimate the resources required to implement the test. This estimate, usually in days of work, comes from having discussions with the tech and design teams and can be used to offset the upside from your probability and impact estimates.
Once you’ve decided on an idea you should define the smallest possible test that will validate or invalidate it, or the “Minimum Viable Test” as Brian Balfour calls it.
- Hack it if you can. Don’t worry too much about implementing your tests in a perfect way; the majority of them won’t work and you’ll end up scrapping them. If a test works then you’ll implement it properly afterwards.
- Restrict segment size. Do you really need to know whether a test works in Arabic or on an old iPhone? Often these are details that add little in the way of learnings but increase complexity. Consider therefore restricting by dimensions like language and device type.
- Keep it small, keep it short. Tests that take a long time slow down your velocity by blocking parts of the funnel from further testing. Challenge and welcome challenge from design and tech to reduce scope and keep tests small.
- Record test information in one place. It’s easy for people to have different recollections of the details of a test, but by recording them in a shared document you can bring everyone onto the same page. More on this in my next post.
Once you’ve defined a test you then need to implement it. You should ensure there are no critical bugs, that the necessary events are in place, and that the relevant people know the test is happening.
- Don’t forget about events. A common pitfall is to track the wrong events or forget about them altogether and end up having to re-do the test. The best way we’ve found of avoiding this is to (i) write down in advance the KPIs you will judge success on, then work backwards to identify the events you need to calculate these, and (ii) add an events sub-task to every A/B testing task in your project management tool to ensure that it doesn’t get missed.
- Talk through edge cases. Are there any flows you haven’t considered that may invalidate your test? Perhaps from deep links, or a buried button in the settings page? We once had a situation where, if a user killed and re-opened the app, they could bypass a test, which muddied the results.
- Test your tests. Tests can be hard to QA if there are several variants, if the test touches many parts of your product, or if changes in the user experience are supposed to happen at a future date. Penciling in QA helps to avoid embarrassing and time-consuming fixes later on.
- Make everyone aware that the test is happening. Tests can affect many teams, from customer support (who get questions from puzzled users) to CRM (who need to adapt their messaging). Giving everyone a heads up that the test is happening and another reminder once it goes live helps ensure they are ready to manage its effects.
When it comes to analysing the results of a test, you want to ensure the numbers have been run correctly and that the results are easy to understand.
- Make sure your sample size is big enough. The concept of the 95% significance level is well known, but you also need a sample size that minimises the risk of conflating natural variability in your data with a positive result (known as a false positive). The best way of avoiding this is to calculate the sample size you need for the test upfront and there are many sample size calculators online to help you do this.
- Sense-check numbers across sources. You’d be surprised how easy it can be to get suspicious numbers, often due to errors in data collection rather than errors in calculation. Examples of good sense-checks are (i) whether the sample size of each variant is in line with what you’d expect given the test targeting, and (ii) whether the number of conversions you see in one data source match those from others.
- Use charts. Usually if you want to see at a glance which variant performs better on a particular KPI (and by what order of magnitude) a chart does this better than a table.
- Have clear so-whats and a list of discussion points. What have we learnt from this test? What should we do differently (if anything)? What aspects are puzzling? Results should include a layer of interpretation to help the reader quickly understand the key takeaways and prompt discussion.
After a test is complete, make sure that the learnings are shared with the relevant people and next steps are followed through.
- Discuss test results, regardless of whether or not the test worked. It’s tempting to only share the results of tests that worked, since who likes to publicise failures? This is a bad idea though because (i) the team learns from failures as well as successes, and (ii) discussing failures is a good way of coming up with more ideas of things to test.
- If the test worked, implement it properly. If the test was a hack then you’ll want to make sure that you implement the winning variant properly to keep the codebase clean and reap the full benefits of the idea.
- Be careful about simply moving all users to the winning variant after a test finishes. Think about what the experience will be like for users who have been in one variant until now and from tomorrow see the winning one. For big changes in the experience you may want to “bridge” these variants and explain to users what is happening.
- Think about whether you can apply learnings from this test to new tests. Tests on the background colour and icon format of our Google Play Store page have influenced our design inside the app, and tests on CRM copy have influenced how we position ourselves as a brand. This cross-pollination of learnings can only happen if they are not mentally siloed.
I hope this has been a useful set of learnings that you can apply to your own A/B testing. If you have any thoughts then I’d love to hear them in the comments section below. In the next post I’ll cover how using a tracker document can help you manage your tests and better retain learnings from past ones.
Many thanks to Antoine Rapy who created the illustrations for this post.