A/B Testing Is Not What You Think
May 21, 2020
Understanding what an A/B test does is pretty easy. Understanding the process of using A/B testing to systematically maximize user love & financial impact is not.
My goal by the end of this series of posts is to compress my expertise from running 100s of A/B tests. This ranges from running experiments on small websites with users in the thousands like Leanrr all the way up to 10s of millions of users at products like Outlook.
You will walk away feeling like a Level 10 A/B Testing Wizard tapping into the compounding effects of learning from every change you make. So hold on to your robe and wizard hat, let’s get started.
What exactly is an A/B test?
The top point of confusion I come across is conflating “the method of A/B testing” and “using A/B testing”. Let’s define the What of both.
The method of A/B testing
When talking about the method of A/B testing replace “A/B test” with the word “hammer”. Let’s use the title: A Hammer Is Not What You Think. It becomes pretty clear we’re talking about a tool. It exists regardless if you know about it. It can do something but how you wield it is independent of what it is.
Hammer: A heavy object you use to exert force on an object.
A/B test: A method to PROVE a difference exists between two things.
Using A/B testing
When using A/B testing things escalate.
You can use a hammer to murder someone (very bad) or you can use it to build them a house (very nice).
You can use certain PROVABLE differences to manipulate somebody, including yourself, with a false narrative that you want to believe (very bad). Or you can repeatedly discover the underlying TRUTH (very nice).
Use with caution
Lastly, just like you can have a shitty little hammer that can’t get the job done, so too you can set A/B tests up incorrectly and everything you think is PROVEN will actually be totally wrong. Luckily, a ton of smart people have already figured out the theory behind it so we can avoid the pitfalls by just following their guidance.
Why do YOU care about A/B testing?
If you made it this far you probably care about discovering the TRUTH about the changes you make. (Either that or you are studying for a weird type of trivia.) But how much does it actually matter? How much money might you be losing by not learning everything there is from every change you make?
Let’s use some numbers:
- In the blue universe you discover the TRUTH 50% of the time about every change.
- In a parallel red universe you discover it 90% of the time using A/B tests.
Well, it’s already apparent 90% is better than 50% and I’ll come back to that in a moment. But that’s not what’s important. What’s important are the compounding effects of being even a little bit better. Let’s take a look at how understanding good and bad changes compounds over time.
- You succeed with 2 changes. One performs 2x better, the other performs 10x better.
- Red you will likely see results of both. Blue you might miss one.
- Red you will want to do more 10x type changes.
- Blue you’ll do more changes like the one you happen to see.
- Could be 2x, could be 10x. 50% chance.
- You fail with a change.
- Red you will see the failure and might choose to fix it based on what you learn or you might decide to stop working on it.
- Blue you might miss the change was bad altogether. (Since it’s 50%.) Blue you might spend more time on the change before you realize you’re wasting your time.
So the difference between 50% and 90% has an immediate effect but also compounds over time because of the way it influences future decisions. Now we have one of these things going on:
By learning more with each change and leveraging compounding growth the red you makes the other you blue. (Pun)
50% was a generous number for non A/B testing approaches. Let’s take a look at some of the commonly used approaches to decide if things work and you decide what % of “truthiness” you want to assign to them:
- “Feeling It Out”
- It felt better. Things definitely got better.
- Problem: Even IF it was better, how much better was it?
- If something is 20% better and something else is 10% better, only the red you would know and continue focusing on the higher impact changes.
- “Asking For Feedback”
- Your spouse, family, friends, colleagues tell you it is good.
- A couple of customers even email you and tell you it is good.
- This is just like “Feeling It Out” but with more compliments. Nice.
- “Eyeing It” – Look Before, Look After
- On March 12th we added the new survey widget. Sales tanked by 12% next week.
- Problem: this is too sensitive to the ongoings of the world beyond your change.
- e.g.: Brazil power outage might have caused 12% less people to buy after your change. It had nothing to do with the survey widget.
- You might end up removing the widget AND not working on it destroying a whole path of improvement.
- The red you on the other hand might see that it is an improvement and continue iterating.
- “Sending Out A Survey”
- You might be more systematic about it and actually ask people to tell you:
- “On a scale of 1 to 5, how much did you like this specific thing we changed?”
- “From 1 to 10, how much would you recommend Windows to somebody?”
- The people who respond to surveys are often a vocal minority and your results have a chance of being biased.
- Also, unless it’s a significant difference, it’s tough to imagine they’d consciously notice the difference.
- You might be more systematic about it and actually ask people to tell you:
So, red universe or blue?
Why do A/B tests work?
A magician never reveals his tricks. But I promised you wizardry: real magic. A/B tests work because of a law of the Universe and you should high five it next time you see it for it. This law is called “The Law Of Large Numbers”.
Allow me to demonstrate with an example. Imagine we give you a fair coin. If you flip it once, you’ll have a 50/50 chance of it landing heads or tails. If you flip it twice, you might get either. While you could go on a streak and flip heads three times in a row, turns out the more you do it, 100s of times, 1000s of times, the closer you will get to a perfect 50/50 ratio. You and your thumb are agents of chaos.
If we give your friend the coin and they also flip the coin a “large” number of times, they will also get 50% heads. If we subtract these two averages from each other, we will see a 0% change. It’s the same coin so it makes sense, no change. The more times each of you flips the coin, the more certain it becomes that you’ll see this phenomenon occur.
Now imagine we actually put a tiny piece of gum on your friend’s coin. After 1000s of flips your coin would still be 50/50 while your friend’s coin with a piece of gum on it will skew in the direction that the piece of gum promotes it landing on. Let’s say the piece of gum causes the coin to land on that side 70% of the time.
70% average on heads for gum coin minus 50% average for regular coins on heads. That’s a 20% difference for heads. Go heads! Obviously in this case, we don’t have an incentive to cheer for heads or tails. But if heads represented the % of users who purchase in our business we’d be on our feet clapping.
A coin only has 1 variable with 2 values: heads or tails. Your product is just a bunch of many sided coins. By randomly picking a bunch of people in A and B and releasing a new experience at the same time, you’re observing them flip a bunch of coins (current UX) vs gum coins (new UX change).
Due to the nature of randomness, with a large enough sample, a bunch of variables cancel out:
- If 1 person was from Japan in A, 1 person would be from Japan in B.
- If 1 person went on a vacation in bucket A, another person went on vacation in bucket B.
- If 1 person has an axe to grind with his VP forcing him to use your app in A, another person will have an axe to grind with his VP for using the app in B.
That means if you release your change to a perfectly random set of people at the same time and there’s enough of them, all the variables about them will cancel out. The only variable that will remain will be the change you made. That will be the only reason you would see a difference.
If like the piece of gum, it causes a shift, that will show up. If it’s an incredibly big piece of gum, the difference will show up very quickly. If it’s a tiny piece of gum, the lack of different will take many, many, flips before being recognized. (Flips are like user sessions in this case.) It’s very hard to predict the “size” of the gum beforehand since unlike the real gum, the impact your change might have is multidimensional. Hence the point of the A/B test.
Why is this exciting? If you are a designer, engineer, product manager, you can iterate directly on your skillset. You are getting feedback that you are absolutely certain had to do with your change. Was it good or bad? How good or bad was it? Why was it good or bad? You get to PROVABLY see.
If you are a manager or a director, by developing a comprehensive system about tracking these changes, you can start building a track record of ROI along with gaining a deeper understanding of your customer base. You can develop an institutional memory about what has worked and hasn’t worked to tap into the compounding effects of the red universe.