The term "Multi-Armed Bandit" comes from the field of probability theory and statistics, and it's often used in the context of reinforcement learning and A/B testing.
The name is derived from a hypothetical scenario involving a gambler at a row of slot machines (also known as "one-armed bandits"), who must decide which machines to play, how many times to play each machine and in which order to play them, to maximize his total reward.
In the context of online experiments, a multi-armed bandit algorithm dynamically adjusts the traffic allocation towards different variations based on their performance. This is different from traditional A/B testing where traffic is split evenly and the allocation does not change during the test.
A multi-armed bandit algorithm works by continuously adjusting the traffic towards the best-performing variations until it can confidently pick the best variation. The winning variation will then receive 100% of the traffic.
For example, if a given variant has a 60% probability of being the best, a multi-armed bandit algorithm like Autotune will provide it 60% of the traffic. At a high level, the multi-armed bandit algorithm works by adding more users to a treatment as soon as it recognizes that it is clearly better in maximizing the reward (the target metric).
Throughout the process, higher-performing treatments are allocated more traffic whereas underperforming treatments are allocated less traffic. When the winning treatment beats the second-best treatment by enough margin, the process terminates.
Let's say you're running an online store and you want to test three different designs for your checkout page to see which one leads to the highest conversion rate.
Instead of splitting your traffic evenly between the three designs (as you would in a traditional A/B test), you could use a multi-armed bandit algorithm to dynamically adjust the traffic allocation based on the performance of each design.
If design A is performing well and has a 70% chance of being the best, the algorithm will allocate 70% of your traffic to design A. If design B has a 20% chance of being the best, it will receive 20% of the traffic, and design C, with a 10% chance of being the best, will receive the remaining 10% of the traffic.
This way, you're maximizing your conversions while the test is still running, instead of waiting until the end of the test to make changes based on the results.
Multi-armed bandit algorithms are particularly useful when you want to minimize the opportunity cost of showing sub-optimal variations to your users.
They're also useful when you want to run continuous experiments without a fixed end date, or when the performance of variations changes over time. However, they're not as useful when you want to measure the impact of a variation on multiple metrics, as they can only optimize for a single metric.