The Multi Armed Bandit Problem

Posted on 21st Aug, 2014

Determining the right title for a news article, layout for a web page or thumbnail for a photo gallery used to be guesswork. On the internet it is easy to log user behavior, with this information it is possible to find the best version out of a selection of variations. Someone trying to determine the best version has two goals: 1. After the test is completed the best version should be correctly identified. 2. We do not want to spent too much of our users on the other versions, since this are lost income for the website.

A good start in academic papers seems to be "An Empirical Evaluation of Thompson Sampling" by Olivier Chapelle and Lihong Li. The authors are affiliated with Yahoo Research, curiously this paper is also hosted on Microsoft Research servers.

The setup of this experiment

In this experiment I will show the performance of the epsilon-greedy algorithm, with and without decreasing epsilon, and Thompson sampling. The baseline solution to this problem is to randomly present one of the versions to the user, and see what version got the most clicks/sign-ups/best conversion.

The epsilon-greedy algorithm tries to make a trade off between exploration (finding the best version) and exploitation (not spending too much users on the wrong version). The algorithm accepts one parameter, epsilon, which is the ratio of users that should be spent on exploration. An epsilon of 1 makes this approach identical to the baseline, presenting versions randomly. A variation of this is to decrease epsilon slowly, to show more of the best version to users when we are more certain that it actually is the best version.

Thompson sampling tries to make the same trade off but without user-definable parameter. It fits a Beta distribution to the conversion of each version, for each view it picks a random value from the distribution and presents the version with the highest value. This has the advantage over epsilon-greedy that after a while it will only explore between the best few versions, instead of all versions.

In this experiment we have a Bandit with six arms, with a conversion of 30%, 33%, 25%, 0%, 5%, and 10%. All approaches are tested for 10, 20, 50, 100, ... up to a 100.000 views, each value is tested 1000 times. The epsilon is set at 0.2, this means that the epsilon greedy algorithm explores 20% of the views, and exploits the remaining 80%. The decay for the decreasing epsilon variant is set at 0.99, this means that after every view the probability of exploration decreases by 1%.

Experimental results

In the two figures below we see the frequentist probability of the algorithm choosing the correct version after the run (top) and the average conversion during a certain run length (bottom).

The probability of choosing the correct version for the three algorithms, for different amount of page views

The average amount of hits for the three algorithms, for different amount of page views

In the top figure we can see that there are is not much difference in the probability of choosing the correct (highest conversing) version between showing users a random version and using the epsilon-greedy algorithm. Thompson sampling performs a little worse initially, but after an amount of runs necessary to get a high probability correctness the difference is gone. Epsilon-greedy with a decreasing epsilon seems to hit a glass ceiling. We will look into that later.

In the bottom figure we can see that there is a big difference in average conversion during the run between the different approaches. The "show random version" approach performs worst, this is because there are 3 versions in our test that have a very poor conversion and push the average down. This is a major drawback for this approach. The epsilon-greedy approach works better, but hits a ceiling, because it always loses 20% of its views to exploration. With a decreasing epsilon the average conversion for a high amount of views is quite high because exploration is close to zero after a few hundred runs. Thompson sampling has the highest conversion if used for enough views.

So lets see what happens with decreasing-epsilon-greedy. Below are two scatter plots with the conversion for each run, one for decreasing-epsilon-greedy (top), one for Thompson sampling (bottom). Both have their average conversion shown by a superimposed dotted line.

Conversion with the epsilon greedy algorithm, for different amount of page views

Conversion with Thompson sampling, for different amount of page views

The reason I chose to use scatter plots with a high transparency instead of box plots is that not-normally distributed values are easier to spot. The plot for Thompson sampling is what I expected to find for all approaches: less variance with a higher amount of views and a higher conversion for more views. The decreasing-epsilon-greedy plot shows why it is not performing the best is can: it can get stuck in sub-optimal solutions, like the versions that have a conversion of 30% or 25% instead of the top performer: 33%. While it is possible to use different parameters, in a real world these values also have to be guessed, and the true conversion values are not known in advance.

I used the IPython Notebook with Numpy and Matplotlib. You can download the code for this experiment here

Future work:

  • When is it time to stop the experiment and choose the best version? This is not a trivial question, and that testing for statistical significance is no holy grail can be read here.
  • What if you're not interested in clicks or sign-ups (binary events), but order size or time spent (continuous values)? I have no answer to this right now, but plan to find a solution for this.
  • What if conversion varies by time of day, or day of the week? Will this influence our results? It probably does. More on that later.