Explore-then-commit algorithm

From Wikipedia, the free encyclopedia

Explore Then Commit (ETC) is an algorithm for the multi-armed bandit problem focused on finding the best trade-off between exploration and exploitation.

Regret minimization

The multi-armed bandit problem is a sequential game where one player has to choose at each turn between actions (arms). Behind every arm there is an unknown distribution that lies in a set known by the player (for example, can be the set of Gaussian distributions or Bernoulli distributions).

At each turn the player chooses (pulls) an arm , they then get an observation of the distribution .

The goal is to minimize the regret at time that is defined as

where

  • is the mean of arm
  • is the highest mean
  • is the number of pulls of arm up to turn

The player has to find an algorithm that chooses at each turn which arm to pull based on the previous actions and observations to minimize the regret .

This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).[1]

Algorithm

Two runs of ETC with the same M = 10. On the first run it does manage to find the best arm after the exploration while it does not on the second run

The idea of the algorithm is to explore each arm times. Then for the rest of the game the algorithm exploits its discoveries by playing the arm with the highest mean. If the horizon is known, then the number of explorations can depend on .

Adaptations of the algorithm exist[2] and can be found in the literature for other settings.[3]

Pseudocode

The player chooses M
 for each arm i do:
    select arm i M times
    update empirical mean mu[i]
for t from MK+1 to T do:
    select arm a with highest empirical mean mu[a]

Theoretical results

See also

References

Related Articles

Wikiwand AI