Lai–Robbins lower bound
Lower bound for bandit problem
From Wikipedia, the free encyclopedia
The Lai–Robbins lower bound[1] gives an asymptotic lower bound on the regret that any uniformly good algorithm must incur in the stochastic multi-armed bandit problem. The original result was proved by Tze Leung Lai and Herbert Robbins in 1985 for parametric exponential families. Later work extended the statement to more general classes of distributions.[2]
Multi-armed bandit problem
The multi-armed bandit problem (MAB) is a sequential game in which the player must trade off exploration (to learn) and exploitation (to earn).
The player chooses among actions (arms) with unknown distributions . The player is assumed to know a class of distributions such that for every one has (for example, may be the family of Gaussian or Bernoulli distributions).
At each round the player selects (pulls) an arm and observes a reward .
We denote
- the number of times arm has been pulled in the first rounds,
- the vector of arm means, where ,
- the highest mean
- the gap of arm .
An arm with is called an optimal arm; otherwise it is a suboptimal arm.
The goal is to minimize the regret at horizon , defined by
Intuitively, the regret is the (expected) total loss compared to always playing an optimal arm:
An MAB algorithm is a (possibly randomized) policy that, at each round , choose an arm a_t by using the observations received from previous turns.[3]
Intuitive example
Suppose a farmer must choose, each year, one of seed varieties to plant. Each variety has an unknown average yield . If the farmer knew the best variety (with mean ) he would plant it every year; in reality he must try varieties to learn which is best. The cumulative regret after years measures the total expected loss in yield due to imperfect knowledge.
Remarks
- The model above is the stochastic MAB; there also exist adversarial variants.[3]
- One may consider a fixed-horizon setting (known ) or an anytime setting (unknown ).
Lai–Robbins lower bound
The theorem gives the right amount of time we should pull a suboptimal arm to distinguish whether we are in the instance with or with where is such that .
Knowning a lower bound on the number of pull of every suboptimal arm gives a lower bound on the regret as only suboptimal arms contribute to the regret.
Before stating the formal theorem we need to define what is a consistent algorithm.
Consistency (uniformly good algorithms)
Let be a class of probability distributions and consider arms with reward distributions . An algorithm is said to be consistent (also called uniformly good) on if, for every instance , the expected regret grows subpolynomially:
This assumption excludes algorithms that perform well on some instances but incur linear regret on others.
Formal lower bound
For any suboptimal arm . For a distribution and a threshold , define
where denotes the Kullback-Leibler divergence.
Then, for any algorithm consistent on and for every instance , every suboptimal arm satisfies
Consequently, the regret satisfies
The original 1985 paper[1] established this result for exponential families; later work showed that the bound holds under much weaker assumptions on .
Intuition
Consistency imposes that, for every , the number of pulls of an optimal arm must be large. This means that is estimated very accurately. The goal is to determine, for a suboptimal arm , how many samples are needed to be confident, with the appropriate level of confidence, that . To do so, we use what is called the most confusing instance: an instance close to such that arm is optimal. We define it as such that, for all , , and is chosen so that . The objective is to determine how many samples of arm are required to distinguish whether we are in the instance with or with in terms of distance.
Algorithms achieving the Lai–Robbins lower bound
Several algorithms are known to achieve the Lai–Robbins asymptotic lower bound under specific assumptions on the reward distribution class . The following list summarizes a non-exhaustive list of algorithms mathing the lower bound.
| Distribution class | Algorithms |
|---|---|
| Gaussian rewards (known variance) | KL-UCB,[4] TS,[5] AdaUCB,[6] RB-SDA[7] |
| Gaussian rewards | CHK[8] |
| One-dimensional exponential families | KL-UCB[4] |
| Bounded rewards | KL-UCB,[4] IMED,[9] Fast-IMED,[10] DMED,[11] NPTS,[12] KL-UCB-switch[13] |
Extension to other problems
Structured bandit
A more complexe is structured bandit where we know that the mean of each arm is in a set with some restriction. In this case we can prove a smaller lower bound that use the knowledge of this set.[14][15]
Best arm identification (BAI)
A similar result has been proved for best arm identification, which is the same game except that, instead of minimizing the regret, the goal is to identify the best arm with probability using as few rounds as possible.[16]
Reinforcement Learning (RL)
Similar results have been proved for regret minimization in average-reward reinforcement learning. The order is also , with a constant that depends on the problem.[17]