Lai–Robbins lower bound

Lower bound for bandit problem From Wikipedia, the free encyclopedia

The Lai–Robbins lower bound[1] gives an asymptotic lower bound on the regret that any uniformly good algorithm must incur in the stochastic multi-armed bandit problem. The original result was proved by Tze Leung Lai and Herbert Robbins in 1985 for parametric exponential families. Later work extended the statement to more general classes of distributions.[2]

Multi-armed bandit problem

The multi-armed bandit problem (MAB) is a sequential game in which the player must trade off exploration (to learn) and exploitation (to earn).

The player chooses among actions (arms) with unknown distributions . The player is assumed to know a class of distributions such that for every one has (for example, may be the family of Gaussian or Bernoulli distributions).

At each round the player selects (pulls) an arm and observes a reward .

We denote

  • the number of times arm has been pulled in the first rounds,
  • the vector of arm means, where ,
  • the highest mean
  • the gap of arm .

An arm with is called an optimal arm; otherwise it is a suboptimal arm.

The goal is to minimize the regret at horizon , defined by

Intuitively, the regret is the (expected) total loss compared to always playing an optimal arm:

An MAB algorithm is a (possibly randomized) policy that, at each round , choose an arm a_t by using the observations received from previous turns.[3]

Intuitive example

Suppose a farmer must choose, each year, one of seed varieties to plant. Each variety has an unknown average yield . If the farmer knew the best variety (with mean ) he would plant it every year; in reality he must try varieties to learn which is best. The cumulative regret after years measures the total expected loss in yield due to imperfect knowledge.

Remarks

  1. The model above is the stochastic MAB; there also exist adversarial variants.[3]
  2. One may consider a fixed-horizon setting (known ) or an anytime setting (unknown ).

Lai–Robbins lower bound

The theorem gives the right amount of time we should pull a suboptimal arm to distinguish whether we are in the instance with or with where is such that .

Knowning a lower bound on the number of pull of every suboptimal arm gives a lower bound on the regret as only suboptimal arms contribute to the regret.

Before stating the formal theorem we need to define what is a consistent algorithm.

Consistency (uniformly good algorithms)

Let be a class of probability distributions and consider arms with reward distributions . An algorithm is said to be consistent (also called uniformly good) on if, for every instance , the expected regret grows subpolynomially:

This assumption excludes algorithms that perform well on some instances but incur linear regret on others.

Formal lower bound

For any suboptimal arm . For a distribution and a threshold , define

where denotes the Kullback-Leibler divergence.

Then, for any algorithm consistent on and for every instance , every suboptimal arm satisfies

Consequently, the regret satisfies

The original 1985 paper[1] established this result for exponential families; later work showed that the bound holds under much weaker assumptions on .

Intuition

Consistency imposes that, for every , the number of pulls of an optimal arm must be large. This means that is estimated very accurately. The goal is to determine, for a suboptimal arm , how many samples are needed to be confident, with the appropriate level of confidence, that . To do so, we use what is called the most confusing instance: an instance close to such that arm is optimal. We define it as such that, for all , , and is chosen so that . The objective is to determine how many samples of arm are required to distinguish whether we are in the instance with or with in terms of distance.

Algorithms achieving the Lai–Robbins lower bound

Several algorithms are known to achieve the Lai–Robbins asymptotic lower bound under specific assumptions on the reward distribution class . The following list summarizes a non-exhaustive list of algorithms mathing the lower bound.

More information Distribution class ...
Non-exhaustive list of algorithms achieving the lower bound under various distributional assumptions
Distribution class Algorithms
Gaussian rewards (known variance) KL-UCB,[4] TS,[5] AdaUCB,[6] RB-SDA[7]
Gaussian rewards CHK[8]
One-dimensional exponential families KL-UCB[4]
Bounded rewards KL-UCB,[4] IMED,[9] Fast-IMED,[10] DMED,[11] NPTS,[12] KL-UCB-switch[13]
Close

Extension to other problems

Structured bandit

A more complexe is structured bandit where we know that the mean of each arm is in a set with some restriction. In this case we can prove a smaller lower bound that use the knowledge of this set.[14][15]

Best arm identification (BAI)

A similar result has been proved for best arm identification, which is the same game except that, instead of minimizing the regret, the goal is to identify the best arm with probability using as few rounds as possible.[16]

Reinforcement Learning (RL)

Similar results have been proved for regret minimization in average-reward reinforcement learning. The order is also , with a constant that depends on the problem.[17]

See also

References

Related Articles

Wikiwand AI