Lai–Robbins lower bound

The Lai–Robbins lower bound^[1] gives an asymptotic lower bound on the regret that any uniformly good algorithm must incur in the stochastic multi-armed bandit problem. The original result was proved by Tze Leung Lai and Herbert Robbins in 1985 for parametric exponential families. Later work extended the statement to more general classes of distributions.^[2]

Multi-armed bandit problem

The multi-armed bandit problem (MAB) is a sequential game in which the player must trade off exploration (to learn) and exploitation (to earn).

The player chooses among $K$ actions (arms) with unknown distributions $\nu =(\nu _{1},\dots ,\nu _{K})$ . The player is assumed to know a class of distributions ${\mathcal {D}}$ such that for every $k$ one has $\nu _{k}\in {\mathcal {D}}$ (for example, ${\mathcal {D}}$ may be the family of Gaussian or Bernoulli distributions).

At each round $t=1,\dots ,T$ the player selects (pulls) an arm $a_{t}$ and observes a reward $X_{t}\sim \nu _{a_{t}}$ .

We denote

$N_{a}(t):=\sum _{s=1}^{t}\mathbf {1} _{\{a_{s}=a\}}$ the number of times arm $a$ has been pulled in the first $t$ rounds,
$\mu (\nu ):=(\mu _{1},\dots ,\mu _{K})$ the vector of arm means, where $\mu _{k}=\mathbb {E} _{X\sim \nu _{k}}[X]$ ,
$\mu ^{*}:=\max _{a}\mu _{a}$ the highest mean
$\Delta _{a}:=\mu ^{*}-\mu _{a}\geq 0$ the gap of arm $a$ .

An arm $a$ with $\mu _{a}=\mu ^{*}$ is called an optimal arm; otherwise it is a suboptimal arm.

The goal is to minimize the regret at horizon $T$ , defined by

R_{T}:=\sum _{a=1}^{K}\Delta _{a}\,\mathbb {E} [N_{a}(T)].

Intuitively, the regret is the (expected) total loss compared to always playing an optimal arm:

{\text{regret}}=\sum _{a}\ ({\text{cost of playing }}a)\times ({\text{times }}a{\text{ is played}}).

An MAB algorithm is a (possibly randomized) policy that, at each round $t$ , choose an arm a_t by using the observations received from previous turns.^[3]

Intuitive example

Suppose a farmer must choose, each year, one of $K$ seed varieties to plant. Each variety $k$ has an unknown average yield $\mu _{k}$ . If the farmer knew the best variety (with mean $\mu ^{*}$ ) he would plant it every year; in reality he must try varieties to learn which is best. The cumulative regret after $T$ years measures the total expected loss in yield due to imperfect knowledge.

Remarks

The model above is the stochastic MAB; there also exist adversarial variants.^[3]
One may consider a fixed-horizon setting (known $T$ ) or an anytime setting (unknown $T$ ).

Lai–Robbins lower bound

The theorem gives the right amount of time we should pull a suboptimal arm $k$ to distinguish whether we are in the instance with $\nu _{k}$ or with ${\tilde {\nu }}_{k}$ where ${\tilde {\nu }}_{k}$ is such that ${\tilde {\mu }}_{k}>\mu ^{*}$ .

Knowning a lower bound on the number of pull of every suboptimal arm gives a lower bound on the regret as only suboptimal arms contribute to the regret.

Before stating the formal theorem we need to define what is a consistent algorithm.

Consistency (uniformly good algorithms)

Let ${\mathcal {D}}$ be a class of probability distributions and consider $K$ arms with reward distributions $\nu =(\nu _{1},\dots ,\nu _{K})\in {\mathcal {D}}^{K}$ . An algorithm is said to be consistent (also called uniformly good) on ${\mathcal {D}}^{K}$ if, for every instance $\nu \in {\mathcal {D}}^{K}$ , the expected regret $R_{T}(\nu )$ grows subpolynomially:

\forall \alpha >0,\qquad R_{T}(\nu )=o(T^{\alpha })\quad {\text{as }}T\to \infty

This assumption excludes algorithms that perform well on some instances but incur linear regret on others.

Formal lower bound

For any suboptimal arm $a$ . For a distribution $\nu _{a}\in {\mathcal {D}}$ and a threshold $x$ , define

{\mathcal {K}}_{\inf }(\nu _{a},x,{\mathcal {D}}):=\inf {\Bigl \{}\operatorname {KL} (\nu _{a},\nu '):\nu '\in {\mathcal {D}},\ \mu '>x{\Bigr \}}

where $\operatorname {KL} (\cdot ,\cdot )$ denotes the Kullback-Leibler divergence.

Then, for any algorithm consistent on ${\mathcal {D}}^{K}$ and for every instance $\nu \in {\mathcal {D}}^{K}$ , every suboptimal arm $a$ satisfies

\mathbb {E} _{\nu }[N_{a}(T)]\geq {\frac {\ln T}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}+o(\ln T)

Consequently, the regret satisfies

R_{T}(\nu )\geq \left(\sum _{a:\,\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}\right)\ln T+o(\ln T)

The original 1985 paper^[1] established this result for exponential families; later work showed that the bound holds under much weaker assumptions on ${\mathcal {D}}$ .

Intuition

Consistency imposes that, for every $\nu$ , the number of pulls of an optimal arm must be large. This means that $\mu ^{*}$ is estimated very accurately. The goal is to determine, for a suboptimal arm $k$ , how many samples are needed to be confident, with the appropriate level of confidence, that $\mu _{k}<\mu ^{*}$ . To do so, we use what is called the most confusing instance: an instance close to $\nu$ such that arm $k$ is optimal. We define it as ${\tilde {\nu }}$ such that, for all $a\neq k$ , ${\tilde {\nu }}_{a}=\nu _{a}$ , and ${\tilde {\nu }}_{k}$ is chosen so that ${\tilde {\mu }}_{k}>\mu ^{*}$ . The objective is to determine how many samples of arm $k$ are required to distinguish whether we are in the instance with $\nu _{k}$ or with ${\tilde {\nu }}_{k}$ in terms of $\operatorname {KL}$ distance.

Algorithms achieving the Lai–Robbins lower bound

Several algorithms are known to achieve the Lai–Robbins asymptotic lower bound under specific assumptions on the reward distribution class ${\mathcal {D}}$ . The following list summarizes a non-exhaustive list of algorithms mathing the lower bound.

More information Distribution class

...

Non-exhaustive list of algorithms achieving the lower bound under various distributional assumptions
Distribution class ${\mathcal {D}}$	Algorithms
Gaussian rewards (known variance)	KL-UCB,^[4] TS,^[5] AdaUCB,^[6] RB-SDA^[7]
Gaussian rewards	CHK^[8]
One-dimensional exponential families	KL-UCB^[4]
Bounded rewards ${\mathcal {P}}([0,1])$	KL-UCB,^[4] IMED,^[9] Fast-IMED,^[10] DMED,^[11] NPTS,^[12] KL-UCB-switch^[13]

Close

Extension to other problems

Structured bandit

A more complexe is structured bandit where we know that the mean of each arm is in a set with some restriction. In this case we can prove a smaller lower bound that use the knowledge of this set.^[14]^[15]

Best arm identification (BAI)

A similar result has been proved for best arm identification, which is the same game except that, instead of minimizing the regret, the goal is to identify the best arm with probability $1-\delta$ using as few rounds as possible.^[16]

Reinforcement Learning (RL)

Similar results have been proved for regret minimization in average-reward reinforcement learning. The order is also $\ln T$ , with a constant that depends on the problem.^[17]

References

[1]
Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. Bibcode:1985AdApM...6....4L. doi:10.1016/0196-8858(85)90002-8.
[2]
Maillard, Odalric-Ambrym (2019). Mathematics of statistical sequential decision making (PhD thesis). Université de Lille, Sciences et Technologies.
[3]
Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.
[4]
Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.
[5]
Agrawal, Shipra; Goyal, Navin (2012). Mannor, Shie; Srebro, Nathan; Williamson, Robert C. (eds.). Analysis of Thompson Sampling for the Multi-armed Bandit Problem. Proceedings of the 25th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 23. PMLR. pp. 39.1 – 39.26.
[6]
Lattimore, Tor (2018). "Refining the Confidence Level for Optimistic Bandit Strategies". Journal of Machine Learning Research. 19 (20): 1–32.
[7]
Baudry, Dorian; Kaufmann, Emilie; Maillard, Odalric-Ambrym (2020). "Sub-sampling for Efficient Non-Parametric Bandit Exploration". arXiv:2010.14323 [stat.ML].
[8]
Cowan, Wesley; Honda, Junya; Katehakis, Michael N. (2018). "Normal Bandits of Unknown Means and Variances". Journal of Machine Learning Research. 18 (154): 1–28.
[9]
Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.
[10]
Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.
[11]
Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.
[12]
Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.
[13]
Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.
[14]
Graves, Todd L.; Lai, Tze Leung (1997). "Asymptotically efficient adaptive choice of control laws in uncontrolled Markov chains". SIAM Journal on Control and Optimization. 35 (3). SIAM: 715–743. doi:10.1137/S0363012994275440.
[15]
Kaufmann, Emilie (2020). Contributions to the optimal solution of several bandit problems (PhD thesis). Université de Lille.
[16]
Garivier, Aurélien; Kaufmann, Emilie (2016). "Optimal Best Arm Identification with Fixed Confidence". arXiv:1602.04589 [math.ST].
[17]
Boone, Victor; Maillard, Odalric-Ambrym (2025). "The regret lower bound for communicating Markov Decision Processes". arXiv:2501.13013 [cs.LG].

Lai–Robbins lower bound

Multi-armed bandit problem

Intuitive example

Lai–Robbins lower bound

Consistency (uniformly good algorithms)

Formal lower bound

Intuition

Algorithms achieving the Lai–Robbins lower bound

Extension to other problems

Structured bandit

Best arm identification (BAI)

Reinforcement Learning (RL)

See also

References

Related Articles