Draft:Best arm identification

Problem in machine learning From Wikipedia, the free encyclopedia


Best arm identification or pure exploration is a one-player game where the player has to find the best action (arm) among a list of actions (arms) by collecting information in the most efficient way.

  • Comment: Thank you for your contribution to Wikipedia! Right now, this draft is far too technical for an encyclopedia, but it'd be great if you could revise the article to make it understandable for non-scholars. Could you please read WP:NOTGUIDE ("Wikipedia is not a manual, guidebook, textbook, or scientific journal") - in particularly points 6 and 7 - Wikipedia is not a textbook and not a scientific journal. Starting by giving more context than you currently do in the lead would be useful. Maybe mention that it is used in machine learning, spend a bit more time explaining how it's used in clinical trials and so forth. You might not need to include the formulae at all - you could perhaps instead point readers to textbook explanations or similar? I really appreciate your efforts to share scientific knowledge in Wikipedia, so please read this decline as an encouragement to make it more accessible to more people, not as a rejection! Lijil (talk) 19:29, 1 January 2026 (UTC)


Best arm identification is a multi-armed bandit game, because a player only gets information about an arm by playing it. The most common objective in multi-armed bandit games is to minimize the regret (i.e., play the best action as much as possible), but in best arm identification, the goal is to find the best arm as efficiently as possible.

This problem naturally arises in scenarios such as adaptive clinical trials[1] where the number of patients is limited and the quantification of the confidence in a treatment is important. It also arises in hyperparameter optimization[2] where the goal is to find the optimal choice of hyperparameters for an algorithm with the smallest possible number of experiments, as it can be costly in terms of time, energy, or money.

Stochastic multi-armed bandit

The stochastic multi-armed bandit (MAB) is a sequential game with one player and actions (arms). Each arm has an unknown probability distribution associated with it.

At each turn, the player has to choose one action and receive an observation from the probability distribution associated with the arm. The more you play an arm, the more you get information on its probability distribution.

Best arm identification

In BAI the goal is to find the arm that has the probability distribution with the highest mean. BAI may be either fixed confidence or fixed horizon. In a fixed-confidence game, a confidence level is fixed at the beginning of the game and the goal is to find the best arm with this confidence level in as few turns as possible. In a fixed horizon game, the number of turns is fixed, and the goal is to find the best arm with the highest possible confidence in turns.

Math formalisation

We have one player and actions (arms). Behind each arm lies an unknown distribution with mean . Each distribution belongs to a known family (such as the set of Gaussian distributions or Bernoulli distributions). At each timestep , the player selects an arm and observes an independent sample from the corresponding distribution.

We will note the highest mean. An arm that satisfies is called an optimal arm; otherwise it is called suboptimal.

In best arm identification (BAI) the objective is to identify an optimal arm.[3]

Two main settings for BAI appear in the literature:

  • Fixed confidence: In this setting, one typically assumes that there exists a unique optimal arm. A confidence level is specified at the beginning. The algorithm must stop at some finite stopping time and return an arm such that the probability of error is bounded: . The objective is to minimize the expected sample complexity .

Such a setting appears, for example, when a constraint on the confidence is required (for example, if we require a confidence level of 95%, so ).

  • Fixed horizon: In this setting, the number of samples is fixed in advance. The goal is to design an algorithm that minimizes the probability of misidentifying the optimal arm: .

This setting appears when the number of experiments is limited (for drug tests, the number of patients can be fixed in advance).

Example of simple modelling

In the case where we have treatments and we want to be sure with a confidence level of 95% which treatment is the best to heal a specific disease. Each treatment heals or does not heal the disease with a probability , which means that each distribution is a Bernoulli distribution, so is the set of Bernoulli distributions. We can use a BAI algorithm to minimize , the number of patients required to find the best treatment with probability 95%.

Applications

Best arm identification naturally arises in several practical domains:

  • Adaptive clinical trials: The objective is to identify the most effective treatment based on sequentially collected patient data. Each treatment can be modeled as having an underlying distribution of outcomes. The goal is to identify the treatment with the highest expected outcome with high confidence (fixed confidence setting ) while minimizing the number of drug test patients (minimise ), as it costs to pay patients for this and we would like to use as little as possible less effective drugs.[1][4][5]
  • Hyperparameter tuning: Selecting the best configuration for machine learning models efficiently by treating each hyperparameter setting as an arm. The goal is to find the best hyperparameter with as few experiments possible as experiments are costly in time and in energy[2]

Fixed confidence level

Typical run for a BAI algorithm with fixed confidence

In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level while minimizing the expected number of samples. Any such algorithm requires two key components:

  • Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time and returns an arm such that and .
  • Sampling rule: A policy that, at each round , selects the next arm to sample based on all previous observations .

Algorithm structure

The general structure of a fixed-confidence BAI algorithm can be described as follows:

Input: Distribution family , confidence level 
Initialize: 
    For each arm : set , 
    Set , 
while  do:
     Sampling_rule(, , , )
    Observe reward 
    Update: 
    Update empirical distribution 
    
     ← Stopping_rule(, , , )
return 

Lower bound

C^* when we have 2 Gaussian arms of variance 1 and with means equal to mu and mu + x

The minimal expected number of pulls to obtain the confidence level of was determined in 2016[6]. For a given instance and a fixed , it provides the minimum value of possible.

To give the lower bound, we first need to define the function , the Kullback–Leibler divergence between two Bernoulli distributions with means and . Which is equivalent to when tend to .

For any algorithm satisfying the -correctness constraint, the expected sample complexity satisfies where the problem-dependent constant only depends on

More information , between distributions ...
Close

Asymptotically optimal algorithms

Since as , an algorithm is called asymptotically optimal if

The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm[6], which consists of tracking the constant in lower bound by using the empirical distribution and adding some minimal exploration to be sure that the estimated lower bound is close enough to the real lower bound .

More information The algorithm attempts to estimate the optimal sampling proportions ...
Close

Fixed horizon

In the fixed-horizon setting, the total number of samples is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round , and at the end of the horizon, it returns a recommendation . The objective is to minimize the probability of error .

This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between which gives an observation of the distribution of the treatment, and each patient corresponds to a turn . The total number of patients is the horizon

Algorithm structure

A typical fixed-horizon BAI algorithm proceeds as follows:

Input: Distribution family , horizon 
Initialize: 
    For each arm : pull once and set ,  initial empirical distribution
for  from  to  do:
     Sampling_rule(, , , )
    Observe reward 
    Update: 
    Update empirical distribution 
return 

Unlike the fixed-confidence setting, there is no stopping rule because we stop at time . The algorithm is only base on a sampling rule.

Lower bound

The lower bound in the fixed-horizon setting gives the best confidence level we can reach with a given number of turns . It is expressed as an asymptotic result when is large.

Lower bound theorem: For any algorithm, for any instance , there exists a constant that depends only on such that the probability of error satisfies [8]

This result shows that the error probability decays exponentially with the number of turns .

Simple regret

An alternative performance metric for fixed-horizon BAI is the simple regret, defined as which measures the expected suboptimality of the returned arm.

While treats all mistakes with the same cost, the simple regret accounts for the gap between the optimal mean and the mean of the arm considered as the optimal arm by the algorithm . This distinction is important in applications where the cost of choosing a suboptimal arm depends on how far it is from optimal.[9]

See also

References

Related Articles

Wikiwand AI