Talk:Likelihood function/Archive 1

This is an archive of past discussions about Likelihood function. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Unclassified comments

Article does not communicate well. A relatively simple matter turns out to be difficult to understand. —Preceding unsigned comment added by 80.212.104.206 (talk) 13:34, 18 September 2010 (UTC)

I adjusted the wiktionary entry so it doesn't say that the mathematical definition is 'likelihood = probability'. Someone more mathematical than I may want to check to see if the mathematical definition I gave is correct. I defined "likelihood" in the parameterized-model sense, because that is the only way in which I have ever seen it used (i.e., not in the more abstract Pr(A | B=b) sense currently given in the Wikipedia article). 128.231.132.2 03:06, 21 March 2007 (UTC)

This article needs integrating / refactoring with the other two on the likelihood principle and maximum likelihood method, and a good going-over by someone expert in the field. -- The Anome

I emphatically agree. I've rewritten some related articles and I may get to this one if I ever have time. -- Mike Hardy

All was going well until I hit

In statistics, a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed, thus:

Would it be possible for someone to elaborate on that sentence of to given an example? FarrelIThink 06:12, 21 February 2007 (UTC)

I found the very first sentence under the "Definition" section very confusing:

The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values.

This is not true in the continuous case, as described by the article itself a few sentences later. I think the whole thing would be much clearer if the first sentence were omitted and it simply said "The likelihood function is defined differently for discrete and continuous probability distributions". I'm currently a student of this topic and I had quickly read the first sentence under Definition (and only that sentence), ended up greatly confused, and only later came back to read the rest of the section to clarify things. --nonagonal — Preceding unsigned comment added by Nonagonal (talk • contribs) 19:59, 8 October 2015 (UTC)

The arrow

Can someone tell me what the arrow notation is suppose to mean? --Huggie (talk) 11:30, 3 April 2010 (UTC)

I can't figure this out either. Did you ever find an answer? Jackmjackm (talk) 20:00, 23 September 2023 (UTC)

The arrows indicate mapping, so

\theta \mapsto f(x|\theta )

is saying that for each parameter value

\theta

, we can define a function

f(x|\theta )

giving the probability (or probability density) of the data

x

given

\theta

. I.e., we map

\theta

to a function

f(\cdot |\theta )

. I hope this is somewhat helpful and not too circular! So many suspicious toenails (talk) 16:16, 26 September 2023 (UTC)

Context tag

I added the context tag because the article starts throwing mathematical functions and jargon around from the very beginning with no explanation of what the letters and symbols mean. Rompe 04:40, 15 July 2006 (UTC)

The tag proposes making it more accessible to a general audience. A vernacular usage makes likelihood synonymous with probability, but that is not what is meant here. I doubt this topic can be made readily comprehensible to those not familiar at the very least with probability theory. So I question the appropriateness of the "context" tag. The article starts with the words "In statistics,...". That's enough to tell the general reader that it's not about criminology, church decoration, sports tactics, chemistry, fiction writing, etc. If not such preceeding words were there, I'd agree with the "context" tag. Michael Hardy 23:55, 16 July 2006 (UTC)

Which came first

Which came first? the common use as in "in all likelihood this will not occur" or the mathematical function?

See History of probability. "Probable and probability and their cognates in other modern languages derive from medieval learned Latin probabilis ... . The mathematical sense of the term is from 1718. ... The English adjective likely is of Germanic origin, most likely from Old Norse likligr (Old English had geliclic with the same sense), originally meaning "having the appearance of being strong or able" "having the similar appearance or qualities, with a meaning of "probably" recorded from the late 14th century. Similarly, the derived noun likelihood had a meaning of "similarity, resemblance" but took on a meaning of "probability" from the mid 15th century." Mathematical formalizations of probability came later, starting primarily around roughly 1600. Ronald Fisher is credited with popularizing "likelihood" in its modern sense beginning around 1912, according to the Wikipedia article on him. DavidMCEddy (talk) 15:47, 21 March 2016 (UTC)

Backwards

An earlier version of this page said "In a sense, likelihood works backwards from probability: given B, we use the conditional probability Pr(A|B) to reason about A, and, given A, we use the likelihood function L(A|B) to reason about B. ". This makes sense; i.e. it says it's backwards, and it is.

The current version uses L(B|A) instead, i.e. it says: "In a sense, likelihood works backwards from probability: given B, we use the conditional probability Pr(A|B) to reason about A, and, given A, we use the likelihood function L(B|A) to reason about B. " This does not make sense. It says it's backwards, but it talks as if Pr and L are interchangeable.

How about switching back to the earlier version, and providing a concrete example to help clarify it? Possible example: Given that a die is fair, we use the probability of getting 10 sixes in a row given that the die is fair to reason about getting 10 sixes in a row; or given that we got 10 sixes in a row, we use the likelihood of getting 10 sixes in a row given that the die is fair to reason about whether the die is fair. (Or should it say "the likelihood that the die is fair given that 10 sixes occur in a row"? What exactly is the definition of "likelihood" used in this sort of verbal context, anyway?) --Coppertwig 20:28, 24 August 2007 (UTC)

I agree. and similarly, in the "abstract", currently the last sentence ends in "...and indicates how likely a parameter value is in light of the observed outcome." I do not know if it is ok to use the word "likely" in this way. Clearly, replacing it with "probable" in this sentence would make it terribly wrong by committing the common reversal-of-conditional-probabilities mistake. Therefore: is "likely" clearly distinct (and understood) from probable? Anyways I would suggest to rewrite and say "... and indicates how likely the observed outcome is to occur for different parameter values." Or am I missing something here? Enlightenmentreloaded (talk) 10:01, 28 October 2011 (UTC)

And the preamble has the variable and the parameter confused in "Equivalently, the likelihood may be written to emphasize that it is the probability of observing sample given ,..." 109.250.93.11 (talk) 16:32, 17 November 2022 (UTC)

Likelihood of continuous distributions is a problem

The contribution looks attractive; however, it ignores several basic mathematical facts:

1. Usually likelihood is assessed using not one realization, but a series of observed random variables (independently identically distributed). Then the likelihood expands to a large product. Usually this is transformed by a logarithm to a sum. This transformation is not linear (like that mentioned in the entry), but it attains its maximum at the same point.

2. Likelihood can easily be defined for discrete distributions, where its values are values of some probabilities. A problem arises with an analogue for continuous distributions. Then the probability density function (pdf) is used instead of probability (probability function, pf). This is incorrect unless we use additional assumptions, e.g., continuity of the pdf. Without it, the notion of likelihood does not make sense, although this error occurs in most textbooks. (Do you know any which makes this correct? I did not find any, I did it in my textbook.) In any case, there are two totally different and incomaparable notions of likelihood, one for discerte, the other for continuous distributions. As a consequence, there is no notion of likelihood applicable to mixed distributions. (Nevertheless, the maximum likelihood method can be applied separately to the discrete and continuous parts.)

Mirko Navara, http://cmp.felk.cvut.cz/~navara —Preceding unsigned comment added by 88.146.54.129 (talk) 08:16, 22 February 2008 (UTC)

Just to clarify, by "the contribution" are you referring to the whole article or a particular section or edit? I assume the former.

On (1), well, the log-likelihood isn't mentioned in this article but clearly it isn't itself a likelihood. The invariance of maximum likelihood estimates to transformation is surely a matter not for this article but for the one on maximum likelihood. (I haven't checked that article to see what it says on the topic, if anything).

On (2), I think you've got a point that this article lacks a rigorous definition. I think the more accessible definition is needed too and should be given first. If you want to add a more rigorous definition, go ahead. I'm sure i've seen a measure-theoretic definition somewhere but I'm afraid i've never got to grips with measure theory myself.

When you say "I did it in my textbook", is that Teorie Pravděpodobnosti Na Kvantových a Fuzzy Logikách? I'm afraid i can't locate a copy to consult. Qwfp (talk) 09:34, 22 February 2008 (UTC)

The "problem" between definitions of likelihood for discrete and continuous distributions is resolved by using Measure-theoretic probability theory. This generality comes with the substantial cost of learning measure theory. Fortunately, is unnecessary for many applications. It is, nevertheless, useful for many purposes -- one of which is understanding the commonality of the treatment between discrete, absolutely continuous and other distributions. I just added an "In general" section to explain this: A discrete probability mass function is the probability density function for that distribution with respect to the counting measure on the set of all possible discrete outcomes. For absolutely continuous distributions, the standard density function is the density (Radon-Nikodym derivative) with respect to the Lebesgue measure. I hope this adds more clarity than confusion. DavidMCEddy (talk) 16:19, 21 March 2016 (UTC)

Regarding different definitions for discrete and continuous distributions: this is a mathematical point, not a conceptual point, and should be discussed further down in the article, but not in its introduction, I think. Can we use a small volume element dx in measurement space, and consider 'p(x|theta)dx' instead of 'p(x|theta)' for the continuous case, at least in the introduction? Benjamin.friedrich (talk) —Preceding undated comment added 20:35, 14 May 2020 (UTC)

Area under the curve

I'm confused about this statement:

"...the integral of a likelihood function is not in general 1. In this example, the integral of the likelihood density over the interval [0, 1] in p_H is 1/3, demonstrating again that the likelihood density function cannot be interpreted as a probability density function for p_H."

Because the likelihood function is defined up to a scalar, the fact that the integral is 1/3 isn't that meaningful. However, I think we could say that one possibility is twice as likely as another or similarly that the likelihood of $p_{H}$ being in the range [a,b] is six times as likely as being in the disjoint range [c,d]. Given that $p_{H}$ can't be less than 0 or more than 1, it seems sensible to normalize the likelihood so that the integral over that range is 1. I think that we could then say that if

{\frac {L(p_{H}\in [a,b]\,|\,\mathrm {HHT} )}{L(p_{H}\in [0,1]\,|\,\mathrm {HHT} )}}={\frac {1}{2}}

then there's a 50/50 chance of $p_{H}$ being in the range [a,b] which would correspond to a normalized likelihood of 0.5. Am I mistaken? Why can't we just normalize to 1.0 and then interpret the normalized likelihood function as a probability density function? —Ben FrantzDale (talk) 17:17, 14 August 2008 (UTC)

"Why can't we just normalize to 1.0"?. There are several reasons. One is that the integral in general doesn't exist (isn't finite). If an appropriate weighting function can be found, then the scaled function becomes something else, with its own interpretation, which would move us away from "likelihood function". However, certain theoretical work has been done which makes use of a different scaling ... scaling by a factor to make the maximum of the scaled likelihood equal to one. Melcombe (talk) 08:45, 15 August 2008 (UTC)

Interesting. Can you give an example of when that integral wouldn't be finite? (This question may be getting at the heart of the difference between "likelihood" and "probability" -- a difference which I don't yet fully understand. —Ben FrantzDale (talk) 12:38, 15 August 2008 (UTC)

An example might be the case where an observation X is from a uniform distribution on (0,a) with a>0. The likelihood function is 1/a for a > (observed X) : so not integrable. A simple change of parameterisation to b=1/a gives a likelihood which is integrable. Melcombe (talk) 13:25, 15 August 2008 (UTC)

Don't forget the simplest case of all: uniform support! Not possible to normalize in this case. Robinh (talk) 14:47, 15 August 2008 (UTC)

It doesn't make sense to speak of a "likelihood density function". Likelihoods are not densities. Density functions are not defined pointwise. One can convolve them, but not multipliy them. Likelihoods are defined pointwise. One can multiply them but not convolve them. One can multiply a likelihood by a density and get another density (although not in general a probability density, until one normalizes). Michael Hardy (talk) 16:00, 15 August 2008 (UTC)

I'm deleting that entire paragraph beginning with, "The likelihood function is not a probability ... ." I agree it's confusing, and I don't see that it adds anything.

The issues raised by a discussion of "the integral of a likelihood function" could be answered clearly with a sensible discussion of likelihood in Bayesian inference. I don't know if I'll find the time to write such a section myself, but it would make a useful addition to this article. DavidMCEddy (talk) 16:37, 21 March 2016 (UTC)

Needs a simpler introduction?

I believe it is a good habit for mathematical articles on Wikipedia, to start with a simple heuristical explanation of the concept, before diving into details and formalism. In this case I think it should be made clearer that the likelihood is simply the pdf regarded as a function of the parameter rather than of the data.

Perhaps the fact that while the pdf is a deterministic function, the likelihood is considered a random function, should also be adressed. Thomas Tvileren (talk) 07:30, 17 April 2009 (UTC)

What is the scaling factor alpha in the introduction good for? If that's for the purpose of simplification of the maximum likelihood method then (a) it is totally misplaced comment and (b) you could put there any strictly increasing function, not just scaling by a constant. --David Pal (talk) 01:35, 1 March 2011 (UTC)

Median

For a bernoulli trial, is there a significant meaning for the median of the likelihood function? —Preceding unsigned comment added by Fulldecent (talk • contribs) 16:30, 13 August 2009 (UTC)

The Bernoulli trial has a probability distribution function f_P defined by f_P(0) = 1−P and f_P(1) = P. This means that the likelihood function is L_x defined by L₀(P) = 1−P and L₁(P) = P for 0≤P≤1. For x=0 the maximum likelihood estimate of P is 0; the median is 1−1/√2 = 0.29; and the mean value is 1/3=0.33. For x=1 the maximum likelihood estimate of P is 1; the median is 1/√2 = 0.71; and the mean value is 2/3=0.67. These are point estimates for P. Some likelihood functions have a well defined maximum likelihood value but no median. Other likelihood functions have median but no mean value. See for example the German tank problem#Likelihood function. Bo Jacoby (talk) 22:27, 3 September 2009 (UTC).

The above is wrong.

First a minor point. The term "probability distribution function usually means cumulative distribution function.
What sense can it make to call the number proposed above the "median" of the likelihood function? That would be the answer if one treated the function as a probability density function, but that makes sense only if we assume a uniform measure on the line, in effect a prior, so the proposed median is actually the median of the posterior probability distribution, assuming a uniform prior. It's not a median of the likelihood function. If we assumed a different prior, we'd get a different median with the SAME likelihood function. Similar comments apply to the mean. There's no such thing as the mean or the median of a likelihood function. Michael Hardy (talk) 00:02, 4 September 2009 (UTC)

Comment to Michael:

The article on probability distribution function allows for the interpretation as probability density function.
The uniform prior likelihood function, f(P)=1 for 0≤P≤1, expresses prior ignorance of the actual value of P. A different prior likelihood function expresses some knowledge of the actual value of P, and no such knowledge is provided. It is correct that assuming a uniform prior distribution makes the likelihood function define a posterior distribution, in which the mode, median, mean value, standard deviation etc, are defined.

Your main objection seems to be that tacitly assuming a uniform prior distribution is unjustified. Consider the (bernoulli) process of sampling from an infinite population as a limiting case of the (hypergeometric) process of sampling from a finite population. The J expression

  udaf=.!/&(i.@>:) * !/&(- i.@>:)

computes odds of the hypergeometric distribution.

The program call

  1 udaf 10
10 9 8 7 6 5 4 3 2 1  0
 0 1 2 3 4 5 6 7 8 9 10

computes the odds when you pick 1 pebble from a population of 10 red and white pebbles. The 11 columns are odds for getting 0 or 1 red pebble, when the number of red pebbels in the population is 0 through 10. The 2 rows are likelihoods for the population containing 0 through 10 red pebbles given that the sample contained 0 or 1 red pebble. The top row shows that 0 red pebbles in the population has the maximum likelihood (= 10). A median is about 2.5 red pebbles = 25% of the population. (10+9+8 = 27 < 27.5 < 28 = 7+6+5+4+3+2+1+0). The mean value is 30% and the standard deviation is 24%.

The prior likelihood function is (of course)

  0 udaf 10
1 1 1 1 1 1 1 1 1 1 1

expressing prior ignorance regarding the number of red pebbles in the population. The maximum likelihood value is undefined; the median and the mean are both equal to 50% of the population, and the standard deviation is 32% of the population.

In the limiting case where the number of pebbles in the population is large, you get (unnormalized) binomial distributions in the columns and (unnormalized) beta distributions in the rows.

  5 udaf 16
4368 3003 2002 1287  792  462  252  126   56   21    6    1    0    0    0    0    0
   0 1365 2002 2145 1980 1650 1260  882  560  315  150   55   12    0    0    0    0
   0    0  364  858 1320 1650 1800 1764 1568 1260  900  550  264   78    0    0    0
   0    0    0   78  264  550  900 1260 1568 1764 1800 1650 1320  858  364    0    0
   0    0    0    0   12   55  150  315  560  882 1260 1650 1980 2145 2002 1365    0
   0    0    0    0    0    1    6   21   56  126  252  462  792 1287 2002 3003 4368

Study the finite case first, and the infinite case as a limit of the finite case, rather than to begin with the infinite case where a prior distribution is problematic. It is dangerous to assume that lim(f(x))=f(lim(x)). Bo Jacoby (talk) 10:00, 4 September 2009 (UTC).

graph

The likelihood function for estimating the probability of a coin landing heads-up without prior knowledge after observing HHT

How was this graph generated? Is there a closed form for this calculation? Is there a closed form for given # of H and # of T ? —Preceding unsigned comment added by Fulldecent (talk • contribs) 17:46, 13 August 2009 (UTC)

The expression

{\binom {n}{i}}p^{i}(1-p)^{n-i}

is for fixed n,p a binomial distribution function of i, (i=0,..,n), and for fixed n,i a continuous (unnormalized) beta distribution of p, (0≤p≤1). So the graph is simply

p^{2}(1-p)\,

Bo Jacoby (talk) 12:33, 20 August 2009 (UTC).

Isn't the correct formula

3*p^{2}(1-p)\,

given that the binomial coefficient 3 choose 2 evaluates to 3? Implementing this correctly scales the probabilities on the y-axis.

Littlejohn.farmer (talk) 17:01, 13 February 2023 (UTC)

Probability of causes and not probability of effects?

The definition given here is the opposite that given by D'Agostini, Bayesian Reasoning in Data Analysis (2003). From pp. 34-35: "The possible values $x$ which may be observed are classified in belief by $f(x|\mu )$ . This function is traditionally called `likelihood' and summarizes all previous knowledge on that kind of measurement..." In other words, it is the probability of an effect $x$ given a parameter (cause) $\mu$ . The definition given in this entry, proportional to the probability of a cause given the effect ( $f(\mu |x))$ seems more useful, as the concept is more important, but is it possible that there is more than one definition in use in the literature? LiamH (talk) 02:10, 4 October 2009 (UTC)

putting x and theta in bold

since P(x|theta) is describing sets of data points (as if a vector), shouldn't it be put in bold?

theta represents a vector (or set) of parameters, and x represents a vector of data points from a sample.

I might be wrong about this, thought it would be worth mentioning

SuperChocolate (talk) 14:49, 18 September 2014 (UTC)

Discussion

It is confusing to have several different definitions that are approximately the same. We first use P(x|\theta) then p_theta(x) then f_theta(x). Then we have two separate discussions on the page about continuous vs. discrete. Can we just define the likelihood for the discrete case and then refer to Likelihood_function#Likelihood_function_of_a_parameterized_model for the continuous case?

It's noted in several places that the likelihood is defined up to a multiplicative constant, is there a reason we don't define it that way?

Finally, there doesn't seem to be uniform notation on the page can we remedy that?

User:SolidPhase what do you think? Prax54 (talk) 20:40, 28 January 2015 (UTC)

On the points you raise, I think that the article needs substantial revisions. Regarding the definition of likelihood for a continuous distribution, the article previously included more on this, but it looked to me to be in error; so I deleted some. See my edit and especially the explanation's link, which cites Burnham & Anderson.

Confusion seems to have come about for historical reasons. Originally, likelihood was used to compare different parameters of the same model: there, the constant is irrelevant. Now, likelihood is used to compare different models (see Likelihood function#Relative likelihood of models): here, the constant is relevant.

SolidPhase (talk) 13:29, 29 January 2015 (UTC)

Thanks for the response. I am not sure where to start in improving this article. Any suggestions are welcome. Prax54 (talk) 11:31, 21 May 2015 (UTC)

In general Likelihoods with respect to a dominating measure

I wish to thank Podgorec for attempting to clarify this section by inserting, "with all distributions being absolutely continuous with respect to a common measure" before "whether discrete, absolutely continuous, a mixture or something else." I've reworded this addition and placed it in a parenthetical comment at the end of the sentence. I've done this to make that section more accessible to people unfamiliar with measure-theoretic probability -- without eliminating the mathematical rigor.

If this is not adequate, I fear we will need to cite a modern text on measure-theoretic probability theory. My knowledge of this subject dates from the late 1970s and early 1980s. I think my memory of that material is is still adequate for this, but the standard treatment of the subject may have changed -- and I no longer have instant access to a text on the subject to cite now. (It would also be good to mention likelihood in the Wikipedia article on Radon-Nikodym theorem, to help explain one important use, but I won't attempt that right now.) DavidMCEddy (talk)

Definition

User:Gitchygoomy changed the definition of likelihood to read, 'The likelihood of a set of parameter values, θ, given outcomes x, is assumed to be equal to the probability of those observed outcomes given those parameter values', from 'The "likelihood ... given outcomes x, is equal to ... .": This is incorrect. I will edit this to read as follows:

'The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability assumed for those observed outcomes given those parameter values'.

Clearly, something is assumed: The assumption is about the probability itself, not the formal identity of the likelihood to it.

I hope this will address User:Gitchygoomy's concerns with the previous definition. DavidMCEddy (talk) 20:55, 6 February 2017 (UTC)

Thanks for the change but I don't think it's quite fair. If the term is supposed to help draw an inference about an actual probability, then a reference to an assumed probability changes its significance entirely. How can you say that the likelihood is equal to an assumed probability, which is unconstrained? — Preceding unsigned comment added by Gitchygoomy (talk • contribs) 21:55, 6 February 2017 (UTC)

I think it should further be reworded as follows:

'The likelihood of a parameter values (or vector of parameter values), θ, given outcomes x, is equal to the probability (density) assumed for those observed outcomes given those parameter values'. (I will change the definition to this.)

This may not address User:Gitchygoomy's concern, which I do not yet understand.

For discrete probabilities, the probability of any specific outcome is precisely the probability density for that possible outcome. More precisely, this probability density is the Radon-Nikodym derivative of the probability with respect to the counting measure, which is the standard dominating measure for discrete probabilities. (See the discussion of likelihood with measure-theoretic probabilities in the main article.)

More generally, probabilities are always between 0 and 1. This means that probability densities (with respect to non-negative dominating measures) are non-negative but also possibly unbounded.

To User:Gitchygoomy: If this does not address your concern, might you be able to provide a more concrete example? Thanks, DavidMCEddy (talk) 00:03, 7 February 2017 (UTC)

"Historical remarks" section

This section seems to trace the etymology and usage of the word "likelihood", which generally seems irrelevant to the specific concept of likelihood functions. Recommend heavily truncating or removing this section. — Preceding unsigned comment added by Denziloe (talk • contribs) 09:55, 7 September 2017 (UTC)

Yes, I much agree. The Wikipedia article is about the likelihood function in mathematical statistics. That function is not what Peirce was referring to in his papers, when he discussed likelihood. Nor is a detailed etymology of the word relevant to the function. I have removed most of the section, and added an additional citation for Fisher. BetterMath (talk) 08:20, 9 November 2017 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Likelihood function. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://archive.is/20130113035515/http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp358?ijkey=iYp4jPP50F5vdX0&keytype=ref to http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp358?ijkey=iYp4jPP50F5vdX0&keytype=ref

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 06:24, 23 December 2017 (UTC)

Wording of lead

This section is set up pursuant WP:BRD, to discuss the wording of the lead.

An earlier edit claimed that the likelihood was based on random variables. The claim is misleading, as discussed in the Definition section. I undid the edit, and made a change to the wording. The lead has since been changed again, to include claims such as "a parameter may have high probability but small likelihood". Such claims are nonsensical. I have reverted. IAmAnEditor (talk) 20:35, 15 March 2018 (UTC)

Reply:

The fact that a proposition can have high probability and small likelihood and vice versa is clear in any book about probability theory. Consider data $D$ and model $M$ . The probability of the model given the data is, by Bayes's theorem, $\mathrm {P} (M|D)\propto \mathrm {P} (D|M)\times \mathrm {P} (M)$ , where (see e.g. Jeffreys: "Theory of Probability", 3rd ed. 1983, §1.22, p. 28):

$\mathrm {P} (M|D)$ is the posterior probability of the model given the data,

$\mathrm {P} (D|M)$ is the likelihood of the model in view of the data,

References

$\mathrm {P} (M)$ is the prior probability of the model.

First of all notice that when we speak of likelihood of the model, the model is inside the conditional, that is, on the right of " $|$ ". When we speak of probability of the model, the model is outside the conditional, on the left of " $|$ " (see I. J. Good's book cited below).

Now consider a case where we have two mutually exclusive models, $M$ and $M'$ having the same large likelihood: $\mathrm {P} (D|M)=\mathrm {P} (D|M')=0.9$ , and the prior probability of $M$ is much lower than that of $M'$ : $\mathrm {P} (M)=0.1=1-\mathrm {P} (M')$ . Then by Bayes's theorem the posterior probability of $M$ given the data is $\mathrm {P} (M|D)={\frac {0.9\times 0.1}{0.9\times 0.1+0.9\times 0.9}}=0.1$ .

As you see, the model $M$ has large likelihood (0.9) in view of the data, but low probability (0.1) given the data. You can work out an example in which the opposite is true.

And in the proof above you can simply exchange the letters $D$ and $M$ and see that the data, too, can have high probability given some model, but small likelihood in view of the same model, and vice versa.

So the sentence at the beginning of the article, "Probability is used to describe plausibility of data.... Likelihood is describe plausibility of the parameter..." simply does not make sense. "Probability" can describe the plausibility of a model, or of a parameter, or of data; what matter is that they are on the left of the conditional sign. The "likelihood" of something – parameter, data, or model – never describes a plausibility.

See I. J. Good: "Probability and the Weighing of Evidence" (1950), §6.1, for a longer explanation of the points above.

See also E. T. Jaynes: "Probability Theory" (2003), §4.1: "A likelihood $L(H)$ is not itself a probability for $H$ ; it is a dimensionless numerical function which, when multiplied by a prior probability and a normalization factor, may become a probability". And §6.23: "Only prior information can tell us whether some hypothesis provides a possible mechanism for the observed facts, consistent with the known laws of physics. If it does not, then the fact that it accounts well for the data may give it a high likelihood, but cannot give it any credence [that is, probability – see full calculations preceding this quote]. A fantasy that invokes the labors of hordes of little invisible elves and pixies running about to generate the data would have just as high a likelihood; but it would still have no credence for a scientist".

pgpgl (talk) 01:30, 16 March 2018 (UTC)

@Pmpgl: Thanks for your efforts. I'm concerned that the increased number of words may have added more confusion than clarity. What do you think of my replacement with two bullets? Thanks, DavidMCEddy (talk) 02:31, 16 March 2018 (UTC)

@DavidMCEddy: Thank you for your efforts too. I understand that we want a simple definition at the beginning of the article, but the current one is still misleading. It seems to suggest that we can only speak about the likelihood of "parameters". But we can very well speak of the likelihood of "data" too (this is typical in inverse problems). Same for "probability". So "data" and "parameter" have no predefined roles. Saying this in other words: "

\mathrm {P} (A|B)

" is the probability of

A

given

B

, and it is also the likelihood of

B

given (or in view of)

A

(see the definition in Good's book). It does not matter whether

A

refers to the parameters and

B

to the data, or vice versa. It always takes a while for students to understand this, and this article surely doesn't help, the way it's written now. How do you propose to rephrase the intro, without bringing in "parameters" and "data"? pgpgl (talk) 02:48, 16 March 2018 (UTC)

@Pmpgl: Can you give me an example where we 'speak of the likelihood of "data"' (apart from informally)?

Might you have handy a reference that defines it in a way that seems clear, concise and compelling and can be conveniently quoted here?

I do not have easy access to references like Edwards, Jeffreys, Good, Box, Fisher, etc., and I cannot find something easily that gives what we want. The references readily identified in a web search are too abstract for our purposes here. Thanks, DavidMCEddy (talk) 04:35, 16 March 2018 (UTC)

@DavidMCEddy: Indeed the question is: what are our purposes here? I have students learning about likelihood who seek wikipedia for enlightenment. I suppose this article should help them. Here are their major confusions:

Dealing with the likelihood of a parameter, some students change the data. They forget that "the likelihood of something" doesn't exist; it's always the likelihood of something, given something else.
Some students mechanically learn to use and manipulate likelihoods correctly, but are unhappy because they still don't understand what they are. I had this problem too, until I read Good's absurdly simple definition (ref. above, p. 62, his italics):
$P(E|H)$ may be called the likelihood of $H$ given $E$ .
Jaynes (ref. above, pp. 89) explains that it's a matter of emphasis:
To explain current usage, we may consider a fixed hypothesis and its implications for different data sets; as we have noted before, the term $P(D|HX)$ , in its dependence on $D$ for fixed $H$ , is called the ‘sampling distribution’. But we may consider a fixed data set in the light of various different hypotheses $\{H,H',...\}$ ; in its dependence on $H$ for fixed $D$ , $P(D|HX)$ is called the "likelihood". A likelihood $L(H)$ is not itself a probability for $H$ ; it is a dimensionless numerical function which, when multiplied by a prior probability and a normalization factor, may become a probability.
Jeffreys (ref. above, pp. 28–29) gives the same explanation:
$P(p|q_{r}H)$ the likelihood, a convenient term introduced by Professor R. A. Fisher, though in his usage it is sometimes multiplied by a constant factor. It is the probability of the observations given the original information and the hypothesis under discussion.
Fisher indeed says the same (Proc. Roy. Soc. London vol. 146, year 1934, http://doi.org/10.1098/rspa.1934.0134, p. 7):
With respect to the parametric values the likelihood is not a probability, and does not obey the laws of probability.
This is the most important: some students think that if a parameter value or a model has a large likelihood given some data, then it must also have a high probability given those data. This is a grave mistake, which can lead to dire consequences in some applications, for example medical ones. A disease can always cause a particular symptom; which means that the likelihood of having that disease, given that symptom, is large. Yet the probability of having that disease, given that symptom, can be negligibly small, because the disease is very rare to start with. Sox, Higgins, Owens: "Medical Decision Making" (2nd ed. Wiley 2013, http://doi.org/10.1002/9781118341544) in chapters 3––4 take great pains to explain this important fact. For example, Splenic Infarct causes acute abdominal pains, which means that it has a large likelihood, given abdominal pains. But if you experience abdominal pains you shouldn't jump to the conclusion that you have a splenic infarct, because it's a rare disease; it has a low probability, given that symptom (https://doi.org/10.5505/1304.7361.2015.16769). User @IAmAnEditor: should read the literature before saying that "such claims [a parameter may have high probability but small likelihood] are nonsensical". It's not a claim, it's a fact.

So it would be useful to have a definition that helps the students avoid the problems above, and especially to emphasize that the likelihood and the probability of a parameter or of a model, given the same data, don't always go hand in hand.

Maybe you've heard Zemansky's jingle (Physics Teacher vol. 8, 1970, pp. 295-300, https://doi.org/10.1119/1.2351512):

Teaching thermal physics
Is as easy as a song:
You think you make it simpler
When you make it slightly wrong!

Let's give a simple definition of "likelihood", but not a wrong one.

pgpgl (talk) 09:57, 16 March 2018 (UTC)

I will make a brief comment now, and add more shortly. First, though, please adhere to WP:BRD.

Pmpgl has nicely given a crucial quote from the top statistician of the 20th century, Ronald Fisher: "likelihood is not a probability, and does not obey the laws of probability". That is crucial, and the Wikipedia article should be consistent with that. IAmAnEditor (talk) 21:30, 16 March 2018 (UTC)

I like User:Pmpgl's Splenic Infarct example, but I think it's too detailed for the lead. I like the lead as User:IAmAnEditor just reverted it to. DavidMCEddy (talk) 21:59, 16 March 2018 (UTC)

Citations of I. J. Good and Harold Jeffreys are suspect, because both authors are strongly Bayesian. The likelihood does not depend on the Bayesian paradigm. Phrases like "posterior probability" should therefore be avoided in general discussions of likelihood. Such phrases might be used if the article had a section such as "Role of likelihood in Bayesian statistics".

A point raised by Pmpgl is that "some students think that if a parameter value or a model has a large likelihood given some data, then it must also have a high probability given those data". I agree, and having some discussion of that point in the article would be very good. It is questionable, however, whether that discussion should be in the lead—before likelihood has even been properly defined. My preference is to discuss the point, with an example (or two), in the body.

@Pmpgl, I stated that "a parameter may have high probability" is nonsensical because it is nonsense to claim that a parameter has a probability (high or low). A parameter does not have a probability; rather, it has a likelihood. And likelihoods are not probabilities. You deny my statement and yet you quote Fisher in support of my statement....
IAmAnEditor (talk) 22:10, 16 March 2018 (UTC)

@IAmAnEditor:

1 You say: "The likelihood does not depend on the Bayesian paradigm", "it is nonsense to claim that a parameter has a probability", "Citations of I. J. Good and Harold Jeffreys are suspect". Please follow Wikipedia:RS and Wikipedia:INDISCRIMINATE and preferably cite sources to back up your statements. (I. J. Good wrote more than 2000 papers in statistics and is famous for his contributions; if he's "suspect", so surely are statisticians like Fisher.)

2 You say: "Phrases like "posterior probability" should therefore be avoided in general discussions of likelihood". Please follow WP:SPECRULES and do not make up your own rules.

3 In the revision, you commanded "Do NOT make edits to the lead without consensus" (your capitals). Please follow WP:SPECRULES: statements like "You must discuss before editing this page" are just made-up rules and are deprecated.

"Likelihood function" is used in Bayesian probability theory and in frequentist probability theory. Why should its definition be given for the frequentist paradigm only? Just because you personally dislike the Bayesian one? Please respect other people's points of view too, And we aren't speaking of just few people. Bayesian theory is a respected theory: there are textbooks and courses about it, articles that use it, professors who teach it, students who study it. For example, the American Psychological Association has deprecated the use of some frequentist practices: http://www.apa.org/science/about/psa/2016/03/p-values.aspx . In a democratic spirit, I believe that the Wikipedia article on "likelihood function" should address both Bayesians and frequentists. I modified the lead accordingly.

Feel free to modify the frequentist definition as you please; I have no sources to propose about it.

@DavidMCEddy Regarding the Bayesian definition:

1. "Probability is used before data are available to describe plausibility of a future outcome, given a value for the parameter". Why, after the data are available, don't we use a probability too,

P(parameter|data)

?

Because without a prior for "parameter", you cannot have a posterior -- you cannot have a probability. DavidMCEddy (talk) 22:51, 16 March 2018 (UTC)

I agree, and specified this in the Bayesian definition. Does it sound OK? pgpgl (talk) 02:17, 17 March 2018 (UTC)

2. "Likelihood is used after data are available to describe plausibility of a parameter value". In Bayesian inference, this is a grave mistake. After the data are available, we still describe the plausibility with a probability,

P(parameter|data)

. If we used the likelihood, we could arrive at wrong conclusions, as the medical example above shows: an implausible event can have high likelihood, so we can't say that the likelihood of a parameter describes its plausibility.

How about "Likelihood is used after data are available to describe the relative plausibility of a parameter value"?

Any time you invoke "probability" for a parameter, you need a prior. Otherwise, it seems you are trying to make a Bayesian omelette without breaking a Bayesian egg, in the words of Gelman )(and maybe Savage). DavidMCEddy (talk) 22:51, 16 March 2018 (UTC)

Yes, I specified the need of prior and posterior probabilities now. And added a reference regarding the "low probability, large likelihood" possibility. How does it sound? pgpgl (talk) 02:17, 17 March 2018 (UTC)

Can you suggest an alternative that is reasonably clear, concise and compelling? In the lead, I don't think we need all the nuances of the entire article -- but we also don't want to say something that is wrong. DavidMCEddy (talk) 22:51, 16 March 2018 (UTC)

It seems inevitable to have a split definition for the frequentist and Bayesian cases, since they are quite different with regard to the use of the likelihood function. I am against relegating the one to frequentist inference or the other to Bayesian inference, since this article may interest students of both flavours of probability theory, and it seems fair to treat them equally. pgpgl (talk) 02:17, 17 March 2018 (UTC)

I suppose this is acceptable, but it looks like it was written by a committee (as it more or less was ;-)

What do you think about converting your current lead into a section with a heading something like "Likelihood in frequentist v. Bayesian inference", and restoring some version of the shorter lead?

I think the world needs a clear, concise and compelling definition of "likelihood function" that both frequentists and Bayesians can embrace. I may be wrong, but I thought what we had was close. DavidMCEddy (talk) 03:23, 17 March 2018 (UTC)

The concise definition, in my opinion, is the one Good gives: the likelihood of

A

given

B

is numerically equal to the probability of

B

given

A

. I thought this definition would satisfy a frequentist point of view too, but apparently it doesn't.

By the way: what is the source of the definition you and IAmAnEditor wrote? It'd be useful to give some sources for it, for the sake of the readers interested in the frequentist definition.

For the Bayesian case that definition doesn't work, so I paraphrased the one by Good, Jeffreys, Jaynes instead. The reason why I bothered changing this article is that some (Bayesian) students had read it and were very confused. From a Bayesian point of view, you cannot generally say that the likelihood of something represents or describes the plausibility of that something. From a Bayesian point of view, plausibility

\equiv

probability (in fact, some authors like Pólya^[1]^[2]^[3], Pearl^[4] even use the two interchangeably). Because in some situations the likelihood and the plausibility(=probability) go in opposite directions, as in the medical example or the worked-out numerical example above.

I find it undemocratic to give a frequentist definition and then modify it for the Bayesian case in a separate section. Why not give a Bayesian definition and then modify it for the frequentist case in a separate section? Both alternatives are undemocratic. Maybe the best solution would be to have two separate articles: "Likelihood function (frequentist inference)" and "Likelihood function (Bayesian inference)". Let me know what you think. I'd be happy if more people participated in this discussion. pgpgl (talk) 09:21, 17 March 2018 (UTC)

References

[1]
Polya, G (1941). "Heuristic Reasoning and the Theory of Probability". The American Mathematical Monthly. 48 (7): 450. doi:10.2307/2303538. JSTOR 2303538.
[2]
Pólya, G (1949). "Preliminary Remarks on a Logic of Plausible Inference". Dialectica. 3: 28–35. doi:10.1111/j.1746-8361.1949.tb00852.x.
[3]
https://en.wikipedia.org/wiki/Mathematics_and_plausible_reasoning
[4]
Pearl, Judea (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. ISBN 9780934613736.

Why deleting "A more detailed discussion of history of likelihood ..."?

Inverse logic

New section on integrability

@Nbarth: A section "Non-integrability" has recently been added. I do not understand portions of this section. What does "given no event, the likelihood function is a uniform likelihood" mean? A likelihood is not defined in terms of an event; rather, a likelihood is defined in terms of an outcome. As well, the claim "likelihoods ... cannot be integrated" is false: for simple examples, see the section "Example 1"; for more, see Severini (2007), listed in the section "Further reading". Also, integrability usually means Lebesgue integrability (or Riemann integrability, etc.), but that does not seem to be what is meant here. Additionally, this section is does not include any references: that is inappropriate.
SolidPhase (talk) 12:24, 26 March 2019 (UTC)

@Nbarth: I agree with User:SolidPhase: After much thought, I can see that the product of a likelihoods over n events is the product of their probability densities. When n = 0, it's the empty product, which must be 1 -- otherwise recursion would not work.

User:Nbarth: I appreciate your efforts to improve this Wikipedia article, but these changes seem to to me to add more confusion than clarity and therefore make this article overall harder to read and therefore less informative.

I didn't say anything earlier, because I don't have time to try to fix it myself. DavidMCEddy (talk) 12:59, 26 March 2019 (UTC)

@SolidPhase: @DavidMCEddy: Sorry about that, deleted in Old revision of Likelihood function. I'll see about finding references and rewording. All I was trying to say was:

by "can't integrate" I meant "can't in general integrate a likelihood over the whole domain" (they'll typically have definite integrals over intervals, but not over the whole real line)
you can multiply likelihoods, and 1 is an empty product (so uniform 1 is the likelihood in the absence of data)
uniform 1 is fine as a likelihood, but not ok as a probability

...however, I said it in a pretty non-standard and abstruse way. I'll see if anything can be salvaged.

—Nils von Barth (nbarth) (talk) 02:19, 1 April 2019 (UTC)

Thanks for removing that section. The integral of the likelihood times a prior is proportional to a posterior. The integral over any finite interval is equivalent to assuming a locally uniform prior. In v:Time to extinction of civilization I argued for an "improper prior locally uniform in the logarithm of the mean time" of an exponential distribution. In that example, a prior was important, because I had one observed lifetime and one lifetime that was censored. I hope to do a follow-on study with a substantially larger sample, so the prior should matter less. DavidMCEddy (talk) 03:02, 1 April 2019 (UTC)

The general point that if you want to apply Bayes' rule to get the posterior distribution, then you have to integrate the likelihood function over the parameters and in general this cannot be done as a closed form. Since the recent explosion of interest in Bayesian statistics is partly driven by the fact that there are now effective technologies for doing numerical analysis on these integrals, this seems to me to be an important point. — Charles Stewart (talk) 13:40, 2 April 2019 (UTC)

@SolidPhase: @DavidMCEddy: Here's Edwards (1992, p. 12) on what I was trying to say about "integrating likelihoods" (emphasis added):

[T]here is nothing in [the] definition [of likelihood] which implies that if summed over all possible hypotheses (even if these could be contemplated) or integrated over all possible parameter values the result will be anything in particular. No special meaning attaches to any part of the area under a likelihood curve, or to the sum of the likelihoods of two or more hypotheses.

So what Edwards says is that even if you can integrate the likelihood function, it's not meaningful. Shall we add some form of this (with attribution)?

—Nils von Barth (nbarth) (talk) 00:12, 3 April 2019 (UTC)

What I was trying to say about "uniform 1" Edwards refers to as "ignorance" (esp. section 4.5 Representation of ignorance), and states it in terms of support (log-likelihood) various places, including (in discussion of "prior support"):

In the absence of any prior preference, support of zero is called for, or a constant support function (p. 36)

His discussion about likelihood being invariant under change of parameters (unlike probabilities, which transform as "densities"/"measures") is relatively lengthy; it's what I was trying to get at with the discussion of change of coordinates.

Perhaps it's clearest to have a section on operations, and note things like:

Multiplying likelihoods (or adding log-likelihoods/supports) is meaningful.
Adding or integrating likelihoods may be possible but not meaningful.
Likelihood is only defined up to a constant factor: multiplying likelihood by a constant or adding a constant to a log-likelihood leaves it effectively unchanged (since ratios are unchanged).
Ignorance corresponds to a constant/uniform likelihood or support (most simply likelihood 1, log-likelihood 0, but any constant works)
Likelihoods are invariant under change of coordinates, unlike probabilities.

WDYT?

—Nils von Barth (nbarth) (talk) 00:43, 3 April 2019 (UTC)

I don't know of any serious statistician who ever advocated integrating a likelihood over a parameter. However, in many applications, Bayesians assume that a prior is locally uniform. They then will integrate the likelihood, interpreting that as proportional to a likelihood times a prior. The result is proportional to a posterior. DavidMCEddy (talk) 02:07, 3 April 2019 (UTC)

When is a conditional probability not a conditional probability?

As of 2019-03-28 the section on "Discrete probability distribution" ended with 'Sometimes the probability of "the value $x$ of $X$ for the parameter value $\theta$ " is written as $P (X = x | θ)$ ; it is often written as $P (X = x; θ)$ , to emphasize that it is not a conditional probability.'

I'm confused: Why is $P (X = x; θ)$ not a conditional probability? And why should it matter in this context whether it is or not?

I'm deleting the ending phrase, "to emphasize that it is not a conditional probability", because I think it adds more confusion than clarity -- and in any event is a point hardly worth making in this context. DavidMCEddy (talk) 22:42, 28 March 2019 (UTC)

A conditional probability is

P(A | B) = P(A \cap B)/P(B)

, where

A

and

B

are events. In the expression

P(X = x | θ)

,

X = x

is an event but

θ

is not an event; rather,

θ

is a parameter of a statistical model. Hence the expression does not denote a conditional probability.

It is worth noting the above (in the article), because the expressions

P(A | B)

and

P(A | θ)

have essentially the same form, and yet their meanings are different. Thus the phrase that your edit deleted should be kept (perhaps expanded to explain more). SolidPhase (talk) 00:11, 29 March 2019 (UTC)

Why is this distinction important enough to take the readers' time in this article?

Do you have an example of a real-world problem where someone could get goofy answers by ignoring this issue?

I'm all for scientific and mathematical rigor when it serves a purpose. I understand the Borel–Kolmogorov paradox. I'm the person who added the comment that in general, "In measure-theoretic probability theory, the density function is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure", which means that any such density can be defined in any arbitrary way on any set of probability 0. This could be used, for example, with a zero-inflated exponential lifetime distribution to model the time to public awareness of a potential nuclear crisis that may have been kept "top secret" for years. Thus, the time to public awareness of the Cuban missile crisis was 0, because it unfolded publicly while the whole world watched, but the public only became aware of the 1983 Soviet nuclear false alarm incident when Yury Votintsev published his memoirs in 1998: The distribution for such a time is a mixture of discrete and continuous distributions.

Moreover, if you are using likelihood in the the context of Bayesian model averaging and AIC-based interpretation,^[1]^[2] the likelihoods do not have to be indexed by a parameter that adopts real values (or values in some finite-dimensional vector space, more generally): You only need a common dominating measure and the same data (or at least the same number of observations) to make them comparable.

However, I fail to see why we need to make that particular distinction in an encyclopedia article like this. DavidMCEddy (talk) 03:47, 29 March 2019 (UTC)

References

[1]
Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), Springer-Verlag, chap. 7.
[2]
Gerda Claeskens; Nils Lid Hjort (2008), Model selection and model averaging, Cambridge University Press, Wikidata Q62568358

parameter(s) singular or plural?

@Bender235: Thanks very much for your contributions to this article.

However, I wonder why did you convert my ambiguous singular vs. plural construction for "parameter(s)" to unambiguous plural "parameters" in the first sentence, from:

In statistics, a likelihood function (often simply a likelihood) is function

{\mathcal {L}}(\theta ,y)

of parameter(s)

to:

In statistics, a likelihood function (often simply a likelihood) is function

{\mathcal {L}}(\theta ,y)

of parameters

? The standard estimation for the Poisson and binomial distributions, for example, include only one parameter -- singular.

Thanks DavidMCEddy (talk) 02:45, 9 April 2019 (UTC)

I know there are likelihood functions were the parameter vector is a singleton, but I don't think writing "parameters" suggests that this special case doesn't exist. Personally, I don't like writing singular and plural "simultaneously" with parentheses. --bender235 (talk) 04:26, 9 April 2019 (UTC)

A good lead

A while ago, I rewrote the lead: so that, in particular, the lead conformed with WP:MATH and MOS:LEAD. Afterward, the lead was edited, by Bender235, so that it no longer conformed with those two policies. I discussed this issue with Bender235 (on their Talk page); we then agreed to revert to the lead that I had written.

Since then, the lead has again been substantially changed (again by Bender235). Following is a comparison of the current lead and the previous lead, which I wrote.

The first paragraph of the current lead is as follows (sans references).

In statistics, the likelihood function (often simply the likelihood) is the joint probability distribution of observed data considered as a function of statistical parameters. It describes the relative probability or odds of obtaining the observed data for all permissible values of the parameters, and is used to identify the particular parameter values that are most plausible given the observed data.

The current first sentence is extremely difficult to understand: that is unacceptable for a first sentence. Moreover, the sentence seems to claim that the likelihood function is a probability distribution: such a claim is false. The current second sentence is also difficult to understand. It might be feasible to revise it though, e.g. "It gives the relative probability of obtaining the observed data for any given parameter value; it can be used to identify the particular parameter values that are most plausible given the observed data." An additional problem is that the paragraph, and indeed the current lead, does not make it clear that likelihood is always relative to a statistical model.

The first paragraph of the previous lead is as follows.

In statistics, a likelihood function (often simply a likelihood) is a particular function of the parameter of a statistical model given data. Likelihood functions play a key role in statistical inference.

The previous first sentence does not realy explain what likelihood is. At the time I wrote it, I could not think of anything better. I put it in largely as a placeholder, until either I or someone else thought of something good. The previous second sentence is, I believe, useful and helpful, particularly for non-technical readers—who should be catered to in the lead, and especially in the first paragraph.

The second paragraph of the previous lead is as follows.

In informal contexts, "likelihood" is often used as a synonym for "probability". In statistics, the two terms have different meanings. Probability is used to describe the plausibility of some data, given a value for the parameter. Likelihood is used to describe the plausibility of a value for the parameter, given some data.

That paragraph has now been removed twice, without any explanation. I strongly believe that the paragraph is highly valuable: because confusion between likelihood and probability is common, and should be addressed.

The second paragraph of the current lead is as follows (sans references).

Over the domain of permissible parameter values, the likelihood function describes a surface. The peak of that surface, if it exists, identifies the point in the parameter space that maximizes the likelihood; that is the value that is most likely to be the parameter of the joint probability distribution underlying the observed data. The procedure for obtaining these arguments of the maximum of the likelihood function is known as maximum likelihood estimation.

Again, this is worded in a way that is too complicated for non-technical readers. Better would be to take the relevant part of the current first paragraph (“used to identify the particular parameter values that are most plausible given the observed data”) and expand upon that. One suggestion for doing so is the following.

The likelihood function can be used to identify the parameter values that are most plausible, given the data. This is done via finding the maximum of the likelihood function.

The current third paragraph is as follows (sans references).

The likelihood principle states that all relevant information for inference about the parameters is contained in the likelihood function. The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher, who believed it to be a self-contained framework for statistical modelling and inference. But the likelihood function also plays a fundamental role in frequentist and Bayesian statistics.

This paragraph seems to be about two different topics: ergo, it should be two different paragraphs. The first sentence is obscure, especially because it leaves the reader with no information about the status of the likelihood principle: is the principle widely accepted? Additionally, why is the likelihood principle discussed in the lead?

The remainder of the current second paragraph is again confusing: what are frequentist and Bayesian statistics? Yes, I know what they are, but the lead of this article should not assume that the reader knows. Additionally, the paragraph seems to imply that there are three main schools of statistics (likelihoodist, frequentist, Bayesian). Such an implication is false: there are four, as even the cited reference states. Mentioning three is misleading and confusing.

The previous third paragraph is as follows (sans references).

Likelihood is used with each of the main proposed foundations of statistics: frequentism, Bayesianism, likelihoodism, and AIC-based. The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher; a relevant quotation is below.
What has now appeared is that the mathematical concept of probability is ... inadequate to express our mental confidence or [lack of confidence] in making ... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term "likelihood" to designate this quantity....
— R. A. Fisher, Statistical Methods for Research Workers

This paragraph gives the reader some idea of what frequentism and Bayesian are. It also mentions the fourth proposed foundation: AIC-based, which nowadays is predominant in some biological fields. It also includes a quotation from Fisher. The quotation is clearly valuable, and yet it has now been wholly removed from the article. It might be debated whether the quotation should go in the lead, but the quotation should not have been removed from the article.

Putting all the above together, I propose the following lead (plus references). It needs a better first sentence though; such a sentence should be technically valid and also fulfill the requirements for accessibility (as described in WP:MATH and MOS:LEAD).

In statistics, a likelihood function (often simply a likelihood) is a particular function of the parameter of a statistical model given data. Likelihood functions play a key role in statistical inference. In particular, the likelihood function can be used to identify the parameter values that are most plausible, given the data: this is done via finding the maximum of the likelihood function.

In informal contexts, "likelihood" is often used as a synonym for "probability". In statistics, the two terms have different meanings. Probability is used to describe the plausibility of some data, given a value for the parameter. Likelihood is used to describe the plausibility of a value for the parameter, given some data.

Likelihood is used with each of the main proposed foundations of statistics: frequentism, Bayesianism, likelihoodism, and AIC-based. The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher; a relevant quotation is below.

What has now appeared is that the mathematical concept of probability is ... inadequate to express our mental confidence or [lack of confidence] in making ... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term "likelihood" to designate this quantity....
— R. A. Fisher, Statistical Methods for Research Workers

@DavidMCEddy: @Nbarth:

SolidPhase (talk) 20:09, 5 June 2019 (UTC)

Your proposal is pretty much the old version, which I replaced for numerous reasons. Overall, we are trying to strike a balance between being as simple as possible and as technical as necessary. In my opinion, your version is overly simplified and unnecessarily imprecise. Your first sentence is an immediate example ("...is a particular function of the parameter of a statistical model..."); what particular function? The current version of the lead ("... is the joint probability distribution of observed data considered as a function of statistical parameters...") is more precise, and pretty much a verbal version of Definition 6.3.1 in Casella/Berger: "Let $f(\mathbf {x} |\theta )$ denote the joint pdf or pmf of the sample $\mathbf {X} =(X_{1},\ldots ,X_{n})$ . Then, given that $\mathbf {X} =\mathbf {x}$ is observed, the function of $\theta$ defined by $L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )$ is called the likelihood function." (p. 290; there are numerous other sources phrasing it similarly, e.g. Rossi, Def. 4.12, or Amemiya, Def. 4.2.1). The same is true for your second sentence ("Likelihood functions play a key role in statistical inference"). Play what role? Why describe it in such vague terms?

The lead should open with a clear, one-sentence definition of the subject: Likelihood function is X. It can be interpreted as Y, and it is used for Z. The current version of the lead does this in my opinion.

Your second paragraph gives a nit-picky distinction between probability and likelihood that better is mentioned in the body of the article. What's more important is to describe (in as general terms as possible) what the likelihood function is, and what it is being used for, and how. It's a scalar-valued function from the parameter space to (0,1) that assigns a likelihood to each combination of parameter values, and if graphed describes a (hyper)surface. Finding the arg max of L is therefore geometrically equivalent to finding the peak of a surface, and that in my opinion is an intuitive illustration of what maximum likelihood estimation does and how the likelihood function is involved.

Your third paragraph is uninformative. The first sentence is basically a reiteration of the second sentence about the likelihood playing a key role in statistics (not to mention that "AIC-based" links to AIC, which does not describe this supposed branch of statistics at all, but instead the model selection criterion). And I don't see what benefit the reader has from this Fisher quote. It's more helpful to shorten and paraphrase it, as is done presently: "The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher, who believed it to be a self-contained framework for statistical modelling and inference." Not to mention that the likelihood principle should definitely be mentioned in the lead about the likelihood function (which it is at the moment). --bender235 (talk) 22:28, 5 June 2019 (UTC)

An essential point is that a likelihood function does not provide a probability distribution. Most of your remarks seem to not properly take this into account.

About maximum likelihood estimation, the explanation that I proposed is both adequate and much simpler and briefer than the current explanation. This is a lead, where simplicity and brevity are especially important: again, as per WP:MATH and MOS:LEAD.

About the likelihood principle, the current treatment seems to be unacceptable, for the reasons that I gave above. If you have a constructive suggestion for a short paragraph on the principle, though, that might be good.

SolidPhase (talk) 20:21, 6 June 2019 (UTC)

The current lead does not claim the likelihood is a probability distribution. It states that the likelihood is equal to a probability distribution considered as a function of the parameters. Literally every statistics textbook on Earth defines likelihood in these terms.

In my opinion, describing the likelihood function and the process of finding its maximum visually/geometrically is very helpful especially to a reader with little math background; you may disagree, but "finding the peak of a surface" is actually less technical than writing finding the maximum value of a function.

The style guidelines you're referring to are, I assume, MOS:MATH, not WP:MATH. And those guidelines say that the purpose of the lead "is to describe, define, and give context to the subject of the article, to establish why it is interesting or useful, and to summarize the most important points." Simply writing that the likelihood is some function of parameters, and is used (somehow?) in statistics, is unhelpful.

Finally, your critique above is inconsistent. On the one hand, you write that the second paragraph current lead "is again confusing: what are frequentist and Bayesian statistics? Yes, I know what they are, but the lead of this article should not assume that the reader knows." On the other hand, your suggestion is to write "Likelihood is used with each of the main proposed foundations of statistics: frequentism, Bayesianism, likelihoodism, and AIC-based." So what is your position here? Can we mention frequentism and Bayesianism, or not? --bender235 (talk) 23:51, 6 June 2019 (UTC)

A likelihood is a function of the parameters; that is by definition. The value of the function, at any given point, is determined by evaluating a probability density function or probability mass function. There is, however, no probability distribution.

That is illustrated in the article: see Example 1, which states that “likelihoods do not have to integrate (or sum) to 1, unlike probabilities”. In contrast, a probability distribution always integrates (or sums) to 1.

Additionally, the quotation from Fisher states that the likelihood “does not in fact obey the laws of probability”. If the likelihood were a probability distribution, then it would obey the laws of probability. (This is one of the reasons that the quotation is valuable.)

You do not understand what likelihood is. That is the root of much of the problem here. You wrongly believe that because the evaluation of the likelihood at a point uses a probability density, then the likelihood provides a probability distribution.

About the policies that I was citing, my citations were correct. Indeed, I earlier included relevant extracts from those policies on your Talk page (in my comment at 12:43, 28 April 2019). I ask you to reread those.

My critique is consistent, contrary to your claim. Again, you have not considered what I wrote. Specifically, I wrote that the lead should give “the reader some idea of what frequentism and Bayesian are”. My proposed draft gives some idea: it describes them as two of the “main proposed foundations of statistics”.

SolidPhase (talk) 21:15, 7 June 2019 (UTC)

Further to the above, some relevant extracts from the policies are copied below.

WP:MATH tells the following.

The article should start with a short introductory section (often referred to as the lead). The purpose of this section is to describe, define, and give context to the subject of the article, to establish why it is interesting or useful, and to summarize the most important points. The lead should as far as possible be accessible to a general reader, so specialized terminology and symbols should be avoided as much as possible.
...
The lead sentence should informally define or describe the subject.
...
The lead section should include, where appropriate ...
An informal introduction to the topic, without rigor, suitable for a general audience. (The appropriate audience for the overview will vary by article, but it should be as basic as reasonable.) The informal introduction should clearly state that it is informal, and that it is only stated to introduce the formal and correct approach.

Additionally, MOS:LEAD tells the following.

It is even more important here than in the rest of the article that the text be accessible. Editors should avoid lengthy paragraphs and overly specific descriptions – greater detail is saved for the body of the article.
...
In general, introduce useful abbreviations, but avoid difficult-to-understand terminology and symbols. Mathematical equations and formulas should be avoided when they conflict with the goal of making the lead section accessible to as broad an audience as possible. Where uncommon terms are essential, they should be placed in context, linked and briefly defined. The subject should be placed in a context familiar to a normal reader.
...
The first sentence should tell the nonspecialist reader what, or who, the subject is. It should be in plain English.

The current lead obviously fails to adhere to those policies.

SolidPhase (talk) 22:11, 7 June 2019 (UTC)

First of all, for someone who keeps citing WP policies time and again, have a look at WP:CIVIL. Second, you're taking down a straw man. I do not claim that the likelihood function gives probabilities, nor does the current lead. A probability distribution tells how probable the observed sample is, a likelihood function tells how likely particular parameter values are. The interpretation differs, but the equivalence in functional form remains. I cited three reliable sources above confirming this definition of the likelihood, and I could easily cite a dozen more (just for good measure, here's Shao, Def. 4.3: "(...)

f_{\theta }(x)

considered as a function of

\theta

is called the likelihood function and denoted

\ell (\theta )

.").

Further, I strongly disagree with you that the current lead violates any of the policies you've cited. The current lead has as little stats terminology as possible, no symbols or equations, and (contrary to your proposal) gives an illustrative visual analogy to the concept of the likelihood function and the method of MLE. I guess the two of us will never agree on this issue, so it this point I will just wait for a third opinion. --bender235 (talk) 03:56, 8 June 2019 (UTC)

I strongly agree with SolidPhase. People keep reverting the changes when I intentionally remove verbiage referring to likelihood as a probability, or joint probability. This is mostly wrong, but even if you could argue it as partially correct, it is very misleading. Let me explain to you why Bender235...

Probabilities, probability spaces, probability measures have very precise definitions that likelihoods do not have to obey. In fact, likelihoods can often still be evaluated with zero knowledge of the sample space and the sigma algebra. You can read the full measure theoretic definition if you please; my favorite book is by Durrett. Likelihood, which are derived from probabilities, do not necessarily have to lie in [0,1]. While they do in the discrete case, likelihoods are equivalent under scaling by a positive constant, taking them even further still from probabilities (see awf edwards). It is in the continuous case, (or even more complicated still, like with singular distribution functions...) one should be strongly averse to interpreting it as a probability since they are ill suited here.

Another justifying reasons supports how they are used in practice, notably in maximum likelihood estimation. You can transform any likelihood function by a strictly monotonic function on

[0,\infty ]

(say a logarithm or square), and not change the character/application of the likelihood functions, which is to compare parameter values and their likeliness (in the words of Fisher). The purpose of probabilities is not solely to compare; after all they have a definite sense of scale. I like replace the word likelihood or likely with plausible or plausibility (to avoid circularity); these words are all synonymous after all. I'd strongly recommend the book by AWF Edwards that goes into the history, but it's kind of a hard read. However, it is considerable less misleading and/or oversimplified than blog explanations...

I hope you understand the reasons I prefer not referring to likelihood as a probability. The quick and dirty is, likelihoods are derived from probabilities, but are distinct enough from probabilities that the misleading language should be avoided. The most common way they are used in practice is for comparison, and many manipulations involving them take them far outside of valid probabilities, further distinguishing them (in practice).

The last thing I think needs to be done is to clarify the notation on its representation as a conditional probability. I think this is a faux pas if you never explicitly state your prior distribution and that you're adopting a bayesian framework. This notation just confuses frequentists, and I think bayesians that adopt it have to admit that they are playing a little fast and loose. Terminology overloading is good when you're an expert, but terrible when you're a beginner. I don't want to have to possibly think about improper prior distributions if I'm learning this for the first time.

I propose the following set of examples, and will expand upon them until they have figures and everything like the current ones do. If I have no pushback, say for a week or two, I'm going to modify the page further. I simply cannot believe that such a fundamental idea in statistics (and in my opinion, the most fruitful) is mischaracterized so badly by the current article, and that others keep changing the subtle but (very intentionally) correct to misleading ones.

==Definition==

The likelihood function,

L

is a function mapping from the set of possible parameter values to the nonnegative real numbers. A common misconception is that likelihood and probability are interchangable^[1] because likelihood is often derived from the probability mass/density functions. Historically, similar ideas went by the name inverse probability, but the potentially misleading terminology led Ronald Fisher to instead use the word likelihood to further distinguish it from probability, which has a mathematically precise definition.^[2]

=== Likelihood Example for Discrete Random Variables ===

For example, consider

5

identical Bernoilli random variable

X_{i}

drawn independently of one another. The probability mass function of

X

is given by:

:P\{X_{i}=x_{i}\}=f(x_{i})=\theta ^{x_{i}}(1-\theta )^{1-x_{i}}:

so the probability of observing the sequence

x_{1},x_{2},\ldots ,x_{5}

is:

:P\{X_{1}=x_{1},\ldots ,X_{5}=x_{5}\}=f(x_{1},\ldots ,x_{5})=\prod _{i=1}^{5}\theta ^{x_{i}}(1-\theta )^{1-x_{i}}:

in this case,

f

is a function from

(x_{1},x_{2},\ldots ,x_{5})\in \{0,1\}^{5}

to

[0,1]

. The corresponding likelihood function is:

:L(\theta )=\prod _{i=1}^{5}\theta ^{x_{i}}(1-\theta )^{1-x_{i}}:

which maps from

\theta \in [0,1]

to

[0,1]

. Notice that the form of the probability mass function and likelihood function are identical, however, the likelihood function is a function of the parameter

\theta

while the probability is a function of the realizations

(x_{1},\ldots ,x_{5})

. In this situation, the likelihood also takes value in

[0,1]

, so it behaves like a probability, but should not be interpreted as one. Notice that the function

f

assumes

\theta

is fixed while

L

assumes

(x_{1},\ldots ,x_{5})

is fixed.

=== Likelihood Example for Continuous Random Variables ===

Consider observing three independent and identically distributed samples of a standard normal distribution,

x_{1},x_{2},x_{3}

. While the probability of any given observation is zero, the probability density of any such observation is given by:

:f(x_{1},x_{2},x_{3})=\prod _{i=1}^{3}{\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x_{i}-\mu )^{2}}{2\sigma ^{2}}}\right):

this is a density function mapping from observations (or data)

(x_{1},x_{2},x_{3})

to

(0,\infty )

. Similarly, the likelihood function,

:L(\mu ,\sigma )=\prod _{i=1}^{3}{\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x_{i}-\mu )^{2}}{2\sigma ^{2}}}\right):

is function mapping from parameters

(\mu ,\sigma )\in (\infty ,\infty )\times (0,\infty )

to

(0,\infty )

.

=== Common Alternative Notation ===

Likelihood is directly compatible with Bayesian inference so it is common to see Bayesian notation used, even though inference using likelihood does not need a prior distribution on the parameters. For example, we can rewrite the first example's probability mass function

f(x)

as

f(x|\theta )

read as, the probability of observing

x

given that the parameter random variable

\Theta

takes value

\theta

. Similarly, the ``likelihood function is written as $f(\theta |x)$ and read as, the probability of observing $\theta$ given that random variables $(X_{1},\ldots ,X_{5})$ take value $(x_{1},\ldots ,x_{5})$ 2600:4040:7A05:4800:48FB:1C25:C97E:CAB5 (talk) 08:24, 24 August 2023 (UTC)

People keep reverting the changes when I intentionally remove verbiage referring to likelihood as a probability, or joint probability.

This is because the likelihood is a probability (or probability density) over observed data. The objection that

L(\theta |y)

may not lie in

[0,1]

can be addressed with a minor edit (i.e., to specify that we might be talking about a probability density), and does not require overhauling the lede. Your edits replace an actual definition of the likelihood, for which multiple sources have been provided on this talk page, with fairly vague descriptions of likelihood in terms of some of its uses. References to definitions of likelihood in terms of distributions on observed data have already been provided on this talk page, but here is the definition provided in Frequentist and Bayesian Regression Methods ^[3]

Hello, Likelihood function. You have new messages at [[User talk:Definition. Viewing

p(\mathbf {y} |{\boldsymbol {\theta }})

as a function of

{\boldsymbol {\theta }}

gives the likelihood function, denoted

L({\boldsymbol {\theta }})

. (Wakefield, 2013: pg. 36.)|User talk:Definition. Viewing

p(\mathbf {y} |{\boldsymbol {\theta }})

as a function of

{\boldsymbol {\theta }}

gives the likelihood function, denoted

L({\boldsymbol {\theta }})

. (Wakefield, 2013: pg. 36.)]].
You can remove this notice at any time by removing the {{Talkback}} or {{Tb}} template.

Here's Lehmann and Casella^[4]:

For a sample point $\mathbf {x} =(x_{1},\dots ,x_{n})$ from a density $f(\mathbf {x} |\theta )$ , the likelihood function $L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )$ is the sample density considered as a function of $\theta$ for fixed $\mathbf {x}$ .

If references to sources including one that it appears you yourself have appealed to are not convincing, consider that arguments about likelihoods not being densities truly break down when we consider the Bayesian paradigm. You write that

Likelihood is directly compatible with Bayesian inference so it is common to see Bayesian notation used, even though inference using likelihood does not need a prior distribution on the parameters.

To say likelihood is compatible with Bayesian inference is an understatement to the point of being wrong. It is an essential component of any Bayesian model, since via Bayes' Theorem

p({\boldsymbol {\theta }}|\mathbf {y} )=p(\mathbf {y} |{\boldsymbol {\theta }})\pi ({\boldsymbol {\theta }})/p(\mathbf {y} )

-- i.e., posterior is equal to prior times likelihood over evidence, or to quote Wakefield again, "Posterior

\propto

Likelihood

\times

Prior" ^[5]. Crucially, Bayes' theorem takes as inputs probabilities or probability densities. If the likelihood isn't a probability (density), then the theorem does not apply. There certainly is a place for discussion of interpretation of the likelihood function somewhere in this article, but the lede should reflect the simple definitions given in multiple authoritative sources – that are compatible with either a frequentist or a Bayesian perspective (among others), since no one school of statistics owns the likelihood. So many suspicious toenails (talk) 19:27, 25 August 2023 (UTC)

Following up with a sidenote:

Similarly, the ``likelihood function is written as $f(\theta |x)$ and read as, the probability of observing $\theta$ given that random variables $(X_{1},\ldots ,X_{5})$ take value $(x_{1},\ldots ,x_{5})$

The notation and interpretation you provide here appear to be for a posterior distribution, not a likelihood. So many suspicious toenails (talk) 19:50, 25 August 2023 (UTC)

This is because the likelihood is a probability (or probability density) over observed data. The objection that ${\mathcal {L}}(\theta |x)$ may not lie in $[0,1]$ can be addressed with a minor edit, and does not require modifying the lede.

The current definition, which I'm unhappy with, I can live with. It just reads awkwardly to me.

To say likelihood is compatible with Bayesian inference is an understatement to the point of being wrong.

I strongly disagree. I am ambivalent toward frequentist and bayesian and wrote the lede following that. There are words that apply when using the Bayesian Framework: maximum a posteiori estimation. I suggest you go check that page; it states that MAP is likelihood plus incorporating a prior. This is exactly what I was trying to emphasize with the word compatible, not that any one school owns likelihood. Why make the page for likelihood function so close to the one for the Bayesian school of thought which already has its own page? Further, consider ideas that rely on likelihood, like the Likelihood-ratio test, which has no prior distribution on the parameters. Consider the derivative ideas of conformal prediction, are completely independent of any prior distribution. That is why I say compatible; if you don't incorporate a prior, the idea still makes sense!

This leads to my second source of unhappiness with the article. When one writes $f(x|\theta )$ , I read this as follows.

The probability that that the random variable $X$ takes value $x$ , given that the random variable $\Theta$ takes value $\theta$ is $f(x|\theta )$ . So when we write that likelihood is equal to $f(x|\theta )$ , I am very unhappy. Why? Because I would ask you, what is the probability space we are considering? Is $\theta$ a realization of a random variable, or is it simply a variable that belongs to a particular set? If it is simply a variable that belongs a particular set, then please do not abuse notation. Things get even worse when we say, we can interpret likelihood as $f(\theta |x)$ . I don't think a beginner would easily understand what's going on with the notation abuse? I sure didn't when I was learning this stuff. I bet many others had the same problem, why else do you think the notation $f(x;\theta )$ arose? The bar seems to be overloaded and refers to two different scenarios. The first scenario, adopted by everyone, is:

$f(x|\theta )$ is to be read as $x$ given $\theta$ .

The second scenario, adopted by those comfortable with notation abuse (usually those who already understand the idea), is:

$f(x|\theta )$ is meant to be read as $x$ with $\theta$ fixed.

Lehmann and Casella are experts and they do this, and I am very upset that they do. If I knew them, I would ask them this question, and tell them why their writing leaves much to be desired since I disagree with their syntax and ordering. There are other experts who use different linguistics, like Murphy (I linked his book...), I believe, to avoid the possible confusion. Humorously, I was trying to locate a copy of Wakefield, and I found a review on Amazon that has a gripe with Wakefield's notation. I really hope you see the ambiguity here; someone else asked the question on the talk page, "When is a probability not a conditional probability?" I think this demonstrates my point.

The notation and interpretation you have written here appear for a posterior distribution, not a likelihood.

I'm not sure which words of mine you're quoting, the notation I'm using is very standard notation. Random variables are denoted by capital letters, e.g., $X_{i}$ . Their realizations are denoted by lowercase letters, e.g., $x_{i}$ . Please refer to the page random variable, they use it there in places too; in this case, you would replace each $x_{i}$ with either 0 or 1; and each corresponds to a single data point. This has nothing to do with posterior distributions, and I wasn't writing it to be. This small trick is very common when dealing with Bernouilli random variables, that is:

$P\{X_{i}=x_{i}\}=f_{X_{i}}(x_{i})=f(x_{i})=p^{x_{i}}(1-p)^{1-x_{i}}={\begin{cases}p&{\text{when }}x_{i}=1\\(1-p)&{\text{when }}x_{i}=0\end{cases}}$

and generalizes to multiple Bernoulli random variables. I first encountered this trick in the context of maximizing likelihood functions that involve Bernoulli random variables, so I thought it would be good to include here. Take the logarithm, maximize the likelihood, and notice how cleanly you obtain the correct result. Furthermore, in this case, the likelihood is also a probability with respect to $x$ . Maybe that detail should be deleted.

In conclusion, my suggestion is:

Do not use Bayesian notation at the very beginning and when you eventually do, at least explain the notation abuse as clearly as possible. Explain that the bar means you leave variables after it fixed, which can (confusingly) also indicates conditional probabilities. I repeat, please do not use the $\theta |x$ notation for likelihood this early because it is notation abuse. Use the notation ${\mathcal {L}}(\theta ;x)$ or ${\mathcal {L}}(\theta )$ this might seem pedantic, but I believe that no overloading is preferable to some overloading. You caught my mistake in referring to Lehmann and Casella, they are experts and overload away.

This is because the likelihood is a probability (or probability density) over observed data.

The first sentence currently claims that a likelihood is a probability distribution. The claim is false, as I have explained above. If you do not understand that explanation, perhaps you should raise the issue on stats.stackexchange.com. (Alternatively, if you know a good statistician, talk with them.)

Please sign your comments. You don't seem to have signed this one, and your comments are flowing into previous material on this page that you do not appear to have written (which I believe includes the last paragraph above).

You write that you are "not sure which words of mine you're quoting" in reference to my comment regarding an interpretation of a likelihood you at least apparently provided in your last round of comments. I provided a block quote with the wording that struck me as flawed, so I'm not entirely sure what the confusion is. Please read the last sentence of the comment signed by 2600:4040:7A05:4800:48FB:1C25:C97E:CAB5 on 08:24, 24 August 2023 (UTC). Are these your words yours? If not, why are they signed together with words that are?

In any case,

f(\theta |x)

interpreted as "the probability of observing

\theta

given that random variables

(X_{1},\ldots ,X_{5})

take value

(x_{1},\ldots ,x_{5})

" is an interpretation of a probability distribution over the parameter space, which is to say that it is not a likelihood.

There are words that apply when using the Bayesian Framework: maximum a posteiori estimation. I suggest you go check that page; it states that MAP is likelihood plus incorporating a prior. This is exactly what I was trying to emphasize with the word compatible, not that any one school owns likelihood.

As I have already said, a Bayesian model requires both a prior and a likelihood (so necessarily a particular flavor of Bayesian estimation will as well) – this is not at all my objection to your wording. On the contrary, the likelihood is an essential component of any Bayesian model, so to call it "compatible" with Bayesian modeling is a rather dramatic understatement. (It's trivially true, in the same sense that human life is "compatible" with access to clean drinking water – true but worded somewhat misleadingly.)

Do not use Bayesian notation at the very beginning and when you eventually do, at least explain the notation abuse as clearly as possible.

I think the case may be slightly overstated here; plenty of frequentists use this notation as well. I think it is entirely appropriate to include a note in the article specifying that frequentist and Bayesian interpretations of this notation differ, but the fact remains that bar notation is commonly used in the statistical literature to specify likelihoods (so commonly, in fact, that it shows up in a source you yourself appeal to). The fact that you do not like this notation and find it ambiguous does not change this (though to repeat – I think it's a great idea to add a note to the article regarding differing interpretations of this notation and/or alternative notation such as

f(x;\theta )

).

I'm not really sure how to respond to what is perhaps the last part of your unsigned comment, because it's not clear which words are yours (or that you intend to represent your thoughts) and which are not (unlike on wikipedia articles proper, best practice on talk pages is to sign your posts). Looking at edit history, it appears you copied a sentence I wrote ("This is because the likelihood is a probability (or probability density) over observed data.") in front of a paragraph that it at least appears someone else wrote, which I'll repeat in block quote for clarity (I also have included a copy of this paragraph below my comment in an attempt to maintain the integrity of a comment written by SolidPhase several years ago):

The first sentence currently claims that a likelihood is a probability distribution. The claim is false, as I have explained above. If you do not understand that explanation, perhaps you should raise the issue on stats.stackexchange.com. (Alternatively, if you know a good statistician, talk with them.)

This has not been adequately demonstrated. Quoting Lehmann and Casella again, "the likelihood function

L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )

is the sample density considered as a function of

\theta

"^[6] (emphasis added). Multiple other reliable sources have been quoted on this page equating the likelihood and the sample density. So many suspicious toenails (talk) 17:18, 28 August 2023 (UTC)

The first sentence currently claims that a likelihood is a probability distribution. The claim is false, as I have explained above. If you do not understand that explanation, perhaps you should raise the issue on stats.stackexchange.com. (Alternatively, if you know a good statistician, talk with them.)

You also do not seem to understand what a statistical model is. For illustration, you edited the article Parameter space. Your edit was clearly based on you not understanding the concept of statistical model. I left a polite comment about your edit on your Talk page (at 18:59, 2 May 2019). That was weeks ago; you have not corrected the article.

Statistical models and likelihood functions are basic concepts in mathematical statistics. You have demonstrated that you do not understand them. Yet you have made substantial edits to many articles in mathematical statistics: and you have sometimes harmed those articles because you misunderstand. That seems to be the underlying issue here.

Regarding this article in particular, perhaps we should go into WP:Dispute Resolution. If you agree, then please pick your preferred method.

@Michael Hardy:

SolidPhase (talk) 19:27, 9 June 2019 (UTC)

Your personal attacks are uncalled for. I will not further comment. I have cited four textbook sources above giving exactly the same definition of the likelihood as the article right now. Please read them carefully. Alternatively, if you know a good statistician, talk with them. --bender235 (talk) 19:59, 9 June 2019 (UTC)

I rewrote the lead, addressing some of the concerns raised above. I decided to open with what the likelihood means, rather than what it is. While still based on textbook definitions of the subject, I hopefully made the distinction between likelihood and probability clearer. The analogy of the likelihood as a surface to be searched for a peak should remain, since it gives an intuitive explanation for how the likelihood is used in parameter estimation. The Fisher quote is removed, since it adds no value to the definition of the subject. If anything, it adds historical perspective, and thus should be placed in the appropriate subsection. The last paragraph of the lead will get a further rewrite, since the connection to, and importance of, the likelihood principle need to be clearer. Also, the notion of log-likelihood needs to appear in the lead. Overall, the whole article could use a rewrite. --bender235 (talk) 22:15, 9 June 2019 (UTC)

I think Bender's latest version of the lede is much better than that of SolidPhase. Attic Salt (talk) 19:11, 11 June 2019 (UTC)

I ask you to explain your reasons. Note that there are at least two aspects to this dispute: accessibility and validity. Regarding accessibility, an example is with whether the quotation, from R. A. Fisher, should be included in the lead (or moved to body of the article, perhaps to the section "Historical remarks"). My preference is to include the quotation in the lead. The evaluation of accessibility, though, is largely a matter of opinion; so if other editors decide to remove the quotation from the lead, then that is what should be done.

Regarding validity, an example is with the claim that a likelihood function is a probability distribution. There should be no opinions on this: it is an issue of factuality. The claim is not valid. I have given two different explanations: first, that likelihoods do not sum to 1; second, via an excerpt from the quotation of Fisher (the pre-eminent statistician of the 20th century)—the likelihood “does not in fact obey the laws of probability”. Neither you nor Bender235 seems to be convinced, however. And no other editors have joined in to comment.

Just now, I googled for likelihood function. The first link that came up was this article. The next three links were as follows.

http://www.stat.cmu.edu/~larry/=stat705/Lecture6.pdf

http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Likelihood/Likelihood.pdf

https://www.statisticshowto.datasciencecentral.com/likelihood-function/

The first link tells this: “The likelihood function is not a probability density function” (emphasis in original).

The second link tells this: “The likelihood function is not a probability density function” (emphasis in original).

The third link tells the following.

Here it certainly looks like we’re just taking our probability density function and cleverly relabeling it as a likelihood function. The reality, though, is actually quite different.

Although a likelihood function might (from its form) look like a probability distribution, it is not a distribution.

Another way to consider the validity aspect is that Wikipedia ultimately requires all statements to be explicitly verifiable via WP:Reliable sources. There is no reliable source that states that a likelihood function is a probability distribution. Hence the claim cannot go in the article.

SolidPhase (talk) 18:31, 12 June 2019 (UTC)

Bender's version is a whole lot more informative regarding the inverted interpretation made of the joint probability of the model parameters for the probability of the data and the topology of the likelihood function (an understanding of which is needed to motive the maximum likelihood method). Attic Salt (talk) 19:11, 12 June 2019 (UTC)

About Bender235's version being “a whole lot more informative regarding the inverted interpretation made of the joint probability of the model parameters for the probability of the data”, I do not really understand what you mean. Will you elaborate?

About Bender235's version being “a whole lot more informative regarding ... the topology of the likelihood function (an understanding of which is needed to motive the maximum likelihood method)”, that is certainly true. That is not, however, a reason for preferring the version of Bender235. It would be a reason if this pertained to the body of the article. This, however, pertains to the lead: see WP:LEAD and the two pages that I quoted above, WP:MATH and MOS:LEAD.

The lead of this article will be read by many people who have taken only a single course in statistics, perhaps many years ago. It will also be read by people who have never taken any course in statistics. The lead should cater to them, as well as others of course, to the extent feasible. Loading up the lead with detailed information is harmful to them.

I strongly agree with including the content of Bender235's version in the article. I am only opposed to including it in the lead.

SolidPhase (talk) 17:35, 14 June 2019 (UTC)

The lede can be organised for both a novice and a semi-statistically minded person (like me). The novice material can be in one paragraph, and the material for a more advanced person can be put in a different paragraph. You might consult with other wikipedia articles covering mathematical issues. Attic Salt (talk) 18:36, 14 June 2019 (UTC)

@SolidPhase: for almost two weeks now I've been explaining you again and again that no one claims that the likelihood function is a probability distribution of the parameter. And as for reliable textbook definitions: I mentioned four already; let me re-post them and add another four for good measure.

Casella/Berger (p. 290): "Let $f(\mathbf {x} |\theta )$ denote the joint pdf or pmf of the sample $\mathbf {X} =(X_{1},\ldots ,X_{n})$ . Then, given that $\mathbf {X} =\mathbf {x}$ is observed, the function of $\theta$ defined by $L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )$ is called the likelihood function."
Rossi (p. 190): "For a sample $X_{1},\ldots ,X_{n}$ , the likelihood function $L(\theta |{\vec {X}})$ is the joint pdf of ${\vec {X}}=(X_{1},\ldots ,X_{n})$ ."
Amemiya (p. 114): "Let $L_{T}(\theta )=L(\mathbf {y} ,\theta )$ be the joint density of a $T$ -vector of random variables $\mathbf {y} =(y_{1},y_{2},\ldots ,y_{T})'$ characterized by a $K$ -vector of parameters $\theta$ . When we regard it as a function of $\theta$ , we call it the likelihood function."
Shao (p. 274): "(...) $f_{\theta }(x)$ considered as a function of $\theta$ is called the likelihood function and denoted $\ell (\theta )$ .").
Koch (p. 181): "Let the random vector $\mathbf {y}$ of the observations have the density function $f(\beta )$ , depending on the unknown and fixed parameter $\beta$ , then the likelihood function $L(\mathbf {y} ,\beta )$ is defined $L(\mathbf {y} ,\beta )=f(\beta )$ ."
Schervish (p. 2): "If $X=(X_{1},\ldots ,X_{n})$ , where the $X_{i}$ are IID each with density (or mass function) $f_{X_{1}|\Theta }(\cdot ,\theta )$ when $\Theta =\theta$ , then $f_{X|\Theta }(x|\theta )=\prod _{i=1}^{n}f_{X_{1}|\Theta }(x_{i}|\theta )$ (1.1), where $x=(x_{1},\ldots ,x_{n})$ . After observing the data $X_{1}=x_{1},\ldots ,X_{n}=x_{n}$ , the function in (1.1), as a function of $\theta$ for fixed $x$ , is called the likelihood function, denoted by $L(\theta )$ ."
Cramer (p. 14): "The specification (1) defines a likelihood function $L=L(\theta ,y,x)=p_{y}(y,\theta ,x)$ . This is obtained from the density (1) by substituting the observed $y$ for $w$ , and an $(l\times 1)$ vector $\theta$ for $\theta ^{0}$ . The $y$ , being realizations, are known nonrandom constants, and $\theta$ is the argument of $L$ ; in comparison with (1) the roles have been reversed, and so has the order in which the symbols occur."
Kane (p. 179): "In the sampling distribution, the values of $x_{1},x_{2},\ldots ,x_{N}$ are regarded as variable, the value of $\theta$ as fixed. On the other hand, in a likelihood function, the values $x_{1},x_{2},\ldots ,x_{N}$ are treated as fixed and the value of $\theta$ as variable: $L(x_{1},x_{2},\ldots ,x_{N}:\theta )=f(x_{1};\theta )f(x_{2};\theta )\dots f(x_{N};\theta )$ . As the algebraic formulation makes quite clear, the difference between [the joint pdf and the likelihood] is simply and entirely one of interpretation."
(curiously, your third external link actually claims exactly the same, once you quote it entirely: "(...) The reality, though, is actually quite different. For your probability density function, you thought of θ as a constant and focused on an ever changing x. In the likelihood function, you let a sample point x be a constant and imagine θ to be varying over the whole range of possible parameter values.")

I'm sorry, but if you refuse to get the point at some point I'll just have to ignore you. My last version of the lead opened with these sentences: "In statistics, the likelihood function (often simply called likelihood) expresses how likely particular values of parameters are for a given set of observations. It is identical to the joint probability distribution of these observations, but is interpreted as a function of the parameters that index the family of those density or mass functions." This definition does not, I repeat, does not claim that the likelihood function is a probability distribution. You are continuing to fight a straw man. --bender235 (talk) 21:15, 12 June 2019 (UTC)

That's my understanding of the likelihood function as well. It is a somewhat uneasy interpretation. I recall that Fisher got very passionate about discussing it, possibly in conflict with Jeffreys. Fisher absolutely did not want the likelihood function interpreted in a Bayesian context, but it ends up sitting in between the frequentist and Bayesian points of view.Attic Salt (talk) 21:23, 12 June 2019 (UTC)

Perhaps look at the section Likelihood function#Interpretations under different foundations. That section needs a very large amount of work though. SolidPhase (talk) 17:35, 14 June 2019 (UTC)

@Bender235: Your proposed lead claims that a likelihood function “is identical” to a probability distribution. The claim is false. I have tried to explain that to you, and I do not know how to explain better. I ask you to consider my suggestion above: talk with a good statistician or raise the issue at stats.stackexchange.com.

I had it with your personal attacks. Other than some dodgy data science blogs you not produced a single reliable source. I have listed eight source above that could not be clearer that the distinction between joint pdf and likelihood is purely one of interpretation about which argument is fixed, and which varies. That is a mathematical fact. The lead should reflect that. --bender235 (talk) 18:51, 14 June 2019 (UTC)

Regarding the citations that you list above, you are misinterpreting the mathematics, with possibly one exception. The possible exception is Kane: that is a book published in 1968 and written by an economist; so it is not surprising that the author misunderstood, and in any case I do not consider the book to be a WP:Reliable source.

Shao, Def. 4.3: "(...)

f_{\theta }(x)

considered as a function of

\theta

is called the likelihood function and denoted

\ell (\theta )

." Compare to "identical to the joint probability distribution of the observations but it is interpreted as a function of the parameters". Those two statements are exactly the same, only one uses symbols. --bender235 (talk) 19:09, 14 June 2019 (UTC)

On your Talk page, you criticized me for not addressing more of the points that you have raised. I was focusing on your claim that the likelihood function is (identical to) a probability distribution, because that is the most important issue. I will now, though, address more of your points.

There are four main proposed foundations of statistics. The lead that I proposed stated such and gave a reference. Your proposed lead removed the reference. Worse, it removed mention of one of the four: AIC-based. Your proposed lead also assumes that the reader knows what frequentism and Bayesianism: I have twice explained why that assumption is inappropriate in the lead; you have not addressed that inappropriateness.

Regarding the AIC-based foundation, if you are unfamiliar with it, you should check the reference given. You could also check the Wikipedia article on AIC, which contains a section on this topic: Akaike information criterion#Foundations of statistics. (Your claim that the article on AIC “does not describe this supposed branch of statistics at all” is thus also invalid.) You could also check the Wikipedia article on Foundations of statistics, which explicitly uses the term “Akaikean-Information Criterion-based statistics”, and cites a reference for that term. To conclude, when making your invalid criticism, you ignored both the reference and the two relevant Wikipedia articles.

Have a look at WP:REFLOOP, it should explain why we don't use Wikipedia articles as references. --bender235 (talk) 19:04, 14 June 2019 (UTC)

About your earlier proposed presentation of the likelihood principle, I flagged a problem: “it leaves the reader with no information about the status of the likelihood principle: is the principle widely accepted?” You later revised the presentation, but your revision does not address the problem. Moreover, your revision added needless information, citing Barnard and Birnbaum: who are they and why should people reading the lead be interested? Again, this is the lead, not the body.

SolidPhase (talk) 17:35, 14 June 2019 (UTC)

One problem with the present lede is this sentence:

"In informal contexts, "likelihood" is often used as a synonym for "probability". In statistics, the two terms have different meanings. Probability is used to describe the plausibility of some data, given a value for the parameter. Likelihood is used to describe the plausibility of a value for the parameter, given some data."

Because of the parallel construction between describing the "plausibility" of the data and the "plausibility" of the parameters, the reader might reasonably infer that the likelihood describes the "probability" of the parameters. I don't think that inference would be correct. In particular, I don't think the likelihood function would be properly normalized in the way that a pdf would be for data. Attic Salt (talk) 13:28, 14 June 2019 (UTC)

About the term “plausible”, I got it from Kalbfleish (cited in the article). Perhaps, though, there is a way to address the issue that you raise and also address the issue of removing the quotation from the lead. What do you think of adding a sentence at the end, like in the following?

In informal contexts, "likelihood" is often used as a synonym for "probability". In statistics, the two terms have different meanings. Probability is used to describe the plausibility of some data, given a value for the parameter. Likelihood is used to describe the plausibility of a value for the parameter, given some data. Likelihoods are not probabilities, though, and they do not obey the laws of probability.

The last part of that (about not obeying the laws of probability) is taken from the quotation. If the above paragraph were adopted, then I would agree with moving the quotation to the section "Historical remarks" and having the lead use the briefer description proposed by Bender235, i.e. “The case for using likelihood in the foundations of statistics was first made by the founder of modern statistics, R. A. Fisher ....”.

SolidPhase (talk) 17:35, 14 June 2019 (UTC)

If we start with SolidPhase's version, I would say, first, that likelihoods are not probabilities, and then not use the "plausible" terminology in the way you do for probability. That will only cause confusion. Having said that, I'd still start with Bender's version and work from there. Thanks, Attic Salt (talk) 18:33, 14 June 2019 (UTC)

@Attic Salt: Thanks for improving the lead. One sentence more or less a first draft is this one, from the second paragraph: "Additionally, the shape and curvature of the likelihood surface convey information about the stability of the estimates (...)." I want to somehow get the point across that a likelihood surface with ridges and valleys and saddles leads to multiple roots to the likelihood equations, i.e. multiple MLEs, and therefore an understanding of the "characteristics" of the likelihood function is relevant for the estimation. Maybe you can think of a better way to phrase this. --bender235 (talk) 19:20, 14 June 2019 (UTC)

Ya, I notice that word, "convey" and changed it to "represent". I'm not married to these words, however. Attic Salt (talk) 19:23, 14 June 2019 (UTC)

Also, what I just found out is, that while the log-likelihood doesn't have its own article, its graph does in support curve (although it's just a stub). --bender235 (talk) 15:33, 15 June 2019 (UTC)

There's an interesting definition of the likelihood in Redner/Walker (1984): "The likelihood function of a sample of observations is the probability density function of the random sample evaluated at the observations at hand." (p. 207). I like this perspective on the likelihood, especially since it underscores the juxtaposition of probability density function (fixed parameter, "varying" sample) and likelihood (fixed sample, varying parameters). We could expand this definition to something like this: "The likelihood function is the probability density function of a random ~~variable~~ sample evaluated at the observations at hand and thus taken as a function of the parameters." Maybe this is something we could finally all agree on. --bender235 (talk) 21:54, 15 June 2019 (UTC)

I don't understand that definition. Shouldn't it be the density function of the data sample at the parameters at hand (be the MLE be obtained or otherwise)? Attic Salt (talk) 22:47, 15 June 2019 (UTC)

Sorry, my mistake. The phrasing is certainly not final. I liked the general idea, though, when I saw it. --bender235 (talk) 23:58, 15 June 2019 (UTC)

Maybe I wasn't clear. I don't like the definition because it doesn't mention anything about the assumed model and the parameters of the model. Attic Salt (talk) 14:37, 16 June 2019 (UTC)

Hm, I see. But then again, there's only so much we can put into the opening paragraph of the article without overloading it.

I still like the Redner-Walker angle, though, so here's another try: In statistics, the likelihood function (often simply called likelihood) expresses how likely particular values of statistical model parameters are for a given set of observations. It is derived from the joint probably distribution of the random sample evaluated at the given observations, and thus solely a function of parameters that index those probability distributions. Writing "derived from" instead of "identical to" is a step towards SolidPhase's constant critique of the latter being misinterpretable. I'd also remove the "formed by multiplying their probability density functions" part since this is something to be explained in the article on joint distributions. --bender235 (talk) 19:37, 16 June 2019 (UTC)

The present first sentence of the article (with my recent edits) reads: "In statistics, the likelihood function (often simply called the likelihood) is the joint probability of a sample of data given a set of model parameter values." I'd like to suggest that what is "given" is not the parameters, but the data. To me, at least, some change of the sentence along the lines of "given the data" might help clear up confusion (by many) of how the likelihood function should be interpreted. Thoughts on this? Attic Salt (talk) 18:42, 14 November 2019 (UTC)

You're correct, and actually this is what the lead said last time I rewrote it, as you can see in the very last comment above yours. Somehow someone changed this, I guess. --bender235 (talk) 00:01, 15 November 2019 (UTC)

No, this is not correct. Consider the definition given in Lehmann & Casella^[7], for example:

For a sample point

\mathbf {x} =(x_{1},\dots ,x_{n})

from a density

f(\mathbf {x} |\theta )

, the likelihood function

L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )

is the sample density considered as a function of

\theta

for fixed

\mathbf {x}

.

So, the likelihood is a probability density that we choose to view as a function of the parameters. Somehow, the idea that it is categorically incorrect to refer to a likelihood as a probability has lodged itself in several brains here. I'm assuming this is the result of stats professors emphasizing that it is not a probability over the parameter space. Nonetheless, the definition of likelihood is explicitly given in terms of a probability density (over data).

It is, in my opinion, much clearer to state the equivalence given in the above definition to begin with (i.e., that the likelihood is a probability over data) than to define "likelihood" in terms of how "likely" parameters are, which is both circular and misleading to lay readers, who will typically interpret "likely" and "probable" as synonyms. I appreciate that editors on this page are trying to avoid leading readers astray. However, I think the current wording is more – not less – likely to lead readers to think that the likelihood is a probability over parameters than a wording that simply states the equality (by definition!) of the probability density of the data given the parameters and the likelihood of the parameters given the data. So many suspicious toenails (talk) 21:35, 21 November 2019 (UTC)

There might be some inconsistency in the literature on phrasing, and it would be interesting to see what phrasing Fisher used, but here are two examples of describing likelihood as a function of parameters (or "hypothesis"), *given* data. I use the first reference, Bohm and Zech, for my work, and the second one, by Edwards, is something I just found by poking around the internet.

Bohm, G. and Zech, G., 2010. Introduction to Statistics and Data Analysis for Physicists, Verlag Deutsches Elektronen-Synchrotron:

p. 137: "While the likelihood is related to the validity of a hypothesis given an observation, the p.d.f. is related to the probability to observe a variate for a given hypothesis."

Edwards, A. W. F., 1972, Likelihood, Cambridge University Press:

p. 9: "The likelihood, L(H|R), of the hypothesis H given data R, and a specific model is proportional to P(R|H), the constant of proportionality being arbitrary."

I, personally, would find it useful for the opening definition of likelihood to state something like "given data", since the likelihood function is used to find parameters for a given set of data. I am, however, happy to learn more about these nuances. Attic Salt (talk) 14:06, 22 November 2019 (UTC)

The definition cited by Toenails from Lehmann & Casella uses the expression "for fixed

\mathbf {x}

". This is, as I see it, consistent with "given data". The Lehmann & Casella definition does not say anything similar to "given a set of model parameter values". So, it appears that the Lehmann & Casella definition is inconsistent with the sentence Toenails is advancing. I think we need to revert the corresponding edit by Toenails. Thank you. Attic Salt (talk) 15:59, 22 November 2019 (UTC)

I think your confusion here may be in the view that either the parameters or the data are given in some absolute sense. However, the likelihood of the parameters given the data

L(\theta |\mathbf {x} )

is equal to the probability of the data given the parameters

f(\mathbf {x} |\theta )

. Moreover, when Lehmann & Casella write that

L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )

, this is precisely saying that the likelihood of [a set of model] parameter[s] (since

\theta

may be a vector and the density

f

implies a model) is the conditional probability density of the data given the parameters. This is what the "|" symbol means – so you'd read

\theta |\mathbf {x}

as "theta given x."

I also appreciate you personally would find a wording that defines likelihood in terms of how 'likely' parameters are given data helpful. However, I think this definition is at best uninformative – what does it mean for parameter values to be "likely"? That the value of the likelihood function is high at those parameter values? If that's what we're using to define likelihood, that's circular.

The short and long of this is that the likelihood is the probability density of observed data at parameter values in question. That's what, to repeat myself,

L(\theta |\mathbf {x} )=f(\mathbf {x} |\theta )

means. In the context of likelihood, we view this as a function of the parameters (though of course it depends on the data as well). So many suspicious toenails (talk) 17:06, 22 November 2019 (UTC)

You might be overinterpreting what I'm saying. I'm just suggesting that the first sentence say "given data" and not "given parameters". My suggestion is consistent with the sources I've quoted and, even, the source you've quoted. I recognise the interpretations you are making, but those are not my point. Thank you, Attic Salt (talk) 17:12, 22 November 2019 (UTC)

To be fair, you were also asserting that Lehmann & Casella 'does not say anything similar to "given a set of model parameter values,"' which is simply not correct – hence my clarification of how to read the notation they use. Perhaps the word 'given' is an unnecessary issue here, however. The lede could simply state that the likelihood is the probability density of the data evaluated at a set of parameter values, which might be clearer (especially for die-hard frequentists who bristle at the idea of conditioning on parameter values). So many suspicious toenails (talk) 18:09, 22 November 2019 (UTC)

We have been talking past each other, yes. I was focussing on the likelihood while you (might have been) focussing on the pdf. Attic Salt (talk) 18:24, 22 November 2019 (UTC)

I've been talking about the likelihood

L(\theta \mid x)

, which is defined in terms of

p(x\mid \theta )

. See for another reference the definition given in Bayesian Data Analysis ^[8] (ch. 1.3, under the heading "Likelihood"): "

p(y\mid \theta )

[...] when regarded as a function of

\theta

, for fixed y, is called the likelihood function."

p(x\mid \theta )

is a function

p:{\mathcal {X}}\times \Theta \rightarrow \mathbf {R}

; when we fix some

x\in {\mathcal {X}}

and allow

\theta

to vary, we refer to this as a likelihood function. There isn't an additional distinction between these entities, at least as they are commonly used. Do you disagree with any of this?

(As an aside, I'm aware that some sources consider any function proportional to

p(x\mid \theta )

to be a likelihood. This may be appropriate to mention somewhere in this article, but I do not think it should inform the lede as it is inconsistent with the explicit definitions of likelihood given in this article and it is incompatible with the Bayesian paradigm, which employs likelihoods as sampling distributions.) So many suspicious toenails (talk) 23:17, 22 November 2019 (UTC)

"...to define "likelihood" in terms of how "likely" parameters are, which is both circular and misleading to lay readers, who will typically interpret "likely" and "probable" as synonyms."

There is a prominent end note addressing this very issue for the confused lay readers, of how "probability" is a characteristic of the sample, and "likelihood" a characteristic of the parameters, despite "likely" and "probable" being synonyms in common speech.

In general, and purely mathematical, the likelihood function is function from the parameter space to the real line,

L:\Theta \to \mathbb {R}

, with

\Theta \subset \mathbb {R} ^{p}

. As a matter of interpretation, however, the value the likelihood function returns tells us how likely particular parameter values are in light of a given sample. This fact should be emphasized in the lead. --bender235 (talk) 23:32, 22 November 2019 (UTC)

Okay, and what does "likely" mean in this context?

More to the point, why not start with a plain English definition of likelihood rather than an interpretation using the undefined term "likely"?

Anecdotally, I'll add that as a statistician myself, I have many times heard other statisticians talk about the likelihood at particular parameter values. I cannot recall ever hearing one speak about how "likely" the parameters are. So many suspicious toenails (talk) 23:46, 22 November 2019 (UTC)

Well, "likely" in the sense that if

L(\mu =5)>L(\mu =4)

, then the value of

\mu =5

is more likely than

\mu =4

for the given sample under the chosen probalistic model. I would consider this a "plain English" definition. --bender235 (talk) 21:47, 23 November 2019 (UTC)

So the likelihood is a function that describes how likely parameters are, where "likely" describes the state of having a relatively high likelihood. This is circular. So many suspicious toenails (talk) 23:15, 23 November 2019 (UTC)

Toenails, I think Bender is saying what is Edwards is saying, as in this succinct summary: , namely that likelihood has meaning in the comparison of hypotheses. Attic Salt (talk) 13:53, 24 November 2019 (UTC)

@So many suspicious toenails: only as circular as saying probability is a function describing how probable a particular value of a random variable is, where "probable" describes the state of having relatively high probability. Sure, if you write it this way it sounds circular, but this is how things are defined. We say "rolling a six" is less likely than "not rolling a six" precisely because the probability (mass) function returns a higher value for the latter. --bender235 (talk) 17:09, 24 November 2019 (UTC)

No, there actually is an axiomatic definition of probability (i.e., of what constitutes a valid probability measure), as well as a definition grounded in long-term frequencies. The pmf in the dice situation, in the frequentist formulation at least, amounts to a statement about the long-term frequencies of the events "rolling a six" and "not rolling a six" over many rolls of a die, so there is absolutely more information there than simply a statement about some arbitrary function. If, for example, I said that rolling a six was in fact more probable than not rolling a six, and pointed you to a different pmf which reflected that belief, would you truly have no recourse but to accept that my statement was true because I could show you a probability model under which my statement was correct? Edit: on reflection, my comments above rely on a distinction between the probability given under a particular model (e.g., via a pmf) and the true long-run probabilities. This isn't really relevant, and I don't think it is a reasonable reading of your comments, so I retract. The below I still stand behind:

More to the point, it is no more helpful to someone trying to understand what probability and a pmf are to say that the probability of an event reflects that value of the pmf and the pmf reflects the probability than it is for me to explain to you what "frimbulum" and "frimbulous" are by explaining that frimbulum is the quality of being frimbulous and that frimbulous means "marked by frimbulum." Similarly for "likely" and "likelihood" defined in terms of each other. These statements may be true, but they are not informative.

This is all fairly extraneous, though, since we already have a solid definition of likelihood in terms of a sampling distribution, and "likely" is not typically a word that (frequentist) statisticians use to describe parameters. So many suspicious toenails (talk) 18:55, 24 November 2019 (UTC)

If you don't believe me on this last point, search a scholarly database for "likelihood" + statistics (or some other second statistical keyword) vs. "likely" + statistics (though I'll point out that perhaps you should provide some positive evidence that it is accepted practice in statistical circles to describe parameters for which the likelihood has high values as "likely" if you want to put this verbiage in the lede.) So many suspicious toenails (talk) 18:59, 24 November 2019 (UTC)

@So many suspicious toenails: well, the article lead already lists a reference (Valavanis) discussing how statisticians distinguish "probability" and "likelihood," and how the latter is reserved to describe parameters. I know the definition I've written above is circular to some extend, but it is (in my opinion) still understandable even for a lay person. A likelihood function returns different values for every

\theta \in \Theta

you throw into it, and whichever

\theta

has the highest likelihood value is, by definition, the most likely. I don't see any other way to put it. If you do, please let me know. --bender235 (talk) 16:50, 26 November 2019 (UTC)

@Bender235: I think the crux of the matter is that "likely" is not typically used to describe parameters for which the likelihood function takes large values. It certainly would be straightforward and convenient if the relationship between "likely" and "likelihood" as they are used in statistics were parallel to that between "probable" and "probability," but it is not. The stated definition given in terms of the sampling distribution conditional on parameter values is correct, although I am of course open to wordings that are more accessible to non-statisticians. So many suspicious toenails (talk) 17:58, 26 November 2019 (UTC)

I think there's a misunderstanding on your side, and I don't think it would be common among our readers. It is certainly true that the likelihood does not have a straight forward interpretation like probability in the sense that

P({\text{die}}=6)={\frac {1}{6}}

means there's a one in six chance of rolling a six with a fair die; the likelihood function usually does not integrate to one over the parameter space, nor will it necessarily take on values between 0 and 1. Hence something like

L(\mu =5)=0.57

does not mean there is a 57% likelihood that

\mu

is actually 5, or something to that extent. Do you think lay readers will come to this erroneous conclusion, based on a the current juxtaposition of "likely" and "probable"? Am I interpreting your concern correctly here? --bender235 (talk) 01:56, 27 November 2019 (UTC)

I do think that using "likely" to describe parameters is unnecessarily confusing, but this is not my primary objection. My primary objection is that is simply is not common practice in statistics to describe parameters for which the likelihood takes high values as "likely." Putting the term "likely" in the lede gives readers the impression this is nomenclature commonly used in statistics. It is not. So many suspicious toenails (talk) 04:04, 27 November 2019 (UTC)

I understand what you mean. Maybe it is best to leave the meaning of likelihood for the second paragraph of the lead, not the very first one. --bender235 (talk) 21:12, 27 November 2019 (UTC)

Ok. I'm not sure what you mean by the 'meaning' of likelihood, but if you mean some phrasing in terms of the word 'likely,' I don't think that belongs in the article at all, for same reasons I've given above. So many suspicious toenails (talk) 22:18, 27 November 2019 (UTC)

What I meant was the interpretation of the likelihood itself. Right now it is defined constructively, as the joint pdf evaluated at the sample and taken as a function of the parameters. But the meaning of it, i.e. "what exactly does the value of the likelihood tell me?", is not part of the opening sentence, and given the ambiguity maybe it shouldn't. --bender235 (talk) 20:26, 28 November 2019 (UTC)

I will give it one more try adding an opening sentence about the meaning of the likelihood function before the (mathematical) definition of it. --bender235 (talk) 19:07, 18 January 2020 (UTC)

Talk:Likelihood function/Archive 1

Unclassified comments

The arrow

Context tag

Which came first

Backwards

Likelihood of continuous distributions is a problem

Area under the curve

Needs a simpler introduction?

Median

graph

Probability of causes and not probability of effects?

putting x and theta in bold

Discussion

In general Likelihoods with respect to a dominating measure

Definition

"Historical remarks" section

External links modified

Wording of lead

Why deleting "A more detailed discussion of history of likelihood ..."?

Inverse logic

New section on integrability

When is a conditional probability not a conditional probability?

parameter(s) singular or plural?

A good lead

Properties of the likelihood function for MLE

Related Articles