Cramér–von Mises criterion

In statistics the Cramér–von Mises criterion is a criterion used for judging the goodness of fit of a cumulative distribution function (CDF) $F^{*}$ compared to a given empirical distribution function $F_{n}$ , or for comparing two empirical distributions. It is also used as a part of other algorithms, such as minimum distance estimation. It is defined as $\omega ^{2}$ , where $\omega ^{2}=\int _{-\infty }^{\infty }[F_{n}(x)-F^{*}(x)]^{2}\,\mathrm {d} F^{*}(x)$

In one-sample applications $F^{*}$ is the theoretical distribution and $F_{n}$ is the empirically observed distribution. Alternatively the two distributions can both be empirically estimated ones; this is called the two-sample case.

The criterion is named after Harald Cramér and Richard Edler von Mises who first proposed it in 1928–1930. ^[1]^[2] The generalization to two samples is due to Anderson. ^[3]

The Cramér–von Mises test is an alternative to the Kolmogorov–Smirnov test (1933).^[4]

Cramér–von Mises test (one sample)

Let $x_{1},x_{2},\ldots ,x_{n}$ be the observed values, in increasing order. Then the test statistic is^[3]^: 1153^[5]

$T=n\omega ^{2}={\frac {1}{12n}}+\sum _{i=1}^{n}\left[{\frac {2i-1}{2n}}-F^{*}(x_{i})\right]^{2}.$

If this value is larger than the tabulated value, then the hypothesis that the data came from the distribution $F^{*}$ can be rejected.

Watson test

A modified version of the Cramér–von Mises test is the Watson test^[6] which uses the statistic U², where^[5]

$U^{2}=T-n({\bar {F}}-{\tfrac {1}{2}})^{2},$

where ${\bar {F}}={\frac {1}{n}}\sum _{i=1}^{n}F^{*}(x_{i}).$

Cramér–von Mises test (two samples)

Let $x_{1},x_{2},\ldots ,x_{n}$ and $y_{1},y_{2},\ldots ,y_{m}$ be the observed values in the first and second sample respectively, in increasing order. Within the combined sample of size $n+m$ , let $r_{1},r_{2},\ldots ,r_{n}$ be the ranks of the xs in the combined sample, and let $s_{1},s_{2},\ldots ,s_{m}$ be the ranks of the ys in the combined sample. Anderson^[3]^: 1149 shows that

$T={\frac {nm}{n+m}}\omega ^{2}={\frac {U}{nm(n+m)}}-{\frac {4mn-1}{6(m+n)}}$

where U is defined as

$U=n\sum _{i=1}^{n}(r_{i}-i)^{2}+m\sum _{j=1}^{m}(s_{j}-j)^{2}$

If the value of T is larger than the tabulated values,^[3]^{: 1154–1159} the hypothesis that the two samples come from the same distribution can be rejected. (Some books^[specify] give critical values for U, which is more convenient, as it avoids the need to compute T via the expression above. The conclusion will be the same.)

The above assumes there are no duplicates in the $x$ , $y$ , and $r$ sequences. So $x_{i}$ is unique, and its rank is $i$ in the sorted list $x_{1},\ldots ,x_{n}$ . If there are duplicates, and $x_{i}$ through $x_{j}$ are a run of identical values in the sorted list, then one common approach is the midrank^[7] method: assign each duplicate a "rank" of $(i+j)/2$ . In the above equations, in the expressions $(r_{i}-i)^{2}$ and $(s_{j}-j)^{2}$ , duplicates can modify all four variables $r_{i}$ , $i$ , $s_{j}$ , and $j$ .

Cramér distance

For two distributions on the real line with cumulative distribution functions $F$ and $G$ and finite first moment, the Cramér distance is

\ell _{2}(F,G)=\left[\int _{-\infty }^{\infty }{\bigl (}F(x)-G(x){\bigr )}^{2}dx\right]^{1/2},

a metric on the space of such distributions.^[8] Note that some sources define the Cramér distance as $\ell _{2}^{2}$ , but this fails the triangle inequality and so cannot be properly defined as a distance. The Cramér distance is the one-dimensional case of the energy distance via the relationship ${\sqrt {2}}\ell _{2}=D$ ,^[9] and when $G$ represents a single observation $y$ with cumulative distribution $G(x)=\mathbf {1} \{x\geq y\}$ , $\ell _{2}^{2}(F,G)$ is equivalent to the continuous ranked probability score, a strictly proper scoring rule.^[10]

Under the probability integral transform (PIT), the plot of the empirical distribution of the transformed values $F^{*}(x_{1}),\ldots ,F^{*}(x_{n})$ and the uniform distribution on $[0,1]$ creates a PIT reliability diagram. The Cramér distance $\ell _{2}$ between these two distributions equals $\omega$ , the square root of the criterion, and serves as a numerical score of the calibration error of $F^{*}$ . This may also be referred to as the Root Mean Square Calibration Error (RMSCE).

For a deterministic (point) forecast at $\mu$ , the PIT degenerates to a Bernoulli random variable on $\{0,1\}$ with success probability $p=\Pr(y>\mu )$ , so in the population limit the Cramér distance between the PIT CDF and the uniform distribution evaluates in closed form to

\ell _{2}=\left[\int _{0}^{1}{\bigl (}(1-p)-x{\bigr )}^{2}\,dx\right]^{1/2}={\sqrt {p^{2}-p+{\tfrac {1}{3}}}}.

This quantity is minimized at $p={\tfrac {1}{2}}$ (the unbiased case) with value ${\tfrac {1}{\sqrt {12}}}\approx 0.2887$ , establishing a calibration-error floor that no point forecast can fall below regardless of how accurate its central value is. In contrast, a well-calibrated probabilistic forecast can approach 0. Similarly, this quantity is maximized at the bias extremes $p\in \{0,1\}$ with value ${\tfrac {1}{\sqrt {3}}}\approx 0.5774$ .

Cramér–von Mises criterion

Cramér–von Mises test (one sample)

Watson test

Cramér–von Mises test (two samples)

Cramér distance

References

Further reading

Related Articles