Polynomial kernel

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features.^[1]

For degree- $d$ polynomials, the polynomial kernel is defined as^[2]

K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}

where $x$ and $y$ are vectors of size $n$ in the input space, i.e. vectors of features computed from training or test samples and $c \geq 0$ is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When $c = 0$ , the kernel is called homogeneous.^[3] (A further generalized polykernel divides $x T y$ by a user-specified scalar parameter $a$ .^[4])

As a kernel, $K$ corresponds to an inner product in a feature space based on some mapping $φ$ :

K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle

The nature of $φ$ can be seen from an example. Let $d = 2$ , so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping,

K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}

From this it follows that the feature map is given by:

\varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)

generalizing for $\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}$ , where $\mathbf {x} \in \mathbb {R} ^{n}$ , $\mathbf {y} \in \mathbb {R} ^{n}$ and applying the multinomial theorem:

${\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}$

The last summation has $l_{d}={\tbinom {n+d}{d}}$ elements, so that:

\varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)

where $l=(j_{1},j_{2},...,j_{n},j_{n+1})$ and

a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d

Polynomial kernel

Practical use

References

Related Articles