K q-flats

From Wikipedia, the free encyclopedia

In data mining and machine learning, k q-flats algorithm[1][2] is an iterative method which aims to partition m observations into k clusters where each cluster is close to a q-flat, where q is a given integer.

It is a generalization of the k-means algorithm. In k-means algorithm, clusters are formed in the way that each cluster is close to one point, which is a 0-flat. k q-flats algorithm gives better clustering result than k-means algorithm for some data set.

Problem formulation

Given a set A of m observations where each observation is an n-dimensional real vector, k q-flats algorithm aims to partition m observation points by generating k q-flats that minimize the sum of the squares of distances of each observation to a nearest q-flat.

A q-flat is a subset of that is congruent to . For example, a 0-flat is a point; a 1-flat is a line; a 2-flat is a plane; a -flat is a hyperplane. q-flat can be characterized by the solution set of a linear system of equations: , where , .

Denote a partition of as . The problem can be formulated as

where is the projection of onto . Note that is the distance from to .

Algorithm

The algorithm is similar to the k-means algorithm (i.e. Lloyd's algorithm) in that it alternates between cluster assignment and cluster update. In specific, the algorithm starts with an initial set of q-flats , and proceeds by alternating between the following two steps:

Cluster Assignment
(given q-flats, assign each point to closest q-flat): the i-th cluster is updated as
Cluster Update
(given cluster assignment, update the q-flats): For , let with rows corresponding to all assigned to cluster l. Set to be the matrix whose columns are the orthonormal eigenvectors corresponding to the least eigenvalues of and .

Stop whenever the assignments no longer change.

The cluster assignment step uses the following fact: given a q-flat and a vector a, where , the distance from a to the q-flat is

The key part of this algorithm is how to update the cluster, i.e. given m points, how to find a q-flat that minimizes the sum of squares of distances of each point to the q-flat. Mathematically, this problem is: given solve the quadratic optimization problem

where is given, and .

The problem can be solved using Lagrangian multiplier method and the solution is as given in the cluster update step.

It can be shown that the algorithm will terminate in a finite number of iterations (no more than the total number of possible assignments, which is bounded by ). In addition, the algorithm will terminate at a point that the overall objective cannot be decreased either by a different assignment or by defining new cluster planes for these clusters (such point is called "locally optimal" in the references).

This convergence result is a consequence of the fact that problem (P2) can be solved exactly. The same convergence result holds for k-means algorithm because the cluster update problem can be solved exactly.

Relation to other machine learning methods

Applications and variations

References

Related Articles

Wikiwand AI