K q-flats
From Wikipedia, the free encyclopedia
In data mining and machine learning, k q-flats algorithm[1][2] is an iterative method which aims to partition m observations into k clusters where each cluster is close to a q-flat, where q is a given integer.
It is a generalization of the k-means algorithm. In k-means algorithm, clusters are formed in the way that each cluster is close to one point, which is a 0-flat. k q-flats algorithm gives better clustering result than k-means algorithm for some data set.
Problem formulation
Given a set A of m observations where each observation is an n-dimensional real vector, k q-flats algorithm aims to partition m observation points by generating k q-flats that minimize the sum of the squares of distances of each observation to a nearest q-flat.
A q-flat is a subset of that is congruent to . For example, a 0-flat is a point; a 1-flat is a line; a 2-flat is a plane; a -flat is a hyperplane. q-flat can be characterized by the solution set of a linear system of equations: , where , .
Denote a partition of as . The problem can be formulated as
| P1 |
where is the projection of onto . Note that is the distance from to .
Algorithm
The algorithm is similar to the k-means algorithm (i.e. Lloyd's algorithm) in that it alternates between cluster assignment and cluster update. In specific, the algorithm starts with an initial set of q-flats , and proceeds by alternating between the following two steps:
- Cluster Assignment
- (given q-flats, assign each point to closest q-flat): the i-th cluster is updated as
- Cluster Update
- (given cluster assignment, update the q-flats): For , let with rows corresponding to all assigned to cluster l. Set to be the matrix whose columns are the orthonormal eigenvectors corresponding to the least eigenvalues of and .
Stop whenever the assignments no longer change.
The cluster assignment step uses the following fact: given a q-flat and a vector a, where , the distance from a to the q-flat is
The key part of this algorithm is how to update the cluster, i.e. given m points, how to find a q-flat that minimizes the sum of squares of distances of each point to the q-flat. Mathematically, this problem is: given solve the quadratic optimization problem
| subject to | P2 |
where is given, and .
The problem can be solved using Lagrangian multiplier method and the solution is as given in the cluster update step.
It can be shown that the algorithm will terminate in a finite number of iterations (no more than the total number of possible assignments, which is bounded by ). In addition, the algorithm will terminate at a point that the overall objective cannot be decreased either by a different assignment or by defining new cluster planes for these clusters (such point is called "locally optimal" in the references).
This convergence result is a consequence of the fact that problem (P2) can be solved exactly. The same convergence result holds for k-means algorithm because the cluster update problem can be solved exactly.