org.apache.mahout.clustering.streaming.cluster
Class BallKMeans
java.lang.Object
org.apache.mahout.clustering.streaming.cluster.BallKMeans
- All Implemented Interfaces:
- Iterable<Centroid>
public class BallKMeans
- extends Object
- implements Iterable<Centroid>
Implements a ball k-means algorithm for weighted vectors with probabilistic seeding similar to k-means++.
The idea is that k-means++ gives good starting clusters and ball k-means can tune up the final result very nicely
in only a few passes (or even in a single iteration for well-clusterable data).
A good reference for this class of algorithms is "The Effectiveness of Lloyd-Type Methods for the k-Means Problem"
by Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman and Chaitanya Swamy. The code here uses the seeding strategy
as described in section 4.1.1 of that paper and the ball k-means step as described in section 4.2. We support
multiple iterations in contrast to the algorithm described in the paper.
Constructor Summary |
BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations)
|
BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations,
boolean kMeansPlusPlusInit,
int numRuns)
|
BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations,
double trimFraction,
boolean kMeansPlusPlusInit,
boolean correctWeights,
double testProbability,
int numRuns)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BallKMeans
public BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations)
BallKMeans
public BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations,
boolean kMeansPlusPlusInit,
int numRuns)
BallKMeans
public BallKMeans(UpdatableSearcher searcher,
int numClusters,
int maxNumIterations,
double trimFraction,
boolean kMeansPlusPlusInit,
boolean correctWeights,
double testProbability,
int numRuns)
splitTrainTest
public Pair<List<? extends WeightedVector>,List<? extends WeightedVector>> splitTrainTest(List<? extends WeightedVector> datapoints)
cluster
public UpdatableSearcher cluster(List<? extends WeightedVector> datapoints)
- Clusters the datapoints in the list doing either random seeding of the centroids or k-means++.
- Parameters:
datapoints
- the points to be clustered.
- Returns:
- an UpdatableSearcher with the resulting clusters.
iterator
public Iterator<Centroid> iterator()
- Specified by:
iterator
in interface Iterable<Centroid>
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.