org.apache.mahout.clustering.streaming.cluster
Class BallKMeans

java.lang.Object
  extended by org.apache.mahout.clustering.streaming.cluster.BallKMeans
All Implemented Interfaces:
Iterable<Centroid>

public class BallKMeans
extends Object
implements Iterable<Centroid>

Implements a ball k-means algorithm for weighted vectors with probabilistic seeding similar to k-means++. The idea is that k-means++ gives good starting clusters and ball k-means can tune up the final result very nicely in only a few passes (or even in a single iteration for well-clusterable data). A good reference for this class of algorithms is "The Effectiveness of Lloyd-Type Methods for the k-Means Problem" by Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman and Chaitanya Swamy. The code here uses the seeding strategy as described in section 4.1.1 of that paper and the ball k-means step as described in section 4.2. We support multiple iterations in contrast to the algorithm described in the paper.


Constructor Summary
BallKMeans(UpdatableSearcher searcher, int numClusters, int maxNumIterations)
           
BallKMeans(UpdatableSearcher searcher, int numClusters, int maxNumIterations, boolean kMeansPlusPlusInit, int numRuns)
           
BallKMeans(UpdatableSearcher searcher, int numClusters, int maxNumIterations, double trimFraction, boolean kMeansPlusPlusInit, boolean correctWeights, double testProbability, int numRuns)
           
 
Method Summary
 UpdatableSearcher cluster(List<? extends WeightedVector> datapoints)
          Clusters the datapoints in the list doing either random seeding of the centroids or k-means++.
 Iterator<Centroid> iterator()
           
 Pair<List<? extends WeightedVector>,List<? extends WeightedVector>> splitTrainTest(List<? extends WeightedVector> datapoints)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BallKMeans

public BallKMeans(UpdatableSearcher searcher,
                  int numClusters,
                  int maxNumIterations)

BallKMeans

public BallKMeans(UpdatableSearcher searcher,
                  int numClusters,
                  int maxNumIterations,
                  boolean kMeansPlusPlusInit,
                  int numRuns)

BallKMeans

public BallKMeans(UpdatableSearcher searcher,
                  int numClusters,
                  int maxNumIterations,
                  double trimFraction,
                  boolean kMeansPlusPlusInit,
                  boolean correctWeights,
                  double testProbability,
                  int numRuns)
Method Detail

splitTrainTest

public Pair<List<? extends WeightedVector>,List<? extends WeightedVector>> splitTrainTest(List<? extends WeightedVector> datapoints)

cluster

public UpdatableSearcher cluster(List<? extends WeightedVector> datapoints)
Clusters the datapoints in the list doing either random seeding of the centroids or k-means++.

Parameters:
datapoints - the points to be clustered.
Returns:
an UpdatableSearcher with the resulting clusters.

iterator

public Iterator<Centroid> iterator()
Specified by:
iterator in interface Iterable<Centroid>


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.