|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.mahout.clustering.streaming.cluster.StreamingKMeans
public class StreamingKMeans
Implements a streaming k-means algorithm for weighted vectors. The goal clustering points one at a time, especially useful for MapReduce mappers that get inputs one at a time. A rough description of the algorithm: Suppose there are l clusters at one point and a new point p is added. The new point can either be added to one of the existing l clusters or become a new cluster. To decide: - let c be the closest cluster to point p; - let d be the distance between c and p; - if d > distanceCutoff, create a new cluster from p (p is too far away from the clusters to be part of them; distanceCutoff represents the largest distance from a point its assigned cluster's centroid); - else (d <= distanceCutoff), create a new cluster with probability d / distanceCutoff (the probability of creating a new cluster increases as d increases). There will be either l points or l + 1 points after processing a new point. As the number of clusters increases, it will go over the numClusters limit (numClusters represents a recommendation for the number of clusters that there should be at the end). To decrease the number of clusters the existing clusters are treated as data points and are re-clustered (collapsed). This tends to make the number of clusters go down. If the number of clusters is still too high, distanceCutoff is increased. For more details, see: - "Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf - "Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson, http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
Constructor Summary | |
---|---|
StreamingKMeans(UpdatableSearcher searcher,
int numClusters)
Calls StreamingKMeans(searcher, numClusters, 1.3, 10, 2). |
|
StreamingKMeans(UpdatableSearcher searcher,
int numClusters,
double distanceCutoff)
Calls StreamingKMeans(searcher, numClusters, distanceCutoff, 1.3, 10, 2). |
|
StreamingKMeans(UpdatableSearcher searcher,
int numClusters,
double distanceCutoff,
double beta,
double clusterLogFactor,
double clusterOvershoot)
Creates a new StreamingKMeans class given a searcher and the number of clusters to generate. |
Method Summary | |
---|---|
UpdatableSearcher |
cluster(Centroid datapoint)
Cluster one data point. |
UpdatableSearcher |
cluster(Iterable<Centroid> datapoints)
Cluster the data points in an Iterable |
UpdatableSearcher |
cluster(Matrix data)
Cluster the rows of a matrix, treating them as Centroids with weight 1. |
double |
getDistanceCutoff()
|
DistanceMeasure |
getDistanceMeasure()
|
int |
getNumClusters()
|
Iterator<Centroid> |
iterator()
|
void |
reindexCentroids()
|
void |
setDistanceCutoff(double distanceCutoff)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public StreamingKMeans(UpdatableSearcher searcher, int numClusters)
StreamingKMeans(org.apache.mahout.math.neighborhood.UpdatableSearcher, int,
double, double, double, double)
public StreamingKMeans(UpdatableSearcher searcher, int numClusters, double distanceCutoff)
StreamingKMeans(org.apache.mahout.math.neighborhood.UpdatableSearcher, int,
double, double, double, double)
public StreamingKMeans(UpdatableSearcher searcher, int numClusters, double distanceCutoff, double beta, double clusterLogFactor, double clusterOvershoot)
searcher
- A Searcher that is used for performing nearest neighbor search. It MUST BE
EMPTY initially because it will be used to keep track of the cluster
centroids.numClusters
- An estimated number of clusters to generate for the data points.
This can adjusted, but the actual number will depend on the data. ThedistanceCutoff
- The initial distance cutoff representing the value of the
distance between a point and its closest centroid after which
the new point will definitely be assigned to a new cluster.beta
- Ratio of geometric progression to use when increasing distanceCutoff. After n increases, distanceCutoff
becomes distanceCutoff * beta^n. A smaller value increases the distanceCutoff less aggressively.clusterLogFactor
- Value multiplied with the number of points counted so far estimating the number of clusters
to aim for. If the final number of clusters is known and this clustering is only for a
sketch of the data, this can be the final number of clusters, k.clusterOvershoot
- Multiplicative slack factor for slowing down the collapse of the clusters.Method Detail |
---|
public Iterator<Centroid> iterator()
iterator
in interface Iterable<Centroid>
public UpdatableSearcher cluster(Matrix data)
data
- matrix whose rows are to be clustered.
public UpdatableSearcher cluster(Iterable<Centroid> datapoints)
datapoints
- Iterable whose elements are to be clustered.
public UpdatableSearcher cluster(Centroid datapoint)
datapoint
- to be clustered.
public int getNumClusters()
public void reindexCentroids()
public double getDistanceCutoff()
public void setDistanceCutoff(double distanceCutoff)
public DistanceMeasure getDistanceMeasure()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |