org.apache.mahout.math.stats
Class TDigest

java.lang.Object
  extended by org.apache.mahout.math.stats.TDigest

public class TDigest
extends Object

Adaptive histogram based on something like streaming k-means crossed with Q-digest.

The special characteristics of this algorithm are:

a) smaller summaries than Q-digest

b) works on doubles as well as integers.

c) provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles

d) fast

e) simple

f) test coverage > 90%

g) easy to adapt for use with map-reduce


Nested Class Summary
static class TDigest.Group
           
 
Field Summary
static int SMALL_ENCODING
           
static int VERBOSE_ENCODING
           
 
Constructor Summary
TDigest(double compression)
          A histogram structure that will record a sketch of a distribution.
 
Method Summary
 void add(double x)
          Adds a sample to a histogram.
 void add(double x, int w)
          Adds a sample to a histogram.
 void add(TDigest other)
           
 void asBytes(ByteBuffer buf)
          Outputs a histogram as bytes using a particularly cheesy encoding.
 void asSmallBytes(ByteBuffer buf)
           
 int byteSize()
          Returns an upper bound on the number bytes that will be required to represent this histogram.
 double cdf(double x)
           
 int centroidCount()
           
 Iterable<? extends TDigest.Group> centroids()
           
 void compress()
           
 double compression()
           
static int decode(ByteBuffer buf)
           
static void encode(ByteBuffer buf, int n)
           
static TDigest fromBytes(ByteBuffer buf)
          Reads a histogram from a byte buffer
static TDigest merge(double compression, Iterable<TDigest> subData)
           
 double quantile(double q)
           
 TDigest recordAllData()
          Sets up so that all centroids will record all data assigned to them.
 int size()
          Returns the number of samples represented in this histogram.
 int smallByteSize()
          Returns an upper bound on the number of bytes that will be required to represent this histogram in the tighter representation.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VERBOSE_ENCODING

public static final int VERBOSE_ENCODING
See Also:
Constant Field Values

SMALL_ENCODING

public static final int SMALL_ENCODING
See Also:
Constant Field Values
Constructor Detail

TDigest

public TDigest(double compression)
A histogram structure that will record a sketch of a distribution.

Parameters:
compression - How should accuracy be traded for size? A value of N here will give quantile errors almost always less than 3/N with considerably smaller errors expected for extreme quantiles. Conversely, you should expect to track about 5 N centroids for this accuracy.
Method Detail

add

public void add(double x)
Adds a sample to a histogram.

Parameters:
x - The value to add.

add

public void add(double x,
                int w)
Adds a sample to a histogram.

Parameters:
x - The value to add.
w - The weight of this point.

add

public void add(TDigest other)

merge

public static TDigest merge(double compression,
                            Iterable<TDigest> subData)

compress

public void compress()

size

public int size()
Returns the number of samples represented in this histogram. If you want to know how many centroids are being used, try centroids().size().

Returns:
the number of samples that have been added.

cdf

public double cdf(double x)
Parameters:
x - the value at which the CDF should be evaluated
Returns:
the approximate fraction of all samples that were less than or equal to x.

quantile

public double quantile(double q)
Parameters:
q - The quantile desired. Can be in the range [0,1].
Returns:
The minimum value x such that we think that the proportion of samples is <= x is q.

centroidCount

public int centroidCount()

centroids

public Iterable<? extends TDigest.Group> centroids()

compression

public double compression()

recordAllData

public TDigest recordAllData()
Sets up so that all centroids will record all data assigned to them. For testing only, really.


byteSize

public int byteSize()
Returns an upper bound on the number bytes that will be required to represent this histogram.


smallByteSize

public int smallByteSize()
Returns an upper bound on the number of bytes that will be required to represent this histogram in the tighter representation.


asBytes

public void asBytes(ByteBuffer buf)
Outputs a histogram as bytes using a particularly cheesy encoding.


asSmallBytes

public void asSmallBytes(ByteBuffer buf)

encode

public static void encode(ByteBuffer buf,
                          int n)

decode

public static int decode(ByteBuffer buf)

fromBytes

public static TDigest fromBytes(ByteBuffer buf)
Reads a histogram from a byte buffer

Returns:
The new histogram structure


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.