org.apache.mahout.classifier.sgd
Class AdaptiveLogisticRegression

java.lang.Object
  extended by org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression
All Implemented Interfaces:
Closeable, org.apache.hadoop.io.Writable, OnlineLearner

public class AdaptiveLogisticRegression
extends Object
implements OnlineLearner, org.apache.hadoop.io.Writable

This is a meta-learner that maintains a pool of ordinary OnlineLogisticRegression learners. Each member of the pool has different learning rates. Whichever of the learners in the pool falls behind in terms of average log-likelihood will be tossed out and replaced with variants of the survivors. This will let us automatically derive an annealing schedule that optimizes learning speed. Since on-line learners tend to be IO bound anyway, it doesn't cost as much as it might seem that it would to maintain multiple learners in memory. Doing this adaptation on-line as we learn also decreases the number of learning rate parameters required and replaces the normal hyper-parameter search.

One wrinkle is that the pool of learners that we maintain is actually a pool of CrossFoldLearner which themselves contain several OnlineLogisticRegression objects. These pools allow estimation of performance on the fly even if we make many passes through the data. This does, however, increase the cost of training since if we are using 5-fold cross-validation, each vector is used 4 times for training and once for classification. If this becomes a problem, then we should probably use a 2-way unbalanced train/test split rather than full cross validation. With the current default settings, we have 100 learners running. This is better than the alternative of running hundreds of training passes to find good hyper-parameters because we only have to parse and feature-ize our inputs once. If you already have good hyper-parameters, then you might prefer to just run one CrossFoldLearner with those settings.

The fitness used here is AUC. Another alternative would be to try log-likelihood, but it is much easier to get bogus values of log-likelihood than with AUC and the results seem to accord pretty well. It would be nice to allow the fitness function to be pluggable. This use of AUC means that AdaptiveLogisticRegression is mostly suited for binary target variables. This will be fixed before long by extending OnlineAuc to handle non-binary cases or by using a different fitness value in non-binary cases.


Nested Class Summary
static class AdaptiveLogisticRegression.TrainingExample
           
static class AdaptiveLogisticRegression.Wrapper
          Provides a shim between the EP optimization stuff and the CrossFoldLearner.
 
Field Summary
static int DEFAULT_POOL_SIZE
           
static int DEFAULT_THREAD_COUNT
           
 
Constructor Summary
AdaptiveLogisticRegression()
           
AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior)
          Uses DEFAULT_THREAD_COUNT and DEFAULT_POOL_SIZE
AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior, int threadCount, int poolSize)
           
 
Method Summary
 double auc()
          What is the AUC for the current best member of the population.
 void close()
          Prepares the classifier for classification and deallocates any temporary data structures.
 State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getBest()
           
 List<AdaptiveLogisticRegression.TrainingExample> getBuffer()
           
 EvolutionaryProcess<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getEp()
           
 int getMaxInterval()
           
 int getMinInterval()
           
 int getNumCategories()
           
 int getNumFeatures()
           
 PriorFunction getPrior()
           
 int getRecord()
           
 State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getSeed()
           
 int nextStep(int recordNumber)
           
 int numFeatures()
          Returns the size of the internal feature vector.
 void readFields(DataInput in)
           
 void setAucEvaluator(OnlineAuc auc)
           
 void setAveragingWindow(int averagingWindow)
           
 void setBest(State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> best)
           
 void setBuffer(List<AdaptiveLogisticRegression.TrainingExample> buffer)
           
 void setEp(EvolutionaryProcess<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> ep)
           
 void setFreezeSurvivors(boolean freezeSurvivors)
           
 void setInterval(int interval)
          How often should the evolutionary optimization of learning parameters occur?
 void setInterval(int minInterval, int maxInterval)
          Starts optimization using the shorter interval and progresses to the longer using the specified number of steps per decade.
 void setPoolSize(int poolSize)
           
 void setRecord(int record)
           
 void setSeed(State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> seed)
           
 void setThreadCount(int threadCount)
           
static int stepSize(int recordNumber, double multiplier)
           
 void train(int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(long trackingKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(long trackingKey, String groupKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void write(DataOutput out)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_THREAD_COUNT

public static final int DEFAULT_THREAD_COUNT
See Also:
Constant Field Values

DEFAULT_POOL_SIZE

public static final int DEFAULT_POOL_SIZE
See Also:
Constant Field Values
Constructor Detail

AdaptiveLogisticRegression

public AdaptiveLogisticRegression()

AdaptiveLogisticRegression

public AdaptiveLogisticRegression(int numCategories,
                                  int numFeatures,
                                  PriorFunction prior)
Uses DEFAULT_THREAD_COUNT and DEFAULT_POOL_SIZE

Parameters:
numCategories - The number of categories (labels) to train on
numFeatures - The number of features used in creating the vectors (i.e. the cardinality of the vector)
prior - The PriorFunction to use
See Also:
AdaptiveLogisticRegression(int, int, org.apache.mahout.classifier.sgd.PriorFunction, int, int)

AdaptiveLogisticRegression

public AdaptiveLogisticRegression(int numCategories,
                                  int numFeatures,
                                  PriorFunction prior,
                                  int threadCount,
                                  int poolSize)
Parameters:
numCategories - The number of categories (labels) to train on
numFeatures - The number of features used in creating the vectors (i.e. the cardinality of the vector)
prior - The PriorFunction to use
threadCount - The number of threads to use for training
poolSize - The number of CrossFoldLearner to use.
Method Detail

train

public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(long trackingKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(long trackingKey,
                  String groupKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

nextStep

public int nextStep(int recordNumber)

stepSize

public static int stepSize(int recordNumber,
                           double multiplier)

close

public void close()
Description copied from interface: OnlineLearner
Prepares the classifier for classification and deallocates any temporary data structures. An online classifier should be able to accept more training after being closed, but closing the classifier may make classification more efficient.

Specified by:
close in interface Closeable
Specified by:
close in interface OnlineLearner

setInterval

public void setInterval(int interval)
How often should the evolutionary optimization of learning parameters occur?

Parameters:
interval - Number of training examples to use in each epoch of optimization.

setInterval

public void setInterval(int minInterval,
                        int maxInterval)
Starts optimization using the shorter interval and progresses to the longer using the specified number of steps per decade. Note that values < 200 are not accepted. Values even that small are unlikely to be useful.

Parameters:
minInterval - The minimum epoch length for the evolutionary optimization
maxInterval - The maximum epoch length

setPoolSize

public void setPoolSize(int poolSize)

setThreadCount

public void setThreadCount(int threadCount)

setAucEvaluator

public void setAucEvaluator(OnlineAuc auc)

numFeatures

public int numFeatures()
Returns the size of the internal feature vector. Note that this is not the same as the number of distinct features, especially if feature hashing is being used.

Returns:
The internal feature vector size.

auc

public double auc()
What is the AUC for the current best member of the population. If no member is best, usually because we haven't done any training yet, then the result is set to NaN.

Returns:
The AUC of the best member of the population or NaN if we can't figure that out.

getBest

public State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getBest()

setBest

public void setBest(State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> best)

getRecord

public int getRecord()

setRecord

public void setRecord(int record)

getMinInterval

public int getMinInterval()

getMaxInterval

public int getMaxInterval()

getNumCategories

public int getNumCategories()

getPrior

public PriorFunction getPrior()

setBuffer

public void setBuffer(List<AdaptiveLogisticRegression.TrainingExample> buffer)

getBuffer

public List<AdaptiveLogisticRegression.TrainingExample> getBuffer()

getEp

public EvolutionaryProcess<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getEp()

setEp

public void setEp(EvolutionaryProcess<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> ep)

getSeed

public State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> getSeed()

setSeed

public void setSeed(State<AdaptiveLogisticRegression.Wrapper,CrossFoldLearner> seed)

getNumFeatures

public int getNumFeatures()

setAveragingWindow

public void setAveragingWindow(int averagingWindow)

setFreezeSurvivors

public void setFreezeSurvivors(boolean freezeSurvivors)

write

public void write(DataOutput out)
           throws IOException
Specified by:
write in interface org.apache.hadoop.io.Writable
Throws:
IOException

readFields

public void readFields(DataInput in)
                throws IOException
Specified by:
readFields in interface org.apache.hadoop.io.Writable
Throws:
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.