org.apache.mahout.clustering.lda.cvb
Class CVB0Driver

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.clustering.lda.cvb.CVB0Driver
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class CVB0Driver
extends AbstractJob

See CachingCVB0Mapper for more details on scalability and room for improvement. To try out this LDA implementation without using Hadoop, check out InMemoryCollapsedVariationalBayes0. If you want to do training directly in java code with your own main(), then look to ModelTrainer and TopicModel. Usage: ./bin/mahout cvb <i>options</i>

Valid options include:

--input path
Input path for SequenceFile<IntWritable, VectorWritable> document vectors. See SparseVectorsFromSequenceFiles for details on how to generate this input format.
--dictionary path
Path to dictionary file(s) generated during construction of input document vectors (glob expression supported). If set, this data is scanned to determine an appropriate value for option --num_terms.
--output path
Output path for topic-term distributions.
--doc_topic_output path
Output path for doc-topic distributions.
--num_topics k
Number of latent topics.
--num_terms nt
Number of unique features defined by input document vectors. If option --dictionary is defined and this option is unspecified, term count is calculated from dictionary.
--topic_model_temp_dir path
Path in which to store model state after each iteration.
--maxIter i
Maximum number of iterations to perform. If this value is less than or equal to the number of iteration states found beneath the path specified by option --topic_model_temp_dir, no further iterations are performed. Instead, output topic-term and doc-topic distributions are generated using data from the specified iteration.
--max_doc_topic_iters i
Maximum number of iterations per doc for p(topic|doc) learning. Defaults to 10.
--doc_topic_smoothing a
Smoothing for doc-topic distribution. Defaults to 0.0001.
--term_topic_smoothing e
Smoothing for topic-term distribution. Defaults to 0.0001.
--random_seed seed
Integer seed for random number generation.
--test_set_percentage p
Fraction of data to hold out for testing. Defaults to 0.0.
--iteration_block_size block
Number of iterations between perplexity checks. Defaults to 10. This option is ignored unless option --test_set_percentage is greater than zero.


Nested Class Summary
static class CVB0Driver.DualDoubleSumReducer
          Sums keys and values independently.
 
Field Summary
static String BACKFILL_PERPLEXITY
           
static String DICTIONARY
           
static String DOC_TOPIC_OUTPUT
           
static String DOC_TOPIC_SMOOTHING
           
static String ITERATION_BLOCK_SIZE
           
static String MAX_ITERATIONS_PER_DOC
           
static String MODEL_TEMP_DIR
           
static String MODEL_WEIGHT
           
static String NUM_REDUCE_TASKS
           
static String NUM_TERMS
           
static String NUM_TOPICS
           
static String NUM_TRAIN_THREADS
           
static String NUM_UPDATE_THREADS
           
static String RANDOM_SEED
           
static String TERM_TOPIC_SMOOTHING
           
static String TEST_SET_FRACTION
           
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Constructor Summary
CVB0Driver()
           
 
Method Summary
static org.apache.hadoop.fs.Path[] getModelPaths(org.apache.hadoop.conf.Configuration conf)
           
static void main(String[] args)
           
static org.apache.hadoop.fs.Path modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)
           
static org.apache.hadoop.fs.Path perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)
           
static double readPerplexity(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path topicModelStateTemp, int iteration)
           
 int run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inputPath, org.apache.hadoop.fs.Path topicModelOutputPath, int numTopics, int numTerms, double alpha, double eta, int maxIterations, int iterationBlockSize, double convergenceDelta, org.apache.hadoop.fs.Path dictionaryPath, org.apache.hadoop.fs.Path docTopicOutputPath, org.apache.hadoop.fs.Path topicModelStateTempPath, long randomSeed, float testFraction, int numTrainThreads, int numUpdateThreads, int maxItersPerDoc, int numReduceTasks, boolean backfillPerplexity)
           
 int run(String[] args)
           
 void runIteration(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path corpusInput, org.apache.hadoop.fs.Path modelInput, org.apache.hadoop.fs.Path modelOutput, int iterationNumber, int maxIterations, int numReduceTasks)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NUM_TOPICS

public static final String NUM_TOPICS
See Also:
Constant Field Values

NUM_TERMS

public static final String NUM_TERMS
See Also:
Constant Field Values

DOC_TOPIC_SMOOTHING

public static final String DOC_TOPIC_SMOOTHING
See Also:
Constant Field Values

TERM_TOPIC_SMOOTHING

public static final String TERM_TOPIC_SMOOTHING
See Also:
Constant Field Values

DICTIONARY

public static final String DICTIONARY
See Also:
Constant Field Values

DOC_TOPIC_OUTPUT

public static final String DOC_TOPIC_OUTPUT
See Also:
Constant Field Values

MODEL_TEMP_DIR

public static final String MODEL_TEMP_DIR
See Also:
Constant Field Values

ITERATION_BLOCK_SIZE

public static final String ITERATION_BLOCK_SIZE
See Also:
Constant Field Values

RANDOM_SEED

public static final String RANDOM_SEED
See Also:
Constant Field Values

TEST_SET_FRACTION

public static final String TEST_SET_FRACTION
See Also:
Constant Field Values

NUM_TRAIN_THREADS

public static final String NUM_TRAIN_THREADS
See Also:
Constant Field Values

NUM_UPDATE_THREADS

public static final String NUM_UPDATE_THREADS
See Also:
Constant Field Values

MAX_ITERATIONS_PER_DOC

public static final String MAX_ITERATIONS_PER_DOC
See Also:
Constant Field Values

MODEL_WEIGHT

public static final String MODEL_WEIGHT
See Also:
Constant Field Values

NUM_REDUCE_TASKS

public static final String NUM_REDUCE_TASKS
See Also:
Constant Field Values

BACKFILL_PERPLEXITY

public static final String BACKFILL_PERPLEXITY
See Also:
Constant Field Values
Constructor Detail

CVB0Driver

public CVB0Driver()
Method Detail

run

public int run(String[] args)
        throws Exception
Throws:
Exception

run

public int run(org.apache.hadoop.conf.Configuration conf,
               org.apache.hadoop.fs.Path inputPath,
               org.apache.hadoop.fs.Path topicModelOutputPath,
               int numTopics,
               int numTerms,
               double alpha,
               double eta,
               int maxIterations,
               int iterationBlockSize,
               double convergenceDelta,
               org.apache.hadoop.fs.Path dictionaryPath,
               org.apache.hadoop.fs.Path docTopicOutputPath,
               org.apache.hadoop.fs.Path topicModelStateTempPath,
               long randomSeed,
               float testFraction,
               int numTrainThreads,
               int numUpdateThreads,
               int maxItersPerDoc,
               int numReduceTasks,
               boolean backfillPerplexity)
        throws ClassNotFoundException,
               IOException,
               InterruptedException
Throws:
ClassNotFoundException
IOException
InterruptedException

readPerplexity

public static double readPerplexity(org.apache.hadoop.conf.Configuration conf,
                                    org.apache.hadoop.fs.Path topicModelStateTemp,
                                    int iteration)
                             throws IOException
Parameters:
topicModelStateTemp -
iteration -
Returns:
double[2] where first value is perplexity and second is model weight of those documents sampled during perplexity computation, or null if no perplexity data exists for the given iteration.
Throws:
IOException

modelPath

public static org.apache.hadoop.fs.Path modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
                                                  int iterationNumber)

perplexityPath

public static org.apache.hadoop.fs.Path perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
                                                       int iterationNumber)

runIteration

public void runIteration(org.apache.hadoop.conf.Configuration conf,
                         org.apache.hadoop.fs.Path corpusInput,
                         org.apache.hadoop.fs.Path modelInput,
                         org.apache.hadoop.fs.Path modelOutput,
                         int iterationNumber,
                         int maxIterations,
                         int numReduceTasks)
                  throws IOException,
                         ClassNotFoundException,
                         InterruptedException
Throws:
IOException
ClassNotFoundException
InterruptedException

getModelPaths

public static org.apache.hadoop.fs.Path[] getModelPaths(org.apache.hadoop.conf.Configuration conf)

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.