org.apache.mahout.clustering.lda.cvb
Class CVB0Driver
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.mahout.common.AbstractJob
org.apache.mahout.clustering.lda.cvb.CVB0Driver
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class CVB0Driver
- extends AbstractJob
See CachingCVB0Mapper
for more details on scalability and room for improvement.
To try out this LDA implementation without using Hadoop, check out
InMemoryCollapsedVariationalBayes0
. If you want to do training directly in java code
with your own main(), then look to ModelTrainer
and TopicModel
.
Usage: ./bin/mahout cvb <i>options</i>
Valid options include:
--input path
- Input path for
SequenceFile<IntWritable, VectorWritable>
document vectors. See
SparseVectorsFromSequenceFiles
for details on how to generate this input format.
--dictionary path
- Path to dictionary file(s) generated during construction of input document vectors (glob
expression supported). If set, this data is scanned to determine an appropriate value for option
--num_terms
.
--output path
- Output path for topic-term distributions.
--doc_topic_output path
- Output path for doc-topic distributions.
--num_topics k
- Number of latent topics.
--num_terms nt
- Number of unique features defined by input document vectors. If option
--dictionary
is defined and this option is unspecified, term count is calculated from dictionary.
--topic_model_temp_dir path
- Path in which to store model state after each iteration.
--maxIter i
- Maximum number of iterations to perform. If this value is less than or equal to the number of
iteration states found beneath the path specified by option
--topic_model_temp_dir
, no
further iterations are performed. Instead, output topic-term and doc-topic distributions are
generated using data from the specified iteration.
--max_doc_topic_iters i
- Maximum number of iterations per doc for p(topic|doc) learning. Defaults to
10
.
--doc_topic_smoothing a
- Smoothing for doc-topic distribution. Defaults to
0.0001
.
--term_topic_smoothing e
- Smoothing for topic-term distribution. Defaults to
0.0001
.
--random_seed seed
- Integer seed for random number generation.
--test_set_percentage p
- Fraction of data to hold out for testing. Defaults to
0.0
.
--iteration_block_size block
- Number of iterations between perplexity checks. Defaults to
10
. This option is
ignored unless option --test_set_percentage
is greater than zero.
Method Summary |
static org.apache.hadoop.fs.Path[] |
getModelPaths(org.apache.hadoop.conf.Configuration conf)
|
static void |
main(String[] args)
|
static org.apache.hadoop.fs.Path |
modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber)
|
static org.apache.hadoop.fs.Path |
perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber)
|
static double |
readPerplexity(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path topicModelStateTemp,
int iteration)
|
int |
run(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputPath,
org.apache.hadoop.fs.Path topicModelOutputPath,
int numTopics,
int numTerms,
double alpha,
double eta,
int maxIterations,
int iterationBlockSize,
double convergenceDelta,
org.apache.hadoop.fs.Path dictionaryPath,
org.apache.hadoop.fs.Path docTopicOutputPath,
org.apache.hadoop.fs.Path topicModelStateTempPath,
long randomSeed,
float testFraction,
int numTrainThreads,
int numUpdateThreads,
int maxItersPerDoc,
int numReduceTasks,
boolean backfillPerplexity)
|
int |
run(String[] args)
|
void |
runIteration(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path corpusInput,
org.apache.hadoop.fs.Path modelInput,
org.apache.hadoop.fs.Path modelOutput,
int iterationNumber,
int maxIterations,
int numReduceTasks)
|
Methods inherited from class org.apache.mahout.common.AbstractJob |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NUM_TOPICS
public static final String NUM_TOPICS
- See Also:
- Constant Field Values
NUM_TERMS
public static final String NUM_TERMS
- See Also:
- Constant Field Values
DOC_TOPIC_SMOOTHING
public static final String DOC_TOPIC_SMOOTHING
- See Also:
- Constant Field Values
TERM_TOPIC_SMOOTHING
public static final String TERM_TOPIC_SMOOTHING
- See Also:
- Constant Field Values
DICTIONARY
public static final String DICTIONARY
- See Also:
- Constant Field Values
DOC_TOPIC_OUTPUT
public static final String DOC_TOPIC_OUTPUT
- See Also:
- Constant Field Values
MODEL_TEMP_DIR
public static final String MODEL_TEMP_DIR
- See Also:
- Constant Field Values
ITERATION_BLOCK_SIZE
public static final String ITERATION_BLOCK_SIZE
- See Also:
- Constant Field Values
RANDOM_SEED
public static final String RANDOM_SEED
- See Also:
- Constant Field Values
TEST_SET_FRACTION
public static final String TEST_SET_FRACTION
- See Also:
- Constant Field Values
NUM_TRAIN_THREADS
public static final String NUM_TRAIN_THREADS
- See Also:
- Constant Field Values
NUM_UPDATE_THREADS
public static final String NUM_UPDATE_THREADS
- See Also:
- Constant Field Values
MAX_ITERATIONS_PER_DOC
public static final String MAX_ITERATIONS_PER_DOC
- See Also:
- Constant Field Values
MODEL_WEIGHT
public static final String MODEL_WEIGHT
- See Also:
- Constant Field Values
NUM_REDUCE_TASKS
public static final String NUM_REDUCE_TASKS
- See Also:
- Constant Field Values
BACKFILL_PERPLEXITY
public static final String BACKFILL_PERPLEXITY
- See Also:
- Constant Field Values
CVB0Driver
public CVB0Driver()
run
public int run(String[] args)
throws Exception
- Throws:
Exception
run
public int run(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputPath,
org.apache.hadoop.fs.Path topicModelOutputPath,
int numTopics,
int numTerms,
double alpha,
double eta,
int maxIterations,
int iterationBlockSize,
double convergenceDelta,
org.apache.hadoop.fs.Path dictionaryPath,
org.apache.hadoop.fs.Path docTopicOutputPath,
org.apache.hadoop.fs.Path topicModelStateTempPath,
long randomSeed,
float testFraction,
int numTrainThreads,
int numUpdateThreads,
int maxItersPerDoc,
int numReduceTasks,
boolean backfillPerplexity)
throws ClassNotFoundException,
IOException,
InterruptedException
- Throws:
ClassNotFoundException
IOException
InterruptedException
readPerplexity
public static double readPerplexity(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path topicModelStateTemp,
int iteration)
throws IOException
- Parameters:
topicModelStateTemp
- iteration
-
- Returns:
double[2]
where first value is perplexity and second is model weight of those
documents sampled during perplexity computation, or null
if no perplexity data
exists for the given iteration.
- Throws:
IOException
modelPath
public static org.apache.hadoop.fs.Path modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber)
perplexityPath
public static org.apache.hadoop.fs.Path perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber)
runIteration
public void runIteration(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path corpusInput,
org.apache.hadoop.fs.Path modelInput,
org.apache.hadoop.fs.Path modelOutput,
int iterationNumber,
int maxIterations,
int numReduceTasks)
throws IOException,
ClassNotFoundException,
InterruptedException
- Throws:
IOException
ClassNotFoundException
InterruptedException
getModelPaths
public static org.apache.hadoop.fs.Path[] getModelPaths(org.apache.hadoop.conf.Configuration conf)
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.