org.apache.mahout.vectorizer.collocations.llr
Class CollocDriver

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.vectorizer.collocations.llr.CollocDriver
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public final class CollocDriver
extends AbstractJob

Driver for LLR Collocation discovery mapreduce job


Field Summary
static boolean DEFAULT_EMIT_UNIGRAMS
           
static String EMIT_UNIGRAMS
           
static String NGRAM_OUTPUT_DIRECTORY
           
static String SUBGRAM_OUTPUT_DIRECTORY
           
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Constructor Summary
CollocDriver()
           
 
Method Summary
static void generateAllGrams(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int maxNGramSize, int minSupport, float minLLRValue, int reduceTasks)
          Generate all ngrams for the DictionaryVectorizer job
static void main(String[] args)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SUBGRAM_OUTPUT_DIRECTORY

public static final String SUBGRAM_OUTPUT_DIRECTORY
See Also:
Constant Field Values

NGRAM_OUTPUT_DIRECTORY

public static final String NGRAM_OUTPUT_DIRECTORY
See Also:
Constant Field Values

EMIT_UNIGRAMS

public static final String EMIT_UNIGRAMS
See Also:
Constant Field Values

DEFAULT_EMIT_UNIGRAMS

public static final boolean DEFAULT_EMIT_UNIGRAMS
See Also:
Constant Field Values
Constructor Detail

CollocDriver

public CollocDriver()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Throws:
Exception

generateAllGrams

public static void generateAllGrams(org.apache.hadoop.fs.Path input,
                                    org.apache.hadoop.fs.Path output,
                                    org.apache.hadoop.conf.Configuration baseConf,
                                    int maxNGramSize,
                                    int minSupport,
                                    float minLLRValue,
                                    int reduceTasks)
                             throws IOException,
                                    InterruptedException,
                                    ClassNotFoundException
Generate all ngrams for the DictionaryVectorizer job

Parameters:
input - input path containing tokenized documents
output - output path where ngrams are generated including unigrams
baseConf - job configuration
maxNGramSize - minValue = 2.
minSupport - minimum support to prune ngrams including unigrams
minLLRValue - minimum threshold to prune ngrams
reduceTasks - number of reducers used
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.