org.apache.mahout.vectorizer.tfidf
Class TFIDFConverter

java.lang.Object
  extended by org.apache.mahout.vectorizer.tfidf.TFIDFConverter

public final class TFIDFConverter
extends Object

This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input should have a WritableComparable key containing and a VectorWritable value containing the term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf format


Field Summary
static String FEATURE_COUNT
           
static String MAX_DF
           
static String MIN_DF
           
static String VECTOR_COUNT
           
static String WORDCOUNT_OUTPUT_FOLDER
           
 
Method Summary
static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int chunkSizeInMegabytes)
          Calculates the document frequencies of all terms from the input set of vectors in SequenceFile format.
static void processTfIdf(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures, int minDf, long maxDF, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers)
          Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VECTOR_COUNT

public static final String VECTOR_COUNT
See Also:
Constant Field Values

FEATURE_COUNT

public static final String FEATURE_COUNT
See Also:
Constant Field Values

MIN_DF

public static final String MIN_DF
See Also:
Constant Field Values

MAX_DF

public static final String MAX_DF
See Also:
Constant Field Values

WORDCOUNT_OUTPUT_FOLDER

public static final String WORDCOUNT_OUTPUT_FOLDER
See Also:
Constant Field Values
Method Detail

processTfIdf

public static void processTfIdf(org.apache.hadoop.fs.Path input,
                                org.apache.hadoop.fs.Path output,
                                org.apache.hadoop.conf.Configuration baseConf,
                                Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
                                int minDf,
                                long maxDF,
                                float normPower,
                                boolean logNormalize,
                                boolean sequentialAccessOutput,
                                boolean namedVector,
                                int numReducers)
                         throws IOException,
                                InterruptedException,
                                ClassNotFoundException
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces. Before using this method calculateDF should be called

Parameters:
input - input directory of the vectors in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
datasetFeatures - Document frequencies information calculated by calculateDF
minDf - The minimum document frequency. Default 1
maxDF - The max percentage of vectors for the DF. Can be used to remove really high frequency features. Expressed as an integer between 0 and 100. Default 99
numReducers - The number of reducers to spawn. This also affects the possible parallelism since each reducer will typically produce a single output file containing tf-idf vectors for a subset of the documents in the corpus.
Throws:
IOException
InterruptedException
ClassNotFoundException

calculateDF

public static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input,
                                                                       org.apache.hadoop.fs.Path output,
                                                                       org.apache.hadoop.conf.Configuration baseConf,
                                                                       int chunkSizeInMegabytes)
                                                                throws IOException,
                                                                       InterruptedException,
                                                                       ClassNotFoundException
Calculates the document frequencies of all terms from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the vectors in SequenceFile format
output - output directory where document frequencies will be stored
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.