org.apache.mahout.vectorizer.common
Class PartialVectorMerger
java.lang.Object
org.apache.mahout.vectorizer.common.PartialVectorMerger
public final class PartialVectorMerger
- extends Object
This class groups a set of input vectors. The Sequence file input should have a
WritableComparable
key containing document id and a VectorWritable
value containing the term frequency vector. This
class also does normalization of the vector.
Method Summary |
static void |
mergePartialVectors(Iterable<org.apache.hadoop.fs.Path> partialVectorPaths,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
float normPower,
boolean logNormalize,
int dimension,
boolean sequentialAccess,
boolean namedVector,
int numReducers)
Merge all the partial RandomAccessSparseVector s into the complete Document
RandomAccessSparseVector |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NO_NORMALIZING
public static final float NO_NORMALIZING
- See Also:
- Constant Field Values
NORMALIZATION_POWER
public static final String NORMALIZATION_POWER
- See Also:
- Constant Field Values
DIMENSION
public static final String DIMENSION
- See Also:
- Constant Field Values
SEQUENTIAL_ACCESS
public static final String SEQUENTIAL_ACCESS
- See Also:
- Constant Field Values
NAMED_VECTOR
public static final String NAMED_VECTOR
- See Also:
- Constant Field Values
LOG_NORMALIZE
public static final String LOG_NORMALIZE
- See Also:
- Constant Field Values
mergePartialVectors
public static void mergePartialVectors(Iterable<org.apache.hadoop.fs.Path> partialVectorPaths,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
float normPower,
boolean logNormalize,
int dimension,
boolean sequentialAccess,
boolean namedVector,
int numReducers)
throws IOException,
InterruptedException,
ClassNotFoundException
- Merge all the partial
RandomAccessSparseVector
s into the complete Document
RandomAccessSparseVector
- Parameters:
partialVectorPaths
- input directory of the vectors in SequenceFile
formatoutput
- output directory were the partial vectors have to be createdbaseConf
- job configurationnormPower
- The normalization value. Must be greater than or equal to 0 or equal to NO_NORMALIZING
dimension
- cardinality of the vectorssequentialAccess
- output vectors should be optimized for sequential accessnamedVector
- output vectors should be named, retaining key (doc id) as a labelnumReducers
- The number of reducers to spawn
- Throws:
IOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.