org.apache.mahout.vectorizer.common
Class PartialVectorMerger

java.lang.Object
  extended by org.apache.mahout.vectorizer.common.PartialVectorMerger

public final class PartialVectorMerger
extends Object

This class groups a set of input vectors. The Sequence file input should have a WritableComparable key containing document id and a VectorWritable value containing the term frequency vector. This class also does normalization of the vector.


Field Summary
static String DIMENSION
           
static String LOG_NORMALIZE
           
static String NAMED_VECTOR
           
static float NO_NORMALIZING
           
static String NORMALIZATION_POWER
           
static String SEQUENTIAL_ACCESS
           
 
Method Summary
static void mergePartialVectors(Iterable<org.apache.hadoop.fs.Path> partialVectorPaths, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, float normPower, boolean logNormalize, int dimension, boolean sequentialAccess, boolean namedVector, int numReducers)
          Merge all the partial RandomAccessSparseVectors into the complete Document RandomAccessSparseVector
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NO_NORMALIZING

public static final float NO_NORMALIZING
See Also:
Constant Field Values

NORMALIZATION_POWER

public static final String NORMALIZATION_POWER
See Also:
Constant Field Values

DIMENSION

public static final String DIMENSION
See Also:
Constant Field Values

SEQUENTIAL_ACCESS

public static final String SEQUENTIAL_ACCESS
See Also:
Constant Field Values

NAMED_VECTOR

public static final String NAMED_VECTOR
See Also:
Constant Field Values

LOG_NORMALIZE

public static final String LOG_NORMALIZE
See Also:
Constant Field Values
Method Detail

mergePartialVectors

public static void mergePartialVectors(Iterable<org.apache.hadoop.fs.Path> partialVectorPaths,
                                       org.apache.hadoop.fs.Path output,
                                       org.apache.hadoop.conf.Configuration baseConf,
                                       float normPower,
                                       boolean logNormalize,
                                       int dimension,
                                       boolean sequentialAccess,
                                       boolean namedVector,
                                       int numReducers)
                                throws IOException,
                                       InterruptedException,
                                       ClassNotFoundException
Merge all the partial RandomAccessSparseVectors into the complete Document RandomAccessSparseVector

Parameters:
partialVectorPaths - input directory of the vectors in SequenceFile format
output - output directory were the partial vectors have to be created
baseConf - job configuration
normPower - The normalization value. Must be greater than or equal to 0 or equal to NO_NORMALIZING
dimension - cardinality of the vectors
sequentialAccess - output vectors should be optimized for sequential access
namedVector - output vectors should be named, retaining key (doc id) as a label
numReducers - The number of reducers to spawn
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.