org.apache.mahout.vectorizer
Class DocumentProcessor

java.lang.Object
  extended by org.apache.mahout.vectorizer.DocumentProcessor

public final class DocumentProcessor
extends Object

This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. The document should be stored in UTF-8 encoding which is recognizable by hadoop. It uses the given Analyzer to process the document into Tokens.


Field Summary
static String ANALYZER_CLASS
           
static String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
           
 
Method Summary
static void tokenizeDocuments(org.apache.hadoop.fs.Path input, Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf)
          Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOKENIZED_DOCUMENT_OUTPUT_FOLDER

public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
See Also:
Constant Field Values

ANALYZER_CLASS

public static final String ANALYZER_CLASS
See Also:
Constant Field Values
Method Detail

tokenizeDocuments

public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
                                     Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
                                     org.apache.hadoop.fs.Path output,
                                     org.apache.hadoop.conf.Configuration baseConf)
                              throws IOException,
                                     InterruptedException,
                                     ClassNotFoundException
Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format

Parameters:
input - input directory of the documents in SequenceFile format
output - output directory were the StringTuple token array of each document has to be created
analyzerClass - The Lucene Analyzer for tokenizing the UTF-8 text
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.