org.apache.mahout.text
Class SequenceFilesFromDirectory
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.mahout.common.AbstractJob
org.apache.mahout.text.SequenceFilesFromDirectory
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class SequenceFilesFromDirectory
- extends AbstractJob
Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a
parent directory containing sub folders of text documents and recursively reads the files and creates the
SequenceFile
s of docid => content. The docid is set as the relative path of the
document from the parent directory prepended with a specified prefix. You can also specify the input encoding
of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.
Method Summary |
protected void |
addOptions()
Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job. |
static void |
main(String[] args)
|
protected Map<String,String> |
parseOptions()
Override this method in order to parse your additional options from the command line. |
int |
run(String[] args)
|
Methods inherited from class org.apache.mahout.common.AbstractJob |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
FILE_FILTER_CLASS_OPTION
public static final String[] FILE_FILTER_CLASS_OPTION
KEY_PREFIX_OPTION
public static final String[] KEY_PREFIX_OPTION
BASE_INPUT_PATH
public static final String BASE_INPUT_PATH
- See Also:
- Constant Field Values
SequenceFilesFromDirectory
public SequenceFilesFromDirectory()
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Throws:
Exception
addOptions
protected void addOptions()
- Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job.
Do not forget to call super() otherwise all standard options (input/output dirs etc) will not be available.
parseOptions
protected Map<String,String> parseOptions()
- Override this method in order to parse your additional options from the command line. Do not forget to call
super() otherwise standard options (input/output dirs etc) will not be available.
- Returns:
- Map of options
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.