org.apache.mahout.utils
Class SplitInput

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.utils.SplitInput
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class SplitInput
extends AbstractJob

A utility for splitting files in the input format used by the Bayes classifiers or anything else that has one item per line or SequenceFiles (key/value) into training and test sets in order to perform cross-validation.

This class can be used to split directories of files or individual files into training and test sets using a number of different methods.

When executed via splitDirectory(Path) or splitFile(Path), the lines read from one or more, input files are written to files of the same name into the directories specified by the setTestOutputDirectory(Path) and setTrainingOutputDirectory(Path) methods.

The composition of the test set is determined using one of the following approaches:

Any one of the methods above can be used to control the size of the test set. If multiple methods are called, a runtime exception will be thrown at execution time.

The setSplitLocation(int) method is passed an integer from 0 to 100 (inclusive) which is translated into the position of the start of the test data within the input file.

Given:

The start of the split will always be adjusted forwards in order to ensure that the desired test set size is allocated. Split location has no effect is random sampling is employed.


Nested Class Summary
static interface SplitInput.SplitCallback
          Used to pass information back to a caller once a file has been split without the need for a data object
 
Field Summary
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Constructor Summary
SplitInput()
           
 
Method Summary
static int countLines(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path inputFile, Charset charset)
          Count the lines in the file specified as returned by BufferedReader.readLine()
 SplitInput.SplitCallback getCallback()
           
 Charset getCharset()
           
 org.apache.hadoop.fs.Path getInputDirectory()
           
 int getSplitLocation()
           
 org.apache.hadoop.fs.Path getTestOutputDirectory()
           
 int getTestRandomSelectionPct()
           
 int getTestRandomSelectionSize()
           
 int getTestSplitPct()
           
 int getTestSplitSize()
           
 org.apache.hadoop.fs.Path getTrainingOutputDirectory()
           
static void main(String[] args)
           
 int run(String[] args)
           
 void setCallback(SplitInput.SplitCallback callback)
          Sets the callback used to inform the caller that an input file has been successfully split
 void setCharset(Charset charset)
          Set the charset used to read and write files
 void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
          Set the directory from which input data will be read when the the splitDirectory() method is invoked
 void setKeepPct(int keepPct)
          Sets the percentage of the input data to keep in a map reduce split input job
 void setMapRedOutputDirectory(org.apache.hadoop.fs.Path mapRedOutputDirectory)
           
 void setSplitLocation(int splitLocation)
          Set the location of the start of the test/training data split.
 void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
          Set the directory to which test data will be written.
 void setTestRandomSelectionPct(int randomSelectionPct)
          Sets number of random input samples that will be saved to the test set as a percentage of the size of the input set.
 void setTestRandomSelectionSize(int testRandomSelectionSize)
          Sets number of random input samples that will be saved to the test set.
 void setTestSplitPct(int testSplitPct)
          Sets the percentage of the input data to allocate to the test split
 void setTestSplitSize(int testSplitSize)
           
 void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
          Set the directory to which training data will be written.
 void setUseMapRed(boolean useMapRed)
          Set to true to use map reduce to split the input
 void splitDirectory()
          Perform a split on directory specified by setInputDirectory(Path) by calling splitFile(Path) on each file found within that directory.
 void splitDirectory(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inputDir)
           
 void splitDirectory(org.apache.hadoop.fs.Path inputDir)
          Perform a split on the specified directory by calling splitFile(Path) on each file found within that directory.
 void splitFile(org.apache.hadoop.fs.Path inputFile)
          Perform a split on the specified input file.
 void validate()
          Validates that the current instance is in a consistent state
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SplitInput

public SplitInput()
Method Detail

run

public int run(String[] args)
        throws Exception
Throws:
Exception

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

splitDirectory

public void splitDirectory()
                    throws IOException,
                           ClassNotFoundException,
                           InterruptedException
Perform a split on directory specified by setInputDirectory(Path) by calling splitFile(Path) on each file found within that directory.

Throws:
IOException
ClassNotFoundException
InterruptedException

splitDirectory

public void splitDirectory(org.apache.hadoop.fs.Path inputDir)
                    throws IOException,
                           ClassNotFoundException,
                           InterruptedException
Perform a split on the specified directory by calling splitFile(Path) on each file found within that directory.

Throws:
IOException
ClassNotFoundException
InterruptedException

splitDirectory

public void splitDirectory(org.apache.hadoop.conf.Configuration conf,
                           org.apache.hadoop.fs.Path inputDir)
                    throws IOException,
                           ClassNotFoundException,
                           InterruptedException
Throws:
IOException
ClassNotFoundException
InterruptedException

splitFile

public void splitFile(org.apache.hadoop.fs.Path inputFile)
               throws IOException
Perform a split on the specified input file. Results will be written to files of the same name in the specified training and test output directories. The validate() method is called prior to executing the split.

Throws:
IOException

getTestSplitSize

public int getTestSplitSize()

setTestSplitSize

public void setTestSplitSize(int testSplitSize)

getTestSplitPct

public int getTestSplitPct()

setTestSplitPct

public void setTestSplitPct(int testSplitPct)
Sets the percentage of the input data to allocate to the test split

Parameters:
testSplitPct - a value between 0 and 100 inclusive.

setKeepPct

public void setKeepPct(int keepPct)
Sets the percentage of the input data to keep in a map reduce split input job

Parameters:
keepPct - a value between 0 and 100 inclusive.

setUseMapRed

public void setUseMapRed(boolean useMapRed)
Set to true to use map reduce to split the input

Parameters:
useMapRed - a boolean to indicate whether map reduce should be used

setMapRedOutputDirectory

public void setMapRedOutputDirectory(org.apache.hadoop.fs.Path mapRedOutputDirectory)

getSplitLocation

public int getSplitLocation()

setSplitLocation

public void setSplitLocation(int splitLocation)
Set the location of the start of the test/training data split. Expressed as percentage of lines, for example 0 indicates that the test data should be taken from the start of the file, 100 indicates that the test data should be taken from the end of the input file, while 25 indicates that the test data should be taken from the first quarter of the file.

This option is only relevant in cases where random selection is not employed

Parameters:
splitLocation - a value between 0 and 100 inclusive.

getCharset

public Charset getCharset()

setCharset

public void setCharset(Charset charset)
Set the charset used to read and write files


getInputDirectory

public org.apache.hadoop.fs.Path getInputDirectory()

setInputDirectory

public void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
Set the directory from which input data will be read when the the splitDirectory() method is invoked


getTrainingOutputDirectory

public org.apache.hadoop.fs.Path getTrainingOutputDirectory()

setTrainingOutputDirectory

public void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
Set the directory to which training data will be written.


getTestOutputDirectory

public org.apache.hadoop.fs.Path getTestOutputDirectory()

setTestOutputDirectory

public void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
Set the directory to which test data will be written.


getCallback

public SplitInput.SplitCallback getCallback()

setCallback

public void setCallback(SplitInput.SplitCallback callback)
Sets the callback used to inform the caller that an input file has been successfully split


getTestRandomSelectionSize

public int getTestRandomSelectionSize()

setTestRandomSelectionSize

public void setTestRandomSelectionSize(int testRandomSelectionSize)
Sets number of random input samples that will be saved to the test set.


getTestRandomSelectionPct

public int getTestRandomSelectionPct()

setTestRandomSelectionPct

public void setTestRandomSelectionPct(int randomSelectionPct)
Sets number of random input samples that will be saved to the test set as a percentage of the size of the input set.

Parameters:
randomSelectionPct - a value between 0 and 100 inclusive.

validate

public void validate()
              throws IOException
Validates that the current instance is in a consistent state

Throws:
IllegalArgumentException - if settings violate class invariants.
IOException - if output directories do not exist or are not directories.

countLines

public static int countLines(org.apache.hadoop.fs.FileSystem fs,
                             org.apache.hadoop.fs.Path inputFile,
                             Charset charset)
                      throws IOException
Count the lines in the file specified as returned by BufferedReader.readLine()

Parameters:
inputFile - the file whose lines will be counted
charset - the charset of the file to read
Returns:
the number of lines in the input file.
Throws:
IOException - if there is a problem opening or reading the file.


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.