org.apache.mahout.utils
Class SplitInputJob

java.lang.Object
  extended by org.apache.mahout.utils.SplitInputJob

public final class SplitInputJob
extends Object


Nested Class Summary
static class SplitInputJob.SplitInputComparator
          Randomly permute key value pairs
static class SplitInputJob.SplitInputMapper
          Mapper which downsamples the input by downsamplingFactor
static class SplitInputJob.SplitInputReducer
          Reducer which uses MultipleOutputs to randomly allocate key value pairs between test and training outputs
 
Method Summary
static void run(org.apache.hadoop.conf.Configuration initialConf, org.apache.hadoop.fs.Path inputPath, org.apache.hadoop.fs.Path outputPath, int keepPct, float randomSelectionPercent)
          Run job to downsample, randomly permute and split data into test and training sets.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

run

public static void run(org.apache.hadoop.conf.Configuration initialConf,
                       org.apache.hadoop.fs.Path inputPath,
                       org.apache.hadoop.fs.Path outputPath,
                       int keepPct,
                       float randomSelectionPercent)
                throws IOException,
                       ClassNotFoundException,
                       InterruptedException
Run job to downsample, randomly permute and split data into test and training sets. This job takes a SequenceFile as input and outputs two SequenceFiles test-r-00000 and training-r-00000 which contain the test and training sets respectively

Parameters:
initialConf -
inputPath - path to input data SequenceFile
outputPath - path for output data SequenceFiles
keepPct - percentage of key value pairs in input to keep. The rest are discarded
randomSelectionPercent - percentage of key value pairs to allocate to test set. Remainder are allocated to training set
Throws:
IOException
ClassNotFoundException
InterruptedException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.