org.apache.mahout.classifier.df.mapreduce
Class Builder

java.lang.Object
  extended by org.apache.mahout.classifier.df.mapreduce.Builder
Direct Known Subclasses:
InMemBuilder, PartialBuilder

public abstract class Builder
extends Object

Base class for Mapred DecisionForest builders. Takes care of storing the parameters common to the mapred implementations.
The child classes must implement at least :


Constructor Summary
protected Builder(TreeBuilder treeBuilder, org.apache.hadoop.fs.Path dataPath, org.apache.hadoop.fs.Path datasetPath, Long seed, org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 DecisionForest build(int nbTrees)
           
protected abstract  void configureJob(org.apache.hadoop.mapreduce.Job job)
          Used by the inheriting classes to configure the job
protected  org.apache.hadoop.fs.Path getDataPath()
           
static org.apache.hadoop.fs.Path getDistributedCacheFile(org.apache.hadoop.conf.Configuration conf, int index)
          Helper method.
static int getNbTrees(org.apache.hadoop.conf.Configuration conf)
          Get the number of trees for the map-reduce job.
static int getNumMaps(org.apache.hadoop.conf.Configuration conf)
          Return the value of "mapred.map.tasks".
protected  org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.conf.Configuration conf)
          Output Directory name
static Long getRandomSeed(org.apache.hadoop.conf.Configuration conf)
          Returns the random seed
static TreeBuilder getTreeBuilder(org.apache.hadoop.conf.Configuration conf)
           
protected static boolean isOutput(org.apache.hadoop.conf.Configuration conf)
          Used only for DEBUG purposes.
static Dataset loadDataset(org.apache.hadoop.conf.Configuration conf)
          Helper method.
protected abstract  DecisionForest parseOutput(org.apache.hadoop.mapreduce.Job job)
          Parse the output files to extract the trees and pass the predictions to the callback
protected  boolean runJob(org.apache.hadoop.mapreduce.Job job)
          Sequential implementation should override this method to simulate the job execution
static void setNbTrees(org.apache.hadoop.conf.Configuration conf, int nbTrees)
          Set the number of trees to grow for the map-reduce job
 void setOutputDirName(String name)
          Sets the Output directory name, will be creating in the working directory
static void sortSplits(org.apache.hadoop.mapreduce.InputSplit[] splits)
          sort the splits into order based on size, so that the biggest go first.
This is the same code used by Hadoop's JobClient.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Builder

protected Builder(TreeBuilder treeBuilder,
                  org.apache.hadoop.fs.Path dataPath,
                  org.apache.hadoop.fs.Path datasetPath,
                  Long seed,
                  org.apache.hadoop.conf.Configuration conf)
Method Detail

getDataPath

protected org.apache.hadoop.fs.Path getDataPath()

getNumMaps

public static int getNumMaps(org.apache.hadoop.conf.Configuration conf)
Return the value of "mapred.map.tasks".

Parameters:
conf - configuration
Returns:
number of map tasks

isOutput

protected static boolean isOutput(org.apache.hadoop.conf.Configuration conf)
Used only for DEBUG purposes. if false, the mappers doesn't output anything, so the builder has nothing to process

Parameters:
conf - configuration
Returns:
true if the builder has to return output. false otherwise

getRandomSeed

public static Long getRandomSeed(org.apache.hadoop.conf.Configuration conf)
Returns the random seed

Parameters:
conf - configuration
Returns:
null if no seed is available

getTreeBuilder

public static TreeBuilder getTreeBuilder(org.apache.hadoop.conf.Configuration conf)

getNbTrees

public static int getNbTrees(org.apache.hadoop.conf.Configuration conf)
Get the number of trees for the map-reduce job.

Parameters:
conf - configuration
Returns:
number of trees to build

setNbTrees

public static void setNbTrees(org.apache.hadoop.conf.Configuration conf,
                              int nbTrees)
Set the number of trees to grow for the map-reduce job

Parameters:
conf - configuration
nbTrees - number of trees to build
Throws:
IllegalArgumentException - if (nbTrees <= 0)

setOutputDirName

public void setOutputDirName(String name)
Sets the Output directory name, will be creating in the working directory

Parameters:
name - output dir. name

getOutputPath

protected org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.conf.Configuration conf)
                                           throws IOException
Output Directory name

Parameters:
conf - configuration
Returns:
output dir. path (%WORKING_DIRECTORY%/OUTPUT_DIR_NAME%)
Throws:
IOException - if we cannot get the default FileSystem

getDistributedCacheFile

public static org.apache.hadoop.fs.Path getDistributedCacheFile(org.apache.hadoop.conf.Configuration conf,
                                                                int index)
                                                         throws IOException
Helper method. Get a path from the DistributedCache

Parameters:
conf - configuration
index - index of the path in the DistributedCache files
Returns:
path from the DistributedCache
Throws:
IOException - if no path is found

loadDataset

public static Dataset loadDataset(org.apache.hadoop.conf.Configuration conf)
                           throws IOException
Helper method. Load a Dataset stored in the DistributedCache

Parameters:
conf - configuration
Returns:
loaded Dataset
Throws:
IOException - if we cannot retrieve the Dataset path from the DistributedCache, or the Dataset could not be loaded

configureJob

protected abstract void configureJob(org.apache.hadoop.mapreduce.Job job)
                              throws IOException
Used by the inheriting classes to configure the job

Parameters:
job - Hadoop's Job
Throws:
IOException - if anything goes wrong while configuring the job

runJob

protected boolean runJob(org.apache.hadoop.mapreduce.Job job)
                  throws ClassNotFoundException,
                         IOException,
                         InterruptedException
Sequential implementation should override this method to simulate the job execution

Parameters:
job - Hadoop's job
Returns:
true is the job succeeded
Throws:
ClassNotFoundException
IOException
InterruptedException

parseOutput

protected abstract DecisionForest parseOutput(org.apache.hadoop.mapreduce.Job job)
                                       throws IOException
Parse the output files to extract the trees and pass the predictions to the callback

Parameters:
job - Hadoop's job
Returns:
Built DecisionForest
Throws:
IOException - if anything goes wrong while parsing the output

build

public DecisionForest build(int nbTrees)
                     throws IOException,
                            ClassNotFoundException,
                            InterruptedException
Throws:
IOException
ClassNotFoundException
InterruptedException

sortSplits

public static void sortSplits(org.apache.hadoop.mapreduce.InputSplit[] splits)
sort the splits into order based on size, so that the biggest go first.
This is the same code used by Hadoop's JobClient.

Parameters:
splits - input splits


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.