org.apache.mahout.math.hadoop.stochasticsvd
Class SSVDSolver

java.lang.Object
  extended by org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver

public final class SSVDSolver
extends Object

Stochastic SVD solver (API class).

Implementation details are in my working notes in MAHOUT-376 (https://issues.apache.org/jira/browse/MAHOUT-376).

As of the time of this writing, I don't have benchmarks for this method in comparison to other methods. However, non-hadoop differentiating characteristics of this method are thought to be :

  • "faster" and precision is traded off in favor of speed. However, there's lever in terms of "oversampling parameter" p. Higher values of p produce better precision but are trading off speed (and minimum RAM requirement). This also means that this method is almost guaranteed to be less precise than Lanczos unless full rank SVD decomposition is sought.
  • "more scale" -- can presumably take on larger problems than Lanczos one (not confirmed by benchmark at this time)

    Specifically in regards to this implementation, I think couple of other differentiating points are:

  • no need to specify input matrix height or width in command line, it is what it gets to be.
  • supports any Writable as DRM row keys and copies them to correspondent rows of U matrix;
  • can request U or V or Uσ=U* Σ0.5 or Vσ=V* Σ0.5 none of which would require pass over input A and these jobs are parallel map-only jobs.

    This class is central public API for SSVD solver. The use pattern is as follows:


    Constructor Summary
    SSVDSolver(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path[] inputPath, org.apache.hadoop.fs.Path outputPath, int ablockRows, int k, int p, int reduceTasks)
              create new SSVD solver.
     
    Method Summary
     int getAbtBlockHeight()
               
     int getOuterBlockHeight()
               
     org.apache.hadoop.fs.Path getPcaMeanPath()
              Optional.
     int getQ()
               
     Vector getSingularValues()
              This contains k+p singular values resulted from the solver run.
     String getuHalfSigmaPath()
               
     String getUPath()
              returns U path (if computation were requested and successful).
     String getuSigmaPath()
               
     String getvHalfSigmaPath()
               
     String getVPath()
              return V path ( if computation was requested and successful ) .
     String getvSigmaPath()
               
     boolean isBroadcast()
               
     boolean isOverwrite()
               
     void run()
              run all SSVD jobs.
     void setAbtBlockHeight(int abtBlockHeight)
              the block height of Y_i during power iterations.
     void setBroadcast(boolean broadcast)
              If this property is true, use DestributedCache mechanism to broadcast some stuff around.
     void setComputeU(boolean val)
              The setting controlling whether to compute U matrix of low rank SSVD.
     void setComputeV(boolean val)
              Setting controlling whether to compute V matrix of low-rank SSVD.
     void setcUHalfSigma(boolean cUHat)
               
     void setcUSigma(boolean cUSigma)
               
     void setcVHalfSigma(boolean cVHat)
               
     void setcVSigma(boolean cVSigma)
               
     void setMinSplitSize(int size)
              Sometimes, if requested A blocks become larger than a split, we may need to use that to ensure at least k+p rows of A get into a split.
     void setOuterBlockHeight(int outerBlockHeight)
              The height of outer blocks during Q'A multiplication.
     void setOverwrite(boolean overwrite)
              if true, driver to clean output folder first if exists.
     void setPcaMeanPath(org.apache.hadoop.fs.Path pcaMeanPath)
               
     void setQ(int q)
              sets q, amount of additional power iterations to increase precision (0..2!).
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Constructor Detail

    SSVDSolver

    public SSVDSolver(org.apache.hadoop.conf.Configuration conf,
                      org.apache.hadoop.fs.Path[] inputPath,
                      org.apache.hadoop.fs.Path outputPath,
                      int ablockRows,
                      int k,
                      int p,
                      int reduceTasks)
    create new SSVD solver. Required parameters are passed to constructor to ensure they are set. Optional parameters can be set using setters .

    Parameters:
    conf - hadoop configuration
    inputPath - Input path (should be compatible with DistributedRowMatrix as of the time of this writing).
    outputPath - Output path containing U, V and singular values vector files.
    ablockRows - The vertical hight of a q-block (bigger value require more memory in mappers+ perhaps larger minSplitSize values
    k - desired rank
    p - SSVD oversampling parameter
    reduceTasks - Number of reduce tasks (where applicable)
    Throws:
    IOException - when IO condition occurs.
    Method Detail

    getQ

    public int getQ()

    setQ

    public void setQ(int q)
    sets q, amount of additional power iterations to increase precision (0..2!). Defaults to 0.

    Parameters:
    q -

    setComputeU

    public void setComputeU(boolean val)
    The setting controlling whether to compute U matrix of low rank SSVD. Default true.


    setComputeV

    public void setComputeV(boolean val)
    Setting controlling whether to compute V matrix of low-rank SSVD.

    Parameters:
    val - true if we want to output V matrix. Default is true.

    setcUHalfSigma

    public void setcUHalfSigma(boolean cUHat)
    Parameters:
    cUHat - whether produce U*Sigma^0.5 as well (default false)

    setcVHalfSigma

    public void setcVHalfSigma(boolean cVHat)
    Parameters:
    cVHat - whether produce V*Sigma^0.5 as well (default false)

    setcUSigma

    public void setcUSigma(boolean cUSigma)
    Parameters:
    cUSigma - whether produce U*Sigma output as well (default false)

    setcVSigma

    public void setcVSigma(boolean cVSigma)
    Parameters:
    cVSigma - whether produce V*Sigma output as well (default false)

    setMinSplitSize

    public void setMinSplitSize(int size)
    Sometimes, if requested A blocks become larger than a split, we may need to use that to ensure at least k+p rows of A get into a split. This is requirement necessary to obtain orthonormalized Q blocks of SSVD.

    Parameters:
    size - the minimum split size to use

    getSingularValues

    public Vector getSingularValues()
    This contains k+p singular values resulted from the solver run.

    Returns:
    singlular values (largest to smallest)

    getUPath

    public String getUPath()
    returns U path (if computation were requested and successful).

    Returns:
    U output hdfs path, or null if computation was not completed for whatever reason.

    getVPath

    public String getVPath()
    return V path ( if computation was requested and successful ) .

    Returns:
    V output hdfs path, or null if computation was not completed for whatever reason.

    getuSigmaPath

    public String getuSigmaPath()

    getuHalfSigmaPath

    public String getuHalfSigmaPath()

    getvSigmaPath

    public String getvSigmaPath()

    getvHalfSigmaPath

    public String getvHalfSigmaPath()

    isOverwrite

    public boolean isOverwrite()

    setOverwrite

    public void setOverwrite(boolean overwrite)
    if true, driver to clean output folder first if exists.

    Parameters:
    overwrite -

    getOuterBlockHeight

    public int getOuterBlockHeight()

    setOuterBlockHeight

    public void setOuterBlockHeight(int outerBlockHeight)
    The height of outer blocks during Q'A multiplication. Higher values allow to produce less keys for combining and shuffle and sort therefore somewhat improving running time; but require larger blocks to be formed in RAM (so setting this too high can lead to OOM).

    Parameters:
    outerBlockHeight -

    getAbtBlockHeight

    public int getAbtBlockHeight()

    setAbtBlockHeight

    public void setAbtBlockHeight(int abtBlockHeight)
    the block height of Y_i during power iterations. It is probably important to set it higher than default 200,000 for extremely sparse inputs and when more ram is available. y_i block height and ABt job would occupy approx. abtBlockHeight x (k+p) x sizeof (double) (as dense).

    Parameters:
    abtBlockHeight -

    isBroadcast

    public boolean isBroadcast()

    setBroadcast

    public void setBroadcast(boolean broadcast)
    If this property is true, use DestributedCache mechanism to broadcast some stuff around. May improve efficiency. Default is false.

    Parameters:
    broadcast -

    getPcaMeanPath

    public org.apache.hadoop.fs.Path getPcaMeanPath()
    Optional. Single-vector file path for a vector (aka xi in MAHOUT-817 working notes) to be subtracted from each row of input.

    Brute force approach would force would turn input into a dense input, which is often not very desirable. By supplying this offset to SSVD solver, we can avoid most of that overhead due to increased input density.

    The vector size for this offest is n (width of A input). In PCA and R this is known as "column means", but in this case it can be any offset of row vectors of course to propagate into SSVD solution.


    setPcaMeanPath

    public void setPcaMeanPath(org.apache.hadoop.fs.Path pcaMeanPath)

    run

    public void run()
             throws IOException
    run all SSVD jobs.

    Throws:
    IOException - if I/O condition occurs.


    Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.