org.apache.mahout.vectorizer.encoders
Class FeatureVectorEncoder

java.lang.Object
  extended by org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
Direct Known Subclasses:
CachingValueEncoder, InteractionValueEncoder, TextValueEncoder, WordValueEncoder

public abstract class FeatureVectorEncoder
extends Object

General interface for objects that record features into a feature vector.

By convention, sub-classes should provide a constructor that accepts just a field name as well as setters to customize properties of the conversion such as adding tokenizers or a weight dictionary.


Field Summary
protected static int CONTINUOUS_VALUE_HASH_SEED
           
protected static int WORD_LIKE_VALUE_HASH_SEED
           
 
Constructor Summary
protected FeatureVectorEncoder(String name)
           
protected FeatureVectorEncoder(String name, int probes)
           
 
Method Summary
abstract  void addToVector(byte[] originalForm, double weight, Vector data)
           
 void addToVector(byte[] originalForm, Vector data)
          Adds a value expressed in byte array form to a vector.
 void addToVector(String originalForm, double weight, Vector data)
          Adds a weighted value expressed in string form to a vector.
 void addToVector(String originalForm, Vector data)
          Adds a value expressed in string form to a vector.
abstract  String asString(String originalForm)
          Converts a value into a form that would help a human understand the internals of how the value is being interpreted.
protected static byte[] bytesForString(String x)
           
 String getName()
           
 int getProbes()
           
protected  double getWeight(byte[] originalForm, double w)
           
protected  int hash(byte[] term1, byte[] term2, int probe, int numFeatures)
          Hash two byte arrays and an integer into the range [0..numFeatures-1].
protected static int hash(byte[] term, int probe, int numFeatures)
          Hash a byte array and an integer into the range [0..numFeatures-1].
protected  int hash(String term, int probe, int numFeatures)
          Hash a string and an integer into the range [0..numFeatures-1].
protected static int hash(String term1, String term2, int probe, int numFeatures)
          Hash two strings and an integer into the range [0..numFeatures-1].
protected  int hash(String term1, String term2, String term3, String term4, int probe, int numFeatures)
          Hash four strings and an integer into the range [0..numFeatures-1].
protected  Iterable<Integer> hashesForProbe(byte[] originalForm, int dataSize, String name, int probe)
          Returns all of the hashes for this probe.
protected abstract  int hashForProbe(byte[] originalForm, int dataSize, String name, int probe)
          Provides the unique hash for a particular probe.
protected  boolean isTraceEnabled()
           
 void setProbes(int probes)
          Sets the number of locations in the feature vector that a value should be in.
 void setTraceDictionary(Map<String,Set<Integer>> traceDictionary)
           
protected  void trace(byte[] subName, int n)
           
protected  void trace(String subName, int n)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CONTINUOUS_VALUE_HASH_SEED

protected static final int CONTINUOUS_VALUE_HASH_SEED
See Also:
Constant Field Values

WORD_LIKE_VALUE_HASH_SEED

protected static final int WORD_LIKE_VALUE_HASH_SEED
See Also:
Constant Field Values
Constructor Detail

FeatureVectorEncoder

protected FeatureVectorEncoder(String name)

FeatureVectorEncoder

protected FeatureVectorEncoder(String name,
                               int probes)
Method Detail

addToVector

public void addToVector(String originalForm,
                        Vector data)
Adds a value expressed in string form to a vector.

Parameters:
originalForm - The original form of the value as a string.
data - The vector to which the value should be added.

addToVector

public void addToVector(byte[] originalForm,
                        Vector data)
Adds a value expressed in byte array form to a vector.

Parameters:
originalForm - The original form of the value as a byte array.
data - The vector to which the value should be added.

addToVector

public void addToVector(String originalForm,
                        double weight,
                        Vector data)
Adds a weighted value expressed in string form to a vector. In some cases it is convenient to use this method to encode continuous values using the weight as the value. In such cases, the string value should typically be set to null.

Parameters:
originalForm - The original form of the value as a string.
weight - The weight to be applied to this feature.
data - The vector to which the value should be added.

addToVector

public abstract void addToVector(byte[] originalForm,
                                 double weight,
                                 Vector data)

hashForProbe

protected abstract int hashForProbe(byte[] originalForm,
                                    int dataSize,
                                    String name,
                                    int probe)
Provides the unique hash for a particular probe. For all encoders except text, this is all that is needed and the default implementation of hashesForProbe will do the right thing. For text and similar values, hashesForProbe should be over-ridden and this method should not be used.

Parameters:
originalForm - The original byte array value
dataSize - The length of the vector being encoded
name - The name of the variable being encoded
probe - The probe number
Returns:
The hash of the current probe

hashesForProbe

protected Iterable<Integer> hashesForProbe(byte[] originalForm,
                                           int dataSize,
                                           String name,
                                           int probe)
Returns all of the hashes for this probe. For most encoders, this is a singleton, but for text, many hashes are returned, one for each word (unique or not). Most implementations should only implement hashForProbe for simplicity.

Parameters:
originalForm - The original byte array value.
dataSize - The length of the vector being encoded
name - The name of the variable being encoded
probe - The probe number
Returns:
an Iterable of the hashes

getWeight

protected double getWeight(byte[] originalForm,
                           double w)

hash

protected int hash(String term,
                   int probe,
                   int numFeatures)
Hash a string and an integer into the range [0..numFeatures-1].

Parameters:
term - The string.
probe - An integer that modifies the resulting hash.
numFeatures - The range into which the resulting hash must fit.
Returns:
An integer in the range [0..numFeatures-1] that has good spread for small changes in term and probe.

hash

protected static int hash(byte[] term,
                          int probe,
                          int numFeatures)
Hash a byte array and an integer into the range [0..numFeatures-1].

Parameters:
term - The bytes.
probe - An integer that modifies the resulting hash.
numFeatures - The range into which the resulting hash must fit.
Returns:
An integer in the range [0..numFeatures-1] that has good spread for small changes in term and probe.

hash

protected static int hash(String term1,
                          String term2,
                          int probe,
                          int numFeatures)
Hash two strings and an integer into the range [0..numFeatures-1].

Parameters:
term1 - The first string.
term2 - The second string.
probe - An integer that modifies the resulting hash.
numFeatures - The range into which the resulting hash must fit.
Returns:
An integer in the range [0..numFeatures-1] that has good spread for small changes in term and probe.

hash

protected int hash(byte[] term1,
                   byte[] term2,
                   int probe,
                   int numFeatures)
Hash two byte arrays and an integer into the range [0..numFeatures-1].

Parameters:
term1 - The first string.
term2 - The second string.
probe - An integer that modifies the resulting hash.
numFeatures - The range into which the resulting hash must fit.
Returns:
An integer in the range [0..numFeatures-1] that has good spread for small changes in term and probe.

hash

protected int hash(String term1,
                   String term2,
                   String term3,
                   String term4,
                   int probe,
                   int numFeatures)
Hash four strings and an integer into the range [0..numFeatures-1].

Parameters:
term1 - The first string.
term2 - The second string.
term3 - The third string
term4 - And the fourth.
probe - An integer that modifies the resulting hash.
numFeatures - The range into which the resulting hash must fit.
Returns:
An integer in the range [0..numFeatures-1] that has good spread for small changes in term and probe.

asString

public abstract String asString(String originalForm)
Converts a value into a form that would help a human understand the internals of how the value is being interpreted. For text-like things, this is likely to be a list of the terms found with associated weights (if any).

Parameters:
originalForm - The original form of the value as a string.
Returns:
A string that a human can read.

getProbes

public int getProbes()

setProbes

public void setProbes(int probes)
Sets the number of locations in the feature vector that a value should be in.

Parameters:
probes - Number of locations to increment.

getName

public String getName()

isTraceEnabled

protected boolean isTraceEnabled()

trace

protected void trace(String subName,
                     int n)

trace

protected void trace(byte[] subName,
                     int n)

setTraceDictionary

public void setTraceDictionary(Map<String,Set<Integer>> traceDictionary)

bytesForString

protected static byte[] bytesForString(String x)


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.