org.apache.mahout.vectorizer.collocations.llr
Class CollocMapper
java.lang.Object
org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
org.apache.mahout.vectorizer.collocations.llr.CollocMapper
public class CollocMapper
- extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
Pass 1 of the Collocation discovery job which generated ngrams and emits ngrams an their component n-1grams.
Input is a SequeceFile, where the key is a document id and the value is the tokenized documents.
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper |
org.apache.hadoop.mapreduce.Mapper.Context |
Method Summary |
protected void |
map(org.apache.hadoop.io.Text key,
StringTuple value,
org.apache.hadoop.mapreduce.Mapper.Context context)
Collocation finder: pass 1 map phase. |
protected void |
setup(org.apache.hadoop.mapreduce.Mapper.Context context)
|
Methods inherited from class org.apache.hadoop.mapreduce.Mapper |
cleanup, run |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MAX_SHINGLE_SIZE
public static final String MAX_SHINGLE_SIZE
- See Also:
- Constant Field Values
CollocMapper
public CollocMapper()
map
protected void map(org.apache.hadoop.io.Text key,
StringTuple value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException,
InterruptedException
- Collocation finder: pass 1 map phase.
Receives a token stream which gets passed through a Lucene ShingleFilter. The ShingleFilter delivers ngrams of
the appropriate size which are then decomposed into head and tail subgrams which are collected in the
following manner
k:head_key, v:head_subgram
k:head_key,ngram_key, v:ngram
k:tail_key, v:tail_subgram
k:tail_key,ngram_key, v:ngram
The 'head' or 'tail' prefix is used to specify whether the subgram in question is the head or tail of the
ngram. In this implementation the head of the ngram is a (n-1)gram, and the tail is a (1)gram.
For example, given 'click and clack' and an ngram length of 3:
k: head_'click and' v:head_'click and'
k: head_'click and',ngram_'click and clack' v:ngram_'click and clack'
k: tail_'clack', v:tail_'clack'
k: tail_'clack',ngram_'click and clack' v:ngram_'click and clack'
Also counts the total number of ngrams encountered and adds it to the counter
CollocDriver.Count.NGRAM_TOTAL
- Overrides:
map
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
- Throws:
IOException
- if there's a problem with the ShingleFilter reading data or the collector collecting output.
InterruptedException
setup
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException,
InterruptedException
- Overrides:
setup
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
- Throws:
IOException
InterruptedException
Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.