|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
---|---|
AggregatorMapper | outputs the pattern for each item in the pattern, so that reducer can group them and select the top K frequent patterns |
AggregatorReducer | groups all Frequent Patterns containing an item and outputs the top K patterns containing that particular item |
CountDescendingPairComparator<A extends Comparable<? super A>,B extends Comparable<? super B>> | Defines an ordering on Pair s whose second element is a count. |
FPGrowthDriver | |
MultiTransactionTreeIterator | Iterates over multiple transaction trees to produce a single iterator of transactions |
ParallelCountingMapper | maps all items in a particular transaction like the way it is done in Hadoop WordCount example |
ParallelCountingReducer | sums up the item count and output the item and the count This can also be used as a local Combiner. |
ParallelFPGrowthCombiner | takes each group of dependent transactions and\ compacts it in a TransactionTree structure |
ParallelFPGrowthMapper | maps each transaction to all unique items groups in the transaction. |
ParallelFPGrowthReducer | takes each group of transactions and runs Vanilla FPGrowth on it and outputs the the Top K frequent Patterns for each group. |
PFPGrowth | Parallel FP Growth Driver Class. |
TransactionTree | A compact representation of transactions modeled on the lines to
FPTree This reduces plenty of space and speeds up
Map/Reduce of PFPGrowth algorithm by reducing data size passed from the Mapper to the reducer where
FPGrowth mining is done |
We have a Top K Parallel FPGrowth Implementation. What it means is that given a huge transaction list,
we find all unique features(field values) and eliminates those features whose frequency in the whole dataset
is less that minSupport
. Using these remaining features N, we find the top K closed patterns for
each of them, generating NK patterns. FPGrowth Algorithm is a generic implementation, we can use any object
type to denote a feature. Current implementation requires you to use a String as the object type. You may
implement a version for any object by creating Iterator
s, Convertors
and TopKPatternWritable for that particular object. For more information please refer the package
org.apache.mahout.fpm.pfpgrowth.convertors.string
.
FPGrowth<String> fp = new FPGrowth<String>();
Set<String> features = new HashSet<String>();
fp.generateTopKStringFrequentPatterns(
new StringRecordIterator(
new FileLineIterable(new File(input), encoding, false), pattern),
fp.generateFList(
new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), minSupport),
minSupport,
maxHeapSize,
features,
new StringOutputConvertor(new SequenceFileOutputCollector<Text,TopKStringPatterns>(writer)));
Iterator<List<String>>
[String, List<Pair<List<String>,Long>>]
and writes them to the appropriate writer class
which takes care of storing the object, in this case in a
SequenceFileOutputFormat
The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthJob
has other features including specifying the regex pattern for spitting a string line of a transaction into
the constituent features.
The numGroups
parameter in FPGrowthJob specifies the number of groups into which transactions
have to be decomposed. The numTreeCacheEntries
parameter specifies the number of generated
conditional FP-Trees to be kept in memory so as not to regenerate them. Increasing this number
increases the memory consumption but might improve speed until a certain point. This depends entirely on
the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |