org.apache.mahout.classifier.sgd
Class TrainNewsGroups

java.lang.Object
  extended by org.apache.mahout.classifier.sgd.TrainNewsGroups

public final class TrainNewsGroups
extends Object

Reads and trains an adaptive logistic regression model on the 20 newsgroups data. The first command line argument gives the path of the directory holding the training data. The optional second argument, leakType, defines which classes of features to use. Importantly, leakType controls whether a synthetic date is injected into the data as a target leak and if so, how.

The value of leakType % 3 determines whether the target leak is injected according to the following table:

0No leak injected
1Synthetic date injected in MMM-yyyy format. This will be a single token and is a perfect target leak since each newsgroup is given a different month
2Synthetic date injected in dd-MMM-yyyy HH:mm:ss format. The day varies and thus there are more leak symbols that need to be learned. Ultimately this is just as big a leak as case 1.

Leaktype also determines what other text will be indexed. If leakType is greater than or equal to 6, then neither headers nor text body will be used for features and the leak is the only source of data. If leakType is greater than or equal to 3, then subject words will be used as features. If leakType is less than 3, then both subject and body text will be used as features.

A leakType of 0 gives no leak and all textual features.

See the following table for a summary of commonly used values for leakType

leakTypeLeak?Subject?Body?

0noyesyes
1mmm-yyyyyesyes
2dd-mmm-yyyyyesyes

3noyesno
4mmm-yyyyyesno
5dd-mmm-yyyyyesno

6nonono
7mmm-yyyynono
8dd-mmm-yyyynono


Method Summary
static void main(String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

main

public static void main(String[] args)
                 throws IOException
Throws:
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.