Sifaka Feature Vectors

The Feature Vector tab is for finding features most highly correlated with each document label and creating feature vectors which can be used by machine learning software for classification. Users can select categories from either document labels or saved sets of documents for creating feature vectors.

Overview of Feature Vectors input

Select feature types: Select feature types for which to calculate feature scores. The more feature types that are selected, the longer calculating feature scores will take.
Minimum Frequency: The minimum number of times the feature must appear to be used.
Negative Documents: Select the documents to use as the negative category. Choose none for no negative category. Choose random for a random sampling of documents proportional in size to the other categories. Choose all to use all documents that are not labeled as on of the selected categories for the negative category.
Label Categories: Each tab is a label type in the document collection. There is also a tab for saved sets. To calculate feature vectors from saved sets of documents, first create a few saved sets using the saved sets tutorial. Once a tab is selected, use the checkboxes to select categories for calculating kappa values of features.

Example

Select an index from the Indexes pane, for example: reuters.
Select the Feature Vectors tab in the right content tab pane.
Select the Label Type, for example: topics.
Select the Feature types that you want, for example: term and noun-phrase.
Enter a Minimum Frequency for feature values, for example: 10. Features with smaller values will not be included in the feature vector.
Select the Negative Documents. Choose random to use a random sampling of negative documents proportional to the size of the label categories selected.
Select the Label Categories for calculating kappa values. The label categories table is sortable by label value or count by clicking on the column header. In this example, the five label categories with the highest counts (earn, acq, money-fx, grain, crude) are selected.
Press View Features button. This experiment may take several minutes. A progress indicator will display as Sifaka calculates the feature scores. Note: If the experiment takes longer than 15 minutes to run, check the java version installed.
When the kappa values are finished calculating, a table on the right will appear with each category and the kappa scores at the first, tenth, fiftieth, and one hundreth highest ranked feature in each row.
At the end of each row there is a View button. Click on that button to view all the features and their kappa scores for each category.
When exporting to feature vectors, all features are selected by default. To filter the number of features, enter feature selection criteria. Check Top number of features to select and enter the number of features to select a certain number of features from each category. Check Feature score above threshold to select a kappa value that a feature must be above to be exported. If both criteria are selected, choose AND to select features that fit both criteria, and choose OR to select features that satisfy at least one of the criteria.
Select whether to export Feature Weights as Binary, TF (term frequency), or TF-IDF (term frequency * inverse term frequency)
Click Save Results to export feature vectors to an ARFF file that can used by WEKA.