Sally

A Tool for Embedding Strings in Vector Spaces

Example 1: Text Categorization

This example deals with text categorization: you are given a set of text documents from different categories and the task is to learn a classifier that predicts these categories on unseen text (see Joachims, ECML 1999).

Requirements

For running this example, we need some machine learning tools. First of all, you will need a version of Sally for processing the string data.

Download Sally and compile it. Enable support for LibArchive using the configuration option --enable-libarchive.

We restrict ourselves to the fast classification method LibLinear in this example. LibLinear implements linear Support Vector Machines (SVM) and allows to learn and apply classifiers with high effectivity and efficiency.

Download LibLinear and compile it.

Configure Sally

Everything is set, let's start by configuring Sally for this particular task. We will use the following example configuration for explaining the parameters of Sally.

Download the example configuration example1.cfg

The first section input defines the input format of the data. We can see that the format is defined as "arc" which corresponds to string data stored as files in an archive.

  input = {
     input_format  = "arc"; 
     ...

The next section in the configuration file describes the features extracted from the strings and how they are used for mapping to a vector space. For analysis of text documents, a standard approach is to simply consider the words (tokens) contained in each document. This concept is often denoted as a "bag-of-words", where each document is represented by one such bag. To define such a bag-of-words model in Sally, we only need to define a few parameters.

  features = {
     ngram_len     = 1;
     granularity   = "tokens";
     token_delim   = "%0a%0d%20%22.,:;!?";
     vect_embed    = "tfidf";
     vect_norm     = "none";
     ...
  1. First, we need to set the n-gram length ngram_len to 1 and tell Sally to look at tokens. This instructs Sally to consider all tokens (words) independently of each other. If you specify an n-gram length of 2, a pair of two consecutive tokens is considered, if you choose 3, a triple of tokens is used and so on.
  2. Next, we need to tell Sally how tokens are defined. This is done by specifying a set of delimiter symbols token_delim that is used to partition the strings into tokens. For analysis of text we usually define these delimiters to be white-space and punctuation symbols.
  3. Finally, we to define the embedding vect_embed and normalization vect_norm of vectors. We use a standard embedding called TF-IDF, where each token is mapped to an individual dimension and weighted according to the term and document frequencies (See here). As TF-IDF is already normalized, we do not need an extra normalization step.

As the last step, we need to define the output format generated by Sally. We will use LibLinear in this example which supports the LibSVM format for reading data. Thus, we simply choose "libsvm" as format.

  output = {
     output_format = "libsvm";
     ...

Run the example

We are done with configuration. Now it is time to apply Sally and learn a classification of categories. We download a data set of text documents first.

Download the example data set reuters.zip

If you look into the data set reuters.zip, you will notice that it contains 2,264 news articles from Reuters. Each article is assigned to one of the following five categories: acq, earn, crude, interest and trade. You can determine the category of an article simply by looking at the suffix of its file name. This is standard trick used by Sally for associating labels with file names.

We now execute the following sequence of commands which will first generate the vectors in LibSVM format and then instruct LibLinear to compute a 5-fold cross-validation. Finally, LibLinear reports the average classification accuracy.

  sally -c example1.cfg reuters.zip reuters.libsvm
  train -v 5 -c 100 reuters.libsvm

If we had a second data set, we could now train a classifier on the first one and predict categorizes of text on the second.