Sally – Home
A tool for embedding strings
Sally is a small tool for mapping a set of strings to a set of
vectors. This mapping is referred to as embedding and allows
for applying techniques of machine learning and data mining for
analysis of string data. Sally can be applied to several types of
string data, such as text documents, DNA sequences or log files,
where it can handle common formats such as directories, archives and
text files of string data.
Sally implements a standard technique for mapping strings to a
vector space that is often referred to as vector space model
or bag-of-words model. The strings are characterized by a set
of features, where each feature is associated with one dimension of
the vector space. The following types of features are supported by
Sally:
bytes, words, n-grams of bytes and n-grams
of words.
Sally proceeds by counting the occurrences of the specified features
in each string and generating a sparse vector of count
values. Alternatively, binary or TF-IDF values can be computed and
stored in the vectors. Sally then normalizes the vector, for example
using the L1 or L2 norm, and outputs it in a specified format, such
as plain text or in
LibSVM
or
Matlab format.
There are many applications for Sally, for example, in the areas of natural language processing, bioinformatics, information retrieval and computer security. To illustrate the merit of Sally, we provide some examples including text categorization, finding genes in DNA and analysing similarities of languages. All examples come with data sets and instructions.
Author of Sally
Sally is developed by Konrad Rieck,
Christian Wressnegger and
Alexander Bikadorov at the University of
Göttingen. Previous versions of Sally have been developed at the
Machine Learning Group of Technische Universität Berlin.
You can contact the main author at
konrad at mlsec.org.
For news and updates
follow us on Twitter.

