Benchmark Data Sets for Sally

This directory contains data sets and code snippets for benchmarking the run-time performance of Sally.

Data sets

All data sets have been preprocessed, such that they can be read with Sally using the input mode "lines". That is, each string is given as one line of a text file, where non-printable characters are escaped using URI encoding. This mode is supported by many scripting languages and thus suitable for a empirical comparison.

Script code

The embedding of strings is often carried out using scripting languages, such as Matlab and Python. To compare the run-time of these scripts with Sally, a Python and Matlab implementation of the main embedding technique have been developed. The scripts use hashed features and try to be as efficient as possible (however, without using black magic of the respective languages).

Raw Results

The run-time performance has been evaluated for the different implementations and data sets on a Intel Xeon CPU X5550 (2.67GHz). The raw performance numbers are available here


The results of this benchmark are presented in the article "Sally: A Tool for Embedding Strings in Vector Spaces" Konrad Rieck, Christian Wressnegger, and Alexander Bikadorov. Journal of Machine Learning Research (JMLR), 13 (Nov) 3247–3251, November 2012.

More information

The full code and all configuration file are available in the Sally repository at Github. Check out the time_eval branch.