This directory contains data sets and code snippets for benchmarking the run-time performance of Sally.
All data sets have been preprocessed, such that they can be read with Sally using the input mode "lines". That is, each string is given as one line of a text file, where non-printable characters are escaped using URI encoding. This mode is supported by many scripting languages and thus suitable for a empirical comparison.
The embedding of strings is often carried out using scripting languages, such as Matlab and Python. To compare the run-time of these scripts with Sally, a Python and Matlab implementation of the main embedding technique have been developed. The scripts use hashed features and try to be as efficient as possible (however, without using black magic of the respective languages).
The run-time performance has been evaluated for the different implementations and data sets on a Intel Xeon CPU X5550 (2.67GHz). The raw performance numbers are available here
The full code and all configuration file are available in the Sally repository at Github. Check out the time_eval branch.