A Tool for Measuring String Similarity

Harry – Home

A Tool for Measuring String Similarity

Harry is a small tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance and the Jaro-Winkler distance.

For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output. The similarity measure can be computed based on the granularity of bytes, bits or tokens (words) contained in the strings. The configuration of this process, such as the input format, the similarity measure and the output format, are specified in a configuration file and can be additionally refined using command-line options.

Harry is implemented using OpenMP, such that the computation time for a set of strings scales linear with the number of available CPU cores. Moreover, efficient implementations of several similarity measures, effective caching of similarity values and low-overhead locking further speedup the computation.

Harry complements the tool Sally that embeds strings in a vector space and allows computing vectorial similarity measures, such as the cosine distance.

Similarity Measures

The following similarity measures for strings are supported by Harry:

    dist_bag             Bag distance
    dist_compression     Normalized compression distance (NCD)
    dist_damerau         Damerau-Levenshtein distance
    dist_hamming         Hamming distance
    dist_jaro            Jaro distance
    dist_jarowinkler     Jaro-Winkler distance
    dist_kernel          Kernel-based distance
    dist_lee             Lee distance
    dist_levenshtein     Levenshtein distance
    dist_osa             Optimal string alignment (OSA) distance
    kern_distance        Distance substitution kernel (DSK)
    kern_spectrum        Spectrum kernel
    kern_subsequence     Subsequence kernel (SSK)
    kern_wdegree         Weighted-degree kernel (WDK)
    sim_braun            Braun-Blanquet coefficient
    sim_dice             Soerensen-Dice coefficient
    sim_jaccard          Jaccard coefficient
    sim_kulczynski       second Kulczynski coefficient
    sim_otsuka           Otsuka coefficient
    sim_simpson          Simpson coefficient
    sim_sokal            Sokal-Sneath coefficient

Author of Harry

authors Harry is currently developed by Konrad Rieck and Christian Wressnegger at the University of Göttingen. Previous versions of the tool have been also developed at Idalab GmbH.

You can contact the main author at konrad at mlsec.org.
For news and updates follow us on Twitter.