Harry

A Tool for Measuring String Similarity

Changes

The following list of changes is automatically generated from the commit messages of the project's GIT repository.

2016-04-16  Konrad Rieck 

    - new release for bug fix

2016-04-12  Konrad Rieck 

    - simplified build script

    - fixed incorrect implementation of bag distance (reported by r. feldt)

2015-12-31  Konrad Rieck 

    - preparing for new release

2015-12-07  Konrad Rieck 

    - updated libarchive requirements. closes #21

2015-10-22  Kevin Freeman 

    - reordering conditions in order to prevent skipping each 256th file of the
    input archive/dir

2015-03-27  Konrad Rieck 

    - fixed inconsistent naming of symbols

    - fixed dependencies in build process

    - changed order of checking

    - updated manpage

    - added some error checking to benchmark script

    - added support for floats of different sizes

    - added fallback if numpy is not installed

    - flipping dimensions on two input sources

    - improved naming of inputs

    - moved sanity check into comparison function

    - added two sanity checks for python code

2015-03-25  Konrad Rieck 

    - fixed bug in autotools setup

    - fixed year for potential 0.4 release

    - fixed bug when building tarball

    - sanitized interfacing with numpy

    - fixed naming in JSON output

    - added comments about col/row stuff

    - rewrote index computation

    - numerous minor name changes in matrix computation

    - fixed dimension in Python module

    - further renaming of matrix-related stuff

    - changed configuration parameters to new naming

    - fixed missing names in template

    - updated docs with new names for dimensions

    - started renaming of matrix dimensions: col and row vs. x and y

    - fixed bug in kernel test

    - fine tuning of test

    - added new test case with python-Levenshtein

    - fixed bug when using two input sources

2015-03-20  chwress 

    - update Copyright years of the main application

2015-03-20  Konrad Rieck 

    - added experiments with jellyfish

2015-03-19  Konrad Rieck 

    - sanitized compile-time generation of Python module

2015-03-16  Konrad Rieck 

    - minor fix to inline functions

2015-03-14  Konrad Rieck 

    - some minor enhancements to the Python module

    - fine tuning

    - added some Python examples to the tutorial

    - fixed minor typo

2015-03-13  Konrad Rieck 

    - am_conditional for python added

    - python script needs to be build on host

    - added python test cases. closes #16

    - improved python test

2015-03-12  Konrad Rieck 

    - added default options and a test case

2015-03-10  Konrad Rieck 

    - support for output compression

    - improved sanity checks for new granularity parameter

2015-03-09  Konrad Rieck 

    - we are coming close to a new release

    - updated tutorial

    - removed deuplicate warning

    - test case for one and two inputs

    - improved warning due to changed command-line options

    - sanity check for two inputs added

    - removed stupid array fallback

2015-03-08  Konrad Rieck 

    - increased threshold for warning

2015-03-07  Konrad Rieck 

    - fixes to man page

    - added support for two input sources

    - minor fixes to docstring

    - improved option display in python

    - compile-time generation of options

    - improved python code

    - rearranged files

    - renamed directory

    - added support for keyword arguments

2015-03-06  Konrad Rieck 

    - improved python module

    - add harry path at compile time

    - improved python interface

    - first version of simple wrapper

    - progress bar should write to stderr

    - improved performance in verbose and log_line mode. the displayed progress
    might not be correct.

    - changed format of raw output

    - length is given in floats

2015-03-02  Konrad Rieck 

    - lousy text for granularity

    - convience switches to enable stdin/raw format

    - fixed wrong function names

    - support for raw output added

    - minor fixes

    - first implementation of raw input mode

    - removed trailing pres

2015-02-12  Konrad Rieck 

    - code for access and comparison of bits

    - added compatibility warning

    - changed word to token in code and docs

2015-02-06  Konrad Rieck 

    - added granularity command-line option

    - added sanity checks for granularity

    - initialization for string granularity (+ beautification fixes)

    - adapted function names to new naming convention

    - made naming of symbol types consistent

2014-11-18  Konrad Rieck 

    - fixed bug in git2changes script

2014-10-31  Konrad Rieck 

    - finalized tests

    - moved patch to correct block

    - fixed version number

    - patch to support libconfig 1.3.x

    - Minor fixes for OpenBSD

    - removed temporary files

    - restricted precision of tests

    - support for selecting an output precision

    - pimped up the test cases

    - added test for loading/parsing of configs

    - further fixes

    - removed test code from tests ;)

    - improved checks

    - test case for options

    - added new test case

2014-10-30  Konrad Rieck 

    - minor fix for ubuntu 12.04

2014-10-29  Konrad Rieck 

    - renamed kernel function

2014-10-26  Konrad Rieck 

    - fixed bugs in prwlock configuration

    - fixed minor bugs in openmp config

    - support for disabling packages which are available

    - fixed typo

    - updated docs

    - improved configuration. openmp and libarchive are now optional

2014-10-26  Christian Wressnegger 

    - ...and also adapt the version number

2014-10-25  Christian Wressnegger 

    - libconfig9-dev -> libconfig8-dev

2014-10-23  Konrad Rieck 

    - minor fix for printing of command line options

    - updated title of man page

    - fixed name of sample data file

    - code beautification

2014-10-22  Konrad Rieck 

    - updated tutorial text

2014-10-21  Konrad Rieck 

    - updated the documentation + minor fix

    - running version of matlab export

    - removed trailing spaces

    - removed trailing spaces

    - fixed indentation

    - removed support for storing triangular matrices which is not possible when
    slicing

2014-10-19  Konrad Rieck 

    - fixed multiple bugs in JSON output

    - changed default for saving indices

    - support for triggering save options from command line

    - fixed naming of JSON fields

    - added note about dummy code

    - added missing header

    - more code for matlab support

    - first code for matlab support

2014-10-18  Konrad Rieck 

    - preparing new minor release

    - support for JSON format

2014-09-21  Konrad Rieck 

    - removed trailing spaces

    - rearranged code

2014-08-28  Konrad Rieck 

    - speed-up: fixed bug in caching of compression lengths

    - speed-up: replaced jaro distance implementation

    - robust comparison of floats

    - Levenshtein is hard to spell. I know

2014-08-27  Konrad Rieck 

    - new test cases for Levenshtein distance

    - speed-up: switch Levenshtein distance implementations

2014-08-26  Konrad Rieck 

    - speed-up: removed bit field in hstring

    - speed-up: inline compare function

    - remove asserts by default

    - minor tweaks

2014-08-26  Christian Wressnegger 

    - Consider the submatrix' position to the diagonal when calculating the
    (positional) specifications of the matrix. Now, this is done for all parts of
    the matrix rather than just for the middle region and thus, fixes bug #12
    (Split computation broken)

2014-08-25  Konrad Rieck 

    - fixed stupid error in benchmark function

    - implementation of simple benchmarking loop

2014-08-24  Christian Wressnegger 

    - make the uniform splitting feature (issue #5) depend on the definition of
    USE_UNIFORM_SPLITTING and disable it for the time being

    - simplify matrix initialization and make use of hmatrix_inferspec for
    determining the number of values to compute

2014-08-17  Christian Wressnegger 

    - ignoring project and temporary files

    - document the hmatrix specification inference

2014-08-15  Christian Wressnegger 

    - Fixes #5 (Uniform splitting of full square matrices)

2014-08-18  Konrad Rieck 

    - Update README.md

2014-08-17  Konrad Rieck 

    - broken if not pdflatex available

    - added examples to installed docs

    - removed DNA example and fixed tutorial

    - more READMEs

    - README for reuters example

    - fixed typo

    - improved first example

    - better keep datasets in a separate branch

2014-08-16  Konrad Rieck 

    - direct output

    - missing type added

    - merge benchmark experiment

    - fine-tuned experimental procedure

    - fixed benchmark script

    - the ARTS dataset is too different from the others and spoils any experiment
    :/

    - Fixed progress bar and initialization for #10 fix

    - collapsed main loops. first attempt at fixing #10

    - simplified benchmark for now

    - Skip redundant computations by checking for NaN. Closes #4

2014-08-14  Konrad Rieck 

    - minor fixes

    - changed looping in benchmark

    - revived benchmark code

    - updated comment

    - extended support for negative indices

    - code beautification

    - command-line switch for soundex

2014-08-14  Christian Wressnegger 

    - wrap the progress bar in a critical region

    - consider div-by-0 when calculating the vcache's hitrate

2014-08-13  Christian Wressnegger 

    - remove unused variables

    - make use of the format specifier PRIu64 for printing uin64_t values

    - another instance of the gzFile vs struct gzFile_s confusion

    - typo

    - gzopen returns a gzFile type which in turn already is a pointer to struct
    gzFile_s

    - include config.h in hconfig.h in order to make use of defines set by the
    configure script

2014-08-05  Konrad Rieck 

    - comment about implementation bug

    - implemented support for soundex index. closes #9

2014-07-30  Konrad Rieck 

    - fixed man page layout

    - fixed doxygen tag

    - update doxygen config

2014-07-28  Konrad Rieck 

    - fixed boundary check in Lee distance

    - extended symbol size to 64 bit. fixed #7

2014-07-26  Konrad Rieck 

    - re-enabling redundant computations. issue #4 need to be fixed later

    - fixed bug

    - fixed bug

    - increased default symbol size

2014-07-24  Konrad Rieck 

    - preparation of new release

    - fixed bug in test cases

    - updated autotools procedure

    - fixed bug in access to symmetric submatrices. closed #6

    - memory saved when computing partial matrices. implemented #2

    - code cleanup

    - eliminated redundant computations. fixed #4

    - improves memory usage for symmetric matrices

2014-07-23  Konrad Rieck 

    - fixed second bug from yesterday's coding session

    - fixed first bug from yesterday's coding session

    - oops some crap slipped through

    - first working version of matrix splitting

    - comment about difficulty of parsing Python array indices

    - first code for splitting matrices

2014-07-22  Konrad Rieck 

    - features are tracked via github issues

    - fixed typo

    - feature added

    - swapped axes in output matrices

    - changed name of parameter: it's called a triangular matrix

    - added more code for range support

    - fixed warning about uninitialized variable in openmp loop

2014-07-18  Konrad Rieck 

    - updated TODOs

2014-07-15  Konrad Rieck 

    - added missing break

    - support for boolean configuration parameters

    - some code beautification

    - support for saving full or half matrices

    - added docs for stdout support

    - support for writing to stdout

    - support for selecting stdin with '-'

    - support for reading from stdin

2014-07-13  Konrad Rieck 

    - implemented dynamic allocation

    - added config and doc for chunk parameter

    - adapted input interface and modules

2014-05-03  Konrad Rieck 

    - new version approaching

    - switched to old time measuring

2014-05-02  Konrad Rieck 

    - moved timestamp into loop

2014-04-30  Konrad Rieck 

    - minor code fixes

2014-04-29  Konrad Rieck 

    - added an extra flush for stdout

    - fixed two minor bugs

2014-04-29  Konrad Rieck 

    - support for log lines

2014-02-25  Konrad Rieck 

    - removed double hiven

2014-02-21  Konrad Rieck 

    - fixed bug in operator precedence

    - fixed bug and added test case

2014-02-20  Konrad Rieck 

    - fixed bug in osa computation

    - fixed bug in output and added GWDG cluster support

2014-02-19  Konrad Rieck 

    - fixed two bugs

    - fixed syntax in cfg file

    - updated readme

    - implementation of optimal sequence alignment (OSA) distance

2013-12-27  Konrad Rieck 

    - removed aliases from list of measures

    - sync

2013-12-26  Konrad Rieck 

    - smaller dataset

    - caching id needs to be considered by vcache

    - new example

    - further simplified

    - simplified example 1

2013-12-24  Konrad Rieck 

    - note about aliases in man page

    - add another alias

    - support for aliases

2013-12-21  Konrad Rieck 

    - support for selecting a separator in text mode

2013-12-18  Konrad Rieck 

    - smaller dataset

    - note about slow spectrum kernel

2013-12-13  Konrad Rieck 

    - typos

2013-12-12  Konrad Rieck 

    - minor changes to README

    - updated TODOs

    - improved manpage

    - fixed problem with large strings and stack space

    - fixed problem with large strings and stack space

    - fixed problem with large strings and stack space

2013-12-11  Konrad Rieck 

    - fixed stupid typoi

    - note about libsvm added

    - pro-tipp: resolving symbols for debugging works better if ASLR is disabled.

    - fixed bug in display of threads

    - fixed bug in progress bar

    - fixed bug with small strings

    - moved temporary arrays to heap

    - more examples

    - added first example

    - code beautification

2013-12-10  Konrad Rieck 

    - hmatrix implementation (partial)

2013-11-22  Konrad Rieck 

    - command-line support for selecting ranges

2013-11-21  Konrad Rieck 

    - simple implementation of matrix layer => volunteers welcome!

    - doxygen fixes

    - small fixes to code

    - code beautification

    - restricted harry to symmetric similarity measures

    - index fixes

    - openmp performance fix

    - simplified internal string representation

    - minor fixes

    - first implementation of hmatrix

    - first implementation of matrix object

    - minor fix

2013-11-16  Konrad Rieck 

    - small fix to readme

    - fixes for future OpenBSD support :(

2013-11-15  Konrad Rieck 

    - minor text fixes

2013-11-08  Konrad Rieck 

    - manpage fixes

2013-11-05  Konrad Rieck 

    - implementation of spectrum kernel by leslie

    - fixed typo

2013-11-04  Konrad Rieck 

    - minor fix to readme

    - code beautification

    - final fixes and test case

    - so far code is broken. now lunch at ccs ;)

    - first implementation of kernel-based distance

2013-11-02  Konrad Rieck 

    - fixed bug in damerau distance

    - comment about inner-product space and distance

    - fixed test case

    - test case added

    - implementation of distance substitution kernel

    - fixed broken vcache

2013-11-01  Konrad Rieck 

    - first code for distance substitution kernel

    - moved measure name matching to separate function

2013-10-28  Konrad Rieck 

    - fixed sizeof call

2013-10-19  Konrad Rieck 

    - updated todos

2013-10-13  Konrad Rieck 

    - small fix to man page

2013-10-12  Konrad Rieck 

    - it's harry

2013-10-07  Konrad Rieck 

    - structure for benchmark experiments

2013-10-05  Konrad Rieck 

    - fixed benchmark script

    - better output

    - fixed counting bug again

    - fixed bug in arc reading

    - better verbose output

    - defensive setting of stirng type

    - fixed new name

    - rather a swap than a rol

    - benchmark script

    - changed parameter name

    - commandline option for setting num threads

    - fixed datatype

    - benchmark datasets added

    - small compile changes

2013-10-04  Konrad Rieck 

    - test output added

2013-10-03  Konrad Rieck 

    - improved cache stats

    - removed old cmdline option

    - much simpler cache implementation

    - enabled pthread locks

    - fixed bug in hash function

    - switched to POSIX rwlocks. Fallback OpenMP mutex

    - useless profiling of rwlock

2013-09-26  Konrad Rieck 

    - improved rwlock. still hangs from time to time

    - new (broken) implementation using a semaphore

2013-09-21  Konrad Rieck 

    - updated references

    - test cases for similarity measures

    - further improvements to the list

    - list of measures added

    - support for listing similarity measures

    - simplified distance function

    - fixed jaro-winkler implementation

    - improved interface of measures

    - simplified names

    - simplified norm names

    - refactored normalization

    - fixed typo

    - implemented different matching modes

    - updated documentation

    - code beautification

    - sanitized measure interface

2013-09-20  Konrad Rieck 

    - fixed bug in similarity computation

    - fixed bug

    - first list of coefficients

    - first implementation of similarity coefficients

    - updated todo

    - forgot norm files

    - note about opemmp

    - refactored kernel normalization

    - refactored length normalization

    - fixed some doc

    - test case for bag distance

    - added normalization

    - first version of bag distance

    - sync

    - fixed back tag

2013-09-17  Konrad Rieck 

    - updated man

    - improved lee distance

    - more updates to references

    - added references

    - more test cases for compression distance

2013-09-16  Konrad Rieck 

    - test case for compression distance

    - test cases for kernel normalization

    - support for kernel normalization

    - code beautification

    - added another test case

    - dirty fix to kernel

    - test case and kernel still broken

    - added broken test case

2013-09-15  Konrad Rieck 

    - note about difficulty of implementing a fair rwlock

    - sync

    - small fixes

    - rwlock for vcache

    - small code beautification

    - ignore xcode chaff

    - first code for rwlock

    - removed dead code

    - refactored basic string class

2013-09-10  Konrad Rieck 

    - symmetrization will be treated separately

    - updated manual

    - usual code beautification

    - rewrote stopword filtering

    - moved preprocessing functions

    - added note about regex

    - improved progress bar

    - renamed source file with config functions

    - added missing code from sally

    - updated TODOs

    - fixed description

    - added note about locking

2013-01-09  Konrad Rieck 

    - preproc function

    - changed str struct to save space

    - fixed broken printf in output

2013-09-01  Konrad Rieck 

    - fixed weird overflow in bit-field

    - changed type to bit-field

    - revived label field from sally

    - added libsvm format

2013-08-29  Konrad Rieck 

    - updated readme

2013-08-28  Konrad Rieck 

    - added missing options

    - implementation of compression distance was ... erhm ... crap. fixed it

    - fixed bug in compile script

    - finished compression distance and cache functionality

2013-08-26  Konrad Rieck 

    - some code for compression distance

    - improved test cases

    - minor improvements to cache

    - implementation of value hash

2013-08-21  Konrad Rieck 

    - implementation of subsequence kernel (untested)

    - prepared code for cache functions

    - doxygen fixes

2013-08-20  Konrad Rieck 

    - better init for alphabet

    - implementation of Damerau-Levenshtein distance

    - updated TODO

    - updated README

    - cleanup of test cases

    - closed form of weighted-degree kernel

    - code beautification

    - note about supported measures

    - fixd bug in weighted degree kernel and added test case

    - added and fixed test case. kern_wdegree still broken

    - implementation of weighted-degree kernel

2013-08-18  Konrad Rieck 

    - note about importance of openmp ;)

2013-08-17  Konrad Rieck 

    - changed jaro-winkler for empty strings: d(e,e) = 0; d(e, x) = 1

    - removed debug output

    - updated TODOs

    - minor fixes to manpage

    - support for Lee distance

    - support for the Jaro and Jaro-Winkler distance

    - support for output compression

    - code beautification

    - restructured main file

    - triangular indexing

2013-08-11  Konrad Rieck 

    - first simple complete version

2013-08-10  Konrad Rieck 

    - strings and symbols stored in union

    - markdown

    - added TODO doc

    - string functions

    - test cases and str functions

    - changed interface for string comparison

    - added utility function for strings

    - added missing credits

    - libarchive-3 fixes

    - better keep this in line

    - more heavy editing

    - fixed doxygen comments

    - added note about md5 hash

    - ports from sally project

    - further code refactoring from project simone

    - imported test cases from simone

2013-08-03  Konrad Rieck 

    - parameter is called type now

    - old test cases from simone

    - autoconf switches for openmp

    - changed name of tool

    - remaining code of simone project

    - added error to stop testing by others ;)

    - added small logo

    - further fixes from sally port

    - new location of libarchive on homebrew

    - imported code from sally and simone

    - adapted code from old prototype

    - completed autotools stuff

    - added missing make files

    - new import