Sally

A Tool for Embedding Strings in Vector Spaces

Changes

The following list of changes is automatically generated from the commit messages of the project's GIT repository.

2015-03-10  Konrad Rieck 

    - checks for new parameters

    - changed naming of ngram_delim to token_delim. see harry.

    - adapted naming to harry configuration

    - adapted test cases to new default setup

    - improved sanity checks for new granularity parameter

    - removed annoying no warning def

    - fixed typo

2015-03-09  Konrad Rieck 

    - we are coming close to a new release

    - added command-line option for granularity

    - fix concept to generalized bow

    - minor fixes in documentation

    - fixes in README

    - word -> token in configuration files

    - updated the manual page

    - updated the manual page

    - fixed incorrect parameter name

    - adapted test case to new output

    - first version of granularity patch

2015-01-30  Konrad Rieck 

    - fixed bug in sanity check for stopwords

2015-01-22  chwress 

    - Check for valid n-gram lengths. Fixes issue #14

2014-11-18  Konrad Rieck 

    - fixed bug in git2changes script

    - new release for bug fix

2014-11-17  Konrad Rieck 

    - fixed concurrency bug with blended n-grams

2014-10-31  Konrad Rieck 

    - finalized tests

    - fixed bug in config syntax

    - added test case for config files

    - added test case for options

    - minor fixes for openbsd

    - fixed type issue with 64 bit integers

    - fixed version numbers

    - completed patch

    - patch for libconfig-1.3.2

    - fixed bug in configuration

2014-10-26  Konrad Rieck 

    - oops. skipped one version accidentally

    - heading to new release

    - fixed compiler warnings on linux

    - improved configuration. openmp and libarchive are now optional and
    auto-detected.

2014-10-26  Christian Wressnegger 

    - ...and also adapt the version number

2014-10-25  Christian Wressnegger 

    - libconfig9-dev -> libconfig8-dev

2014-10-23  Konrad Rieck 

    - updated title of man page

2014-10-19  Konrad Rieck 

    - fixed condition on hash reset

    - minor code cleanup

2014-10-18  Konrad Rieck 

    - preparing new minor release

    - code beautification

    - support for JSON format

2014-07-26  Konrad Rieck 

    - simplified default config

2014-07-24  Konrad Rieck 

    - updated autotools procedure

2014-07-22  Konrad Rieck 

    - features are tracked via github issues

2014-07-20  Konrad Rieck 

    - support for skipping null vectors

2014-07-18  Konrad Rieck 

    - improved support for stdin reading

    - don't be verbose

    - separate source with command char

    - fixed bug in output module

    - support for stdout

2014-05-31  Konrad Rieck 

    - new test cases for n-grams

    - preparing new version

2014-05-30  Konrad Rieck 

    - code cleanup

    - dimension reduction with a Bloom filter

    - added two short command line options

    - fixes typical tfidf recursion problem, AGAIN

    - minor fixes

    - first implementation of blended n-grams

    - updated docs and fixed printing of config

2014-05-07  Konrad Rieck 

    - sparsify reduced dimensions

    - fixed bugs in reduction functions

    - fixed broken minhash implementation

    - some badly needed code beautification

    - fixes to header file

    - simple rehash function

2014-05-06  Konrad Rieck 

    - position is 32bit signed int

    - fixed docs

2014-05-06  Christian Wressnegger 

    - correctly overwrite values of wrong type

    - correctly handle old config files

    - make use of CONFIG_TYPE_BOOL and introduce program switches as replacement
    for program arguments expecting 0 & 1 as value.

2014-05-06  Konrad Rieck 

    - improved configuration of positional n-grams and their shift

2014-04-06  Konrad Rieck 

    - doxygen text changed

2014-04-02  Konrad Rieck 

    - more minor code fixes

    - more text fixes

    - note about hash_bits

    - minhash is not sparse

    - more docs

    - more comments

    - extended description slightly

    - simhash and minhash running

    - added config and docs

    - first version of dim reduction

2014-02-25  Konrad Rieck 

    - sync

2013-12-25  Konrad Rieck 

    - new release

    - removed obsolete doxygen config

2013-12-13  Konrad Rieck 

    - updated cleaning in makefile

2013-12-12  Christian Wressnegger 

    - fix the processing of archive files when computing IDF weightings

2013-11-16  Konrad Rieck 

    - small fix to readme

    - fixed bug in markdown

    - fixes for OpenBSD (5.4)

2013-10-28  Konrad Rieck 

    - fixed sizeof call

2013-10-10  Konrad Rieck 

    - darp fix

2013-10-05  Konrad Rieck 

    - fixed bug in file counting

2013-09-10  Konrad Rieck 

    - added note about regex

2013-08-10  Konrad Rieck 

    - replaced deprecated library calls

    - further libarchive-3 patches

    - minor fix

2013-08-03  Konrad Rieck 

    - added small logo

    - updated readme

    - new location of libarchive on homebrew

2013-06-15  Konrad Rieck 

    - fixed bugs in doxygen doc

    - improved terminology in manual

    - updated year

    - preparations for new release

    - minor fixes

    - added note about hash file to docs

    - implemented option for hash file

    - code for new config parameter

    - fixed Matlab encoding bug

    - changed obsolete AM macro

2013-01-07  Konrad Rieck 

    - updated test cases and CHANGES

2012-12-27  Konrad Rieck 

    - changed format of CHANGES file

    - new version

    - updated autotool config

2012-12-18  Konrad Rieck 

    - ubuntu switched to libconfig9

2012-11-22  Konrad Rieck 

    - fixed indexing bug

    - added note about shift to man page

    - cleaned up word extraction

    - fixed caching of positional n-grams with shift

    - one pos_shift by extract run

2012-10-08  Christian Wressnegger 

    - fix wrong indexing in the Matlab output module

2012-10-08  Konrad Rieck 

    - corrected package name

2012-09-28  Konrad Rieck 

    - Added authors to documentation

2012-09-04  Christian Wressnegger 

    - support for libarchive-3

2012-08-28  Konrad Rieck 

    - added misc docs to manual

    - silently the semantic of gzeof changed. rah.

    - switched order of warning *again*

    - removed manual cmdline parsing

    - changed order of print/warning for config

    - moved config printing to the init function

    - minor code beautification

2012-08-24  chwress 

    - only issue a warning about the missing configuration file once we try to
    actually run the sally processor

2012-08-24  Konrad Rieck 

    - should be false to avoid cycle of calls

    - only 4-char tabs will bring you to heaven

    - brackets

    - fixed len assignment

    - changed times implementation to support loop and bsearch variant

2012-08-24  chwress 

    - fix some typos in the code documentation

    - nicely make use of TRUE & FALSE defines

    - make fvec_equals actually work (thx konrad)

2012-08-23  Konrad Rieck 

    - changed length of word are removed

    - new switch for default config is D

2012-08-23  chwress 

    - introduce idf_check for being able to properly unit-test the internal
    computed idf values

    - introduce the fvec_equals(.,.) function

    - the 1-pass of the tf*idf embedding (determining the idf values) must not
    "post-process" the raw feature vector

2012-08-22  chwress 

    - unify the feature vector creation

    - fix fvec_times for empty feature vectors

    - introduce fvec_truncate

    - make sure not to try to allocate memory if a zero-length feature vector is
    requested

    - add sum test for arithmetic operations with spares features (this is far
    from complete)

2012-08-21  Konrad Rieck 

    - implemented thresholding of vectors

    - ... == 0.0 is problematic with floats. hope i don't run into problems later

    - sanity check

    - config for thresholding

    - docs for thresholding

    - minor fixes

2012-08-21  chwress 

    - update the eMail address of chwress

    - naming ("print_default" -> "print_defaults")

2012-08-21  chwress 

    - update the documentation for --print_config & --print_default

    - Differentiate between --print_config (-C) and --print_default (-D) Whereas
    the the first prints the current configuration including the user's
    configuration file and program arguments and the latter prints the default
    configuration as -P did.

2012-08-20  Konrad Rieck 

    - fixed config for timing

    - added timing output

    - fixed wrong #ifdef

    - clean-up of compile script

2012-08-20  chwress 

    - typo

    - properly return Sally's main function

2012-08-17  Konrad Rieck 

    - re-enabled openmp code

    - close to 0.8. let's do some beta testing first

    - changes documented

    - first version of stop word filtering

    - started with stop word processing

    - exit if stop word file no available but requested

    - refactored delimiter setup

    - basic framework for stopwords

    - added missing header (thanks to Andreas Ziehe)

    - stop words require the definition of delimiters

    - added missing file to automake

    - added run-time eval code to master

    - fixed example config

    - added documentation for stop words

    - support for configuring stop words

    - updated config

2012-08-16  Konrad Rieck 

    - removed confusing filtering of empty lines

    - extended example config

    - support for reversing of strings

    - annoying line wrap fixed

    - fixed wrong email in docs

    - removed global configuration file

    - beautified printing of config

    - support for printing the default configuration

2012-08-13  Konrad Rieck 

    - added christian as author

    - added new authors

    - minor beautification in stdin module

    - fixes to stdin module

2012-08-09  Alexander Bikadorov 

    - added stdin input module

2012-08-08  Konrad Rieck 

    - Disabled MD5 support in default setup

2012-08-06  Konrad Rieck 

    - changed last test case to an empty string, instead of a just a whitespace
    char

    - restored pull request #5 which was accidentially reverted by request #6

    - empty strings should be skipped. see other input modules

    - fixed wrong index in string assignment

    - synchronized sally.cfg and sconfig.c (thx to chwress)

    - removed support for .sally config

    - sorted n-grams disable by default

2012-08-03  chwress 

    - strip newline characters at the end of each line of the input file

    - revert cff4cc3...

2012-08-03  chwress 

    - correctly count the number of lines rather then the number of newline
    charcters

2012-08-03  chris 

    - fix the length specifier of the percent decoded input strings

2012-06-27  Konrad Rieck 

    - sync

2012-05-26  Konrad Rieck 

    - added md README

2012-05-14  Konrad Rieck 

    - minor fixes to man page

    - fixed typo

    - fixed configure

2012-05-13  Konrad Rieck 

    - fixed broken aliasing

    - fixed broken md5 copy

    - fixed gzfile pointer madness

    - added note about zlib dep

    - new release

    - spell check

    - extended README

2012-05-12  Konrad Rieck 

    - added shift. cache still broken

2012-04-23  Konrad Rieck 

    - code cleanup 6

    - code cleanup 5

    - code cleanup 4

    - code cleanup 3

    - code cleanup 2

    - code cleanup 1

2012-03-13  Konrad Rieck 

    - new laptop

2012-03-01  Konrad Rieck 

    - changed legend

    - some results

    - simple measuring function

2012-02-24  Konrad Rieck 

    - unified output of dimensions (via c. wressnegger)

2012-02-21  Konrad Rieck 

    - fixed bug in sign computation

    - removed some brackets

    - support for signed embedding added

    - minor fix to documentation

    - added option

    - updated options

    - added documentation of sign feature

    - added framework for new option

2012-02-10  Konrad Rieck 

    - first cleanup of lib dir

    - destroyed structure. hmm.

    - sync

2012-02-06  Konrad Rieck 

    - updated incorrect version number

    - new release done

    - new release

2012-02-05  Konrad Rieck 

    - added note about sorted n-grams

    - added note about decoding

    - code for sorting bytes in n-grams

    - code for sorting words in n-grams

    - minor fix to decoding function

    - missin comma added

    - added config switches

    - added support for decoding strings with URI encoding

    - further fixed README

    - switched to libtoolize

2012-02-04  Konrad Rieck 

    - another dir to ignore

    - moved decode function to util

2012-01-22  Konrad Rieck 

    - sync

2011-10-20  Konrad Rieck 

    - Adapted flags to GWDG cluster

2011-10-19  Konrad Rieck 

    - adapted compile script

    - added cmd switch

    - first running version of pos ngrams

    - half version of positional n-grams. explicit reprepresentation at least
    incorrect

    - moved hash computation to extra func

    - moved cfg into extraction functions

    - config for positional n-grams

    - fixed bug from debuggin

2011-08-22  Konrad Rieck 

    - fixed nasty bug in fasta module

    - minor changes

2011-08-16  Konrad Rieck 

    - removed annoying -x in bootstrap script

2011-07-12  Konrad Rieck 

    - fixed annoying bug in feature output (credits to t. krueger)

2011-07-10  Konrad Rieck 

    - sync

2011-07-09  Konrad Rieck 

    - further comments about run-time behavior

    - added note about dimensionality

2011-07-05  Konrad Rieck 

    - version change

    - minor bug fix in output module

    - first version of output module for cluto

2011-04-01  Konrad Rieck 

    - added missing config file

2011-03-09  Konrad Rieck 

    - minor fix

2011-02-21  Konrad Rieck 

    - final release

    - minor typos fixed

    - new release approaching

    - further extended man page

    - extended man page

2011-01-17  Konrad Rieck 

    - linux fixes

2011-01-11  Konrad Rieck 

    - fixed bug in command line arguments

2011-01-06  Konrad Rieck 

    - fixed minor bugs in config parsing

    - additional check of config

    - filter for auto-generated header

    - added support for global configuration and command line options

2011-01-04  Konrad Rieck 

    - updated help screen. longopts code will follow tommorrow.

    - changed names of format parameters

    - removed minor memory leak

    - fixed regex matching

    - fixed regex code

    - added support for labels in text lines

2011-01-02  Konrad Rieck 

    - started to update the man pagee

    - extended example configuration

2010-12-21  Konrad Rieck 

    - important fix: added newline to eof

    - fixed wrong skip lines

2010-12-17  Konrad Rieck 

    - sanitized contributions directory

    - new contrib

2010-12-10  Konrad Rieck 

    - removed space from src field of line module

2010-10-08  Konrad Rieck 

    - missing latex templates

    - added support to generate PDF manual

2010-10-07  Konrad Rieck 

    - new release 0.5.2

2010-10-05  Konrad Rieck 

    - added contrib directory and simple matlab function

2010-10-01  Konrad Rieck 

    - updated manpage

    - removed trim functionality as it badly interfers with normalization and
    tfidf

    - support to trim min. and max. of dimensions

    - new minor update

    - support for exporting features to matlab

    - added label to matlab struct

    - feature vectors are now converted to matlab structs

    - improved documentation of output and input modules

    - started rewrite of matlab module

2010-09-30  Konrad Rieck 

    - updated documentation

    - replaced strlcat with strncat for compatibility. got a bad feeling now

2010-09-29  Konrad Rieck 

    - update to README

    - spell fixes in manual

    - release version 0.5.0

2010-09-28  Konrad Rieck 

    - another minor fix to manual

    - further extensions to manual

    - minor fix to manual

    - new TODOs

2010-09-23  Konrad Rieck 

    - added support to extract labels from FASTA descriptions

    - fixes for new module

    - new input module for fasta files

2010-09-18  Konrad Rieck 

    - support for gzip-compressed files

2010-09-17  Konrad Rieck 

    - added test case for TFIDF

    - first version of TF-IDF weighting

    - changed layout of project

    - updated manual

    - added code from Cujo project

2010-09-14  Konrad Rieck 

    - fixed bug

2010-09-07  Konrad Rieck 

    - extended documentation

    - new version with matlab support

    - spell fixes in man page

    - updated documentation

    - sync

    - first version of matlab export

    - template for matlab support

    - commented clone function

2010-09-04  Konrad Rieck 

    - added some comments and code beautification

    - tested netbeans for development

2010-09-03  Konrad Rieck 

    - sync

    - added docs

    - no color

    - color?

    - sync

2010-09-02  Konrad Rieck 

    - fixed new libconfig functions and added getline clone

    - sync

2010-08-27  Konrad Rieck 

    - added patch for filesystems that do not return d_type

    - enabled GNU libc functions

    - fixes to disable all openmp directives

2010-08-27  Konrad Rieck 

    - next alpha release

    - removed openmp in main loop

    - minor fixes and openmp testing

    - fixed documentation, again

    - fixed line module and some memory leaks

    - templates for input module

    - improved documentation

2010-08-26  Konrad Rieck 

    - change naming to chunks

    - fixed bug in archive routines

    - added support for blocksize of 1

    - fixed problem with broken gzprintf

2010-08-26  Konrad Rieck 

    - forgot test files

    - updated documentation

    - added support for exporting list of features

2010-08-20  Konrad Rieck 

    - added info messages

    - fixed nasty bug in exportmake

    - fixed nasty bug in export

2010-08-17  Konrad Rieck 

    - removed openmp temporary

    - started to build first working version

    - sanitized input and added output module

2010-08-16  Konrad Rieck 

    - removed libtool dependency

    - simplified design of tool

    - changed config names

    - first version with global config_t configuration

    - removed sally_t configuration

    - configuration stuff

    - added checks for libconfig, libm and libz

    - minor fixes

    - input format

    - added short tutorial on how to develop input modules

    - improved interfaces

2010-08-15  Konrad Rieck 

    - added TODO file with new formats

    - improved interface

2010-08-15  Konrad Rieck 

    - added interface for input modules

2010-08-14  Konrad Rieck 

    - changed label to float type

2010-08-13  Konrad Rieck 

    - support for reading archives on-line

    - fixed concurrency problem

    - splitted input code to modules

    - removed feature vector arrays. this is the wrong concept for online
    processing

    - added support for libarchive (untested)

    - incorporated feature vector array

    - forgot pod file

    - changed description again :)

    - changed scope and description of project

    - added import templates

    - added import templates

    - added support for src field in feature vector

    - added first export routines

    - made delimiters field opaque to user

    - fixed hash table test cases

    - code cleanup after making feature hash a field of Sally struct

2010-08-11  Konrad Rieck 

    - improved hash interface and configuration of sally

    - added test cases for fhash and fvec

    - added load/save functions

    - rearranged files

    - arrays of feature vectors

    - added parsing of options

    - added feature arrays

    - added test files

    - fixed nasty bug: wrong definition of FALSE

    - added test case and fixed fmap

2010-08-07  Konrad Rieck 

    - migrated code from project Cujo

    - sync

    - utility and hash functions added

    - added templates for library and tool

    - smallest possible change to a repository

    - updated ignored files

    - git ignore file added

    - added autotools setup

    - added GPL license file