This webpage revolves around machine learning and computer security. It provides a collection of open-source software and datasets that have been developed by the research group of Konrad Rieck. The group is currently working at TU Braunschweig, where it forms the Institute of System Security.
Joern is a platform for robust analysis of C/C++ code. It generates code property graphs, a novel graph representation of code that exposes the code’s syntax, control-flow, data-flow and type information. Code property graphs are stored in a graph database. This allows code to be mined using search queries formulated in the graph traversal language Gremlin. Joern forms the basis for assisted vulnerability discovery using machine learning techniques.
Pulsar is a network fuzzer with automatic protocol learning and simulation capabilites. The tool allows to model a protocol through machine learning techniques, such as clustering and hidden Markov models. These models can be used to simulate communication between Pulsar and a real client or server thanks to semantically correct messages which, in combination with a series of fuzzing primitives, allow to test the implementation of an unknown protocol for errors in deeper states of its protocol state machine.
Adagio is a collection of Python modules for analyzing and detecting Android malware. These modules allow to extract labeled call graphs from Android APKs or DEX files and apply an explicit feature map that captures their structural relationships. Additional modules provide classes for designing binary or multiclass classification experiments and applying machine learning for detection of malicious structure.
Letter Salad, or Salad for short, is an efficient and flexible implementation of the anomaly detection method Anagram. The method uses n-grams (substrings of length n) maintained in a Bloom filter for efficiently detecting anomalies in large sets of string data. Salad extends the original method by supporting n-grams of bytes as well n-grams of words and tokens.
Malheur is a tool for the automatic analysis of program behavior recorded from malware. It has been designed to support the regular analysis of malware and the development of detection and defense measures. Malheur allows for identifying novel classes of malware with similar behavior and assigning unknown malware to discovered classes using machine learning.
Harry is a tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein and Jaro-Winkler distance.
Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can applied to several types of string data, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files.
Prisma is an R package for processing and analyzing huge text corpora. In combination with the tool Sally the package provides testing-based token selection and replicate-aware, highly tuned non-negative matrix factorization and principal component analysis. Prisma allows for analyzing very big data sets even on desktop machines.
Adversarial machine learning and digital watermarking share similar attack and defense strategies — an observation that has been largely overlooked by the research community. We have identified first links between both fields and make our datasets publicly avaiable.
The Drebin dataset consists of roughly 5,000 malicious Android applications that have been collected as part of the Mobile Sandbox project between 2010 and 2012. The dataset has been downloaded by over 150 research institutes and universities.
The Malheur dataset contains the recorded behavior of roughly 30,000 malicious programs (malware). It has been created in 2009 for developing clustering and classification methods for malware behavior. Due to the rapid evolution of malware, the dataset can be considered obsolote nowadays.