Friday, November 14, 2008

Generalized Suffix Trees and Distinct Substrings.

This is a technical post on strings, substrings and suffix trees. This could get boring. You have been warned.

Everybody knows the concept of longest common substring, where a set of strings is given and the task is to determine the longest substring shared by the set. While intuitive and straightforward to implement using a generalized suffix tree, this setting is not robust against "noise", such as corrupted or truncated strings. Moreover, finding the longest common substring is useless, if the considered strings derive from different data distributions, thus not necessary sharing any substring.

These shortcomings can be addressed by determining a set of shared substrings, which are shared by at least m strings but not necessary all strings in the set. To restrict the amount of small shared substrings, one requires the substrings to have a minimum length. This setting has been applied in the PolyGraph and Hamsa paper for automatic signature generation from network data. Again, an implementation is easily realized using a depth-first traversal of a generalized suffix tree. Unfortunately, many of the shared substrings will be prefixes and suffixes of each other.

This issue is resolved by the set of distinct shared substrings, where a distinct substring x is not contained in any other distinct substring y, unless x appears in at least m strings independently of y. Implementing this setting using generalized suffix trees is a little bit more involved. Together with a colleague, I came up with this approach:

We start to determine interesting substrings by a depth-first traversal of a generalized suffix tree.

This time, however, we are looking for distinct substrings and hence can not afford to output a substring that later turns out to be non-distinct. We thus first visit nodes which give rise to longer substrings, that is for each node we order the child nodes according to the length of possible substrings. Consequently, we find longer substrings first.

If we have determined a first substring x shared by m strings, we can be sure that it is distinct as no longer substrings exist. Next, we need to mark all non-distinct substrings of x. We do this by traversing the suffix links and parent nodes of x, thereby processing all suffixes and prefixes of x. For each node we maintain a counter indicating the number of strings the corresponding substring is shared by. When processing the prefixes and suffixes of x, we decrement this counter by the number of occurences of x.

Finally, we continue the depth-first traversal. We do not descend to nodes with a counter smaller than m, as no distinct shared substring can be determined on the corresponding path.

To be honest, I have not thoroughly thought about the complexity of this algorithm. Yet, I expect it to be "somewhat linear" in the length of the considered strings. The output of the algorithm resembles the concept of distinct substrings proposed in PolyGraph, where the authors apply a costly post-processing of shared substrings. Note that this approach can also be implemented using suffix and LCP arrays with the traversal techniques studied by Kasai et al.

Did I mention that I like string algorithms?

Thursday, October 16, 2008

Automatic Feature Selection for Anomaly Detection.

Network intrusion detection requires discriminative features from network traffic, which for the case of misuse detection allow for the precise formulation of signatures and rules, and which, on the other end, enable application of learning methods for anomaly detection. However, devising a set of features is not trivial, as attacks are reflected in all sorts of different patterns and numerical measures. A good example is the work of Kruegel at al. (see 1, 2, 3), in which various heterogeneous features are proposed and manually combined into an effective anomaly detection system.

Recently, colleges at our lab came up with a method for automatic selection and weighting of such features for intrusion detection. Instead of defining a weighting of different features manually, the method automatically determines the mixture which optimizes the performance of anomaly detection. This optimization is realized by incorporating the process of feature selection directly into an anomaly detection method—a rather involved mathematical procedure. The paper will be presented at the AISec Workshop co-located with CCS. Following is its abstract:

A frequent problem in anomaly detection is to decide among different feature sets to be used. For example, various features are known in network intrusion detection based on packet headers, content byte streams or application level protocol parsing. A method for automatic feature selection in anomaly detection is proposed which determines optimal mixture coefficients for various sets of features. The method generalizes the support vector data description (SVDD) and can be expressed as a semi-infinite linear program that can be solved with standard techniques. The case of a single feature set can be handled as a particular case of the proposed method. The experimental evaluation of the new method on unsanitized HTTP data demonstrates that detectors using automatically selected features attain competitive performance, while sparing practitioners from a priori decisions on feature sets to be used.
Automatic Feature Selection for Anomaly Detection. M. Kloft, U. Brefeld, P. Düssel, C. Gehl, and P. Laskov. Proceedings of the First ACM Workshop on AISec, October 2008.

Sunday, September 28, 2008

Approximate Kernels for Trees.

The majority of machine learning and artificial intelligence methods are designed to work on vectorial data. In practice, however, data does not always come in the form of simple vectors. For instance, analysis of HTML documents requires processing and matching tree structures. The state-of-the-art techniques for learning with trees are convolutional kernel functions. These functions are effective but painfully slow on large trees. To speed-up such kernels for security applications, we came up with approximate tree kernels. Here is the abstract from the corresponding technical report:

Convolution kernels for trees provide effective means for learning with tree-structured data, such as parse trees of natural language sentences. Unfortunately, the computation time of tree kernels is quadratic in the size of the trees as all pairs of nodes need to be compared: large trees render convolution kernels inapplicable. In this paper, we propose a simple but efficient approximation technique for tree kernels. The approximate tree kernel (ATK) accelerates computation by selecting a sparse and discriminative subset of subtrees using a linear program. The kernel allows for incorporating domain knowledge and controlling the overall computation time through additional constraints. Experiments on applications of natural language processing and web spam detection demonstrate the efficiency of the approximate kernels. We observe run-time improvements of two orders of magnitude while preserving the discriminative expressiveness and classification rates of regular convolution kernels.
Approximate Kernels for Trees. Konrad Rieck, Ulf Brefeld and Tammo Krueger. Technical Report FIRST 5/2008, Fraunhofer Institute FIRST, September 2008.

Saturday, September 20, 2008

Low throughput? The imbalance of network traffic.

I have been recently running experiments with a "fast" learning method for anomaly detection in HTTP and FTP payloads. While the method performed well in terms of attack detection, the realized throughput was frustrating. Incoming traffic was processed at 15 mbit/s (HTTP) and 30 mbit/s (FTP) on a single CPU, including traffic normalization, TCP reassembly and anomaly detection.

How useful is such a low throughput?

Well, better than one might think. Depending on the considered application-layer protocol, there is often an imbalance of ingress and egress traffic. For example, the HTTP traffic at our institute has an ingress-egress ratio of 1/50, while the FTP traffic recorded at LBNL (port 21 only) reaches a ratio of 1/6. Thus, when looking at the total throughput the considered learning method yields around 770 mbit/s (HTTP) and 200 mbit/s (FTP) throughput. That's not so frustrating, given that the performance could be easily increased using multi-core systems.

For other protocols, however, the ingress-egress is not a small fraction and one can not indirectly benefit from the imbalance of traffic. In conclusion, when analyzing the run-time performance of an intrusion detection method it really matters which protocol and direction is monitored for evil things.

Saturday, August 30, 2008

Portscanning the Internet.

I just returned from a long holiday trip. While browsing through the list of blog postings I missed during this period, I noticed an interesting article on large-scale port scanning:

Portscanning the Internet posted at the THC blog.

The author reports on his experience with port scanning methods, which are capable to scan the entire Internet in a reasonable time frame. Besides several remarks on how to conduct such scans and how to threshold a scanning system, there is also a funny side note on a false worm outbreak caused by scanning activities.

Friday, July 11, 2008

Slides for Talk on Learning Malware Behavior.

I just noticed that heise security posted a short note on our recent paper presented at DIMVA 2008. Cool. As an addition to the technical paper, I am thus also providing the slides for my talk. Have fun.

Learning and Classification of Malware Behavior.

Yesterday, I have been presenting some of our joint work with the University of Mannheim on malware analysis at this year's DIMVA. The corresponding paper is available online:

Learning and Classification of Malware Behavior. To appear in Proceedings of Fifth Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 2008.

Thorsten Holz wrote a nice summary of our key results in his blog yesterday. In essence, we devise a method for automatic classification of malware families based on their behavior. The method proceeds by learning behavioral patterns from malware monitored in a sandbox. Starting from a set of malware binaries labeled by an anti-virus scanner, the method generalizes observed behavior so that variants undetected by anti-virus products can be identified. Moreover, the method is transparent to the security practitioner as the learning model can be examined and decisions made by the system can be traced back to behavioral patterns discriminative for each malware family.