One of the main prerequisites for the development of reliable defenses to protect a network resource consists in the collection of quantitative data on the Internet threats. This attempt to "know your enemy" leads to an increasing interest in the collection and exploitation of datasets providing intelligence on network attacks. The creation of these datasets is a very challenging task. The challenge derives from the need to cope with the spatial and quantitative diversity of malicious activities. The observations need to be performed on a broad perspective, since the activities are not uniformly distributed over the IP space. At the same time, the data collectors need to be sophisticated enough to extract a sufficient amount of information on each activity and perform meaningful inferences. How to combine the simultaneous need to deploy a vast number of data collectors with the need of sophistication required to make meaningful observations? Such a challenge constitutes the foundations of this work.
We propose in this work the usage of protocol learning techniques for the automated generation of protocol interaction models. Such techniques aim at achieving automatically what is normally considered a tedious and time intensive manual task, especially when dealing with complex protocols or close-specification ones. Starting from the hypothesis that most of the network interaction handled by a honeypot is generated by deterministic attack tools, wepropose a technique called ScriptGen. ScriptGen is able to infer information on protocol semantics from a set of samples and to build a representation of the interaction in the form of a Finite State Machine. The approach is protocol agnostic: no assumption is made on the structure of the protocol. The protocol structure is partially reconstructed through the application of bioinformatics techniques and through a set of inferences on the statistical variability of the input samples.
ScriptGen allows the automated construction of protocol emulators also for binary protocols whose manual analysis would be tedious. ScriptGen-based honeypots provide a high level of interaction with the attackers at very low cost for all the activities falling within the current protocol knowledge. We show how we are able to detect deviations from this knowledge, and take advantage of a proxying algorithm to dynamically react to them by producing refinements of the Finite State Machine.
We exploit the characteristics of the ScriptGen approach to design and implement a distributed honeypot deployment, SGNET. Coupling the ScriptGen learning with memory tainting techniques and with a simple shellcode emulator, we show how we are able to enable SGNET to emulate code injection attacks, downloading malware samples. We integrate the data collected by honeypot sensors deployed in 23 different testing sites to build a centralized dataset. The dataset is enriched with the output of different analysis tools providing different perspectives on the collected data.
The value of the resulting dataset is twofold. Firstly, it provides us information on the behavior of the ScriptGen learning technique when dealing with Internet attacks. We are able to validate the approach, showing its capability to correctly carry on the network interaction and, at the same time, achieve a very high scalability. Secondly, it is a rich and valuable source of intelligence on the structure of code injection attacks and on the malware propagation techniques. We propose a simple clustering algorithm to explore the complexity of the interrelationships among the different stages of a code injection attack and we evaluate its potential in studying the propagation strategy of modern malware.