"Implementing a Distributed Clustering on Apache Hadoop"

Thibaut DEBATTY - Royal Military Academy
Digital Security

Date: -
Location: Eurecom

: In cyber security, one of the required steps for attack attribution consists in clustering the collected data (spams, attack traces, security events, etc). This is very computation intensive process due the huge quantities of data to process. Moreover, to allow iterative analysis the results should be obtained as fast as possible. For this master thesis project, we demonstrated the possibility to perform this clustering in parallel to obtain faster results. Therefore we implemented a system to cluster spams in parallel using Apache Hadoop. The goal of the project was twofold: 1. Cluster the dataset to deliver meaningful and useful results; 2. Perform this clustering as fast as possible, in a scalable way. During this talk, we will present: - The MapReduce framework and Apache Hadoop; - The parallel K-means clustering algorithm; - Our implementation of SPAM clustering; - Experimental results obtained on different platforms and a speedup analysis; - Some Clusters visualization techniques that we used.