Graduate School and Research Center in Digital Sciences

Seminar: "Implementing a Distributed Clustering on Apache Hadoop"

Thibaut DEBATTY - Royal Military Academy

Digital Security

Date: November 28, 2012

Location: Eurecom - Salle RS

: In cyber security, one of the required steps for attack attribution consists in clustering the collected data (spams, attack traces, security events, etc). This is very computation intensive process due the huge quantities of data to process. Moreover, to allow iterative analysis the results should be obtained as fast as possible. For this master thesis project, we demonstrated the possibility to perform this clustering in parallel to obtain faster results. Therefore we implemented a system to cluster spams in parallel using Apache Hadoop. The goal of the project was twofold: 1. Cluster the dataset to deliver meaningful and useful results; 2. Perform this clustering as fast as possible, in a scalable way. During this talk, we will present: - The MapReduce framework and Apache Hadoop; - The parallel K-means clustering algorithm; - Our implementation of SPAM clustering; - Experimental results obtained on different platforms and a speedup analysis; - Some Clusters visualization techniques that we used.

"Implementing a Distributed Clustering on Apache Hadoop"