A k-nn graph is a special kind of graph where each node (also called vertex) has an edge (a link) to the k most similar other nodes in the graph according to some appropriate measure of similarity. A k-nn graph can actually be used as an index, with multiple advantages: 1) it requires little memory, 2) it can be used with any measure of similarity, even non-metric and 3) although building a k-nn graph is usually a CPU intensive and time consuming operation, processing a k-nn graph is very fast.
In this thesis we study the usage of k-nn graphs to analyze large datasets. We make the assumption that the dataset is so large that it cannot be processed using a single computer. Hence we propose algorithms designs relying on the bulk synchronous parallel (BSP) model, then we implement them and evaluate their performance using either the Apache Hadoop MapReduce framework or the Apache Spark framework.
We study two different aspects of k-nn graphs: 1) how to build and store them and 2) how to use them to analyze data. In the first part of this thesis we propose an efficient way to build a k-nn graph from large text datasets, we show how to update a k-nn graph when data has to be inserted in or removed from the graph, and how to partition the nodes of a large k-nn graph between multiple compute nodes in a way that optimizes the analysis of the graph. In the second part we show that k-nn graphs can successfully be used to perform clustering of large text datasets and to detect compromised computers in a network.