Scalable graph building from text data

Debatty, Thibault; Michiardi, Pietro; Thonnard, Olivier; Mees, Wim
KDD 2014, 20th ACM SIGKDD Conference on Knowledge Discover and Data Mining, Workshop BIGMINE 2014, 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, August 24-27, 2014, New York, USA / Also To appear in Journal of Machine Learning Research

In this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an
approximate k-NN graph from large text datasets. The algorithm uses a modi ed version
of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an
exhaustive search inside the buckets to build the graph. It also uses multiple stages to join
the di erent unconnected subgraphs. We experimentally test the algorithm on di erent
datasets consisting of the subject of spam emails. Although the algorithm is still at an
early development stage, it already proves to be four times faster than a MapReduce
implementation of NN-Descent, for the same quality of produced graph.

HAL
Type:
Conférence
City:
New York
Date:
2014-08-24
Department:
Data Science
Eurecom Ref:
4365
Copyright:
© ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in KDD 2014, 20th ACM SIGKDD Conference on Knowledge Discover and Data Mining, Workshop BIGMINE 2014, 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, August 24-27, 2014, New York, USA / Also To appear in Journal of Machine Learning Research

PERMALINK : https://www.eurecom.fr/publication/4365