Ecole d'ingénieur et centre de recherche en Sciences du numérique

Scalable graph building from text data

Debatty, Thibault; Michiardi, Pietro; Thonnard, Olivier; Mees, Wim

KDD 2014, 20th ACM SIGKDD Conference on Knowledge Discover and Data Mining, Workshop BIGMINE 2014, 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, August 24-27, 2014, New York, USA / Also To appear in Journal of Machine Learning Research

In this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modi ed version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the di erent unconnected subgraphs. We experimentally test the algorithm on di erent datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph.

Document Hal Bibtex

Titre:Scalable graph building from text data
Mots Clés:k-NN graph, graph building, MapReduce, text data, Context-Triggered Piece- wise Hashing
Type:Conférence
Langue:English
Ville:New York
Pays:ÉTATS-UNIS
Date:
Département:Data Science
Eurecom ref:4365
Copyright: © ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in KDD 2014, 20th ACM SIGKDD Conference on Knowledge Discover and Data Mining, Workshop BIGMINE 2014, 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, August 24-27, 2014, New York, USA / Also To appear in Journal of Machine Learning Research
Bibtex: @inproceedings{EURECOM+4365, year = {2014}, title = {{S}calable graph building from text data}, author = {{D}ebatty, {T}hibault and {M}ichiardi, {P}ietro and {T}honnard, {O}livier and {M}ees, {W}im}, booktitle = {{KDD} 2014, 20th {ACM} {SIGKDD} {C}onference on {K}nowledge {D}iscover and {D}ata {M}ining, {W}orkshop {BIGMINE} 2014, 3rd {I}nternational {W}orkshop on {B}ig {D}ata, {S}treams and {H}eterogeneous {S}ource {M}ining: {A}lgorithms, {S}ystems, {P}rogramming {M}odels and {A}pplications, {A}ugust 24-27, 2014, {N}ew {Y}ork, {USA} / {A}lso {T}o appear in {J}ournal of {M}achine {L}earning {R}esearch}, address = {{N}ew {Y}ork, {\'{E}}{TATS}-{UNIS}}, month = {08}, url = {http://www.eurecom.fr/publication/4365} }
Voir aussi: