Ecole d'ingénieur et centre de recherche en Sciences du numérique

Determining the k in k-means with MapReduce

Debatty, Thibault; Michiardi, Pietro; Thonnard, Olivier; Mees, Wim

ICDT 2014, 17th International Conference on Database Theory, in conjunction with EDBT/ICDT 2014, 24-28 March 2014, Athens, Greece

In this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the " best " results have a computation cost that is proportional to nk 2. We run experiments that confirm that the processing time is proportional to k. These experiments also show that, because G-means adds new centers progressively, if and where they are needed, it reduces the probability to fall into a local minimum, and finally finds better centers than classical k-means processing.

Document Hal Bibtex

Titre:Determining the k in k-means with MapReduce
Type:Conférence
Langue:English
Ville:Athens
Pays:GRÈCE
Date:
Département:Data Science
Eurecom ref:4366
Copyright: © ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ICDT 2014, 17th International Conference on Database Theory, in conjunction with EDBT/ICDT 2014, 24-28 March 2014, Athens, Greece
Bibtex: @inproceedings{EURECOM+4366, year = {2014}, title = {{D}etermining the k in k-means with {M}ap{R}educe}, author = {{D}ebatty, {T}hibault and {M}ichiardi, {P}ietro and {T}honnard, {O}livier and {M}ees, {W}im}, booktitle = {{ICDT} 2014, 17th {I}nternational {C}onference on {D}atabase {T}heory, in conjunction with {EDBT}/{ICDT} 2014, 24-28 {M}arch 2014, {A}thens, {G}reece}, address = {{A}thens, {GR}{\`{E}}{CE}}, month = {03}, url = {http://www.eurecom.fr/publication/4366} }
Voir aussi: