Masked multi-time diffusion for multi-modal generative modeling

Bounoua, Mustapha; Franzese, Giulio; Michiardi, Pietro
NeurIPS 2023, 37th Conference on Neural Information Processing Systems, 11-16 December 2023, New Orleans, USA

Multi-modal data is ubiquitous, and models to learn a joint representation of all
modalities have flourished. However, existing approaches suffer from a coherencequality
tradeoff, where generation quality comes at the expenses of generative
coherence across modalities, and vice versa. To overcome these limitations, we
propose a novel method that uses a set of independently trained, uni-modal, deterministic
autoencoders. Individual latent variables are concatenated and fed to a
masked diffusion model to enable generative modeling. We also introduce a new
multi-time training method to learn the conditional score network for multi-modal
diffusion. Empirically, our methodology substantially outperforms competitors in
both generation quality and coherence.

Type:
Conference
City:
New Orleans
Date:
2023-12-11
Department:
Data Science
Eurecom Ref:
7540
Copyright:
© NIST. Personal use of this material is permitted. The definitive version of this paper was published in NeurIPS 2023, 37th Conference on Neural Information Processing Systems, 11-16 December 2023, New Orleans, USA and is available at :

PERMALINK : https://www.eurecom.fr/publication/7540