Machine learning has profoundly reshaped both business and daily life over the past decade, primarily due to the availability of vast datasets that power foundation models such as large language models. However, data is inherently multimodal, mirroring how humans perceive the world through their five senses. This pervasive nature of multimodality has driven breakthroughs in applications ranging from text-to-image synthesis and video generation to vision-language models and has influenced fields such as automotive technology, neuroscience, and biology. Nevertheless, multimodality poses several challenges, including the heterogeneity of data types, complex inter-modal interactions, and issues of observability that may lead to unpaired data, thereby constraining their effective utilization. In addressing these challenges, this thesis first proposes a novel generative modeling approach designed to approximate complex multimodal data distributions. By leveraging score-based diffusion models, we propose a method capable of capturing the rich structures of multimodal data. Once the multimodal data is accurately modeled, the focus shifts to interpreting the intermodal interactions. Accordingly, this thesis develops estimators for information-theoretic measures, including mutual information in dual-modal settings, as well as total correlation, dual total correlation, and O-information for arbitrary numbers of modalities. Finally, recognizing that paired multimodal data is not always available, the thesis introduces an original method for constructing paired couplings in dual-modal cases without pre-existing paired samples. This method leverages diffusion models alongside reinforcement learning to achieve minimum entropy coupling, thereby maximizing mutual information between modalities. The proposed contributions achieved state-of-the-art performance on several benchmarks and unlocked new capabilities in multimodal learning, with tangible impact in automotive applications where enhanced sensor integration improved system robustness. These findings underscore the practical importance of the proposed methods in real-world contexts, paving the way for further advancements in multimodal machine learning.
Harnessing multimodality: Diffusion based generative modeling and information estimation
Thesis
Type:
Thesis
Date:
2025-07-11
Department:
Data Science
Eurecom Ref:
8222
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also:
PERMALINK : https://www.eurecom.fr/publication/8222