Comparing high-dimension data: Interpretable two-sample testing by variable eslection

Mitsuzawa, Kensuke
Thesis

EN-US">High-dimensional data is everywhere, and the amount and quality of this data keep increasing, even though analysing it remains time-consuming. Two-sample testing is a common method for comparing two datasets, but it often does not provide enough information for humans to understand, intuit, and comprehend the results. In this thesis, we investigate variable selection for comparing a pair of high-dimensional datasets, enabling humans to gain insights without time-consuming analysis work. The variable selection is performed during two-sample testing and identifies the variables (or dimensions) responsible for the discrepancies between the two distributions. We focus on Maximum Mean Discrepancy (MMD), which is a distance metric between probability distributions, and an optimisation problem of its estimator. This problem optimises the Automatic Relevance Detection (ARD) weights in a kernel function. The kernel function is defined for individual variables to maximise the power of the MMD-based test. We extend this optimisation problem into the variable selection task by adding sparse regularisation. Since this regularisation term requires an arbitrary parameter, we develop algorithms to find appropriate regularisation parameters. Furthermore, we address a variable selection problem with a set of high-dimensional timeseries data. Our main aim is to identify important variables that reflect the differences between two probability distributions. To accomplish this, we have devised an algorithm for selecting variables from pairs of time-series data. The algorithm divides a set of time steps into a certain number of buckets and applies a variable selection method for each bucket. We empirically show that MMD-based variable selection methods are a suitable approach for this task. Finally, we demonstrate that a model parameter calibration task, which involves estimating a suitable parameter for a black-box model (e.g., an intractable simulation model), can be conducted in a human-in-the-loop style by introducing the MMD-based variable selection method. The calibration method employs KernelABC of which distance metric is an optimised MMD estimator. Overall, this work provides advancements in variable selection methods, significantly improving the interpretability and efficiency of high-dimensional data analysis in various applications.


Type:
Thesis
Date:
2024-12-02
Department:
Data Science
Eurecom Ref:
7711
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also:

PERMALINK : https://www.eurecom.fr/publication/7711