EDBT/ICDT 2024, 27th International Conference on Extending Database Technology, 25-28 March 2024, Paestum, Italy
Performing data analysis over incomplete data produces biased results and sub-par performance. Imputation over relational datasets that contain both categorical and continuous variables is challenging. The challenges are accentuated when the missingness proportion of dataset is high, wherein a large fraction of the relation contain missing values, or if missing values occur in multiple attributes of a single tuple. In this paper, we propose GRIMP, a novel approach for imputation that tackles these challenges. GRIMP achieves high imputation accuracy through a combination
of three novel ideas. First, it represents relational data as a heterogeneous graph, encoding sophisticated relationships between tuples, attributes and cell values. Second, it uses graph representation learning based on message passing to combine and aggregate the representations from appropriate neighborhoods. This allows GRIMP to leverage information from other cell values of the same tuple and that of similar tuples for imputation. Finally, it uses a self-supervised multi-task learning paradigm for training imputation models. In other words, GRIMP does not need any explicit training data as it uses the existing relational data, even when it has missing values. GRIMP trains an imputation model for each attribute using a two-stage approach consisting of a task agnostic section, where the parameters are shared across all attributes, and an attribute specific imputation model. Experiments over ten datasets and seven baselines show that GRIMP performs accurate imputation and provides new insights about the limitations of data imputation systems.
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in EDBT/ICDT 2024, 27th International Conference on Extending Database Technology, 25-28 March 2024, Paestum, Italy and is available at :