QATCH: Automatic evaluation of SQL-centric tasks on proprietary data

Papicchio, Simone; Papotti, Paolo; Cagliero, Luca

ACM Transactions on Intelligent Systems and Technology, 20 January 2025

Tabular Representation Learning (TRL) and Large Language Models (LLMs) have become established for tackling Question Answering (QA) and Semantic Parsing (SP) tasks on tabular data. State-of-the-art models are pre-trained and evaluated on large open-domain datasets. However, the performance on existing QA and SP benchmarks is not necessarily representative of that achieved on proprietary data as the characteristics of the input and the complexity of the posed queries show high variability. To tackle this challenge, our goal is to allow end-users to evaluate TRL and LLM performance on their own proprietary data. We present QATCH (Query-Aided TRL Checklist), a toolbox to automatically generate a testing checklist tailored to QA and SP. QATCH provides a testing suite highlighting models’ strengths and weaknesses on relational tables unseen at training time. The proposed toolbox relies on a SQL query generator that crafts tests of varying types and complexity including, amongst others, tests on null values, projection, selections, joins, group by, and having clauses. QATCH also supports a set of general cross-task performance metrics providing more insights into SQL-related model capabilities than currently used metrics. The empirical results, achieved by state-of-the-art TRL models and LLMs, show substantial performance differences (1) between existing benchmarks and proprietary data, (2) across queries of different complexity.

Detail

DOI

BIBTEX

Type:

Journal

Date:

2025-01-20

Department:

Data Science

Eurecom Ref:

8044

© ACM, 2025. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Intelligent Systems and Technology, 20 January 2025 https://doi.org/10.1145/3712704