Tabular Representation Learning (TRL) and Large Language Models (LLMs) have become established for tackling Question Answering (QA) and Semantic Parsing (SP) tasks on tabular data. State-of-the-art models are pre-trained and evaluated on large open-domain datasets. However, the performance on existing QA and SP benchmarks is not necessarily representative of that achieved on proprietary data as the characteristics of the input and the complexity of the posed queries show high variability. To tackle this challenge, our goal is to allow end-users to evaluate TRL and LLM performance on their own proprietary data. We present QATCH (Query-Aided TRL Checklist), a toolbox to automatically generate a testing checklist tailored to QA and SP. QATCH provides a testing suite highlighting models’ strengths and weaknesses on relational tables unseen at training time. The proposed toolbox relies on a SQL query generator that crafts tests of varying types and complexity including, amongst others, tests on null values, projection, selections, joins, group by, and having clauses. QATCH also supports a set of general cross-task performance metrics providing more insights into SQL-related model capabilities than currently used metrics. The empirical results, achieved by state-of-the-art TRL models and LLMs, show substantial performance differences (1) between existing benchmarks and proprietary data, (2) across queries of different complexity.
QATCH: Automatic evaluation of SQL-centric tasks on proprietary data
ACM Transactions on Intelligent Systems and Technology, 20 January 2025
Type:
Journal
Date:
2025-01-20
Department:
Data Science
Eurecom Ref:
8044
Copyright:
© ACM, 2025. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Intelligent Systems and Technology, 20 January 2025 https://doi.org/10.1145/3712704
See also:
PERMALINK : https://www.eurecom.fr/publication/8044