Understanding Multi-page Visually Rich Documents: not really a piece of cake for Visual LLMs

Luca Cagliero - Associate Professor
Data Science

Date: -
Location: Eurecom

Abstract: Extracting information from document images or PDF files is particularly challenging as documents include a variety of textual and non-textual elements (e.g., figures, captions, diagrams, tables). Understanding Visually Rich Documents involves a deep understanding of both multimodal content and its layout information. This talk addresses three key aspects related to VRD understanding: (1) The limitations of state-of-the-art Visual LLMs while coping with multi-page VRDs; (2) The training of lightweight models for cost-effective Key Information Extraction; (3) The layout context preservation while querying Visual LLMs with noisy queries. Short Bio: Luca Cagliero is Associate Professor at the Department of Control and Computer Engineering of Politecnico di Torino and coordinator of the SmartData@PoliTo interdepartmental center. His main research interests are in the areas of Natural Language Processing and Multimodal Learning. He currently coordinates the activities of five PhD students and one PostDoc, mainly working in the areas of automated summarization, question answering, and pattern mining. He serves as Associate Editor of the KAIS, ESWA, IEEE Data Description journals and as TPC members of top-tier conferences. He coordinates several industrial partnership programs on applied machine learning and is currently member of the Scientific Advisory Board of AFC DH of Intesa Sanpaolo Spa. Within the European Unite! Framework, he also coordinates Master- and PhD-level research programs related to the speech and language domains.