Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

Generative AI & LLMs

Published: arXiv: 2511.11468v1

Authors

Davide Napolitano Luca Cagliero Fabrizio Battiloro

Abstract

The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

Paper Summary

Problem

Visual Large Language Models (VLLMs) are excellent at answering questions about visually rich documents, but they struggle with detecting unanswerable questions. These questions appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. This problem is crucial because even a well-formed question on a multi-page document may not have a determinable answer.

Key Innovation

The researchers introduce VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. This framework automatically alters the questions of existing VQA datasets, verifies their unanswerability using a VLLM-as-a-judge approach, and evaluates VLLMs' performance. The innovation lies in its ability to dynamically corrupt the input questions through a mix of NLP and multimodal learning techniques.

Practical Impact

The VRD-UQA benchmark can be applied in real-world scenarios where VLLMs are used to analyze visually rich documents, such as PDF files, printed or scanned copies, and online articles. By evaluating VLLMs' ability to detect unanswerable questions, this benchmark can help developers create more robust document VQA systems. This, in turn, can improve the accuracy and reliability of VLLMs in various applications, such as question-answering systems, document analysis, and content summarization.

Analogy / Intuitive Explanation

Imagine you're trying to answer a question about a complex diagram. The question seems valid, but the diagram is missing a crucial piece of information. In this case, the VLLM would struggle to detect that the question is unanswerable due to the missing information. VRD-UQA is like a "question editor" that introduces subtle corruptions to the questions, making it challenging for VLLMs to detect unanswerable questions. By evaluating VLLMs' performance on these corrupted questions, VRD-UQA helps developers create more robust and accurate VLLMs.

Paper Information

Categories:

cs.CV cs.AI

Published Date:

arXiv ID:

2511.11468v1

Quick Actions

Back to Home