Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Agentic AI
Published: arXiv: 2601.23220v1
Authors

Anglin Liu Ruichao Chen Yi Lu Hongxia Xu Jintai Chen

Abstract

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

Paper Summary

Problem
Multimodal Large Language Models (MLLMs) are widely used in medical imaging, but they suffer from a critical limitation: geometric blindness. This means that even though they can generate semantically rich descriptions, they often fail to ground their outputs in the strict geometric facts of the image. This can lead to plausible yet factually incorrect hallucinations, such as misplacing organs or hallucinating lesions.
Key Innovation
The Med-Scout framework proposes a novel solution to this problem by using Reinforcement Learning (RL) to leverage the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. This approach allows Med-Scout to significantly mitigate geometric blindness and improve performance on radiological and comprehensive medical VQA tasks.
Practical Impact
The practical impact of Med-Scout is significant, as it can be applied to various medical imaging tasks, such as radiological diagnosis and comprehensive medical VQA. By improving the geometric perception of MLLMs, Med-Scout can help clinicians make more accurate diagnoses and provide better patient care. Additionally, Med-Scout can be used to develop more robust and reliable medical AI systems that can handle complex medical imaging tasks.
Analogy / Intuitive Explanation
Imagine trying to describe a puzzle to someone without showing them the puzzle pieces. You might use fancy words and phrases, but without actually seeing the puzzle, you might get some of the pieces in the wrong place. This is similar to what happens when MLLMs try to describe medical images without understanding the underlying geometry. Med-Scout is like a "puzzle solver" that helps MLLMs understand the geometry of the image and describe it more accurately.
Paper Information
Categories:
cs.CV cs.AI
Published Date:

arXiv ID:

2601.23220v1

Quick Actions