BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

AI in healthcare
Published: arXiv: 2511.16857v1
Authors

Vineet Bhat Sungsu Kim Valts Blukis Greg Heinrich Prashanth Krishnamurthy Ramesh Karri Stan Birchfield Farshad Khorrami Jonathan Tremblay

Abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

Paper Summary

Problem
Vision-language models (VLMs) have made significant progress in understanding scenes and generating descriptions, but they struggle with object-interaction reasoning - the ability to understand and predict fine-grained physical relationships between objects, including grasp affordances, collision-aware motion paths, and manipulation sequencing in cluttered environments. This is a critical limitation for deploying VLMs as embodied agents in real-world robotic environments.
Key Innovation
BOP-Ask is a novel large-scale dataset for object-interaction reasoning that provides a rich resource for training and evaluating VLMs. The dataset includes over 150k images and 33M question-answer pairs spanning six tasks, including 3D object poses, grasp affordances, motion trajectories, and object rearrangements. BOP-Ask is designed to bridge the gap between pixel-level perception and high-level reasoning, enabling VLMs to reason and act upon objects in complex scenes.
Practical Impact
The BOP-Ask dataset and benchmarks have the potential to significantly improve the performance of VLMs in real-world robotic environments. By training on BOP-Ask, VLMs can learn to reason about object interactions, grasp affordances, and motion trajectories, enabling them to perform tasks such as object manipulation, navigation, and scene understanding. This can have a wide range of applications, including robotics, augmented reality, and autonomous vehicles.
Analogy / Intuitive Explanation
Imagine you are trying to put together a puzzle, but the pieces are all jumbled up and you need to figure out how to get them to fit together. That's what object-interaction reasoning is like - it's the ability to understand how objects relate to each other in space and how to manipulate them to achieve a goal. BOP-Ask is like a comprehensive guidebook that helps VLMs learn how to solve this puzzle by providing a rich set of examples and tasks that teach them how to reason about object interactions.
Paper Information
Categories:
cs.CV cs.RO
Published Date:

arXiv ID:

2511.16857v1

Quick Actions