A Mixed Diet Makes DINO An Omnivorous Vision Encoder

AI in healthcare
Published: arXiv: 2602.24181v1
Authors

Rishabh Kabra Maks Ovsjanikov Drew A. Hudson Ye Xia Skanda Koppula Andre Araujo Joao Carreira Niloy J. Mitra

Abstract

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Paper Summary

Problem
Computer vision foundation models like DINOv2 are great at processing images, but they struggle to understand different types of images, such as RGB, depth, and segmentation maps, in a unified way. This makes it difficult for them to perform tasks that require cross-modal understanding, like recognizing objects in different lighting conditions or combining information from multiple image types.
Key Innovation
The researchers propose a new approach called the Omnivorous Vision Encoder, which learns a modality-agnostic feature space. This means that the encoder can produce consistent and powerful embeddings for a given scene, regardless of the input modality (RGB, depth, segmentation, etc.). The key innovation is a dual objective training approach that maximizes feature alignment between different modalities and anchors the learned representations to the output of a frozen teacher model like DINOv2.
Practical Impact
The Omnivorous Vision Encoder has several practical implications. It enables robust cross-modal understanding, which is essential for tasks like object recognition, scene understanding, and visual navigation. It also allows the model to generalize to unseen visual modalities, paving the way for a more foundational vision model. This could have significant impacts on applications like robotics, self-driving cars, and medical imaging.
Analogy / Intuitive Explanation
Imagine you're trying to recognize a friend in a crowded room. You can see their face, but also their height, posture, and clothing. A good vision model should be able to combine this information from different modalities (face, body, clothing) to produce a consistent and accurate representation of your friend. The Omnivorous Vision Encoder is like a super-smart friend who can take in different types of information and produce a unified and powerful representation, allowing for more accurate recognition and understanding.
Paper Information
Categories:
cs.CV cs.AI
Published Date:

arXiv ID:

2602.24181v1

Quick Actions