Table of Contents
Fetching ...

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim

TL;DR

This work tackles the limited 3D spatial understanding of Vision-Language Models by introducing Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects geometry priors from 3D foundation models into frozen VLMs. It leverages paired views to distill three signals—sparse correspondences, relative depth, and dense cost-volume alignment—via a LoRA-based adapter in CLIP, without changing the architecture. The method yields strong gains across 3D correspondence, depth estimation, semantic segmentation, and 3D vision-language reasoning benchmarks, significantly reducing training cost compared to prior approaches. Overall, it provides a scalable and practical pathway to bridge 2D VLMs with 3D understanding for spatially grounded multimodal tasks.

Abstract

Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

TL;DR

This work tackles the limited 3D spatial understanding of Vision-Language Models by introducing Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects geometry priors from 3D foundation models into frozen VLMs. It leverages paired views to distill three signals—sparse correspondences, relative depth, and dense cost-volume alignment—via a LoRA-based adapter in CLIP, without changing the architecture. The method yields strong gains across 3D correspondence, depth estimation, semantic segmentation, and 3D vision-language reasoning benchmarks, significantly reducing training cost compared to prior approaches. Overall, it provides a scalable and practical pathway to bridge 2D VLMs with 3D understanding for spatially grounded multimodal tasks.

Abstract

Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.

Paper Structure

This paper contains 36 sections, 10 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Geometric Distillation enhances 3D spatial reasoning in vision-language models. By distilling geometric cues such as correspondences, relative depth, and cost alignment from 3D foundation models, our method improves 3D visual understanding and enables accurate reasoning in tasks like answering which object is closer.
  • Figure 2: Geometric cues and PCA visualization of feature transformation through geometric distillation.
  • Figure 3: Overview of Geometric Distillation Architecture. A 3D foundation model extracts geometric cues including (1) sparse correspondences, (2) depth maps, and (3) dense cost volumes from multi-view inputs. These cues supervise a frozen CLIP image encoder with a lightweight adapter (LoRA) via three loss branches: $\mathcal{L}_\text{match}$, $\mathcal{L}_\text{depth}$, and $\mathcal{L}_\text{cost}$. The distillation enables the VLM to acquire 3D spatial awareness without explicit 3D annotations.
  • Figure 4: Visualization of cost volume. (a) Anchor view with query location (yellow box). Cost volume heatmaps from (b) the teacher (MASt3R), (c) the vanilla CLIP, and (d) after geometric distillation. The proposed method better captures localized geometric similarity, closely aligning with the teacher’s output.
  • Figure 5: Semantic Transfer. (a) Source image with annotated keypoints. Transfer results using (b) MEF you2024multiview_ME and (c) our approach. Our method produces more accurate and spatially consistent transfers.
  • ...and 3 more figures