Table of Contents
Fetching ...

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Bowen Zhao, Leo Parker Dirac, Paulina Varshavskaya

TL;DR

A new benchmark is proposed the authors call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context and finds that VLMs fail to do this zero-shot learning, and sometimes continue to fail after finetuning.

Abstract

Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

TL;DR

A new benchmark is proposed the authors call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context and finds that VLMs fail to do this zero-shot learning, and sometimes continue to fail after finetuning.

Abstract

Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.
Paper Structure (20 sections, 3 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Examples of tasks where the object of interest is a simple shape. In (a) the colors and textures are trivial with $\varphi = (\mathbb{I}_2, \mathbb{C}_{\text{shape}}, 1, \mathbb{T}_{\text{none}})$, while in (b) there is more visual complexity with $\varphi = (\mathbb{I}_5, \mathbb{C}_{\text{shape}}, 1, \mathbb{T}_{\text{none}})$. In (c) there are distractor shapes, and the model must identify the object of interest using the text of the query, with $\varphi = (\mathbb{I}_3, \mathbb{C}_{\text{tshape}}, 3, \mathbb{T}_{\text{guide}})$
  • Figure 2: VLMs' performance on the task family $\varphi = (\mathbb{I}_5, \mathbb{C}_{\text{hard}}, 3, \mathbb{T}_{\text{guide}})$ and $\varphi = (\mathbb{I}_5, \mathbb{C}_{\text{tshape}}, 3, \mathbb{T}_{\text{guide}})$ after various CL strategies. FT denotes the straight-up finetuning baseline accuracy shown in \ref{['tab:main-result']}.
  • Figure 3: Correlations between the accuracy improvements after CL to the VLMs' performance on $\mathbb{E}_{\varphi_1}$ and $\mathbb{E}_{\varphi_2}$ after training with $\mathbb{E}_{\varphi_1}$.
  • Figure 4: Ablation on MiniCPM for the task $(\mathbb{I}_5, \mathbb{C}_{\text{hard}}, 3, \mathbb{T}_{\text{guide}})$.