Table of Contents
Fetching ...

Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita, Yusuke Tozaki, Hideaki Omote, Congkha Nguyen, Ryosuke Nakamoto, Yuta Koreeda, Hiroaki Ozaki

TL;DR

This work tackles the underexplored problem of Diagram understanding by vision-language models, where edges are often missed due to reliance on textual and positional cues. It introduces a debiased diagram dataset of directed graphs paired with Mermaid-style captions and trains CLIP-based image encoders via contrastive learning to learn edge representations. Across linear probing, image retrieval, and a new diagram captioning task, the finetuned encoders outperform pretrained baselines and even beat zero-shot large models in captioning, demonstrating robust edge understanding independent of text or layout cues. The results suggest that bias removal is a promising path to improve diagram comprehension in VLMs, with practical implications for diagram-centric reasoning and retrieval.

Abstract

The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.

Can Visual Encoder Learn to See Arrows?

TL;DR

This work tackles the underexplored problem of Diagram understanding by vision-language models, where edges are often missed due to reliance on textual and positional cues. It introduces a debiased diagram dataset of directed graphs paired with Mermaid-style captions and trains CLIP-based image encoders via contrastive learning to learn edge representations. Across linear probing, image retrieval, and a new diagram captioning task, the finetuned encoders outperform pretrained baselines and even beat zero-shot large models in captioning, demonstrating robust edge understanding independent of text or layout cues. The results suggest that bias removal is a promising path to improve diagram comprehension in VLMs, with practical implications for diagram-centric reasoning and retrieval.

Abstract

The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Examples of diagram captioning by GPT-4o gpt4o: (a) inferring relationships based on conventional top-down hierarchies, (b) leveraging semantic relationships between node labels, and (c) struggling when neither positional nor textual biases are available. All results were produced by gpt-4o-2024-08-06 with temperature 0.
  • Figure 2: Overview of our approach: (a) training a CLIP model with diagram--caption pairs that eliminate positional and textual biases, and (b) evaluating the model on three tasks: linear probing, image retrieval, and diagram captioning.
  • Figure 3: Examples of query images (top row) and the top retrieved images using the pretrained ViT-L/14 (middle row) and finetuned ViT-L/14 (bottom row). Images surrounded by orange lines represent true positives that share the same directed graph structures as the queries.