Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita; Yusuke Tozaki; Hideaki Omote; Congkha Nguyen; Ryosuke Nakamoto; Yuta Koreeda; Hiroaki Ozaki

Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita, Yusuke Tozaki, Hideaki Omote, Congkha Nguyen, Ryosuke Nakamoto, Yuta Koreeda, Hiroaki Ozaki

TL;DR

This work tackles the underexplored problem of Diagram understanding by vision-language models, where edges are often missed due to reliance on textual and positional cues. It introduces a debiased diagram dataset of directed graphs paired with Mermaid-style captions and trains CLIP-based image encoders via contrastive learning to learn edge representations. Across linear probing, image retrieval, and a new diagram captioning task, the finetuned encoders outperform pretrained baselines and even beat zero-shot large models in captioning, demonstrating robust edge understanding independent of text or layout cues. The results suggest that bias removal is a promising path to improve diagram comprehension in VLMs, with practical implications for diagram-centric reasoning and retrieval.

Abstract

The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.

Can Visual Encoder Learn to See Arrows?

TL;DR

Abstract

Can Visual Encoder Learn to See Arrows?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)