Table of Contents
Fetching ...

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Hiroshi Sasaki

TL;DR

This work proposes a new training paradigm designed to enhance diagram comprehension in vision-language models by introducing pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements.

Abstract

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

TL;DR

This work proposes a new training paradigm designed to enhance diagram comprehension in vision-language models by introducing pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements.

Abstract

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.
Paper Structure (22 sections, 6 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 6 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: Conceptual overview of our proposed method. Text is extracted from a raster image via OCR to generate pseudo diagrams with random connections. These are rendered into an editable format, from which hard positive and negative pairs are created through rule-based edits. The resulting samples are used to train a VLM with structure-aware contrastive learning, enhancing its ability to distinguish fine-grained structural differences.
  • Figure 2: The editable pseudo diagram synthesis flow. OCR-extracted text is grouped by its proximity into combinations to form nodes. After adding random edge connections, this structure is finalised as definition code. This code is then used by a renderer to create both an editable image and a corresponding rule-based caption.
  • Figure 3: The synthesis of hard positive and negative samples. Hard positive images are generated via minor visual edits, such as altering node positions or flow orientation. Hard negatives are created by permuting nodes and edges. The original diagram code serves as the hard positive caption, while hard negative captions are generated by swapping labels within the code to create a structural mismatch.