Table of Contents
Fetching ...

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

Yangyi Chen, Xingyao Wang, Manling Li, Derek Hoiem, Heng Ji

TL;DR

This work targets the limitations of vision-language models in extracting visual structure, introducing ViStruct which encodes concepts, relations, and events using code-vision representations. It couples this representation with a curriculum pyramid and a replay buffer to progressively teach visual structures from simple concepts to complex events, and builds the ViStruct Suite with weakly-supervised event structures aligned to WordNet and FrameNet. Empirically, ViStruct yields consistent improvements on visual relation detection, scene graph classification, and situation recognition, and demonstrates strong zero-shot capabilities. The approach offers a scalable, hierarchical framework for comprehensive visual structure understanding and provides datasets and benchmarks to support future research in structured multimodal reasoning.

Abstract

State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

TL;DR

This work targets the limitations of vision-language models in extracting visual structure, introducing ViStruct which encodes concepts, relations, and events using code-vision representations. It couples this representation with a curriculum pyramid and a replay buffer to progressively teach visual structures from simple concepts to complex events, and builds the ViStruct Suite with weakly-supervised event structures aligned to WordNet and FrameNet. Empirically, ViStruct yields consistent improvements on visual relation detection, scene graph classification, and situation recognition, and demonstrates strong zero-shot capabilities. The approach offers a scalable, hierarchical framework for comprehensive visual structure understanding and provides datasets and benchmarks to support future research in structured multimodal reasoning.

Abstract

State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.
Paper Structure (29 sections, 4 figures, 5 tables)

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Programming language enables a unified representation of visual structural information in code-vision representations, including concepts, attributes, locations, relations, and visual events.
  • Figure 2: The curriculum pyramid in ViStruct that incorporates five different levels of multimodal knowledge acquisition, progressing from basic to more advanced stages. All levels of visual structures can be uniformly represented using the programming language. The code is generated based on the image in Figure \ref{['fig: hello']}.
  • Figure 3: The effectiveness of the curriculum learning framework for visual relation detection. The metrics are introduced in Sec. \ref{['sec:visaul_relation']}. The dotted lines denote the results of ViStruct-Mix.
  • Figure 4: The focusing optimization trick prioritizes the semantic content of generated code (highlighted portion) while disregarding the code's syntactic structure.