Table of Contents
Fetching ...

On Data Synthesis and Post-training for Visual Abstract Reasoning

Ke Zhu, Yu Wang, Jiangjiang Liu, Qunyi Xie, Shanshan Liu, Gang Zhang

TL;DR

This work tackles abstract visual reasoning (AVR) in large vision-language models by introducing a data-centric approach that relieves task difficulty through two sources of synthesized data and a two-stage post-training strategy. Building on LLaVA-NeXT-7B, the authors generate regular puzzles via Attributed Stochastic Image Grammar and non-regular puzzles from CCSE data, enriched with visual-elicitation prompts and template-style chain-of-thought. The training regime combines perception-focused pretraining with process-level supervision and conditional multi-task learning to coax the model into step-by-step inference without sacrificing general multimodal abilities. Results on AVR benchmarks (RAVEN and MARVEL) set a new state-of-the-art, especially for perception and reasoning in structured and semi-structured visual reasoning tasks; limitations remain on more diverse, irregular puzzles, highlighting the need for scalable, high-quality annotations and larger attribute sets to close remaining gaps.

Abstract

This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.

On Data Synthesis and Post-training for Visual Abstract Reasoning

TL;DR

This work tackles abstract visual reasoning (AVR) in large vision-language models by introducing a data-centric approach that relieves task difficulty through two sources of synthesized data and a two-stage post-training strategy. Building on LLaVA-NeXT-7B, the authors generate regular puzzles via Attributed Stochastic Image Grammar and non-regular puzzles from CCSE data, enriched with visual-elicitation prompts and template-style chain-of-thought. The training regime combines perception-focused pretraining with process-level supervision and conditional multi-task learning to coax the model into step-by-step inference without sacrificing general multimodal abilities. Results on AVR benchmarks (RAVEN and MARVEL) set a new state-of-the-art, especially for perception and reasoning in structured and semi-structured visual reasoning tasks; limitations remain on more diverse, irregular puzzles, highlighting the need for scalable, high-quality annotations and larger attribute sets to close remaining gaps.

Abstract

This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.

Paper Structure

This paper contains 14 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Fig. \ref{['fig:figre1-dataset-trial-1']}: evaluation results on AVR benchmarks RAVEN raven and MARVEL marvel. LLaVA-AVR is trained with our naively collected data with original label. LLaVA-AVR(E) means we Eliciate the model to learn using our strategy shown in Fig. \ref{['fig:figure-1-pipeline']}.
  • Figure 2: The produced Chain-of-thought (CoT) by three different advanced model Step-1V step, MoonShot-V1 kimi and GPT-4o. The left shown image quiz is randomly sampled from MARVEL test dataset marvel. The correct choice for this puzzle is 4.
  • Figure 3: Our data generation pipeline for the regular puzzle. We first choose seven different seed pattern from the initial tree, then apply the sampled rule to generate the whole mage (structural pattern). We then generate the template-based chain-of-thought and perception question-answer based on the information stored in previous process. The whole process do not involve any LLM or human effort.
  • Figure 4: Our data generation pipeline for the non-regular puzzle crawled from the CCSE website. We totally crawled about 8k data, with 4k remaining after data filtering process. We then generate coarse caption and reformat the original answer into template CoT, both of which go through an LLM to obtain specific questions for each images. Finally, we use human labor to manually annotate these questions.
  • Figure 5: The training pipeline of our model LLaVA-AVR-7B, including Pretraining stage with short perception VQA, and Multi-task Supervised finetuning with both perception VQA and long CoT. The stage-1 model are all initialized with LLaVA-NeXT-7B.
  • ...and 2 more figures