Table of Contents
Fetching ...

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

TL;DR

STELAR-Vision introduces topology-aware reasoning for vision-language models by generating diverse reasoning topologies (Chain, Tree, Graph) via TopoAug and post-training with supervised fine-tuning and reinforcement learning. The framework demonstrates substantial improvements in both in-distribution and out-of-distribution tasks, with notable efficiency gains from Frugal Learning that shorten output length while maintaining accuracy. Key contributions include automatic topology annotation, a two-phase post-training pipeline, and empirical evidence that adaptive topology selection enhances generalization across diverse multimodal reasoning benchmarks. The approach offers practical impact for efficient, flexible multimodal inference and opens avenues for end-to-end topology induction in future work.

Abstract

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks.

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

TL;DR

STELAR-Vision introduces topology-aware reasoning for vision-language models by generating diverse reasoning topologies (Chain, Tree, Graph) via TopoAug and post-training with supervised fine-tuning and reinforcement learning. The framework demonstrates substantial improvements in both in-distribution and out-of-distribution tasks, with notable efficiency gains from Frugal Learning that shorten output length while maintaining accuracy. Key contributions include automatic topology annotation, a two-phase post-training pipeline, and empirical evidence that adaptive topology selection enhances generalization across diverse multimodal reasoning benchmarks. The approach offers practical impact for efficient, flexible multimodal inference and opens avenues for end-to-end topology induction in future work.

Abstract

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks.

Paper Structure

This paper contains 26 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Limitations of the Popular Chain-of-Thought Reasoning Structures. The widely adopted Chain-of-Thought (CoT) reasoning paradigm (in green) often results in unnecessarily verbose reasoning processes, as demonstrated in the first example. Under CoT reasoning, the model redundantly counts each cube, whereas with $Graph$ topology (in blue), it quickly identifies the key point of the question. In the bottom-row example, CoT reasoning begins with a detailed examination of each subplot but ultimately arrives at an incorrect answer. In contrast, $Tree$ topology (in red) initiates reasoning with a high-level overview before delving into specific features. In both scenarios, CoT-style reasoning proves suboptimal.
  • Figure 2: An overview of the STELAR-Vision framework
  • Figure 3: Comparison of topology accuracy across subjects: Accuracy of $Chain$, $Tree$, and $Graph$ reasoning topological structures per subject of MATH-V dataset. $Chain$ remains the best overall reasoning structure, while $Tree$, and $Graph$ perform better in at reasoning subjects such as "graph theory" or "statistics".
  • Figure 4: Distribution of generated reasoning token length of $Chain$, $Tree$, and $Graph$ topological structures in TopoAug Dataset. The box within each violin plot represents the median, and 25% and 75% percentile thresholds.