Table of Contents
Fetching ...

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

Zhehan Kan, Yanlin Liu, Kun Yin, Xinghua Jiang, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Qingmin Liao, Wenming Yang

TL;DR

This work targets persistent challenges in visual reasoning with LVLMs, notably think–answer inconsistency, instability during long-chain reasoning, data-inefficiency, and training–testing resolution gaps. It introduces TACO, a GRPO-based framework that couples reasoning and answering through Think-Answer Consistency (TAC), stabilizes long-chain exploration with Rollback Resample Strategy (RRS), boosts data efficiency via Adaptive Difficulty Sampling (ADS), and mitigates performance gaps with Test-Time Resolution Scaling (TTRS) and Test-Time Multi-Scale Ensemble (TTME). The approach yields substantial gains on both in-domain and out-of-domain REC and VQA benchmarks, outperforming RL from human feedback baselines and prior LVLM-R1-type methods, with TTME further enhancing OOD generalization. Overall, TACO provides a scalable, stable, and versatile pathway to improve grounded reasoning in LVLMs for complex multimodal tasks.

Abstract

DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning opportunities. Additionally, TACO employs an adaptive learning schedule that focuses on moderate difficulty samples to optimize data efficiency. Furthermore, we propose the Test-Time-Resolution-Scaling scheme to address performance degradation due to varying resolutions during reasoning while balancing computational overhead. Extensive experiments on in-distribution and out-of-distribution benchmarks for REC and VQA tasks show that fine-tuning LVLMs leads to significant performance improvements.

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

TL;DR

This work targets persistent challenges in visual reasoning with LVLMs, notably think–answer inconsistency, instability during long-chain reasoning, data-inefficiency, and training–testing resolution gaps. It introduces TACO, a GRPO-based framework that couples reasoning and answering through Think-Answer Consistency (TAC), stabilizes long-chain exploration with Rollback Resample Strategy (RRS), boosts data efficiency via Adaptive Difficulty Sampling (ADS), and mitigates performance gaps with Test-Time Resolution Scaling (TTRS) and Test-Time Multi-Scale Ensemble (TTME). The approach yields substantial gains on both in-domain and out-of-domain REC and VQA benchmarks, outperforming RL from human feedback baselines and prior LVLM-R1-type methods, with TTME further enhancing OOD generalization. Overall, TACO provides a scalable, stable, and versatile pathway to improve grounded reasoning in LVLMs for complex multimodal tasks.

Abstract

DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning opportunities. Additionally, TACO employs an adaptive learning schedule that focuses on moderate difficulty samples to optimize data efficiency. Furthermore, we propose the Test-Time-Resolution-Scaling scheme to address performance degradation due to varying resolutions during reasoning while balancing computational overhead. Extensive experiments on in-distribution and out-of-distribution benchmarks for REC and VQA tasks show that fine-tuning LVLMs leads to significant performance improvements.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An example of a REC task: Illustrating (center) an enhanced GRPO-based learning loop with TAC, RRS, and ADS. Qualitative comparison between GRPO and TACO (top right) demonstrates TACO's superior inference sampling. TTRS module (bottom right) effectively addresses resolution gaps between training and testing images. TACO's accurate output(on REC) and reasoning process closely mirroring the ground truth are exemplified by a visual reasoning task (left).
  • Figure 1: Example of model performance variation with input different scales in LISA.
  • Figure 2: During training, the TAC reward ensures consistent Think-Answer output. Samples are initially given equal sampling rates, with temporary "dirty samples" identified by the KL divergence between the current policy $\pi_\theta$ and the reference policy $\pi_{\text{ref}}$. Their gradients are masked, and sampling rates are reduced to stabilize long-chain exploration and allow future resampling. Samples are classified as easy, moderate, or hard based on accuracy rewards, with easy samples rarely resampled, hard samples slightly reduced, and moderate samples increased for focused learning. Multiple scale resolutions are sampled during reasoning, and the answer with the least intersection is selected. Since the number of samples is small and the test image is compressed, reasoning time remains nearly.
  • Figure 3: Effectiveness of the Think-Answer Consistency (TAC) reward, comparing TACO, VLM-R1, and VLM-R1 + TAC. The subplots illustrate TAC's influence on: (a) response length evolution; (b) training accuracy (IoU reward) and the critical reasoning-answer alignment; (c) policy stability, tracked via KL divergence; and (d) Performance on the LISA test set.
  • Figure 3: Performance (accuracy) comparison on OOD Benchmark.
  • ...and 1 more figures