Table of Contents
Fetching ...

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
Paper Structure (23 sections, 5 equations, 5 figures, 3 tables)

This paper contains 23 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Scheduled Curriculum Adaptive Loss progression over training time vs vanilla SFT. In standard SFT, the <think> portion dominates the output compared to short answer segments (upper panel). SFT applies uniform token weights, making the loss skewed towards the extended reasoning segment (lower left panel). SCALe addresses this token imbalance and introduces a time-dependent weighting schedule that progressively amplifies the influence of answer tokens. As training advances, this shifts the loss toward the final answer, guiding the model to prioritize correct outcomes rather than lengthy reasoning (lower right panel).
  • Figure 2: Reasoning output from Qwen2.5, from HF test set lmms-lab/ICON-QA dataset. Blue (left) box is reasoning from vanilla-SFT. Orange (right) box is SCALe-SFT reasoning. Vanilla SFT counts the number of cars wrong.
  • Figure 3: Reasoning output from Qwen2.5, from HF test set lmms-lab/ScienceQA-IMG dataset. Blue bottom right box is reasoning from vanilla-SFT. Orange bottom left is SCALe-SFT reasoning. Vanilla SFT correctly counts two gas giants among eight planets yet concludes that 50% are gas, revealing a logical inconsistency between reasoning and answer, while Scheduled Weighted SFT maintains logical consistency and yields the correct conclusion.
  • Figure 4: Reasoning output from Gemma3, from HF test set lmms-lab/ScienceQA-IMG dataset. Blue bottom right box is reasoning from vanilla-SFT. Orange bottom left is SCALe-SFT reasoning. Vanilla SFT goes into an endless loop and cannot reach a final answer (shortened with ...) while SCALe-SFT has consice reasoning steps and accurate final answer.
  • Figure 5: Reasoning output from Llava-next, from HF test set lmms-lab/ScienceQA-IMG dataset. Blue bottom right box is reasoning from vanilla-SFT. Orange bottom left is SCALe-SFT reasoning. In vanilla SFT the token imbalance can lead to well structured reasoning traces that fail to align with the final answer supervision- the model “thinks correctly” but concludes incorrectly.