Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu
TL;DR
This work tackles the challenge of weak self-evaluation in efficient multimodal LLMs during Chain-of-Thought reasoning under resource constraints. It introduces SEAT, which uses a stronger EMLLM to evaluate CoT outputs and train a lighter model, and Cas-SEAT, which decouples reasoning and evaluation with cascaded short prompts to address long-input/output issues while preserving CoT. A Double-level Data Filtering (DDF) pipeline is proposed to curate data suitable for lightweight EMLLMs, reducing training cost. Empirical results across MMMU, MathVista, Math-V, and We-Math show Cas-SEAT delivering substantial self-evaluation gains and outperforming larger baselines, highlighting its practical impact for low-resource multimodal reasoning tasks and providing a Cas-SEAT-DDF dataset for future research.
Abstract
Efficient Multimodal Large Language Models (EMLLMs) can improve performance through Chain-of-Thought (CoT) reasoning, but they have poor self-evaluation capabilities during the CoT reasoning process. This is due to their tendency to simplify the reasoning process and the degradation of self-evaluation ability during downstream task fine-tuning. To address this, we intuitively propose \textit{Self-Evaluation Augmented Training (SEAT)}, which uses more powerful EMLLMs to evaluate CoT reasoning data. The evaluation data is then used to train EMLLMs. However, due to the difficulties EMLLMs face with processing long token input-output sequences, and the degradation of self-evaluation ability as a basis for CoT reasoning, the SEAT method is not fully adapted. Therefore, we further propose \textit{Cascaded Self-Evaluation Augmented Training (Cas-SEAT)}, which converts long prompts into cascaded short prompts, each focusing on a specific task. Additionally, we mix CoT reasoning and self-evaluation data to preserve its CoT reasoning ability while enhancing the self-evaluation capability of EMLLMs. We also conduct \textit{Double-level Data Filtering (DDF)}, which includes source data filtering and labeled data filtering, using both manual selection and MLLMs for filtering. Cas-SEAT and DDF work together to improve the performance of EMLLMs. Experiments show that Cas-SEAT achieves an average improvement of 22.16% across multiple datasets, and DDF significantly reduces the resource consumption of training
