Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

Zheqi Lv; Wenkai Wang; Jiawei Wang; Shengyu Zhang; Fei Wu

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu

TL;DR

This work tackles the challenge of weak self-evaluation in efficient multimodal LLMs during Chain-of-Thought reasoning under resource constraints. It introduces SEAT, which uses a stronger EMLLM to evaluate CoT outputs and train a lighter model, and Cas-SEAT, which decouples reasoning and evaluation with cascaded short prompts to address long-input/output issues while preserving CoT. A Double-level Data Filtering (DDF) pipeline is proposed to curate data suitable for lightweight EMLLMs, reducing training cost. Empirical results across MMMU, MathVista, Math-V, and We-Math show Cas-SEAT delivering substantial self-evaluation gains and outperforming larger baselines, highlighting its practical impact for low-resource multimodal reasoning tasks and providing a Cas-SEAT-DDF dataset for future research.

Abstract

Efficient Multimodal Large Language Models (EMLLMs) can improve performance through Chain-of-Thought (CoT) reasoning, but they have poor self-evaluation capabilities during the CoT reasoning process. This is due to their tendency to simplify the reasoning process and the degradation of self-evaluation ability during downstream task fine-tuning. To address this, we intuitively propose \textit{Self-Evaluation Augmented Training (SEAT)}, which uses more powerful EMLLMs to evaluate CoT reasoning data. The evaluation data is then used to train EMLLMs. However, due to the difficulties EMLLMs face with processing long token input-output sequences, and the degradation of self-evaluation ability as a basis for CoT reasoning, the SEAT method is not fully adapted. Therefore, we further propose \textit{Cascaded Self-Evaluation Augmented Training (Cas-SEAT)}, which converts long prompts into cascaded short prompts, each focusing on a specific task. Additionally, we mix CoT reasoning and self-evaluation data to preserve its CoT reasoning ability while enhancing the self-evaluation capability of EMLLMs. We also conduct \textit{Double-level Data Filtering (DDF)}, which includes source data filtering and labeled data filtering, using both manual selection and MLLMs for filtering. Cas-SEAT and DDF work together to improve the performance of EMLLMs. Experiments show that Cas-SEAT achieves an average improvement of 22.16% across multiple datasets, and DDF significantly reduces the resource consumption of training

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 7 figures, 11 tables)

This paper contains 26 sections, 5 equations, 7 figures, 11 tables.

Introduction
Related Work
Methodology
Problem Formulation and Notations
Data and Model.
Prompt.
Formula.
Double-level Data Filtering
Vanilla SEAT
Cas-SEAT
Experiments
Experimental Setup
Datasets
Baselines
Implementation Details
...and 11 more sections

Figures (7)

Figure 1: (a) A sample from the dataset used for training and inference of multimodal large language models, containing images, questions, and answers. (b) Overview of Chain-of-Thought (CoT) reasoning, self-evaluation reasoning, and their corresponding enhancement methods. (c) The proposed computational method, Cas-SEAT. (d) The proposed dataset construction method, DDF, which provides the Cas-SEAT Dataset for Cas-SEAT. (e) Comparison of CoT reasoning ability, self-evaluation ability, and overall performance. Symbols "–", "↑", "↓", "↑↑", and "↓↓" indicate comparable, improved, degraded, significantly improved, and significantly degraded performance, respectively.
Figure 2: Overview of the method. It illustrates the prompts designed for Augmented Self-Evaluation and Augmented Cascading Self-Evaluation, along with the corresponding training data generated.
Figure 3: Comparison of evaluation in terms of model performance improvement
Figure 4: A bar chart analysis sample in MathVista. Green background indicates the raw data, red text represents incorrect reasoning processes (sometimes with no reasoning process), pink background and yellow background denote results from direct reasoning and self-evaluation, respectively. Blue text and blue background indicate the corrected reasoning process and corrected results, respectively.
Figure 5: Comparison of Cas-SEAT and the Baseline based on Qwen2-VL(2B).
...and 2 more figures

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

TL;DR

Abstract

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)