Table of Contents
Fetching ...

Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Jen-Tse Huang, Chi Chen, Wei Ke, Helen Meng, Maosong Sun

TL;DR

This work proposes SIcog, a self-learning framework to build next-generation foundation multimodal LLMs by tightly integrating multimodal pre-training with self-generated data. Central innovations are Chain-of-Description for stepwise visual perception and structured Chain-of-Thought for multimodal reasoning, enabling a self-improvement loop with minimal external annotations. Through a four-step pipeline—minimal annotation fine-tuning, self-generated data generation, self-consistency data curation, and staged multimodal pre-training—SIcog achieves benchmark-leading performance and stronger reasoning when combined with post-training techniques, while maintaining perception quality. The findings highlight the importance of synergizing pre-training with inference-time computation and post-training optimization, and point to scalable paths for continual cognitive self-improvement in foundation MLLMs.

Abstract

Recent progress in (multimodal) large language models ((M)LLMs) has shifted focus from pre-training to inference-time computation and post-training optimization, largely due to concerns over the availability of high-quality human data. However, these strategies alone are insufficient to drive substantial model improvements. We argue that effective model advancement requires strong synergy among pre-training, inference-time computation, and post-training optimization. In this paper, we introduce Self-Improving cognition (SIcog), a self-learning framework for constructing next-generation foundation MLLMs by imparting multimodal knowledge and enhancing systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning. SIcog first equips a base model with systematic perception and reasoning using minimal external supervision. The enhanced models then generate candidate image captions and CoT reasoning responses for unlabeled images and image-question pairs across diverse tasks, which are filtered through a semantic-similarity-guided self-consistency mechanism. These high-quality, self-generated samples enable large-scale multimodal pre-training, creating a self-improvement loop. Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition. Using only 213K self-generated pre-training samples, SIcog achieves significant improvements, including +3.6% on MMStar and +3.5% on AI2D, outperforming previous pre-training approaches. When combined with post-training techniques for CoT reasoning, SIcog yields +9% gains on MMVet and +8.5% on ScienceQA.

Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

TL;DR

This work proposes SIcog, a self-learning framework to build next-generation foundation multimodal LLMs by tightly integrating multimodal pre-training with self-generated data. Central innovations are Chain-of-Description for stepwise visual perception and structured Chain-of-Thought for multimodal reasoning, enabling a self-improvement loop with minimal external annotations. Through a four-step pipeline—minimal annotation fine-tuning, self-generated data generation, self-consistency data curation, and staged multimodal pre-training—SIcog achieves benchmark-leading performance and stronger reasoning when combined with post-training techniques, while maintaining perception quality. The findings highlight the importance of synergizing pre-training with inference-time computation and post-training optimization, and point to scalable paths for continual cognitive self-improvement in foundation MLLMs.

Abstract

Recent progress in (multimodal) large language models ((M)LLMs) has shifted focus from pre-training to inference-time computation and post-training optimization, largely due to concerns over the availability of high-quality human data. However, these strategies alone are insufficient to drive substantial model improvements. We argue that effective model advancement requires strong synergy among pre-training, inference-time computation, and post-training optimization. In this paper, we introduce Self-Improving cognition (SIcog), a self-learning framework for constructing next-generation foundation MLLMs by imparting multimodal knowledge and enhancing systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning. SIcog first equips a base model with systematic perception and reasoning using minimal external supervision. The enhanced models then generate candidate image captions and CoT reasoning responses for unlabeled images and image-question pairs across diverse tasks, which are filtered through a semantic-similarity-guided self-consistency mechanism. These high-quality, self-generated samples enable large-scale multimodal pre-training, creating a self-improvement loop. Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition. Using only 213K self-generated pre-training samples, SIcog achieves significant improvements, including +3.6% on MMStar and +3.5% on AI2D, outperforming previous pre-training approaches. When combined with post-training techniques for CoT reasoning, SIcog yields +9% gains on MMVet and +8.5% on ScienceQA.

Paper Structure

This paper contains 52 sections, 4 equations, 12 figures, 19 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) SIcog enhances an MLLM's systematic cognition during multimodal pre-training using self-generated data, enabling next-generation foundation MLLMs. (b) With up to 213K self-generated pre-training samples, SIcog produces foundation MLLMs with superior cognitive capabilities, showing benchmark-leading performance compared to prevalent pre-training approaches.
  • Figure 2: The SIcog framework comprises four steps: $(\textit{i})$ Developing multimodal cognitive capabilities by finetuning an MLLM with minimal annotated image-captioning data (with Chain-of-Description) and visual instruction-tuning data (with structured Chain-of-Thought), enhancing systematic perception and reasoning (upper left). $(\textit{ii})$ Generating candidate captions and responses for pre-training by sampling from the improved models (upper right). $(\textit{iii})$ Curating self-generated pre-training data through self-consistency-guided quality evaluation, selecting the most semantically consistent candidates for learning (lower right). $(\textit{iv})$ Constructing a next-generation foundation MLLM by performing multimodal pre-training on the curated data (lower left). For brevity, language ability preservation is omitted; see Figure \ref{['fig:sicog_complete']} for the complete version.
  • Figure 3: The SIcog framework consists of four steps: $(\textit{i})$Enhancing multimodal cognition: Fine-tune an MLLM using minimal annotated data—image-captioning data in the Chain-of-Description format and visual instruction-tuning data with structured CoT—to improve systematic perception and reasoning (upper left). $(\textit{ii})$Generating candidate data: Use the improved models to sample candidate captions and responses for pre-training (upper right). $(\textit{iii})$Curating pre-training data: Apply self-consistency-guided quality evaluation to select the most semantically consistent, self-generated candidates for learning (lower right). $(\textit{iv})$Constructing the next-generation MLLM: Perform multimodal pre-training on the curated data to build a foundation MLLM with enhanced self-improving cognition (lower left).
  • Figure 4: Illustration of Chain-of-Description (left) for enhancing systematic perception and structured Chain-of-Thought (right) for strengthening reasoning capabilities.
  • Figure 5: Illustration of (a) the original image and (b) the four types of image corruption.
  • ...and 7 more figures