Table of Contents
Fetching ...

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu

TL;DR

This paper defines cognition engineering as the deliberate construction of AI thinking via test-time scaling, marking a shift from Act I knowledge retrieval to Act II thought construction. It identifies three pillars—Knowledge Foundation, Test-time Scaling Foundation, and Self-Training Foundation—and frames test-time scaling methods (parallel sampling, tree search, multi-turn correction, and long CoT) as practical avenues to deepen AI cognition. It also outlines training strategies (RL scaling, supervised fine-tuning, iterative self-reinforced learning) and surveys progress across math, code, and multimodal domains, discussing safety, RAG, evaluation, and infrastructure. The authors argue for data and reward design innovations (cognition data engineering, environment design) and envision human-AI cognitive partnerships, with implications for accelerated scientific discovery and more robust AI systems. They conclude with future directions, including new architectures and latent thought pretraining, and provide a practical tutorial and ensemble strategies to democratize cognition engineering.

Abstract

The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations such as knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

TL;DR

This paper defines cognition engineering as the deliberate construction of AI thinking via test-time scaling, marking a shift from Act I knowledge retrieval to Act II thought construction. It identifies three pillars—Knowledge Foundation, Test-time Scaling Foundation, and Self-Training Foundation—and frames test-time scaling methods (parallel sampling, tree search, multi-turn correction, and long CoT) as practical avenues to deepen AI cognition. It also outlines training strategies (RL scaling, supervised fine-tuning, iterative self-reinforced learning) and surveys progress across math, code, and multimodal domains, discussing safety, RAG, evaluation, and infrastructure. The authors argue for data and reward design innovations (cognition data engineering, environment design) and envision human-AI cognitive partnerships, with implications for accelerated scientific discovery and more robust AI systems. They conclude with future directions, including new architectures and latent thought pretraining, and provide a practical tutorial and ensemble strategies to democratize cognition engineering.

Abstract

The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations such as knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

Paper Structure

This paper contains 158 sections, 7 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: The three scaling phases illustrated as a progression of knowledge representation. Pre-training scaling (blue) forms isolated knowledge islands with fundamental physics concepts connected by limited innate associations. Post-training scaling (green) densifies these islands with more sophisticated learned connections between related concepts. Test-time scaling (red) enables dynamic reasoning pathway formation between previously disconnected concepts through extended computation, facilitating multi-hop inference across the entire knowledge space. Test-time scaling builds bridges between knowledge islands, connecting distant nodes that remain isolated during pre-training and conventional post-training.
  • Figure 2: Workflow for applying test-time scaling in a specific domain. For more details, please refer to the main paper.
  • Figure 3: The DIKW pyramid and its relationship to cognition engineering paradigm.
  • Figure 4: Illustration of parallel sampling selection methods: Best-of-N (F1), Majority voting (F2), and Combined strategy (F3).
  • Figure 5: The relationship between scaling dimensions and performance for each test-time scaling method.
  • ...and 15 more figures