Table of Contents
Fetching ...

The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness

Zhongjie Jiang

TL;DR

The paper tackles AI data-collapse by arguing that purely statistical data generation erodes the cognitive texture inherent in human language. It introduces the PMCSF, a dual-engine framework comprising a Cognitive State Decoder and a Cognitive Text Encoder that reverse-engineer text into a 17-dimensional cognitive state space and re-materialize it with bounded rationality through Cognitive Perturbation Operators. Through a two-stage validation—cognitive codec verification and functional gain testing in A-share markets—the authors demonstrate that simulating cognitive imperfections yields texts that closely resemble human language and confer measurable gains in risk management and trading performance. The work offers a pathway to more robust synthetic data by embedding cognitive dynamics, with broad implications for AI safety, interpretability, and cross-domain applicability beyond finance. It also provides open-source artefacts to foster reproducibility and further exploration of cognitive texture as a fundamental synthetic-data primitive.

Abstract

Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.

The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness

TL;DR

The paper tackles AI data-collapse by arguing that purely statistical data generation erodes the cognitive texture inherent in human language. It introduces the PMCSF, a dual-engine framework comprising a Cognitive State Decoder and a Cognitive Text Encoder that reverse-engineer text into a 17-dimensional cognitive state space and re-materialize it with bounded rationality through Cognitive Perturbation Operators. Through a two-stage validation—cognitive codec verification and functional gain testing in A-share markets—the authors demonstrate that simulating cognitive imperfections yields texts that closely resemble human language and confer measurable gains in risk management and trading performance. The work offers a pathway to more robust synthetic data by embedding cognitive dynamics, with broad implications for AI safety, interpretability, and cross-domain applicability beyond finance. It also provides open-source artefacts to foster reproducibility and further exploration of cognitive texture as a fundamental synthetic-data primitive.

Abstract

Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.

Paper Structure

This paper contains 143 sections, 7 equations, 9 figures, 36 tables.

Figures (9)

  • Figure 1: The Topology of Survival: A Visual Abstract of "Cognitive Phase Transition" During Financial Crisis. This 3D manifold illustrates the core thesis of this study: in non-ergodic markets (e.g., the 2015 Crash), survival is a topological bifurcation. The Purple Zone represents the "Consensus Phase" where both agents behave similarly. However, as the system approaches a critical singularity (June 15): (A) The Rationality Trap (Red Trajectory): The Human-Baseline strategy, which operates under the constraint of "mean reversion bias" (a heuristic detailed in the study's glossary), misclassifies structural collapse as transient market noise—a failure to distinguish between cyclical fluctuations and irreversible decline. This misperception propels the strategy toward a gravitational well of collapse, where the pull of declining asset values becomes inescapable (drawdown: -23.2%). (B) The Instinctual Escape (Blue Trajectory): The CTE-Enhanced strategy, infused with synthetic cognitive noise (a key component of the PMCSF framework), initiates a phase transition upon detecting a GARCH volatility peak ($h_{joy} \to 0.92$)—a threshold where the model's "cognitive texture" (imperfections mimicking human heuristic biases) triggers a break from statistical optimality. Shattering the symmetry of mean-reverting expectations, the strategy implements a "Digital Spartan" escape: full liquidation that avoids the gravitational well's pull. Metaphorical Insight: Survival is not about better fitting; it is about breaking the loop.
  • Figure 2: Architecture of the Prompt-driven Cognitive Computing Framework (PMCSF). Operating as a "Cognitive Codec", the system incorporates two inverse engines: CSD (dimensionality reduction) and CTE (reconstruction). The diagram delineates the invariant 17-Dim Cognitive State Vector serving as the intermediate language, alongside CTE's dual-layer design (Macro-Anchoring & Micro-Perturbation) for simulating bounded rationality. This topology—grounded in the hypothesis of cognitive invariants—suggests structural stability across diverse AI models and contexts.
  • Figure 3: 17-Dimensional Cognitive State Vector Comparison. Radar chart visualizing distinct "Cognitive Topologies" decoded by CSD. Note the structural divergence between Panic Mode (Red; high Fear/Uncertainty) and Frenzy Mode (Blue; high Greed/FOMO). This confirms CSD's capability to disentangle complex market sentiments into interpretable mathematical vectors.
  • Figure 4: The Spectrum of Cognitive Rhythm: Statistical Smoothness vs. Biological Fluctuation. (A) Distribution: Standard AI (blue) remains confined to a symmetric normal distribution, while CTE models (red/green/purple) demonstrate human-like right-skewness and Zipfian tails. (B) Variability: Box plots reveal the rigid stability of Standard AI ($CV \approx 44\%$) alongside the high dynamic range and outliers of CTE models ($CV \approx 65\%$), thereby verifying the Sentence Length Oscillation Operator. (C) Fingerprints: The radar chart contrasts the static nature of Standard AI (centered) with the dynamic, context-dependent signatures of CTE models (expanded).
  • Figure 5: Cross-Model Consistency Heatmap. Pearson correlation coefficients between DeepSeek (DS) and Doubao (DB) across 26 archetypal scenarios reveal high correlations in Novice (0.93) and Veteran (0.90) alignment—confirming robust detection of "Frenzy" (MCFI) and "Prudence" signals, respectively. The strong diagonal structure verifies that the 17-dimensional cognitive topology—the structural relationships of cognitive states within the latent semantic space—is model-agnostic, meaning it remains stable across different AI models.
  • ...and 4 more figures