The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness
Zhongjie Jiang
TL;DR
The paper tackles AI data-collapse by arguing that purely statistical data generation erodes the cognitive texture inherent in human language. It introduces the PMCSF, a dual-engine framework comprising a Cognitive State Decoder and a Cognitive Text Encoder that reverse-engineer text into a 17-dimensional cognitive state space and re-materialize it with bounded rationality through Cognitive Perturbation Operators. Through a two-stage validation—cognitive codec verification and functional gain testing in A-share markets—the authors demonstrate that simulating cognitive imperfections yields texts that closely resemble human language and confer measurable gains in risk management and trading performance. The work offers a pathway to more robust synthetic data by embedding cognitive dynamics, with broad implications for AI safety, interpretability, and cross-domain applicability beyond finance. It also provides open-source artefacts to foster reproducibility and further exploration of cognitive texture as a fundamental synthetic-data primitive.
Abstract
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
