From Implicit to Explicit: Enhancing Self-Recognition in Large Language Models
Yinghan Zhou, Weifeng Zhu, Juan Wen, Wanli Peng, Zhengxian Wu, Yiming Xue
TL;DR
This work identifies implicit self-recognition (ISR) in the individual presentation paradigm (IPP) where hidden representations encode self-/other-text signals but outputs fail to reflect them, highlighting a gap between internal structure and generation. It introduces Cognitive Surgery (CoSur), a four-module framework that extracts hidden representations, builds discriminative self- and other-recognition subspaces via SVD, performs authorship discrimination with projection energies, and applies cognitive editing to steer outputs toward correct authorship. Across three LLMs on the HC3 dataset, CoSur substantially improves IPP self-recognition accuracy (average around 97–99%) and outperforms baselines, with ablations showing the critical role of the learned subspaces. The approach also demonstrates efficiency gains and robust generalization to unseen sources, suggesting practical applications for model evaluation, defense against malicious prompts, and LLM-generated text detection. Overall, CoSur provides a principled, training-free method to reconcile internal discriminative signals with observable model outputs by addressing the implicit self-recognition bottleneck.
Abstract
Large language models (LLMs) have been shown to possess a degree of self-recognition ability, which used to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the pair presentation paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the individual presentation paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first investigate the cause of this failure and attribute it to implicit self-recognition (ISR). ISR describes the gap between internal representations and output behavior in LLMs: under the IPP scenario, the model encodes self-recognition information in its feature space, yet its ability to recognize self-generated texts remains poor. To mitigate the ISR of LLMs, we propose cognitive surgery (CoSur), a novel framework comprising four main modules: representation extraction, subspace construction, authorship discrimination, and cognitive editing. Experimental results demonstrate that our proposed method improves the self-recognition performance of three different LLMs in the IPP scenario, achieving average accuracies of 99.00%, 97.69%, and 97.13%, respectively.
