Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
Guanxu Chen, Dongrui Liu, Tao Luo, Lijie Hu, Jing Shao
TL;DR
This work addresses the opacity of large language models by introducing TELLME, an internal transparency mechanism that disentangles behavior-related representations to ease monitoring without external modules. It combines a contrastive disentanglement loss with a retain loss to preserve general capabilities, forming a balanced objective. A theoretical analysis based on optimal transport links reduced within-class variance and better inter-class separation to improved generalization, while extensive experiments across math, knowledge, and safety tasks and multiple LLMs validate disentanglement through multiple metrics. Case studies in safety risk monitoring and detoxification demonstrate tangible gains in monitoring reliability and safety performance, including notable improvements when integrated with supervised fine-tuning. Overall, TELLME offers a scalable, internal approach to transparency and monitoring that can enhance governance and safety as LLMs scale.
Abstract
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.
