Table of Contents
Fetching ...

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

Guanxu Chen, Dongrui Liu, Tao Luo, Lijie Hu, Jing Shao

TL;DR

This work addresses the opacity of large language models by introducing TELLME, an internal transparency mechanism that disentangles behavior-related representations to ease monitoring without external modules. It combines a contrastive disentanglement loss with a retain loss to preserve general capabilities, forming a balanced objective. A theoretical analysis based on optimal transport links reduced within-class variance and better inter-class separation to improved generalization, while extensive experiments across math, knowledge, and safety tasks and multiple LLMs validate disentanglement through multiple metrics. Case studies in safety risk monitoring and detoxification demonstrate tangible gains in monitoring reliability and safety performance, including notable improvements when integrated with supervised fine-tuning. Overall, TELLME offers a scalable, internal approach to transparency and monitoring that can enhance governance and safety as LLMs scale.

Abstract

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

TL;DR

This work addresses the opacity of large language models by introducing TELLME, an internal transparency mechanism that disentangles behavior-related representations to ease monitoring without external modules. It combines a contrastive disentanglement loss with a retain loss to preserve general capabilities, forming a balanced objective. A theoretical analysis based on optimal transport links reduced within-class variance and better inter-class separation to improved generalization, while extensive experiments across math, knowledge, and safety tasks and multiple LLMs validate disentanglement through multiple metrics. Case studies in safety risk monitoring and detoxification demonstrate tangible gains in monitoring reliability and safety performance, including notable improvements when integrated with supervised fine-tuning. Overall, TELLME offers a scalable, internal approach to transparency and monitoring that can enhance governance and safety as LLMs scale.

Abstract

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.

Paper Structure

This paper contains 41 sections, 4 theorems, 23 equations, 6 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

(Proven in chuang2021measuring)Please See Appendix thm_details for more details of our theoretical analysis. Given a classifier $g \in {\mathcal{G}}$, where $g = [g_1, \cdots, g_C]$ and ${\mathcal{G}} = {\mathcal{G}}_1\times \cdots\times {\mathcal{G}}_C$; ${\mathcal{G}}_j: {\mathcal{X}} \rightarrow where $\textnormal{Lip}(g,j) = \sup_{x_j,x^{\prime}_j \in {\mathcal{X}}} \frac{|\rho_g(\phi(x_j))-\

Figures (6)

  • Figure 1: TELLME is designed to enhance the transparency of LLMs and makes them easier to monitor without external modules. Disentanglement of different behaviors in LLMs' representation space improve their transparency, achieving better monitoring reliability and safety performance.
  • Figure 2: Overview of TELLME.TELLME disentangles representations by maximizing the examples' similarities of similar behaviors, and minimizing the examples' similarities of different behaviors. Meanwhile, TELLME utilizes constraints of $l_2$ distance and KL distance on representations and probabilities, respectively, to maintain the general capabilities of LLMs.
  • Figure 3: t-SNE Visualization of LLMs' representations in three scenarios and four LLMs.
  • Figure 4: Layer-wise ablation studies in multi-risk classification task and model detoxification task
  • Figure 5: A example on detoxification task related to crime from Llama-3.1-8B-Instruct.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: $s$-Wasserstein distance villani2009wasserstein
  • Definition 2: $k$-variance
  • Theorem 1
  • Remark 1
  • Definition 3
  • Proposition 4
  • Proposition 5
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm_margin']}