Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Chen Qian; Jie Zhang; Wei Yao; Dongrui Liu; Zhenfei Yin; Yu Qiao; Yong Liu; Jing Shao

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao

TL;DR

This work investigates how five trustworthiness dimensions—reliability, toxicity, privacy, fairness, and robustness—are encoded during the pre-training of large language models. It shows that middle-layer representations already contain linearly separable trustworthiness signals and demonstrates how steering vectors derived from pre-training checkpoints can enhance downstream trustworthiness via activation intervention. The study further reveals a two-phase learning dynamic in mutual information during pre-training, indicating a fitting-to-compression transition for trustworthiness concepts. Collectively, these findings suggest that pre-training checkpoints offer untapped potential for improving trustworthy AI without additional data or costly fine-tuning, with implications for self-alignment and efficient trustworthiness enhancement.

Abstract

Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

TL;DR

Abstract

Paper Structure (51 sections, 6 equations, 19 figures, 5 tables)

This paper contains 51 sections, 6 equations, 19 figures, 5 tables.

Introduction
Probing LLM Pre-training Dynamics in Trustworthiness
Research Dimensions and Datasets of Truthworthy LLM
Experimental Setup
Activation dataset.
Linear probing.
Probing Results
Middle layer representations exhibit linearly separable patterns.
The potential of pre-training checkpoints.
Controlling Trustworthiness via the Steering Vectors from Pre-training Checkpoints
Activation Intervention
Experimental Setup
Evaluation on Trustworthiness Datasets.
Details of Steering Vectors Construction.
Intervention to Enhance Distinct Trustworthiness Dimensions
...and 36 more sections

Figures (19)

Figure 1: Overview of tracing trustworthiness dynamics during pre-training. 1) Linear probing identifies linearly separable opposing concepts during early pre-training; 2) Steering vectors are developed to enhance LLMs' trustworthiness; 3) Probing LLMs with mutual information reveals a two-phase trend regarding trustworthiness.
Figure 2: The linear probe accuracy on five trustworthiness dimensions for the first 80 pre-training checkpoints. For each checkpoint, we report the results from layers {0, 6, 12, 18, 24, 30}. The results from all layers of the 360 checkpoints are in Appendix \ref{['appendix:full-probing']}.
Figure 3: A schematic illustration of (a) constructing a steering vector from the pre-training checkpoints and (b) intervening in the SFT model towards more trustworthiness by employing the steering vector.
Figure 4: The trends of toxic ratio and PPL as the intervention strength $\alpha$ increases.
Figure 5: Pearson Correlation Coefficient for Probing ACC and trustworthiness performance.
...and 14 more figures

Theorems & Definitions (2)

Definition 1: Mutual Information (MI)
Definition 2: Hilbert-Schmidt Independence Criterion (HSIC)

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

TL;DR

Abstract

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (2)