Table of Contents
Fetching ...

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang

TL;DR

Data2Behavior is introduced, a new task for predicting unintended model behaviors prior to training that achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning, and Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model.

Abstract

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

TL;DR

Data2Behavior is introduced, a new task for predicting unintended model behaviors prior to training that achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning, and Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model.

Abstract

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
Paper Structure (49 sections, 5 equations, 5 figures, 5 tables)

This paper contains 49 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Unintended behaviors induced by fine-tuning on benign-looking data via subliminal learning. We propose a new proactive task: Predicting Unintended Model Behaviors Before Training with a simple yet effective method that anticipates such risks before tuning.
  • Figure 2: Prediction bias rate (%) on "Panda" and "New York City" of Qwen2.5-32B-Instruct and Gemma3-12b-it.
  • Figure 3: Log probability difference (Diff) for the bias entity "the New York City" (NYC) between benign biased and normal training data, measured at the 2nd, 8th, 64th, and last input token positions for Gemma 3-12b-it and Qwen3-14B.
  • Figure 4: The interplay between Data ($\mathcal{D}$), Model ($\mathcal{M}$), and Behavior ($\mathcal{B}$) serves as a fundamental lens for understanding recent advancements in LLMs.
  • Figure 5: The instances of the dataset used in this paper. Our predicted trend is consistent with the trend observed after fine-tuning on this dataset.