Table of Contents
Fetching ...

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, Ye Wang

Abstract

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Abstract

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.
Paper Structure (29 sections, 2 equations, 17 figures, 5 tables)

This paper contains 29 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Safety and harmfulness amplification during Test-Time Reinforcement Learning (TTRL). Top left: attack success rate (ASR %) of Jailbreak-V28k prompts on Qwen1.5B-Instruct when Jailbreak-V28k prompts are injected into AMC test-time data. Top right: the resulting reasoning tax, i.e., loss in AMC accuracy. Bottom left: ASR for Qwen-1.5B-Instruct on Jailbreak-V28k when TTRL is done on HarmInject prompts (see Section \ref{['subsec:RQ4']}). Bottom right: reasoning tax for the Qwen-1.5B-Instruct model.
  • Figure 2: ASR measured across three jailbreak datasets, JailbreakV-28k, WildJailbreak, and Llama Artifacts (left to right, respectively) during TTRL, for Qwen-1.5B-Instruct (top row) and Llama-3-8B-Instruct (bottom row).
  • Figure 3: Impact on safety and reasoning for Qwen-1.5B-Instruct model after harmful prompt injection across three jailbreak datasets, JailbreakV-28k, WildJailbreak, and Llama Artifacts (left to right, respectively) during TTRL, for safety (top row) and AMC accuracy (bottom row).
  • Figure 4: TTRL visualization for safety and harmfulness amplification. (A) An example case when a jailbreak prompt is encountered during TTRL, and the base model produces relatively safe answers, and the majority vote extracted label is safe, which reinforces the safe behavior leading to safety amplification. (B) Another case where the base model is relatively unsafe to the jailbreak prompt, which leads to unsafe generations, and the majority vote reinforces that behavior, leading to harmfulness amplification.
  • Figure 5: Impact on safety (top row) and reasoning (bottom row) for Qwen-0.5B-Instruct, Llama3.2-3B-Instruct, and Llama3-8B-Instruct models (left to right) after injecting benign instruction-following prompts. The ASR is reported on the JailbreakV-28k prompts.
  • ...and 12 more figures