Table of Contents
Fetching ...

Time-Reversal Provides Unsupervised Feedback to LLMs

Yerram Varun, Rahul Madhavan, Sravanti Addepalli, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain

TL;DR

This work introduces Time Reversed Language Models (TRLMs) that score and generate in the response→query direction to provide unsupervised feedback for forward LLMs. It defines several variants (TRLM-Ba, TRLM-Fo, TRLM-FoBa) and shows, both theoretically and empirically, that reverse-direction scoring can yield non-trivial distribution shifts and significant performance gains across tasks such as best-of-N reranking, citation attribution, and document retrieval, as well as enhance input safety filters against jailbreak attacks. Key results include up to a 5% improvement on AlpacaEval via TRLM-Ba reranking, 44.15% improvement in citation attribution on CNN/Daily Mail, and substantial NF-Corpus/MS-Marco gains in retrieval metrics, alongside strong reductions in false negatives for safety defenses. The findings suggest that reverse-directed feedback is a powerful, unsupervised signal for improving generation quality and safety without additional supervised data, with implications for scalable alignment and safer deployment of LLMs.

Abstract

Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.

Time-Reversal Provides Unsupervised Feedback to LLMs

TL;DR

This work introduces Time Reversed Language Models (TRLMs) that score and generate in the response→query direction to provide unsupervised feedback for forward LLMs. It defines several variants (TRLM-Ba, TRLM-Fo, TRLM-FoBa) and shows, both theoretically and empirically, that reverse-direction scoring can yield non-trivial distribution shifts and significant performance gains across tasks such as best-of-N reranking, citation attribution, and document retrieval, as well as enhance input safety filters against jailbreak attacks. Key results include up to a 5% improvement on AlpacaEval via TRLM-Ba reranking, 44.15% improvement in citation attribution on CNN/Daily Mail, and substantial NF-Corpus/MS-Marco gains in retrieval metrics, alongside strong reductions in false negatives for safety defenses. The findings suggest that reverse-directed feedback is a powerful, unsupervised signal for improving generation quality and safety without additional supervised data, with implications for scalable alignment and safer deployment of LLMs.

Abstract

Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.

Paper Structure

This paper contains 29 sections, 3 theorems, 4 equations, 4 figures, 10 tables, 12 algorithms.

Key Result

Lemma 1

The new LLM policy $\tilde{\mathbb{P} }_{\texttt{Fw}}$ that optimizes (eq:align_opt) is given by: $\tilde{\mathbb{P} }_{\texttt{Fw}}(\texttt{Answer} | \texttt{Question}) \propto \mathbb{P} _{\texttt{Fw}}^{1+\alpha} (\texttt{Answer} | \texttt{Question})$ where $\alpha$ is chosen appropriately dependi

Figures (4)

  • Figure 1: This task is an approach to link specific highlight sentences to lines that corroborate these sentences from within a lines in an article. By using linear binary and exclusion search methods, the aim is to efficiently and accurately find sentences in the articles that support the highlights.
  • Figure 2: This task is an approach to link specific highlight sentences to lines that corroborate these sentences from within a lines in an article. By using linear binary and exclusion search methods, the aim is to efficiently and accurately find sentences in the articles that support the highlights.
  • Figure 3: This task is used to assess the representational capability of $\texttt{TRLM}$. Here we look at how likely a document is to contain information relevant to answering a question. The language understanding of an LLM makes it likely that it produces better semantic retrieval than a simple embedding based model which is not contextual.
  • Figure 4: Plots showing the False Negative Rate and False Positive Rate of the proposed defense strategy. Positive indicates UNSAFE response, while negative indicates SAFE response. The first plot considers $72$ questions generated from the JBB dataset. The second plot considers questions from the new-HA dataset. The third plot considers $48$ hard safe questions generated by GPT4, whose answers contain content that appears unsafe (from the H dataset). The fourth plot considers $49$ easy safe questions from Alpaca Eval2 dataset (E dataset). TRLM-Ba (PT) - the reverse pre-trained model clearly outperforms all other cases with lower FNR rate while keeping FPR rates under check.

Theorems & Definitions (4)

  • Lemma 1: Corollary of Lemma $1$ in yang2024asymptotics
  • Lemma 2: Corollary of Lemma $1$ in yang2024asymptotics
  • Theorem 1
  • proof : Theorem \ref{['thm:reduce_support']}