Table of Contents
Fetching ...

Tuning Language Models for Robust Prediction of Diverse User Behaviors

Fanjin Meng, Jingtao Ding, Jiahui Gong, Chen Yang, Hong Chen, Zuojian Wang, Haisheng Lu, Yong Li

TL;DR

This work tackles long-tail user behavior prediction by revealing that standard fine-tuning biases LLMs toward frequent anchor behaviors. It introduces BehaviorLM, a two-stage progressive fine-tuning framework: A-Tuning specializes the model on anchor behaviors while preserving general knowledge via auxiliary conversations, and B-Tuning reintroduces tail data through difficulty-based sample selection to balance the training signal. Across two real-world datasets, BehaviorLM achieves substantial gains in tail prediction (up to 27%/20% absolute improvements) and demonstrates strong sample efficiency, leveraging the intrinsic behavioral knowledge in LLMs for few-shot learning. The approach offers a practical path to robust, scalable behavior prediction in intelligent assistants and related applications.

Abstract

Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

Tuning Language Models for Robust Prediction of Diverse User Behaviors

TL;DR

This work tackles long-tail user behavior prediction by revealing that standard fine-tuning biases LLMs toward frequent anchor behaviors. It introduces BehaviorLM, a two-stage progressive fine-tuning framework: A-Tuning specializes the model on anchor behaviors while preserving general knowledge via auxiliary conversations, and B-Tuning reintroduces tail data through difficulty-based sample selection to balance the training signal. Across two real-world datasets, BehaviorLM achieves substantial gains in tail prediction (up to 27%/20% absolute improvements) and demonstrates strong sample efficiency, leveraging the intrinsic behavioral knowledge in LLMs for few-shot learning. The approach offers a practical path to robust, scalable behavior prediction in intelligent assistants and related applications.

Abstract

Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

Paper Structure

This paper contains 32 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Empirical distribution of user behaviors in the Behavior dataset: "Anchor Behaviors" occur more than 1% of the time, while "Tail Behaviors" represent the rest. (b) Semantic embedding visualization of anchor and tail behaviors in the LLM. (c) Prediction accuracy comparison across LLM tuning methods and GPT4o for anchor and tail behaviors, with "NT" indicating no tuning.
  • Figure 2: The BehaviorLM framework, with a progressive fine-tuning approach
  • Figure 3: The effect of behavioral knowledge under different model size (1.5B, 8B, 70B), in terms of performance robustness across behavior types and few-shot sample numbers.
  • Figure 4: Performance comparison between fine-tuning on all behaviors, anchor behaviors and tail behaviors.
  • Figure 5: Comparison between BehaviorLM and a non-LLM transformer-based method under different sizes of training data.