SDPO: Segment-Level Direct Preference Optimization for Social Agents

Aobo Kong; Wentao Ma; Shiwan Zhao; Yongbin Li; Yuchuan Wu; Ke Wang; Xiaoqian Liu; Qicheng Li; Yong Qin; Fei Huang

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang

TL;DR

This work addresses the challenge of aligning LLM-based social agents in multi-turn dialogues, where prior DPO methods are limited to single turns. It introduces Segment-Level Direct Preference Optimization (SDPO), which dynamically selects key, equal-length segments within interactions to form fine-grained positive and negative data pairs, grounded in a state-action occupancy framework and a Bradley-Terry preference model that eliminates the partition function $Z$. The SDPO loss, $L_{SDPO}$, operates on these segments and enables rigorous, noise-reduced training, with a data-construction pipeline that locates erroneous turns and extracts the most informative segments. Empirical evaluation on the SOTOPIA benchmark shows SDPO-tuned agents outperform traditional DPO, session-level methods like ETO and DMPO, and even proprietary models such as GPT-4o, demonstrating substantial improvements in social intelligence and robustness. The work provides a practical, scalable approach to segment-wise alignment and releases code and data for reproducibility and broader application to other multi-turn tasks.

Abstract

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

SDPO: Segment-Level Direct Preference Optimization for Social Agents

TL;DR

. The SDPO loss,

, operates on these segments and enables rigorous, noise-reduced training, with a data-construction pipeline that locates erroneous turns and extracts the most informative segments. Empirical evaluation on the SOTOPIA benchmark shows SDPO-tuned agents outperform traditional DPO, session-level methods like ETO and DMPO, and even proprietary models such as GPT-4o, demonstrating substantial improvements in social intelligence and robustness. The work provides a practical, scalable approach to segment-wise alignment and releases code and data for reproducibility and broader application to other multi-turn tasks.

Abstract

Paper Structure (33 sections, 12 equations, 9 figures, 9 tables)

This paper contains 33 sections, 12 equations, 9 figures, 9 tables.

Introduction
Preliminary
SOTOPIA Environment
Task Formulation
Direct Preference Optimization
Method
Behavioral Cloning
Preference Data Construction
SDPO Loss
Experiments
Datasets
Experimental Setup
Baselines
Results
Analysis
...and 18 more sections

Figures (9)

Figure 1: An overview of three alignment algorithms, illustrated using a SOTOPIA task as an example. represents the agent to be tested. A more detaild illustration is provided in Figure \ref{['fg: overview']}.
Figure 2: Data construction pipeline for SDPO. represents the agent to be tested. here denotes GPT-4o.
Figure 3: The goal ratings and average words per session for various agents. The word count includes only the utterances of our agents. The square bracket denotes [average turns per session $\times$ average words per turn].
Figure 4: Comparison of the quality of positive sessions sampled at the session level and segment level.
Figure 5: Prompt organization formats in original and modified SOTOPIA, respectively.
...and 4 more figures

SDPO: Segment-Level Direct Preference Optimization for Social Agents

TL;DR

Abstract

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)