SDPO: Segment-Level Direct Preference Optimization for Social Agents
Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
TL;DR
This work addresses the challenge of aligning LLM-based social agents in multi-turn dialogues, where prior DPO methods are limited to single turns. It introduces Segment-Level Direct Preference Optimization (SDPO), which dynamically selects key, equal-length segments within interactions to form fine-grained positive and negative data pairs, grounded in a state-action occupancy framework and a Bradley-Terry preference model that eliminates the partition function $Z$. The SDPO loss, $L_{SDPO}$, operates on these segments and enables rigorous, noise-reduced training, with a data-construction pipeline that locates erroneous turns and extracts the most informative segments. Empirical evaluation on the SOTOPIA benchmark shows SDPO-tuned agents outperform traditional DPO, session-level methods like ETO and DMPO, and even proprietary models such as GPT-4o, demonstrating substantial improvements in social intelligence and robustness. The work provides a practical, scalable approach to segment-wise alignment and releases code and data for reproducibility and broader application to other multi-turn tasks.
Abstract
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.
