System Message Generation for User Preferences using Open-Source Models
Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong
TL;DR
The paper addresses the lack of system messages in public SFT datasets and licensing constraints that hinder industrial use. It introduces SysGen, a four-phase pipeline that generates task-specific system messages at the phrase level using open-source models, filters and verifies them, and then produces aligned assistant responses. Training open-source LLMs on SysGen data improves performance on single-turn (Multifacet) and multi-turn (SysBench) benchmarks, with particularly strong gains in short conversations, indicating better early-stage interaction. The study also shows that distilling SysGen-generated data to models lacking native system-role support yields gains, and emphasizes that diverse, structured system messages improve LLM adaptability across varied user scenarios.
Abstract
System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages and are subject to strict license constraints in industrial applications. Moreover, manually annotating system messages that align with user instructions is resource-intensive. In light of these challenges, we introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions using existing supervised fine-tuning datasets that lack system messages. Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks. Notably, our method shows strong gains in shorter conversations, suggesting that it enhances early-stage interaction effectiveness. Our qualitative analysis further emphasizes the value of diverse and structured system messages in improving LLM adaptability across varied user scenarios.
