Table of Contents
Fetching ...

System Message Generation for User Preferences using Open-Source Models

Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong

TL;DR

The paper addresses the lack of system messages in public SFT datasets and licensing constraints that hinder industrial use. It introduces SysGen, a four-phase pipeline that generates task-specific system messages at the phrase level using open-source models, filters and verifies them, and then produces aligned assistant responses. Training open-source LLMs on SysGen data improves performance on single-turn (Multifacet) and multi-turn (SysBench) benchmarks, with particularly strong gains in short conversations, indicating better early-stage interaction. The study also shows that distilling SysGen-generated data to models lacking native system-role support yields gains, and emphasizes that diverse, structured system messages improve LLM adaptability across varied user scenarios.

Abstract

System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages and are subject to strict license constraints in industrial applications. Moreover, manually annotating system messages that align with user instructions is resource-intensive. In light of these challenges, we introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions using existing supervised fine-tuning datasets that lack system messages. Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks. Notably, our method shows strong gains in shorter conversations, suggesting that it enhances early-stage interaction effectiveness. Our qualitative analysis further emphasizes the value of diverse and structured system messages in improving LLM adaptability across varied user scenarios.

System Message Generation for User Preferences using Open-Source Models

TL;DR

The paper addresses the lack of system messages in public SFT datasets and licensing constraints that hinder industrial use. It introduces SysGen, a four-phase pipeline that generates task-specific system messages at the phrase level using open-source models, filters and verifies them, and then produces aligned assistant responses. Training open-source LLMs on SysGen data improves performance on single-turn (Multifacet) and multi-turn (SysBench) benchmarks, with particularly strong gains in short conversations, indicating better early-stage interaction. The study also shows that distilling SysGen-generated data to models lacking native system-role support yields gains, and emphasizes that diverse, structured system messages improve LLM adaptability across varied user scenarios.

Abstract

System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, and specify various output formats and communication styles. Despite such versatility, publicly available datasets often lack system messages and are subject to strict license constraints in industrial applications. Moreover, manually annotating system messages that align with user instructions is resource-intensive. In light of these challenges, we introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions using existing supervised fine-tuning datasets that lack system messages. Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks. Notably, our method shows strong gains in shorter conversations, suggesting that it enhances early-stage interaction effectiveness. Our qualitative analysis further emphasizes the value of diverse and structured system messages in improving LLM adaptability across varied user scenarios.

Paper Structure

This paper contains 38 sections, 2 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Our SysGen pipeline provides two main points: system message generation and newly-generated answer. We manually select eight key fuctionalities of system messages and generate phrases with specific tags to original SFT datasets that lack of system messages. Through our pipeline, we can generate better aligned assistant responses with system messages given user-oriented instruction.
  • Figure 2: Overall SysGen data construction pipeline. Our pipeline consists of four phases: (Phase 1) We gather SFT datasets which do not contain system messages and use open-source models to generate system messages with manually selected eight key fuctionality tags. (Phase 2) We then remove incorrectly generated tag tokens and reorganize tags with phrases in a predefined order for consistency. (Phase 3) We use a LLM-as-a-judge approach with self-model feedback to filter out empty, overly specific, and unnatural phrases. (Phase 4) We finally remove tags to create natural system messages and generate new responses along with the user instructions.
  • Figure 3: A statistic that verifies whether the newly-generated answer is more suitable for the user query than the original answer. It records the probability that GPT-4o would respond with the newly-generated answer being better than the original answer (the probability should ideally exceed 50%).
  • Figure 4: We conduct a multi-turn conversation that could align the system message at the inference level. After training our SysGen-generated data, all the open-source models achieve significant improvement on shorter rounds (R1-R3) of conversation. In longer rounds (R4-R5), our method still demonstrates its effectiveness, but much lower rate than the shorter rounds of conversation.
  • Figure 5: The GPT4o LLM-as-a-judge results of measuring the alignment between generated system messages and new assistant responses. We use 20 samples for each data source which sums up to 100 samples in total per models.