Table of Contents
Fetching ...

DeepThink: Aligning Language Models with Domain-Specific User Intents

Yang Li, Mingxuan Luo, Yeyun Gong, Chen Lin, Jian Jiao, Yi Liu, Kaili Huang

TL;DR

This paper addresses the gap between synthetic instruction data and real user questions in domain-specific QA by introducing DeepThink, a framework that (1) generates seed questions mirroring authentic user queries, (2) simulates dual-role conversations to reveal hidden user intents, (3) refines answers through conversational context and retrieved documents, and (4) trains with Retrieval-Augmented Supervised Fine-Tuning. Empirical results in the advertising domain show DeepThink surpassing GPT-4-turbo+RAG by $7.92\%$ across relevance, completeness, clarity, accuracy, and actionability, and outperforming other data-synthesis baselines, with significant benefits from imitation-based seed data and iterative refinement. The work highlights the importance of grounding instruction data in realistic interactions and leveraging external knowledge during SFT to reduce hallucinations and better address user needs in vertical domains. Practically, DeepThink offers a scalable approach to tailor LLMs for industry-specific QA tasks with improved user satisfaction and actionable guidance.

Abstract

Supervised fine-tuning with synthesized instructions has been a common practice for adapting LLMs to domain-specific QA tasks. However, the synthesized instructions deviate from real user questions and expected answers. This study proposes a novel framework called DeepThink to generate high-quality instructions. DeepThink first generates a few seed questions to mimic actual user questions, simulates conversations to uncover the hidden user needs, and refines the answer by conversational contexts and the retrieved documents for more comprehensive answers. Experiments demonstrate that DeepThink achieves an average performance improvement of 7.92% compared to a GPT-4-turbo+RAG-based assistant on the real user test set in the advertising domain across dimensions such as relevance, completeness, clarity, accuracy, and actionability.

DeepThink: Aligning Language Models with Domain-Specific User Intents

TL;DR

This paper addresses the gap between synthetic instruction data and real user questions in domain-specific QA by introducing DeepThink, a framework that (1) generates seed questions mirroring authentic user queries, (2) simulates dual-role conversations to reveal hidden user intents, (3) refines answers through conversational context and retrieved documents, and (4) trains with Retrieval-Augmented Supervised Fine-Tuning. Empirical results in the advertising domain show DeepThink surpassing GPT-4-turbo+RAG by across relevance, completeness, clarity, accuracy, and actionability, and outperforming other data-synthesis baselines, with significant benefits from imitation-based seed data and iterative refinement. The work highlights the importance of grounding instruction data in realistic interactions and leveraging external knowledge during SFT to reduce hallucinations and better address user needs in vertical domains. Practically, DeepThink offers a scalable approach to tailor LLMs for industry-specific QA tasks with improved user satisfaction and actionable guidance.

Abstract

Supervised fine-tuning with synthesized instructions has been a common practice for adapting LLMs to domain-specific QA tasks. However, the synthesized instructions deviate from real user questions and expected answers. This study proposes a novel framework called DeepThink to generate high-quality instructions. DeepThink first generates a few seed questions to mimic actual user questions, simulates conversations to uncover the hidden user needs, and refines the answer by conversational contexts and the retrieved documents for more comprehensive answers. Experiments demonstrate that DeepThink achieves an average performance improvement of 7.92% compared to a GPT-4-turbo+RAG-based assistant on the real user test set in the advertising domain across dimensions such as relevance, completeness, clarity, accuracy, and actionability.

Paper Structure

This paper contains 26 sections, 1 equation, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Three phenomena on real-world advertising platforms
  • Figure 2: Performance comparison of DeepThink and GPT-4-turbo across five evaluation dimensions over different time spans ( "Historic," and "Recent."). DeepThink performs better than GPT-4-turbo in relevance, completeness, clarity, accuracy, and actionability.
  • Figure 3: The framework of DeepThink
  • Figure 4: Human Preference Evaluation (WinRate models vs. GPT-4-turbo %)
  • Figure 5: Score distribution of the instructions
  • ...and 17 more figures