Table of Contents
Fetching ...

Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact

Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo

TL;DR

This paper compares instruction-tuning for LLMs trained with external context (Ctx-LLM) versus context-free data (NoCtx-LLM) to understand how context affects knowledge use and downstream performance. It demonstrates that context-augmented training strengthens grounding and reduces hallucinations in vision-language models, while also shifting reliance away from parametric memory toward provided evidence. The authors show that two deployment strategies—training a mixture of both data types or routing inputs between two specialized models—can preserve complementary strengths and yield robust performance across diverse tasks. These insights inform practical data and system design for real-world applications where context availability varies. Overall, context-aware instruction tuning enhances grounding and leads to practical, scalable deployment options for both text and vision-language tasks.

Abstract

Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.

Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact

TL;DR

This paper compares instruction-tuning for LLMs trained with external context (Ctx-LLM) versus context-free data (NoCtx-LLM) to understand how context affects knowledge use and downstream performance. It demonstrates that context-augmented training strengthens grounding and reduces hallucinations in vision-language models, while also shifting reliance away from parametric memory toward provided evidence. The authors show that two deployment strategies—training a mixture of both data types or routing inputs between two specialized models—can preserve complementary strengths and yield robust performance across diverse tasks. These insights inform practical data and system design for real-world applications where context availability varies. Overall, context-aware instruction tuning enhances grounding and leads to practical, scalable deployment options for both text and vision-language tasks.

Abstract

Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.

Paper Structure

This paper contains 56 sections, 2 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Avg. rate at which the model attends most to each input segment: context, generated response, or other (e.g., system prompts) during generation.
  • Figure 2: Average Performance of Ctx-LLM and NoCtx-LLM across inference setups. The x-axis indicates whether context is provided at inference: With Context uses external context, while Without Context requires the model to rely on its own parametric knowledge.
  • Figure 3: Accuracy by models trained on different datasets (Models) and evaluated under different inference conditions (Inference condition). Original (Ori) refers to knowledge aligned with the model's parametric knowledge, while Counterfactual (CF) denotes counterfactual knowledge. "+" or Ctx indicates that context is provided; "-" or NoCtx indicates no context is provided during training or inference.
  • Figure 4: Precision, Recall, and F1 score on the ImageInWords fine-grained captioning task, evaluated with CapMAS, comparing Ctx-VLM and NoCtx-VLM using Llama3.1 8B as base model.
  • Figure 5: Avg. accuracy (y-axis) of atomic facts from generated responses as a function of their position (x-axis). Error bars indicate variance.
  • ...and 11 more figures