Table of Contents
Fetching ...

CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Juntae Lee, Jihwan Bang, Seunghan Yang, Simyung Chang

TL;DR

CIFLEX addresses the challenge of efficiently handling sub-tasks in long, multi-turn conversations on a single on-device LLM. It introduces a Contextual Instruction Flow with a main path and side paths that reuse a shared KV cache, attaching only task-specific prompts for sub-tasks and rolling back after execution. A hierarchical binary sub-task classification guides task routing on small models, and the authors provide two new multi-turn, multi-task datasets (TopiOCQA-Task+ and QReCC-Task+) for evaluation. Empirical results show substantial reductions in prefill computation and latency while maintaining task performance, enabling scalable on-device dialogue systems that support diverse sub-tasks. Overall, CIFLEX demonstrates practical efficiency and robustness for edge-device multi-task dialogue without multi-model overhead.

Abstract

We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

TL;DR

CIFLEX addresses the challenge of efficiently handling sub-tasks in long, multi-turn conversations on a single on-device LLM. It introduces a Contextual Instruction Flow with a main path and side paths that reuse a shared KV cache, attaching only task-specific prompts for sub-tasks and rolling back after execution. A hierarchical binary sub-task classification guides task routing on small models, and the authors provide two new multi-turn, multi-task datasets (TopiOCQA-Task+ and QReCC-Task+) for evaluation. Empirical results show substantial reductions in prefill computation and latency while maintaining task performance, enabling scalable on-device dialogue systems that support diverse sub-tasks. Overall, CIFLEX demonstrates practical efficiency and robustness for edge-device multi-task dialogue without multi-model overhead.

Abstract

We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

Paper Structure

This paper contains 27 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Overall framework of the proposed CIFLEX.
  • Figure 2: Total prefilled tokens for sub-task classification until each turn in LLaMA3.1-Instruct (8B) on TopiOCA-Task+.
  • Figure 3: Turn-wise prefilled tokens for main-task and sub-task execution until each turns in LLaMA3.1-Instruct (8B) on TopiOCA-Task+.
  • Figure 4: Turn-wise prefilled tokens for sub-task classification, and main-task and sub-task execution until each turns in LLaMA3.1-Instruct (8B) on QReCC-Task+ dataset.
  • Figure 5: Prompt template for main-task execution.
  • ...and 6 more figures