Table of Contents
Fetching ...

Efficient On-Device Agents via Adaptive Context Management

Sanidhya Vijayvargiya, Rahul Lokesh

TL;DR

This paper tackles the memory bottleneck of on-device AI agents by introducing a context-efficient framework that compresses conversational history into a Context State Object (CSO) via a dual-adapter memory system. It couples a token-efficient tool-schema representation with a just-in-time schema-passing mechanism to dramatically reduce initial context and growth while preserving or improving task performance on complex, multi-turn tasks. The approach is instantiated on a 3B parameter SLM and validated against baselines, showing more than a 6x reduction in initial context and 10x–25x reduction in growth rate, enabling persistent, capable on-device operation with local tools and cloud delegation when needed. The work offers a practical pathway toward private, low-latency AI assistants by balancing on-device computation with selective cloud reasoning, and it highlights concrete design patterns for memory management, tool orchestration, and data generation in resource-constrained environments.

Abstract

On-device AI agents offer the potential for personalized, low-latency assistance, but their deployment is fundamentally constrained by limited memory capacity, which restricts usable context. This reduced practical context window creates a trade-off between supporting rich, stateful interactions with complex tool capabilities and maintaining on-device feasibility. We break this trade-off with a framework for context-efficient on-device agents, driven by three synergistic optimizations (1) a dynamic memory system using specialized LoRA adapters to distill conversational history into a compressed, and structured Context State Object; (2) a minimalist serialization format for tool schemas to minimize token overhead per tool; and (3) a just-in-time schema-passing mechanism that loads full tool definitions only upon tool selection. We instantiate this framework by adapting a 3B parameter SLM to context-efficient trajectories and rigorously evaluate it against a conventional baseline on complex user tasks. Our agent matches, or exceeds, the performance of a conventional baseline while dramatically compressing context, achieving more than a 6-fold reduction in initial system prompt context and a 10- to 25-fold reduction in context growth rate based on the interaction verbosity, demonstrating that strategic context management is key to unlocking capable and persistent on-device AI.

Efficient On-Device Agents via Adaptive Context Management

TL;DR

This paper tackles the memory bottleneck of on-device AI agents by introducing a context-efficient framework that compresses conversational history into a Context State Object (CSO) via a dual-adapter memory system. It couples a token-efficient tool-schema representation with a just-in-time schema-passing mechanism to dramatically reduce initial context and growth while preserving or improving task performance on complex, multi-turn tasks. The approach is instantiated on a 3B parameter SLM and validated against baselines, showing more than a 6x reduction in initial context and 10x–25x reduction in growth rate, enabling persistent, capable on-device operation with local tools and cloud delegation when needed. The work offers a practical pathway toward private, low-latency AI assistants by balancing on-device computation with selective cloud reasoning, and it highlights concrete design patterns for memory management, tool orchestration, and data generation in resource-constrained environments.

Abstract

On-device AI agents offer the potential for personalized, low-latency assistance, but their deployment is fundamentally constrained by limited memory capacity, which restricts usable context. This reduced practical context window creates a trade-off between supporting rich, stateful interactions with complex tool capabilities and maintaining on-device feasibility. We break this trade-off with a framework for context-efficient on-device agents, driven by three synergistic optimizations (1) a dynamic memory system using specialized LoRA adapters to distill conversational history into a compressed, and structured Context State Object; (2) a minimalist serialization format for tool schemas to minimize token overhead per tool; and (3) a just-in-time schema-passing mechanism that loads full tool definitions only upon tool selection. We instantiate this framework by adapting a 3B parameter SLM to context-efficient trajectories and rigorously evaluate it against a conventional baseline on complex user tasks. Our agent matches, or exceeds, the performance of a conventional baseline while dramatically compressing context, achieving more than a 6-fold reduction in initial system prompt context and a 10- to 25-fold reduction in context growth rate based on the interaction verbosity, demonstrating that strategic context management is key to unlocking capable and persistent on-device AI.

Paper Structure

This paper contains 46 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An unchecked long context (left) strains device resources, leading to execution failures like incorrect calls or gibberish. Our approach (right) creates an optimized context for reliable and efficient command execution.
  • Figure 2: System architecture of the on-device AI agent. The user interacts with the agent through a chat interface. The on-device agent processes the user's requests and performs internal tool invocations, utilizing a suite of local, on-device tools (e.g., Email, Gallery, Reminders) as well as a more powerful Cloud Agent for complex queries.
  • Figure 3: Our two context optimization techniques for on-device agents. (a) A state-tracking system maintains a compact history. (b) A two-step tool-call process reduces the overhead of tool schemas.
  • Figure 4: Averaged context input length over assistant turns for the Multi-Tool category in the evaluation runs. Shaded regions represent 95% Confidence Intervals.
  • Figure 5: Averaged context input length over assistant turns for the Cloud Delegation category in the evaluation runs. Shaded regions represent 95% Confidence Intervals.