Table of Contents
Fetching ...

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

Basab Jha, Firoj Paudel

TL;DR

The paper addresses the brittleness of domain-specific fine-tuning in on-device language models and proposes the Generalized Edge Model (GEM) to balance specialization with cross-domain robustness. GEM integrates a Dynamic Token Router, Sparse Cross-Attention Router (SCAR), Hybrid Quantization, and Adaptive Knowledge Preservation, delivering sub-100ms latency while achieving a cross-domain F1 of 0.89 and a 7% improvement over GPT-4 Lite on general tasks. The study introduces new metrics—Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)—and demonstrates GEM’s ability to reduce catastrophic forgetting by about 43% across 47 benchmarks spanning eight domains. These findings underscore the feasibility of robust, domain-adaptive edge models and offer hardware-informed guidance for deploying ODLMs in real-world edge environments.

Abstract

The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

TL;DR

The paper addresses the brittleness of domain-specific fine-tuning in on-device language models and proposes the Generalized Edge Model (GEM) to balance specialization with cross-domain robustness. GEM integrates a Dynamic Token Router, Sparse Cross-Attention Router (SCAR), Hybrid Quantization, and Adaptive Knowledge Preservation, delivering sub-100ms latency while achieving a cross-domain F1 of 0.89 and a 7% improvement over GPT-4 Lite on general tasks. The study introduces new metrics—Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)—and demonstrates GEM’s ability to reduce catastrophic forgetting by about 43% across 47 benchmarks spanning eight domains. These findings underscore the feasibility of robust, domain-adaptive edge models and offer hardware-informed guidance for deploying ODLMs in real-world edge environments.

Abstract

The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.

Paper Structure

This paper contains 30 sections, 2 theorems, 21 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

The Generalization Gap satisfies:

Figures (5)

  • Figure 1: Historical evolution of edge AI model complexity, plotted on a logarithmic scale of parameter counts from 2005 to 2025, reflecting the transition from simple perceptrons to advanced ODLMs like GEM.
  • Figure 2: Probability density of F1 scores for the healthcare chatbot, showing a tight peak at 0.95 for in-domain tasks and a broader, lower peak at 0.40 for out-of-domain tasks, illustrating fragile mastery.
  • Figure 3: Scatter plot of in-domain vs. out-of-domain F1 scores for three models, highlighting the variability in generalization gaps.
  • Figure 4: Comprehensive GEM architecture, illustrating token flow from input through routing, SCAR, quantization, to output, with example processing of a mixed-domain sentence.
  • Figure 5: DSI vs. GG comparison between GEM and MobileBERT across domains, showing GEM’s superior generalization.

Theorems & Definitions (4)

  • Theorem 1
  • Proof 1
  • Theorem 2
  • Proof 2