Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?
Basab Jha, Firoj Paudel
TL;DR
The paper addresses the brittleness of domain-specific fine-tuning in on-device language models and proposes the Generalized Edge Model (GEM) to balance specialization with cross-domain robustness. GEM integrates a Dynamic Token Router, Sparse Cross-Attention Router (SCAR), Hybrid Quantization, and Adaptive Knowledge Preservation, delivering sub-100ms latency while achieving a cross-domain F1 of 0.89 and a 7% improvement over GPT-4 Lite on general tasks. The study introduces new metrics—Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)—and demonstrates GEM’s ability to reduce catastrophic forgetting by about 43% across 47 benchmarks spanning eight domains. These findings underscore the feasibility of robust, domain-adaptive edge models and offer hardware-informed guidance for deploying ODLMs in real-world edge environments.
Abstract
The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.
