Table of Contents
Fetching ...

Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

Sebastián Andrés Cajas Ordóñez, Luis Fernando Torres Torres, Mackenzie J. Meni, Carlos Andrés Duran Paredes, Eric Arazo, Cristian Bosch, Ricardo Simon Carbajo, Yuan Lai, Leo Anthony Celi

TL;DR

This work tackles the challenge of deploying accurate yet latency-stable neural models on edge devices by introducing a curiosity-driven quantized Mixture-of-Experts that leverages heterogeneous quantization (BitNet ternary, BitLinear 1–16 bits, and PTQ) and Bayesian uncertainty-based routing. The approach yields a practical operating point around $4$-bit quantization, achieving roughly $99\%$ of full-precision accuracy with substantial compression and energy savings, while dramatically improving latency stability via curiosity-driven routing (82% reduction in latency variance, $p=0.008$). Across ESC-50, Quinn, and UrbanSound8K, simple 4-bit and 8-bit quantized models closely match full-precision performance, whereas MoE architectures introduce latency overhead without clear accuracy gains; nevertheless, curiosity routing offers predictable, robust edge inference at scale where deployment emissions dominate training costs. The combination of heterogeneous quantization and epistemic-uncertainty-guided routing yields energy- and carbon-aware, edge-friendly models, with clear deployment guidelines and a path toward hardware-aware optimizations.

Abstract

Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene's test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

TL;DR

This work tackles the challenge of deploying accurate yet latency-stable neural models on edge devices by introducing a curiosity-driven quantized Mixture-of-Experts that leverages heterogeneous quantization (BitNet ternary, BitLinear 1–16 bits, and PTQ) and Bayesian uncertainty-based routing. The approach yields a practical operating point around -bit quantization, achieving roughly of full-precision accuracy with substantial compression and energy savings, while dramatically improving latency stability via curiosity-driven routing (82% reduction in latency variance, ). Across ESC-50, Quinn, and UrbanSound8K, simple 4-bit and 8-bit quantized models closely match full-precision performance, whereas MoE architectures introduce latency overhead without clear accuracy gains; nevertheless, curiosity routing offers predictable, robust edge inference at scale where deployment emissions dominate training costs. The combination of heterogeneous quantization and epistemic-uncertainty-guided routing yields energy- and carbon-aware, edge-friendly models, with clear deployment guidelines and a path toward hardware-aware optimizations.

Abstract

Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene's test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

Paper Structure

This paper contains 29 sections, 8 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Curiosity-driven routing architecture. Audio embeddings are processed through a Bayesian router that computes epistemic uncertainty via Monte Carlo dropout to select top-k heterogeneous quantized experts. Expert outputs are aggregated, with exploration encouraged under high uncertainty (Eq. 8).