BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen
TL;DR
BitMar addresses the challenge of deploying cross-modal vision–language transformers on edge devices by introducing a compact, 1.58-bit quantized multimodal pipeline augmented with an external episodic memory. The architecture fuses quantized text and vision features, stores contextual information in a fixed memory, and injects retrieved context into a BitNet-based decoder with per-layer conditioning and attention sinks for long-context processing. Training combines cross-modal alignment, memory-consistency regularization, and adaptive control to maintain modality balance, while evaluation demonstrates strong efficiency and competitive performance on lightweight tasks. The work advances edge AI for multimodal understanding by showing that aggressive quantization, memory augmentation, and streaming attention can enable practical on-device reasoning with a small model footprint.
Abstract
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
