Table of Contents
Fetching ...

BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen

TL;DR

BitMar addresses the challenge of deploying cross-modal vision–language transformers on edge devices by introducing a compact, 1.58-bit quantized multimodal pipeline augmented with an external episodic memory. The architecture fuses quantized text and vision features, stores contextual information in a fixed memory, and injects retrieved context into a BitNet-based decoder with per-layer conditioning and attention sinks for long-context processing. Training combines cross-modal alignment, memory-consistency regularization, and adaptive control to maintain modality balance, while evaluation demonstrates strong efficiency and competitive performance on lightweight tasks. The work advances edge AI for multimodal understanding by showing that aggressive quantization, memory augmentation, and streaming attention can enable practical on-device reasoning with a small model footprint.

Abstract

Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

TL;DR

BitMar addresses the challenge of deploying cross-modal vision–language transformers on edge devices by introducing a compact, 1.58-bit quantized multimodal pipeline augmented with an external episodic memory. The architecture fuses quantized text and vision features, stores contextual information in a fixed memory, and injects retrieved context into a BitNet-based decoder with per-layer conditioning and attention sinks for long-context processing. Training combines cross-modal alignment, memory-consistency regularization, and adaptive control to maintain modality balance, while evaluation demonstrates strong efficiency and competitive performance on lightweight tasks. The work advances edge AI for multimodal understanding by showing that aggressive quantization, memory augmentation, and streaming attention can enable practical on-device reasoning with a small model footprint.

Abstract

Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: BitMar Architecture. The model processes multimodal inputs: text tokens and DiNOv2-compressed image features. Quantized encoders (1.58-bit) generate compact text and vision embeddings ($z$, $v$), which are fused via cross-modal attention into shared query representations ($Q$, $Q_{query}$). A sliding-window attention mechanism enables long-context processing. A fixed episodic memory matrix ($K \times C$) stores and retrieves multimodal context vectors through quantized read/write weights ($W$, $W_0$), supporting optional SD-card offloading for edge deployment.
  • Figure 2: Episodic Memory Activation Patterns. (a) Early training shows scattered and weak activations with minimal specialization. (b) Late training exhibits stronger and more differentiated activations, reflecting the emergence of structured memory representations.
  • Figure 3: Quantization effectiveness over training epochs.