Table of Contents
Fetching ...

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli

TL;DR

MemLoRA introduces an on-device memory system that replaces large LLMs with small language models augmented by task-specific adapters for knowledge extraction, memory update, and generation. The approach extends to multimodal memory with MemLoRA-V, enabling native visual understanding via a vision-adapter and a VQA benchmark augmentation for LoCoMo. Across text-only tasks, MemLoRA matches or surpasses models tens of times larger, while delivering substantial efficiency gains suitable for privacy-preserving on-device deployment. The multimodal extension demonstrates strong VQA performance and preserved text capabilities, underscoring MemLoRA's practicality for private, offline AI on mobile and edge devices.

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

TL;DR

MemLoRA introduces an on-device memory system that replaces large LLMs with small language models augmented by task-specific adapters for knowledge extraction, memory update, and generation. The approach extends to multimodal memory with MemLoRA-V, enabling native visual understanding via a vision-adapter and a VQA benchmark augmentation for LoCoMo. Across text-only tasks, MemLoRA matches or surpasses models tens of times larger, while delivering substantial efficiency gains suitable for privacy-preserving on-device deployment. The multimodal extension demonstrates strong VQA performance and preserved text capabilities, underscoring MemLoRA's practicality for private, offline AI on mobile and edge devices.

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operationsknowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10 larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60 larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.

Paper Structure

This paper contains 27 sections, 1 equation, 4 figures, 9 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview. We employ specialized LoRA adapters to enable small (vision) language models to perform memory operations for on-device deployment. The base model dynamically switches between expert adapters, each trained for a distinct stage: (1) knowledge extraction, (2) memory update, (3) memory-augmented generation. In the last stage, the model can switch between text-only and multimodal adapter, depending on the input. By specializing each adapter for its specific operation, MemLoRA(-V) achieves performance comparable to models 10-60x larger while enabling efficient local execution without cloud API dependencies.
  • Figure 2: Training Pipeline (Extraction LoRA). We first generate outputs for the specific memory-related task via a larger model (teacher). Raw output is further cleaned and used as target for training LoRA parameters of a small model (student).
  • Figure 3: Our augmentation of LoCoMo includes challenging VQA tasks about (a) counting object quantities, (b) identifying colors, and (c) asking about unusual objects.
  • Figure C4: VQA Examples. LoCoMo images with corresponding three associated generated questions (Q), InternVL3-78B answers (A), and predictions with InternVL3-2B without (P(IVL2B)) and with (P(IVL2B+Exp)) expert adapters.