Table of Contents
Fetching ...

Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction

Menglin Xia, Xuchao Zhang, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle

TL;DR

Hybrid-RACA addresses the latency-cost tension in retrieval-augmented text prediction by pairing a small on-device predictor with cloud-generated memory. It introduces an asynchronous memory-update mechanism driven by an augmentation coordinator and a cloud memory generator that compresses retrieved documents into concise takeaways, forming memory $m_t$ for real-time use. The client is instruction-tuned to leverage this memory, with a loss that aligns its outputs to LLM-generated targets, achieving strong utility on multiple datasets while keeping latency low. The approach demonstrates practical benefits for edge-based real-time composition and suggests broader applicability to hybrid edge-cloud AI systems.

Abstract

Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM's capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.

Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction

TL;DR

Hybrid-RACA addresses the latency-cost tension in retrieval-augmented text prediction by pairing a small on-device predictor with cloud-generated memory. It introduces an asynchronous memory-update mechanism driven by an augmentation coordinator and a cloud memory generator that compresses retrieved documents into concise takeaways, forming memory for real-time use. The client is instruction-tuned to leverage this memory, with a loss that aligns its outputs to LLM-generated targets, achieving strong utility on multiple datasets while keeping latency low. The approach demonstrates practical benefits for edge-based real-time composition and suggests broader applicability to hybrid edge-cloud AI systems.

Abstract

Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM's capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
Paper Structure (24 sections, 1 equation, 6 figures, 8 tables)

This paper contains 24 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the Hybrid-RACA system, which is a hybrid system for composition assistance. The top left box represents the writing interface. The framework has four main components: augmentation coordinator and client model on the client side (left), and retriever and LLM-based memory generator on the cloud (right).
  • Figure 2: Process of the augmentation coordinator
  • Figure 3: Example of constructing instruction-tuning data
  • Figure 4: Inference latency for client inference, retrieval and memory generation on multiple devices
  • Figure 5: Hybrid-RACA performance with asynchronous memory update.
  • ...and 1 more figures