Table of Contents
Fetching ...

A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications

Ofir Marom

TL;DR

This work presents MCBR-RAG, a general retrieval-augmented generation framework for multimodal case-based reasoning. It formalizes how multimodal case problems can be converted into text-based representations and paired with application-specific latent representations to enable effective retrieval and reuse with LLMs. The authors instantiate the framework in two domains—Math-24 and Backgammon—by training specialized text-generation and latent-representation models, demonstrating improved generation quality over baselines that lack contextual information. The results highlight the potential of multimodal, RAG-enabled CBR to enhance problem solving in domains where cases comprise diverse data modalities. Overall, MCBR-RAG offers a scalable blueprint for integrating multimodal data into RAG-enabled CBR pipelines with practical impact on AI-assisted reasoning tasks.

Abstract

Case-based reasoning (CBR) is an experience-based approach to problem solving, where a repository of solved cases is adapted to solve new cases. Recent research shows that Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages of the CBR pipeline by retrieving similar cases and using them as additional context to an LLM query. Most studies have focused on text-only applications, however, in many real-world problems the components of a case are multimodal. In this paper we present MCBR-RAG, a general RAG framework for multimodal CBR applications. The MCBR-RAG framework converts non-text case components into text-based representations, allowing it to: 1) learn application-specific latent representations that can be indexed for retrieval, and 2) enrich the query provided to the LLM by incorporating all case components for better context. We demonstrate MCBR-RAG's effectiveness through experiments conducted on a simplified Math-24 application and a more complex Backgammon application. Our empirical results show that MCBR-RAG improves generation quality compared to a baseline LLM with no contextual information provided.

A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications

TL;DR

This work presents MCBR-RAG, a general retrieval-augmented generation framework for multimodal case-based reasoning. It formalizes how multimodal case problems can be converted into text-based representations and paired with application-specific latent representations to enable effective retrieval and reuse with LLMs. The authors instantiate the framework in two domains—Math-24 and Backgammon—by training specialized text-generation and latent-representation models, demonstrating improved generation quality over baselines that lack contextual information. The results highlight the potential of multimodal, RAG-enabled CBR to enhance problem solving in domains where cases comprise diverse data modalities. Overall, MCBR-RAG offers a scalable blueprint for integrating multimodal data into RAG-enabled CBR pipelines with practical impact on AI-assisted reasoning tasks.

Abstract

Case-based reasoning (CBR) is an experience-based approach to problem solving, where a repository of solved cases is adapted to solve new cases. Recent research shows that Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages of the CBR pipeline by retrieving similar cases and using them as additional context to an LLM query. Most studies have focused on text-only applications, however, in many real-world problems the components of a case are multimodal. In this paper we present MCBR-RAG, a general RAG framework for multimodal CBR applications. The MCBR-RAG framework converts non-text case components into text-based representations, allowing it to: 1) learn application-specific latent representations that can be indexed for retrieval, and 2) enrich the query provided to the LLM by incorporating all case components for better context. We demonstrate MCBR-RAG's effectiveness through experiments conducted on a simplified Math-24 application and a more complex Backgammon application. Our empirical results show that MCBR-RAG improves generation quality compared to a baseline LLM with no contextual information provided.
Paper Structure (15 sections, 8 equations, 7 figures, 4 tables)

This paper contains 15 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Four Math-24 puzzle cards. The bottom right card is $(4,5,9,10)$ and has solutions $(10-4) \times (9-5)$ and $(10+5-9)\times 4$.
  • Figure 2: CNN for learning text generation in Math-24. Once the model is trained, the predictions can be used to generate a text-based representation of a Math-24 card image i.e. '4 5 9 10' for the image in this figure.
  • Figure 3: FFNN for learning latent representations in Math-24.
  • Figure 4: An example of a Backgammon lesson magriel1997. The board image represents the current position, and the image caption indicates that player X rolls a 3-1. The expert's analysis below the image discusses the move 16/13, 2/1.
  • Figure 5: Given the $4$ landmark coordinates, as shown by the red dots on the left image, we can slice up the image to obtain the $24$ points, as well as the centrally located bar point, as illustrated on the right. For example, given the bottom two coordinates, we compute the distance, $d$, between the start of the 1-point and the end of the 6-point along the $x$-axis. As there are $6$ points per quadrant in Backgammon, we can infer that the width of each point is $\frac{d}{6}$. We can then use this to find the start location of the 6-point along the $x$-axis. The other locations can also be inferred using similar logic.
  • ...and 2 more figures