Table of Contents
Fetching ...

Reasoning LLMs for User-Aware Multimodal Conversational Agents

Hamed Rahimi, Jeanne Cattoni, Meriem Beghili, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani

TL;DR

The paper tackles the cold-start problem in personalized human-robot interactions by introducing USER-LLM R1, a framework that dynamically infers and refines user profiles through a synergy of a User Encoder, a Vision-Language Model (User-VLM), Retrieval-Augmented Generation (RAG), and Chain-of-Thought (CoT) reasoning LLMs. It demonstrates how initial user models can be generated from multimodal inputs and progressively updated during dialogue to deliver contextually relevant, personalized responses from the first interaction. Evaluations on the ElderlyTech-VQA Bench show notable improvements in ROUGE metrics and favorable human judgments, with ablations highlighting the importance of reasoning model size and the efficiency of the User-VLM in cold-start situations. The work also discusses privacy, bias, and governance considerations, proposing a dynamic, consent-aware approach to user profiling suitable for elderly users and real-world deployment.

Abstract

Personalization in social robotics is critical for fostering effective human-robot interactions, yet systems often face the cold start problem, where initial user preferences or characteristics are unavailable. This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent that addresses this challenge through dynamic user profiling and model initiation. Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models (VLMs) to initialize user profiles from multimodal inputs, enabling personalized interactions from the first encounter. Leveraging a Retrieval-Augmented Generation (RAG) architecture, the system dynamically refines user representations within an inherent CoT process, ensuring contextually relevant and adaptive responses. Evaluations on the ElderlyTech-VQA Bench demonstrate significant improvements in ROUGE-1 (+23.2%), ROUGE-2 (+0.6%), and ROUGE-L (+8%) F1 scores over state-of-the-art baselines, with ablation studies underscoring the impact of reasoning model size on performance. Human evaluations further validate the framework's efficacy, particularly for elderly users, where tailored responses enhance engagement and trust. Ethical considerations, including privacy preservation and bias mitigation, are rigorously discussed and addressed to ensure responsible deployment.

Reasoning LLMs for User-Aware Multimodal Conversational Agents

TL;DR

The paper tackles the cold-start problem in personalized human-robot interactions by introducing USER-LLM R1, a framework that dynamically infers and refines user profiles through a synergy of a User Encoder, a Vision-Language Model (User-VLM), Retrieval-Augmented Generation (RAG), and Chain-of-Thought (CoT) reasoning LLMs. It demonstrates how initial user models can be generated from multimodal inputs and progressively updated during dialogue to deliver contextually relevant, personalized responses from the first interaction. Evaluations on the ElderlyTech-VQA Bench show notable improvements in ROUGE metrics and favorable human judgments, with ablations highlighting the importance of reasoning model size and the efficiency of the User-VLM in cold-start situations. The work also discusses privacy, bias, and governance considerations, proposing a dynamic, consent-aware approach to user profiling suitable for elderly users and real-world deployment.

Abstract

Personalization in social robotics is critical for fostering effective human-robot interactions, yet systems often face the cold start problem, where initial user preferences or characteristics are unavailable. This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent that addresses this challenge through dynamic user profiling and model initiation. Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models (VLMs) to initialize user profiles from multimodal inputs, enabling personalized interactions from the first encounter. Leveraging a Retrieval-Augmented Generation (RAG) architecture, the system dynamically refines user representations within an inherent CoT process, ensuring contextually relevant and adaptive responses. Evaluations on the ElderlyTech-VQA Bench demonstrate significant improvements in ROUGE-1 (+23.2%), ROUGE-2 (+0.6%), and ROUGE-L (+8%) F1 scores over state-of-the-art baselines, with ablation studies underscoring the impact of reasoning model size on performance. Human evaluations further validate the framework's efficacy, particularly for elderly users, where tailored responses enhance engagement and trust. Ethical considerations, including privacy preservation and bias mitigation, are rigorously discussed and addressed to ensure responsible deployment.

Paper Structure

This paper contains 17 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: USER-LLM R1 Architecture: The framework consists of three principal components: a User Encoder for profile encoding, a Vision-Language Model (VLM) for initial user modeling, and a Chain-of-Thought (CoT) Reasoning Large Language Model (LLM) for updating user profile and personalized response generation.
  • Figure 2: CoT Reasoning LLM vs Regular LLM grootendorst2025visualguide
  • Figure 3: Evaluation with Human Expert: Our framework with 70B parameters and an adaptive User-VLM module, achieves GPT-4o-level personalization at lower computational cost.