Table of Contents
Fetching ...

USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

Hamed Rahimi, Adil Bahaj, Mouad Abrini, Mahdi Khoramshahi, Mounir Ghogho, Mohamed Chetouani

TL;DR

User-VLM 360° presents a holistic framework for personalized vision-language interactions in social HRI by integrating user-aware tuning with bias-aware optimization. The approach combines a vision encoder and an LLM, refined through Vision Alignment, Instruction Tuning with LoRA/MoLE, and DPO-based Bias Mitigation, supported by a carefully constructed multimodal dataset suite. Empirical results across eight benchmarks show strong gains in personalized VQA and facial feature understanding, while maintaining robust general-purpose reasoning and reducing bias, with substantial efficiency gains over prompting-based baselines. Deployment on the Pepper robot demonstrates real-time adaptability and feasibility for edge-robot experiences, and an ethical verification framework accompanies the release of open-source 3B/10B models to promote responsible adoption and governance of personalized VLMs in real-world settings.

Abstract

The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30X speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.

USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

TL;DR

User-VLM 360° presents a holistic framework for personalized vision-language interactions in social HRI by integrating user-aware tuning with bias-aware optimization. The approach combines a vision encoder and an LLM, refined through Vision Alignment, Instruction Tuning with LoRA/MoLE, and DPO-based Bias Mitigation, supported by a carefully constructed multimodal dataset suite. Empirical results across eight benchmarks show strong gains in personalized VQA and facial feature understanding, while maintaining robust general-purpose reasoning and reducing bias, with substantial efficiency gains over prompting-based baselines. Deployment on the Pepper robot demonstrates real-time adaptability and feasibility for edge-robot experiences, and an ethical verification framework accompanies the release of open-source 3B/10B models to promote responsible adoption and governance of personalized VLMs in real-world settings.

Abstract

The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30X speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.

Paper Structure

This paper contains 57 sections, 4 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Deployment of User-VLM 360° on Pepper Social Robotic Framework. User-aware Tuning mitigates the semantic gap arising from the misalignment between user queries and the observed scene as captured from the robot's camera perspective. While instruction-tuning could address this for large VLMs, it adds latency and reduces performance. User-VLM 360° overcomes this by inherently aligning cross-modal user representations, enabling robust real-time adaptation in dynamic robotic environments.
  • Figure 2: User-aware Tuning consists of three key steps: In the first step,Vision Alignment, the model is trained to recognize and interpret human emotions, age, gender, and ethnicity based on facial features and visual signals. In the second step, Instruction Tuning, the model undergoes supervised instruction tuning, enabling it to respond effectively to general-purpose questions by incorporating visual cues. Finally, to mitigate over-personalization and prevent biased or unethical responses, the third step, Bias Mitigation, focuses on training the model to generate ethical and contextually appropriate responses.
  • Figure 3: Distribution of Training Datasets. The datasets are constructed by combining high-quality general-purpose datasets with a facial image datasets, further refined to align with both visual and linguistic contexts.