Demographic User Modeling for Social Robotics with Multimodal Pre-trained Models
Hamed Rahimi, Mouad Abrini, Mahdi Khoramshahi, Mohamed Chetouani
TL;DR
Demographic user modeling from visual-linguistic data in social robotics is challenged by data sparsity and bias. The authors introduce GenUser and FairUser to study this problem and evaluate CLIP and FaRL in both out-of-the-box and fine-tuned settings, proposing a hybrid objective that adds masked image modeling to the standard contrastive loss, yielding $L_{total}=L_c+L_{MIM}$. Results show limited out-of-the-box performance but substantial gains after fine-tuning, with FaRL offering better cross-dataset generalization than CLIP, which can suffer from catastrophic forgetting when fine-tuned on a single dataset. This work provides a pathway to more robust, privacy-aware multimodal demographic modeling for social robotics, with implications for healthcare and education applications where nuanced user understanding is critical.
Abstract
This paper investigates the performance of multimodal pre-trained models in user profiling tasks based on visual-linguistic demographic data. These models are critical for adapting to the needs and preferences of human users in social robotics, thereby providing personalized responses and enhancing interaction quality. First, we introduce two datasets specifically curated to represent demographic characteristics derived from user facial images. Next, we evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets, both in its out-of-the-box state and after fine-tuning. Initial results indicate that CLIP performs suboptimal in matching images to demographic descriptions without fine-tuning. Although fine-tuning significantly enhances its predictive capacity, the model continues to exhibit limitations in effectively generalizing subtle demographic nuances. To address this, we propose adopting a masked image modeling strategy to improve generalization and better capture subtle demographic attributes. This approach offers a pathway for enhancing demographic sensitivity in multimodal user modeling tasks.
