Table of Contents
Fetching ...

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon

TL;DR

RePIC introduces a reinforcement-learning-based post-training framework to personalize multimodal language models for image captioning. It leverages verifiable rewards—Object Consistency Tuning (OCT), Visual Localization Tuning (VLT), and Identity Consistency Tuning (ICT)—within a Group Relative Policy Optimization (GRPO) scheme to enhance both visual recognition and personalized generation, reducing reliance on large-scale high-quality captions. Empirical results show substantial gains over SFT-based baselines, especially in challenging multi-concept scenarios, while maintaining general captioning capabilities. The approach represents a data-efficient path to robust real-world personalization for MLLMs with potential impact on personalized assistants and accessible AI systems.

Abstract

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

TL;DR

RePIC introduces a reinforcement-learning-based post-training framework to personalize multimodal language models for image captioning. It leverages verifiable rewards—Object Consistency Tuning (OCT), Visual Localization Tuning (VLT), and Identity Consistency Tuning (ICT)—within a Group Relative Policy Optimization (GRPO) scheme to enhance both visual recognition and personalized generation, reducing reliance on large-scale high-quality captions. Empirical results show substantial gains over SFT-based baselines, especially in challenging multi-concept scenarios, while maintaining general captioning capabilities. The approach represents a data-efficient path to robust real-world personalization for MLLMs with potential impact on personalized assistants and accessible AI systems.

Abstract

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC

Paper Structure

This paper contains 41 sections, 6 equations, 23 figures, 16 tables.

Figures (23)

  • Figure 1: Visualizations of personalized image captioning results. In the first row, the zero-shot MLLM frequently fails to generate personalized captions. The used images are sourced from Yo’LLava nguyen2024yo. The remaining rows illustrate multi-concept scenarios at inference time. Compared to other SFT-based methods, our approach consistently produces faithful and detailed captions while accurately recognizing all provided identities, even for 3 or 4 concepts. All images are sourced from MuDI jang2024identity.
  • Figure 2: Overview of our RePIC framework: (a) training phase and (b) inference phase. An abbreviated example of the prompt template is shown; complete templates are provided in the Appendix.
  • Figure 3: Visualization of preference evaluation scores for single and 2-concept settings, corresponding to the first and second rows, respectively. In (a), our model outperforms all other baseline models, while in (b), it surpasses all ablation variants.
  • Figure 4: Qualitative examples of 2-concept personalized image captioning.
  • Figure 5: Visualization of training stability.
  • ...and 18 more figures