User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Xuan Wang; Guanhong Wang; Wenhao Chai; Jiayu Zhou; Gaoang Wang

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang

TL;DR

This work tackles personalized image captioning by incorporating user language style while leveraging a frozen large language model. It introduces User-Aware Prefix-Tuning (UAPT), which builds a visual prefix from CLIP-based features via a query-guided mapping and fuses it with TF-IDF-derived user priors through a transformer, producing prefixes that condition a frozen GPT-2 generator. The training focuses on the Mapping and Fusion networks, enabling efficient adaptation with a small number of trainable parameters while CLIP and GPT-2 remain fixed; the objective is the autoregressive likelihood $L = - max_theta \sum_{i=1}^{N} \sum_{j=1}^{L} \log p_theta( c_j^i | x^i, u^i, c_1^i, ..., c_{j-1}^i )$. Empirical results on Instagram and YFCC100M show large improvements over strong baselines across five metrics, including roughly twofold gains in BLEU-4 and CIDEr, demonstrating effective, domain-bridging personalization with efficiency advantages.

Abstract

Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

TL;DR

. Empirical results on Instagram and YFCC100M show large improvements over strong baselines across five metrics, including roughly twofold gains in BLEU-4 and CIDEr, demonstrating effective, domain-bridging personalization with efficiency advantages.

Abstract

Paper Structure (18 sections, 4 equations, 3 figures, 3 tables)

This paper contains 18 sections, 4 equations, 3 figures, 3 tables.

Introduction
Related Work
Image Captioning
Personalized Vision and Language Research
Prompt Tuning in Image Captioning
Methods
Problem Definition
Query-Guided Visual Knowledge Mapping
User-Aware Prior Knowledge Fusion
Caption Generation
Experiments
Dataset
Implementation Details
Baselines
Quantitative Analysis
...and 3 more sections

Figures (3)

Figure 1: Owing to the personalities of the users, similar images may exhibit varying descriptions.
Figure 2: Overview of our User-Aware Prefix-Tuning Network (UAPT) framework. At first, we utilize a frozen image encoder $f_{I}^{v}$ and context encoder $f_{T}^{t}$ to extract visual features and user-specific embeddings, respectively. Then, a query-guided mapping network $f_{mapping}$ is exploited to align vision and language semantics. Subsequently, visual knowledge and user prior knowledge are fused by a transformer based fusion network $f_{fusion}$ to output embeddings as prefixes. Finally, the prefixes are input into a frozen large language model $f_{gpt2}$ to generate personalized captions.
Figure 3: Examples from Instagram (top) and YFCC100M (bottom).

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

TL;DR

Abstract

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)