Table of Contents
Fetching ...

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, Gaoang Wang

TL;DR

This work tackles personalized image captioning by incorporating user language style while leveraging a frozen large language model. It introduces User-Aware Prefix-Tuning (UAPT), which builds a visual prefix from CLIP-based features via a query-guided mapping and fuses it with TF-IDF-derived user priors through a transformer, producing prefixes that condition a frozen GPT-2 generator. The training focuses on the Mapping and Fusion networks, enabling efficient adaptation with a small number of trainable parameters while CLIP and GPT-2 remain fixed; the objective is the autoregressive likelihood $L = - max_theta \sum_{i=1}^{N} \sum_{j=1}^{L} \log p_theta( c_j^i | x^i, u^i, c_1^i, ..., c_{j-1}^i )$. Empirical results on Instagram and YFCC100M show large improvements over strong baselines across five metrics, including roughly twofold gains in BLEU-4 and CIDEr, demonstrating effective, domain-bridging personalization with efficiency advantages.

Abstract

Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

TL;DR

This work tackles personalized image captioning by incorporating user language style while leveraging a frozen large language model. It introduces User-Aware Prefix-Tuning (UAPT), which builds a visual prefix from CLIP-based features via a query-guided mapping and fuses it with TF-IDF-derived user priors through a transformer, producing prefixes that condition a frozen GPT-2 generator. The training focuses on the Mapping and Fusion networks, enabling efficient adaptation with a small number of trainable parameters while CLIP and GPT-2 remain fixed; the objective is the autoregressive likelihood . Empirical results on Instagram and YFCC100M show large improvements over strong baselines across five metrics, including roughly twofold gains in BLEU-4 and CIDEr, demonstrating effective, domain-bridging personalization with efficiency advantages.

Abstract

Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.
Paper Structure (18 sections, 4 equations, 3 figures, 3 tables)

This paper contains 18 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Owing to the personalities of the users, similar images may exhibit varying descriptions.
  • Figure 2: Overview of our User-Aware Prefix-Tuning Network (UAPT) framework. At first, we utilize a frozen image encoder $f_{I}^{v}$ and context encoder $f_{T}^{t}$ to extract visual features and user-specific embeddings, respectively. Then, a query-guided mapping network $f_{mapping}$ is exploited to align vision and language semantics. Subsequently, visual knowledge and user prior knowledge are fused by a transformer based fusion network $f_{fusion}$ to output embeddings as prefixes. Finally, the prefixes are input into a frozen large language model $f_{gpt2}$ to generate personalized captions.
  • Figure 3: Examples from Instagram (top) and YFCC100M (bottom).