Table of Contents
Fetching ...

Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Uri Berger, Omri Abend, Lea Frermann, Gabriel Stanovsky

TL;DR

This work introduces reformulation feedback as a novel, inference-time signal for image captioning, where human-authored edits steer model outputs toward desired attributes like factuality or style without retraining the captioning model. A reformulation model is trained on relatively small datasets to transform a given image-caption pair into a reformulated caption, enabling seamless integration with off-the-shelf captioners. The approach yields clear gains in English and German captioning, particularly in challenging domains, and achieves state-of-the-art results for German image captioning and for style-transfer on FlickrStyle, validated by extensive automatic and human evaluations. The findings demonstrate the practical utility of lightweight, transferable reformulation models for adapting captioning systems to user goals and multilingual settings, with wide potential for other generation tasks.

Abstract

Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.

Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

TL;DR

This work introduces reformulation feedback as a novel, inference-time signal for image captioning, where human-authored edits steer model outputs toward desired attributes like factuality or style without retraining the captioning model. A reformulation model is trained on relatively small datasets to transform a given image-caption pair into a reformulated caption, enabling seamless integration with off-the-shelf captioners. The approach yields clear gains in English and German captioning, particularly in challenging domains, and achieves state-of-the-art results for German image captioning and for style-transfer on FlickrStyle, validated by extensive automatic and human evaluations. The findings demonstrate the practical utility of lightweight, transferable reformulation models for adapting captioning systems to user goals and multilingual settings, with wide potential for other generation tasks.

Abstract

Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.
Paper Structure (39 sections, 8 figures, 4 tables)

This paper contains 39 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Our proposed method with reformulation for improved factuality as an example. Top: Collecting human-written reformulations of model captions. Center: Using the collected data to train models to generate reformulations, given an input image and original caption. Bottom: Combining an off-the-shelf captioning model (no training) with our reformulation model, to adapt generated captions at inference time.
  • Figure 2: Examples in which the reformulated captions achieved better results than the original ones, in all SPICE elements. Orig: the caption generated by the model. Re: the reformulated caption.
  • Figure 3: Results for human evaluation on different axes. We show proportions of preferences for generated captions without (base) and with (base+re) reformulations, and ties. * indicates a significant difference between base and base+re (Sign test; $p<0.05$).
  • Figure 4: Results for human evaluation on different axes of German generated captions. We show proportions of preferences for generated captions without (base) and with (base+re) reformulations, and ties. * indicates significance as in Figure \ref{['fig:english_human_eval']}.
  • Figure 5: Results for human evaluation on different axes of stylized captions. We show proportions of preferences for the baseline (CapDec) and BLIP reformulated (BLIP+re) captions, and ties. * indicates significance as in Figure \ref{['fig:english_human_eval']}.
  • ...and 3 more figures