Table of Contents
Fetching ...

Continual Learning for Image Captioning through Improved Image-Text Alignment

Bertram Taetz, Gal Bordelius

TL;DR

This work tackles continual image captioning by addressing catastrophic forgetting through a multi-loss framework that fuses prompt-based semantic guidance with language-informed alignment. Built on a ViT-GPT-2 backbone, the model optimizes a composite objective $\,\mathcal{L}_{\text{total}} = L_{ ext{CE}} + L_{ ext{nouns}} + L_{ ext{CLIP}} + L_{ ext{LGCL}}$, transitioning from noun-prompts to caption-based alignment and employing a language-guided contrastive triplet loss. The approach achieves stronger semantic retention and improved caption alignment on continual MS-COCO benchmarks (ContCap and RATT) while introducing no inference-time overhead. These findings suggest that integrating structured linguistic prompts with cross-modal alignment losses can effectively mitigate forgetting in open-world image captioning, with practical implications for real-time, adaptive vision-language systems.

Abstract

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link: https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

Continual Learning for Image Captioning through Improved Image-Text Alignment

TL;DR

This work tackles continual image captioning by addressing catastrophic forgetting through a multi-loss framework that fuses prompt-based semantic guidance with language-informed alignment. Built on a ViT-GPT-2 backbone, the model optimizes a composite objective , transitioning from noun-prompts to caption-based alignment and employing a language-guided contrastive triplet loss. The approach achieves stronger semantic retention and improved caption alignment on continual MS-COCO benchmarks (ContCap and RATT) while introducing no inference-time overhead. These findings suggest that integrating structured linguistic prompts with cross-modal alignment losses can effectively mitigate forgetting in open-world image captioning, with practical implications for real-time, adaptive vision-language systems.

Abstract

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link: https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

Paper Structure

This paper contains 14 sections, 6 equations, 2 figures, 5 tables, 2 algorithms.

Figures (2)

  • Figure 1: Overview of the proposed multi-objective training approach combining prompt-based, CLIP-based cosine similarity loss and triplet loss.
  • Figure 2: Overview of the proposed multi-objective training approach for image captioning. The model is optimized with four supervisory signals: Standard Cross-Entropy Loss, Prompt based Cosine Similarity Loss, CLIP based Cosine Similarity Loss and Language-Guided Contrastive Loss.