Table of Contents
Fetching ...

CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

Divya Jyoti Bajpai, Manjesh Kumar Hanawal

TL;DR

CapEEN addresses the latency challenge of image captioning by embedding knowledge-distilled early exits into an encoder–decoder backbone (Swin-Transformer + GPT-2). The two-stage training preserves backbone quality while distilling deep-layer knowledge into intermediate exits, enabling tokens to exit early when confident. To cope with distribution drift in deployment, A-CapEEN uses a Multi-Armed Bandit framework to adapt exit thresholds online, achieving robustness to distortions with minimal overhead. Across MS COCO and Flickr30k, CapEEN delivers a notable 1.77x speedup with competitive caption quality, while A-CapEEN further enhances resilience under noise and blur, highlighting practical gains for real-world, resource-constrained deployments.

Abstract

Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at https://github.com/Div290/CapEEN

CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

TL;DR

CapEEN addresses the latency challenge of image captioning by embedding knowledge-distilled early exits into an encoder–decoder backbone (Swin-Transformer + GPT-2). The two-stage training preserves backbone quality while distilling deep-layer knowledge into intermediate exits, enabling tokens to exit early when confident. To cope with distribution drift in deployment, A-CapEEN uses a Multi-Armed Bandit framework to adapt exit thresholds online, achieving robustness to distortions with minimal overhead. Across MS COCO and Flickr30k, CapEEN delivers a notable 1.77x speedup with competitive caption quality, while A-CapEEN further enhances resilience under noise and blur, highlighting practical gains for real-world, resource-constrained deployments.

Abstract

Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at https://github.com/Div290/CapEEN
Paper Structure (24 sections, 1 theorem, 9 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 1 theorem, 9 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem A.1

For any $\gamma \geq 1$, the regret of A-CapEEN with $K$ arms in the action set after $T$ rounds is given as: where $\Delta_{\alpha} = r(\alpha^{*})-r(\alpha)$.

Figures (6)

  • Figure 1: The encoder-decoder framework with attached exits. The figure states that low-level features could be extracted from early classifiers and inferred there, while high-level features are inferred at deeper classifiers. The color of the text in the caption is the same as the color of the classifier after that layer.
  • Figure 2: This figure shows the effect of distortion in the performance when the model was trained on undistorted images and tested on images with varying distortion levels ($\sigma$ models the distortion level).
  • Figure 3: The overall training process for the decoder. Teacher C: Teacher classifier, Student C: Student Classifier, the bars show the probability distribution across different exits.
  • Figure 4: Change in the performance of different metrics with changing time reduction rates. These reductions are observed by changing the threshold parameter $\alpha$.
  • Figure 5: The change in time reduction rate as well as the BLEU-4 scores when the values of the $\lambda$ are varied.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem A.1