CAPEEN: Image Captioning with Early Exits and Knowledge Distillation
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
TL;DR
CapEEN addresses the latency challenge of image captioning by embedding knowledge-distilled early exits into an encoder–decoder backbone (Swin-Transformer + GPT-2). The two-stage training preserves backbone quality while distilling deep-layer knowledge into intermediate exits, enabling tokens to exit early when confident. To cope with distribution drift in deployment, A-CapEEN uses a Multi-Armed Bandit framework to adapt exit thresholds online, achieving robustness to distortions with minimal overhead. Across MS COCO and Flickr30k, CapEEN delivers a notable 1.77x speedup with competitive caption quality, while A-CapEEN further enhances resilience under noise and blur, highlighting practical gains for real-world, resource-constrained deployments.
Abstract
Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at https://github.com/Div290/CapEEN
