Automated Image Captioning with CNNs and Transformers
Joshua Adrian Cahyono, Jeremy Nathan Jusuf
TL;DR
This work addresses automated image captioning by progressively comparing CNN-RNN baselines with increasingly sophisticated transformer-based architectures, including CNN-Attn, YOLOCNN-Attn, ViT-Attn, and a novel ViTCNN-Attn hybrid. The system is trained on MSCOCO and Flickr30k, optimizing the objective $\theta = \arg\max_\theta \log p(S|I; \theta)$ with cross-entropy loss over sequences and evaluated via BLEU, METEOR, and CIDEr. Key contributions include a comprehensive evaluation of attention-enhanced and transformer-based models, a practical precomputation strategy that accelerates training, and a demonstration that the ViTCNN-Attn hybrid best leverages both local CNN features and global ViT context to produce high-quality captions. The results underscore the value of combining regional object awareness with global contextual understanding, offering a promising path for robust, descriptive image captioning in real-world applications.
Abstract
This project aims to create an automated image captioning system that generates natural language descriptions for input images by integrating techniques from computer vision and natural language processing. We employ various different techniques, ranging from CNN-RNN to the more advanced transformer-based techniques. Training is carried out on image datasets paired with descriptive captions, and model performance will be evaluated using established metrics such as BLEU, METEOR, and CIDEr. The project will also involve experimentation with advanced attention mechanisms, comparisons of different architectural choices, and hyperparameter optimization to refine captioning accuracy and overall system effectiveness.
