Table of Contents
Fetching ...

AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language

Pankaj Choudhury, Yogesh Aggarwal, Prabhanjan Jadhav, Prithwijit Guha, Sukumar Nandi

TL;DR

This paper tackles the need for computationally efficient image captioning in a low-resource language by introducing AC-Lite, a lightweight model for Assamese that runs on-resource-constrained devices. It combines a ShuffleNetv2x1.5 visual encoder, a GRU-based bilinear attention module, and a GRU decoder, trained with cross-entropy and enhanced via self-critical reinforcement learning. The approach achieves strong efficiency (about $2.45$ GFLOPs and $22.87$M parameters) with competitive accuracy on COCO-AC and Flickr30K-AC, and demonstrates a favorable trade-off compared to heavier Assamese captioning baselines. Ablation analyses identify the best encoder/decoder configuration and confirm the benefits of GRU-based attention, supporting practical on-device deployment for assistive and educational applications in Assamese-speaking communities.

Abstract

Most existing works in image caption synthesis use computation heavy deep neural networks and generates image descriptions in English language. This often restricts this important assistive tool for widespread use across language and accessibility barriers. This work presents AC-Lite, a computationally efficient model for image captioning in low-resource Assamese language. AC-Lite reduces computational requirements by replacing computation-heavy deep network components with lightweight alternatives. The AC-Lite model is designed through extensive ablation experiments with different image feature extractor networks and language decoders. A combination of ShuffleNetv2x1.5 with GRU based language decoder along with bilinear attention is found to provide the best performance with minimum compute. AC-Lite was observed to achieve an 82.3 CIDEr score on the COCO-AC dataset with 2.45 GFLOPs and 22.87M parameters.

AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language

TL;DR

This paper tackles the need for computationally efficient image captioning in a low-resource language by introducing AC-Lite, a lightweight model for Assamese that runs on-resource-constrained devices. It combines a ShuffleNetv2x1.5 visual encoder, a GRU-based bilinear attention module, and a GRU decoder, trained with cross-entropy and enhanced via self-critical reinforcement learning. The approach achieves strong efficiency (about GFLOPs and M parameters) with competitive accuracy on COCO-AC and Flickr30K-AC, and demonstrates a favorable trade-off compared to heavier Assamese captioning baselines. Ablation analyses identify the best encoder/decoder configuration and confirm the benefits of GRU-based attention, supporting practical on-device deployment for assistive and educational applications in Assamese-speaking communities.

Abstract

Most existing works in image caption synthesis use computation heavy deep neural networks and generates image descriptions in English language. This often restricts this important assistive tool for widespread use across language and accessibility barriers. This work presents AC-Lite, a computationally efficient model for image captioning in low-resource Assamese language. AC-Lite reduces computational requirements by replacing computation-heavy deep network components with lightweight alternatives. The AC-Lite model is designed through extensive ablation experiments with different image feature extractor networks and language decoders. A combination of ShuffleNetv2x1.5 with GRU based language decoder along with bilinear attention is found to provide the best performance with minimum compute. AC-Lite was observed to achieve an 82.3 CIDEr score on the COCO-AC dataset with 2.45 GFLOPs and 22.87M parameters.

Paper Structure

This paper contains 12 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Various components of an Image Captioning System.
  • Figure 2: Functional block diagram of the proposed AC-Lite model with Bilinear Attention.
  • Figure 3: Qualitative example produced by the proposed AC-Lite on COCO-AC test set. Here, AC-Lite -- caption generated by the proposed model, gloss -- gloss annotation.