Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Jing Jie Tan; Anissa Mokraoui; Ban-Hoe Kwan; Danny Wee-Kiat Ng; Yan-Chai Hum

Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum

TL;DR

This work tackles the challenge of captioning low-resolution images by introducing SOLI, a Siamese-driven optimization framework that learns robust, transportable latent embeddings through contrastive learning while optionally fine-tuning with cross-entropy. The approach leverages a dual-pathway encoder with augmented data (including Gaussian blur and step resizing) to align embeddings across variations, improving caption quality under resource constraints. Empirical results show that SOLI-par significantly boosts BLEU-4 for several encoder–decoder pairs, while SOLI-half can hinder performance on normal images, underscoring the importance of balanced fine-tuning. The findings suggest SOLI offers a practical, lightweight solution for low-resolution image captioning in resource-limited settings, with potential for incremental and reinforcement-learning extensions.

Abstract

Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.

Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

TL;DR

Abstract

Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)