Table of Contents
Fetching ...

Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum

TL;DR

This work tackles the challenge of captioning low-resolution images by introducing SOLI, a Siamese-driven optimization framework that learns robust, transportable latent embeddings through contrastive learning while optionally fine-tuning with cross-entropy. The approach leverages a dual-pathway encoder with augmented data (including Gaussian blur and step resizing) to align embeddings across variations, improving caption quality under resource constraints. Empirical results show that SOLI-par significantly boosts BLEU-4 for several encoder–decoder pairs, while SOLI-half can hinder performance on normal images, underscoring the importance of balanced fine-tuning. The findings suggest SOLI offers a practical, lightweight solution for low-resolution image captioning in resource-limited settings, with potential for incremental and reinforcement-learning extensions.

Abstract

Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.

Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

TL;DR

This work tackles the challenge of captioning low-resolution images by introducing SOLI, a Siamese-driven optimization framework that learns robust, transportable latent embeddings through contrastive learning while optionally fine-tuning with cross-entropy. The approach leverages a dual-pathway encoder with augmented data (including Gaussian blur and step resizing) to align embeddings across variations, improving caption quality under resource constraints. Empirical results show that SOLI-par significantly boosts BLEU-4 for several encoder–decoder pairs, while SOLI-half can hinder performance on normal images, underscoring the importance of balanced fine-tuning. The findings suggest SOLI offers a practical, lightweight solution for low-resolution image captioning in resource-limited settings, with potential for incremental and reinforcement-learning extensions.

Abstract

Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.

Paper Structure

This paper contains 15 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The sample image from the Flickr8k dataset undergoing different augmentation techniques. Here, $R$ denotes the ratio, $S$ denotes the step, and $GF$ denotes the Gaussian filter's standard deviation ($\sigma$). For instance, $R0.5S1\_GF500$ indicates an augmentation with a ratio of 0.5, a step of 1, and a Gaussian filter with a standard deviation of 500. Note that for illustration purposes, the image has been resized again to a similar resolution.
  • Figure 2: A depiction of the proposed architecture for SOLI: Siamese-Driven Optimization for Low-Resolution Image Latent Embedding for Captioning.
  • Figure 3: Similarity differences of latent embeddings from the encoder model. Left: ResNet101 Model, Right: VIT Model
  • Figure 4: Similarity differences in latent embeddings from the encoder model after fine-tuning (SOLI-par). Left: ResNet101 Model, Right: VIT Model