Table of Contents
Fetching ...

PixLore: A Dataset-driven Approach to Rich Image Captioning

Diego Bonilla-Salvador, Marcelino Martínez-Sober, Joan Vila-Francés, Antonio José Serrano-López, Pablo Rodríguez-Belenguer, Fernando Mateo

TL;DR

This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU, and highlights the importance of well-curated datasets in enhancing the performance of smaller models.

Abstract

In the domain of vision-language integration, generating detailed image captions poses a significant challenge due to the lack of curated and rich datasets. This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU. The followed approach, which involves training on a carefully assembled dataset from state-of-the-art Computer Vision models combined and augmented by ChatGPT, addresses the question of whether intricate image understanding can be achieved with an ensemble of smaller-scale models, referred to as Knowledge Stitching. Comparative evaluations against major models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite having considerably fewer parameters, is rated higher than the existing State-of-the-Art models in over half of the assessments. Precisely, PixLore outperform Bard and BLIP-2, which score approximately 35.18% and 27.98% lower than PixLore in the task of image captioning. This research not only presents a groundbreaking approach but also highlights the importance of well-curated datasets in enhancing the performance of smaller models.

PixLore: A Dataset-driven Approach to Rich Image Captioning

TL;DR

This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU, and highlights the importance of well-curated datasets in enhancing the performance of smaller models.

Abstract

In the domain of vision-language integration, generating detailed image captions poses a significant challenge due to the lack of curated and rich datasets. This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU. The followed approach, which involves training on a carefully assembled dataset from state-of-the-art Computer Vision models combined and augmented by ChatGPT, addresses the question of whether intricate image understanding can be achieved with an ensemble of smaller-scale models, referred to as Knowledge Stitching. Comparative evaluations against major models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite having considerably fewer parameters, is rated higher than the existing State-of-the-Art models in over half of the assessments. Precisely, PixLore outperform Bard and BLIP-2, which score approximately 35.18% and 27.98% lower than PixLore in the task of image captioning. This research not only presents a groundbreaking approach but also highlights the importance of well-curated datasets in enhancing the performance of smaller models.
Paper Structure (11 sections, 7 figures, 2 tables)

This paper contains 11 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Each image goes through a variety of Computer Vision models that aid ChatGPT in the creation of very rich captions
  • Figure 2: Joined prompt for ChatGPT from all the models mentioned. Each model corresponds to its section name output converted to plain text. The image on the right is the one used in the example. For reference, the output of ChatGPT for this example is: Two adorable cats are peacefully lounging on a comfortable couch, one on the left and the other on the right. They appear to be curled up and enjoying a restful moment. On the couch's backrest, you can see two remote controls neatly placed. The cats have distinctive tabby fur patterns, and their soft pink blankets are spread out beneath them. It seems like they are in complete control of their relaxation, creating a serene and cozy atmosphere.
  • Figure 3: Sample of the survey header, with the instructions given to the evaluators; and a sample of a question, with the five options of the models. The image is a random image of the test set of the COCO dataset and the captions are ordered randomly.
  • Figure 4: Comparison of average human evaluators' scores (0 to 5).
  • Figure 5: Density plots for each model's reponses.
  • ...and 2 more figures