Table of Contents
Fetching ...

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

Dian Chao, Xin Song, Shupeng Zhong, Boyuan Wang, Xiangyu Wu, Chen Zhu, Yang Yang

TL;DR

The BRIO model framework is integrated, enabling a more coherent alignment between the generation and evaluation processes and recognizing a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics employed to assess the quality of generated captions.

Abstract

In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

TL;DR

The BRIO model framework is integrated, enabling a more coherent alignment between the generation and evaluation processes and recognizing a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics employed to assess the quality of generated captions.

Abstract

In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.
Paper Structure (10 sections, 5 equations, 3 figures, 1 table)

This paper contains 10 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: In the official dataset, OCR information contains errors, such as extracting ‘300’ as ‘30}’ Our OCR not only corrects these errors but also extracts more valuable information from the figures.
  • Figure 2: Using LLaMA to query paragraphs and refine results: Initially, the presence of information from multiple other images in the original paragraph caused interference. However, after querying the paragraph with LLaMA, only the information pertaining to the target image is retained.
  • Figure 3: The main structure of our approch.