Proposal Report for the 2nd SciCAP Competition 2024

Pengpeng Li; Tingmin Li; Jingyuan Wang; Boyuan Wang; Yang Yang

Proposal Report for the 2nd SciCAP Competition 2024

Pengpeng Li, Tingmin Li, Jingyuan Wang, Boyuan Wang, Yang Yang

TL;DR

The paper tackles summarizing descriptions tied to specific objects within long documents by proposing a two-stage, object-centric pipeline that fuses high-quality OCR data with paragraph-level filtering. It combines PaddleOCR-based object text extraction, chunk-wise paragraph filtering, and two-stage generation (filtering followed by summarization) with a model-ensemble of Pegasus and LLaMA2-13B to produce high-quality summaries. The approach achieves top performance in the 2024 SciCAP competition for both long-caption and short-caption tracks, demonstrating that object-aware inputs and targeted inference reduce noise and improve factual accuracy. This work highlights a practical pathway for high-stidelity, object-centered document summarization in multimodal text settings.

Abstract

In this paper, we propose a method for document summarization using auxiliary information. This approach effectively summarizes descriptions related to specific images, tables, and appendices within lengthy texts. Our experiments demonstrate that leveraging high-quality OCR data and initially extracted information from the original text enables efficient summarization of the content related to described objects. Based on these findings, we enhanced popular text generation model models by incorporating additional auxiliary branches to improve summarization performance. Our method achieved top scores of 4.33 and 4.66 in the long caption and short caption tracks, respectively, of the 2024 SciCAP competition, ranking highest in both categories.

Proposal Report for the 2nd SciCAP Competition 2024

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 2 figures, 1 table)

This paper contains 12 sections, 2 equations, 2 figures, 1 table.

Introduction
Related Work
Methodology
Overall Architecture
Prepare training data
Generate summaries
Model-ensemble
Experiment
Dataset
Details
Results
Conclusion

Figures (2)

Figure 1: Overall Architecture. Our solution consists of three main stages, which includes prepare training data, generate summaries and Model-ensemble.
Figure 2: Comparing OCR results from the original data with those extracted using PaddleOCR, the original OCR contains error information. while OCR captured by PaddleOCR corrects these errors and includes information missed in the original data.

Proposal Report for the 2nd SciCAP Competition 2024

TL;DR

Abstract

Proposal Report for the 2nd SciCAP Competition 2024

Authors

TL;DR

Abstract

Table of Contents

Figures (2)