Proposal Report for the 2nd SciCAP Competition 2024
Pengpeng Li, Tingmin Li, Jingyuan Wang, Boyuan Wang, Yang Yang
TL;DR
The paper tackles summarizing descriptions tied to specific objects within long documents by proposing a two-stage, object-centric pipeline that fuses high-quality OCR data with paragraph-level filtering. It combines PaddleOCR-based object text extraction, chunk-wise paragraph filtering, and two-stage generation (filtering followed by summarization) with a model-ensemble of Pegasus and LLaMA2-13B to produce high-quality summaries. The approach achieves top performance in the 2024 SciCAP competition for both long-caption and short-caption tracks, demonstrating that object-aware inputs and targeted inference reduce noise and improve factual accuracy. This work highlights a practical pathway for high-stidelity, object-centered document summarization in multimodal text settings.
Abstract
In this paper, we propose a method for document summarization using auxiliary information. This approach effectively summarizes descriptions related to specific images, tables, and appendices within lengthy texts. Our experiments demonstrate that leveraging high-quality OCR data and initially extracted information from the original text enables efficient summarization of the content related to described objects. Based on these findings, we enhanced popular text generation model models by incorporating additional auxiliary branches to improve summarization performance. Our method achieved top scores of 4.33 and 4.66 in the long caption and short caption tracks, respectively, of the 2024 SciCAP competition, ranking highest in both categories.
