Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report
Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou, Aixi Zhong
TL;DR
Pharos-ESG tackles the challenge of understanding and analyzing long, visually irregular ESG reports by combining reading-order modeling, ToC-guided hierarchy reconstruction, contextual image-to-text transformation, and multi-level labeling for ESG, GRI, and sentiment. The framework integrates a block-based reading order with a Relation-Aware Transformer, RAP for ToC parsing, ALIGN for precise ToC-body alignment, and a ternary, hierarchy-aware labeling scheme, all validated on expert-annotated benchmarks and the Aurora-ESG dataset. Key contributions include state-of-the-art parsing performance, robust cross-market generalization across China, Hong Kong, and the U.S., and the release of Aurora-ESG as a large-scale multimodal ESG resource. Collectively, Pharos-ESG enables scalable, interpretable ESG analysis for financial governance and decision-making, providing structured, multilingual, and semantically rich representations of complex disclosures.
Abstract
Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.
