Table of Contents
Fetching ...

Sm-Nd Isotope Data Compilation from Geoscientific Literature Using an Automated Tabular Extraction Method

Zhixin Guo, Tao Wang, Chaoyang Wang, Jianping Zhou, Guanjie Zheng, Xinbing Wang, Chenghu Zhou

Abstract

The rare earth elements Sm and Nd significantly address fundamental questions about crustal growth, such as its spatiotemporal evolution and the interplay between orogenesis and crustal accretion. Their relative immobility during high-grade metamorphism makes the Sm-Nd isotopic system crucial for inferring crustal formation times. Historically, data have been disseminated sporadically in the scientific literature due to complicated and costly sampling procedures, resulting in a fragmented knowledge base. However, the scattering of critical geoscience data across multiple publications poses significant challenges regarding human capital and time. In response, we present an automated tabular extraction method for harvesting tabular geoscience data. We collect 10,624 Sm-Nd data entries from 9,138 tables in over 20,000 geoscience publications using this method. We manually selected 2,118 data points from it to supplement our previously constructed global Sm-Nd dataset, increasing its sample count by over 20\%. Our automatic data collection methodology enhances the efficiency of data acquisition processes spanning various scientific domains. Furthermore, the constructed Sm-Nd isotopic dataset should motivate the research of classifying global orogenic belts.

Sm-Nd Isotope Data Compilation from Geoscientific Literature Using an Automated Tabular Extraction Method

Abstract

The rare earth elements Sm and Nd significantly address fundamental questions about crustal growth, such as its spatiotemporal evolution and the interplay between orogenesis and crustal accretion. Their relative immobility during high-grade metamorphism makes the Sm-Nd isotopic system crucial for inferring crustal formation times. Historically, data have been disseminated sporadically in the scientific literature due to complicated and costly sampling procedures, resulting in a fragmented knowledge base. However, the scattering of critical geoscience data across multiple publications poses significant challenges regarding human capital and time. In response, we present an automated tabular extraction method for harvesting tabular geoscience data. We collect 10,624 Sm-Nd data entries from 9,138 tables in over 20,000 geoscience publications using this method. We manually selected 2,118 data points from it to supplement our previously constructed global Sm-Nd dataset, increasing its sample count by over 20\%. Our automatic data collection methodology enhances the efficiency of data acquisition processes spanning various scientific domains. Furthermore, the constructed Sm-Nd isotopic dataset should motivate the research of classifying global orogenic belts.
Paper Structure (19 sections, 7 figures, 2 tables)

This paper contains 19 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An overview of the automatic searching and collecting tabular data tool workflow, which consists of two main steps: (A) document retrieval and (B) tabular data collection.
  • Figure 2: An overview of tabular data collection pipeline. The pipeline process consists of four modules: (a) table region detection, (b) text detection, (c) table structure recognition, and (d) table content construction.
  • Figure 3: An overview of table region detection processing. The table detection neural network localize tables and returns the table position and page information.
  • Figure 4: An overview of tabular structure recognition with frames. (A) denotes the initial table structure complete with frames. (B) indicates the detection of horizontal lines. (C) indicates the vertical line detection, and (D) showcases the identification of adjacent points.
  • Figure 5: An overview of tabular structure recognition without frames. (A) presents the initial table lacking frames. (B) illustrates the result of image processing. (C) exhibits the construction of vertical and horizontal lines. (D) indicates the table structure post-cell merging recognition. (E) displays the final output following comprehensive table structure recognition.
  • ...and 2 more figures