HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Kazuki Kawamura; Akihiro Yamamoto

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Kazuki Kawamura, Akihiro Yamamoto

TL;DR

The paper tackles extracting and integrating relational information from HTML tables across web pages with varying structures. It introduces HTML-LSTM, a dual-direction Tree-LSTM operating on DOM trees with node-level encoding of tag, text, and PoS tags, augmented by Bi-LSTM content encoding. The approach uses $L_{focal}$ and $L_{f1}$ losses and data augmentation, achieving $F_1$ scores of $0.96$ on preschool data and $0.86$ on syllabus data, and it outperforms a baseline Tree-LSTM. This enables robust cross-page table integration and can be extended to non-table HTML fragments, enhancing scalable information extraction from the web.

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

TL;DR

and

losses and data augmentation, achieving

scores of

on preschool data and

on syllabus data, and it outperforms a baseline Tree-LSTM. This enables robust cross-page table integration and can be extended to non-table HTML fragments, enhancing scalable information extraction from the web.

Abstract

Paper Structure (13 sections, 7 equations, 5 figures, 5 tables)

This paper contains 13 sections, 7 equations, 5 figures, 5 tables.

Introduction
Related Work
HTML-LSTM
Extracting Information
Encoding of HTML Data:
HTML-LSTM:
Integrating Information
Implementation Details
Experiments
Experiments on Preschool Data
Experiments on Syllabus Data
Ablation Experiments
Conclusion

Figures (5)

Figure 1: The HTML-LSTM framework for information extraction and integration from HTML tables in web pages
Figure 2: The workflow of information extraction using HTML-LSTM
Figure 3: Example of converting HTML data to a tree structure
Figure 4: HTML-LSTM architecture
Figure 5: Example of information integration

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

TL;DR

Abstract

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Authors

TL;DR

Abstract

Table of Contents

Figures (5)