HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM
Kazuki Kawamura, Akihiro Yamamoto
TL;DR
The paper tackles extracting and integrating relational information from HTML tables across web pages with varying structures. It introduces HTML-LSTM, a dual-direction Tree-LSTM operating on DOM trees with node-level encoding of tag, text, and PoS tags, augmented by Bi-LSTM content encoding. The approach uses $L_{focal}$ and $L_{f1}$ losses and data augmentation, achieving $F_1$ scores of $0.96$ on preschool data and $0.86$ on syllabus data, and it outperforms a baseline Tree-LSTM. This enables robust cross-page table integration and can be extended to non-table HTML fragments, enhancing scalable information extraction from the web.
Abstract
In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.
