Table of Contents
Fetching ...

Hypertext Entity Extraction in Webpage

Yifei Yang, Tianqiao Liu, Bo Shao, Hai Zhao, Linjun Shou, Ming Gong, Daxin Jiang

TL;DR

This work introduces HEED, a hypertext-enriched dataset for webpage entity extraction in multilingual e-commerce domains, and MoEEF, a Mixture of Experts framework that fuses text and rich hypertext features to improve extraction performance. By encoding text with XLM-RoBERTa-base and aggregating 20-d hypertext embeddings into a unified representation, MoEEF employs modality-specific experts and a router to softly vote among predictions, enabling strong cross-language performance and robust handling of long web content. Ablation studies confirm that each hypertext feature category contributes to performance, while multi-modal input and an optimal number of experts are crucial for maximizing F1. The results show that hypertext cues offer tangible gains over text-only baselines and even competitive GPT-3.5-turbo prompts, highlighting the practical impact of structured visual cues for web information extraction in real-world e-commerce data.

Abstract

Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED}) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the \textbf{Mo}E-based \textbf{E}ntity \textbf{E}xtraction \textbf{F}ramework (\textit{MoEEF}), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in \textit{HEED} and several model components in \textit{MoEEF} are analyzed.

Hypertext Entity Extraction in Webpage

TL;DR

This work introduces HEED, a hypertext-enriched dataset for webpage entity extraction in multilingual e-commerce domains, and MoEEF, a Mixture of Experts framework that fuses text and rich hypertext features to improve extraction performance. By encoding text with XLM-RoBERTa-base and aggregating 20-d hypertext embeddings into a unified representation, MoEEF employs modality-specific experts and a router to softly vote among predictions, enabling strong cross-language performance and robust handling of long web content. Ablation studies confirm that each hypertext feature category contributes to performance, while multi-modal input and an optimal number of experts are crucial for maximizing F1. The results show that hypertext cues offer tangible gains over text-only baselines and even competitive GPT-3.5-turbo prompts, highlighting the practical impact of structured visual cues for web information extraction in real-world e-commerce data.

Abstract

Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED}) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the \textbf{Mo}E-based \textbf{E}ntity \textbf{E}xtraction \textbf{F}ramework (\textit{MoEEF}), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in \textit{HEED} and several model components in \textit{MoEEF} are analyzed.
Paper Structure (30 sections, 8 equations, 8 figures, 8 tables)

This paper contains 30 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) A webpage from SWDE, which lacks hypertext features and contains noise. (b) A webpage from our HEED that keeps hypertext information.
  • Figure 2: A sample from HEED.
  • Figure 3: Overview of the MoEEF. The Hypertext Features are extracted from the original rendered webpages.
  • Figure 4: Visualization of the router for different tasks.
  • Figure 5: Visualizations of representations for different experts.
  • ...and 3 more figures