Hypertext Entity Extraction in Webpage
Yifei Yang, Tianqiao Liu, Bo Shao, Hai Zhao, Linjun Shou, Ming Gong, Daxin Jiang
TL;DR
This work introduces HEED, a hypertext-enriched dataset for webpage entity extraction in multilingual e-commerce domains, and MoEEF, a Mixture of Experts framework that fuses text and rich hypertext features to improve extraction performance. By encoding text with XLM-RoBERTa-base and aggregating 20-d hypertext embeddings into a unified representation, MoEEF employs modality-specific experts and a router to softly vote among predictions, enabling strong cross-language performance and robust handling of long web content. Ablation studies confirm that each hypertext feature category contributes to performance, while multi-modal input and an optimal number of experts are crucial for maximizing F1. The results show that hypertext cues offer tangible gains over text-only baselines and even competitive GPT-3.5-turbo prompts, highlighting the practical impact of structured visual cues for web information extraction in real-world e-commerce data.
Abstract
Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED}) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the \textbf{Mo}E-based \textbf{E}ntity \textbf{E}xtraction \textbf{F}ramework (\textit{MoEEF}), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in \textit{HEED} and several model components in \textit{MoEEF} are analyzed.
