Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content

Chia-Hsuan Chang; Lei Wang; Christopher C. Yang

Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content

Chia-Hsuan Chang, Lei Wang, Christopher C. Yang

TL;DR

The paper tackles the problem of expanding consumer health vocabulary beyond English by introducing a cross-lingual ATR framework that learns two language-specific word vector spaces via skip-gram on HCGC, then aligns them into a bilingual space using an orthogonal Procrustes transformation guided by a small set of bilingual anchors from Wikipedia. Term expansion is performed through cosine-based retrieval in the bilingual space with a dynamic, modularity-driven threshold, enabling language-agnostic CHV assembly with fewer human annotations. Experiments show the approach outperforms large language models like GPT-3.5-Turbo and Cohere Rerank in identifying cross-language CHV, demonstrating both effectiveness and resource efficiency. The method offers practical benefits for multilingual consumer health applications, enabling scalable cross-language CHV construction and reducing reliance on parallel corpora or manual translations, though it acknowledges limitations with infrequent multiword expressions and corpus size imbalance across languages.

Abstract

The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting its applicability to other languages. This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using the skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on the isometry assumption, the framework aligns two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. The experimental results demonstrate that our framework outperforms the other two large language models in identifying CHV across languages. Our framework only requires raw HCGC corpora and a limited size of medical translations, reducing human efforts in compiling cross-lingual CHV.

Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content

TL;DR

Abstract

Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content

Authors

TL;DR

Abstract

Table of Contents

Figures (6)