Table of Contents
Fetching ...

NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

Hanwool Lee, Sara Yu, Yewon Hwang, Jonghyun Choi, Heejae Ahn, Sungbum Jung, Youngjae Yu

TL;DR

NMIXX tackles the gap between general-purpose embeddings and finance-specific, cross-lingual semantics in low-resource languages by introducing a domain-adapted embedding suite and a Korean financial STS benchmark. It generates 18.8k high-confidence triplets via a semantic-shift taxonomy and bilingual positives, training with a temperature-scaled triplet loss to yield robust cross-lingual alignment, particularly for Korean–English finance. The study demonstrates sizable gains on financial STS tasks (e.g., KorFinSTS) but notes a trade-off with general-domain STS performance, highlighting tokenizer coverage as a critical factor for success in low-resource bilingual settings. By releasing both NMIXX models and KorFinSTS, the work offers practical tools for domain-specific multilingual representation learning in finance and motivates broader exploration of language-native domain adaptation across additional languages.

Abstract

General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX's multilingual bge-m3 variant achieves Spearman's rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

TL;DR

NMIXX tackles the gap between general-purpose embeddings and finance-specific, cross-lingual semantics in low-resource languages by introducing a domain-adapted embedding suite and a Korean financial STS benchmark. It generates 18.8k high-confidence triplets via a semantic-shift taxonomy and bilingual positives, training with a temperature-scaled triplet loss to yield robust cross-lingual alignment, particularly for Korean–English finance. The study demonstrates sizable gains on financial STS tasks (e.g., KorFinSTS) but notes a trade-off with general-domain STS performance, highlighting tokenizer coverage as a critical factor for success in low-resource bilingual settings. By releasing both NMIXX models and KorFinSTS, the work offers practical tools for domain-specific multilingual representation learning in finance and motivates broader exploration of language-native domain adaptation across additional languages.

Abstract

General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX's multilingual bge-m3 variant achieves Spearman's rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

Paper Structure

This paper contains 31 sections, 1 equation, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Model Performance: Before vs After Domain-Adaptive Training