Table of Contents
Fetching ...

Pixel Sentence Representation Learning

Chenghao Xiao, Zhuoxu Huang, Danlu Chen, G Thomas Hudson, Yizhi Li, Haoran Duan, Chenghua Lin, Jie Fu, Jungong Han, Noura Al Moubayed

TL;DR

This work introduces Pixel Linguist, a pixel-based framework for learning sentence and document semantics by treating text as visual inputs and applying visually-grounded perturbations (typos, word-order changes). It combines unsupervised visual alignment with topical alignment and supervised reasoning alignment, enabling monolingual and cross-lingual learning via an iterative, multilingual transfer process. The approach yields competitive semantic textual similarity results and demonstrates zero-shot cross-lingual transfer with a surprising leapfrogging effect when multilingual signals are integrated. While pixel models currently lag behind traditional language models in pure sentence semantics, the framework offers a compelling, interpretable, and potentially more universal representation strategy, with strong implications for low-resource languages and multilingual understanding.

Abstract

Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains an unsolved problem. This is largely due to the discreteness of subword units brought by tokenization of language models, limiting small perturbations of inputs to form semantics-preserved positive pairs. In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. Drawing from cognitive and linguistic sciences, we introduce an unsupervised visual sentence representation learning framework, employing visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to texts to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision, achieving comparable performance in semantic textual similarity (STS) to existing state-of-the-art NLP methods. Additionally, we unveil our method's inherent zero-shot cross-lingual transferability and a unique leapfrogging pattern across languages during iterative training. To our knowledge, this is the first representation learning method devoid of traditional language models for understanding sentence and document semantics, marking a stride closer to human-like textual comprehension. Our code is available at https://github.com/gowitheflow-1998/Pixel-Linguist

Pixel Sentence Representation Learning

TL;DR

This work introduces Pixel Linguist, a pixel-based framework for learning sentence and document semantics by treating text as visual inputs and applying visually-grounded perturbations (typos, word-order changes). It combines unsupervised visual alignment with topical alignment and supervised reasoning alignment, enabling monolingual and cross-lingual learning via an iterative, multilingual transfer process. The approach yields competitive semantic textual similarity results and demonstrates zero-shot cross-lingual transfer with a surprising leapfrogging effect when multilingual signals are integrated. While pixel models currently lag behind traditional language models in pure sentence semantics, the framework offers a compelling, interpretable, and potentially more universal representation strategy, with strong implications for low-resource languages and multilingual understanding.

Abstract

Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains an unsolved problem. This is largely due to the discreteness of subword units brought by tokenization of language models, limiting small perturbations of inputs to form semantics-preserved positive pairs. In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. Drawing from cognitive and linguistic sciences, we introduce an unsupervised visual sentence representation learning framework, employing visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to texts to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision, achieving comparable performance in semantic textual similarity (STS) to existing state-of-the-art NLP methods. Additionally, we unveil our method's inherent zero-shot cross-lingual transferability and a unique leapfrogging pattern across languages during iterative training. To our knowledge, this is the first representation learning method devoid of traditional language models for understanding sentence and document semantics, marking a stride closer to human-like textual comprehension. Our code is available at https://github.com/gowitheflow-1998/Pixel-Linguist
Paper Structure (32 sections, 2 equations, 3 figures, 9 tables)

This paper contains 32 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Perceptual difference between tokenization-based language models and vision models, with the example of the word "extraordinary" with one single typo injected.
  • Figure 2: Training process.
  • Figure 3: Left 1-2: Embedding Distribution of the vanilla model and model after 3 rounds of iterative alignment. Left 3: English and OOD language performance in the final optimization of allNLI. After alignment, English and other languages present a bonding effect.