Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition
Tom Simon, William Mocaer, Pierrick Tranouez, Clement Chatelain, Thierry Paquet
TL;DR
This work tackles the difficulty of recognizing unseen text and symbols with OCR systems by proposing Rosetta, a multimodal model that uses Multimodal In-Context Learning to classify novel patterns without retraining. Central to Rosetta are the Context-Aware Tokenizer (CAT) and a Visual Prompt Generator (VPG) that together enable context-driven, open-vocabulary classification without linguistic priors. In carefully designed synthetic experiments, Rosetta demonstrates robust generalization to unseen text, symbol patterns, and new alphabets, achieving a lower mean CER ($6.54\%$) and TER below 10% across diverse test sets, outperforming a comparable OCR-based model in consistency. The approach promises practical impact for multilingual and evolving-script document understanding by enabling open-vocabulary recognition without heavy retraining or language-model reliance.
Abstract
We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.
