Table of Contents
Fetching ...

Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition

Tom Simon, William Mocaer, Pierrick Tranouez, Clement Chatelain, Thierry Paquet

TL;DR

This work tackles the difficulty of recognizing unseen text and symbols with OCR systems by proposing Rosetta, a multimodal model that uses Multimodal In-Context Learning to classify novel patterns without retraining. Central to Rosetta are the Context-Aware Tokenizer (CAT) and a Visual Prompt Generator (VPG) that together enable context-driven, open-vocabulary classification without linguistic priors. In carefully designed synthetic experiments, Rosetta demonstrates robust generalization to unseen text, symbol patterns, and new alphabets, achieving a lower mean CER ($6.54\%$) and TER below 10% across diverse test sets, outperforming a comparable OCR-based model in consistency. The approach promises practical impact for multilingual and evolving-script document understanding by enabling open-vocabulary recognition without heavy retraining or language-model reliance.

Abstract

We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.

Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition

TL;DR

This work tackles the difficulty of recognizing unseen text and symbols with OCR systems by proposing Rosetta, a multimodal model that uses Multimodal In-Context Learning to classify novel patterns without retraining. Central to Rosetta are the Context-Aware Tokenizer (CAT) and a Visual Prompt Generator (VPG) that together enable context-driven, open-vocabulary classification without linguistic priors. In carefully designed synthetic experiments, Rosetta demonstrates robust generalization to unseen text, symbol patterns, and new alphabets, achieving a lower mean CER () and TER below 10% across diverse test sets, outperforming a comparable OCR-based model in consistency. The approach promises practical impact for multilingual and evolving-script document understanding by enabling open-vocabulary recognition without heavy retraining or language-model reliance.

Abstract

We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.

Paper Structure

This paper contains 25 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: To classify sequences of unknown symbols in a query image $X$, Rosetta leverages a context image $X_c$ containing similar symbols along with their associated labels in $Y_c$. At each step during decoding, Rosetta identifies matching symbols in the context image and assigns the corresponding label provided in the textual context, highlighted in red in the figure.
  • Figure 2: To classify sequences of unknown symbols in a query image $X$, Rosetta leverages a context image $X_c$ containing similar symbols along with their associated labels in $T_c$. $T_c$ represents the tokenized encoding of the symbols in $X_c$, preserving their order of appearance. At each step, Rosetta identifies matching symbols in the context image and assigns the corresponding label provided in the textual context, highlighted in red in the figure.
  • Figure 3: Illustration of context and query text encoding/decoding using the Context-Aware Tokenizer (CAT). (a) CAT encodes a context text and stores all character-token mappings in a dictionary $\mathcal{D}$. This dictionary is then used by CAT to (b) encode new text or to (c) decode predictions from the model. The '*' character denotes predictions corresponding to the $\langle ooc \rangle$ (out-of-context) token.
  • Figure 4: Illustration of the Rosetta architecture, structured around three core components: (1) a Context-Aware Tokenizer (CAT) that encodes the context text $Y_c$ into a sequence of tokens $T_c$ and decodes the predicted sequence of tokens $T$ back into a sequence of characters $Y$ using a dictionary $D$ of character-token association; (2) a Visual Prompt Generator (VPG) that converts the context and query images ($X_c$, $X$) into token sequences interpretable by the transformer decoder; and (3) a transformer decoder that processes the multimodal data from both the CAT and the VPG to predict $T$, a sequence of tokens corresponding to the symbols in the query image.
  • Figure 5: Samples of context and query images from the training set, showing variations in the coverage rate $\alpha$ and the number of symbols $S_{\text{add}}$. The red symbols in $X_c$ represent symbols that belong to $S_{\text{add}}$. The red color is used for illustration purposes only.
  • ...and 9 more figures