Unsupervised Document and Template Clustering using Multimodal Embeddings
Phillipe R. Sampaio, Helene Maxcici
TL;DR
This paper tackles unsupervised organization of documents into categories and templates by evaluating a model-agnostic pipeline that projects last-layer multimodal encoder states into fixed-size document vectors for clustering. It systematically compares eight encoders (text-only, layout-aware, vision-only, and vision-language) across five heterogeneous corpora using four clustering algorithms ($k$-Means, DBSCAN, HDBSCAN+$k$-NN, and BIRCH), including an oracle-free tuning protocol. Key findings show that vision-centric encoders excel at clean template discovery but are brittle under covariate shift, whereas text signals provide orthogonal robustness; fused multimodal embeddings (e.g., LayoutLMv3, Donut, Gemma3, InternVL3) offer the best robustness–accuracy trade-off. The work provides a reproducible tuning protocol and detailed evaluation settings, offering practical guidance for scalable intelligent document processing and future research in unsupervised document clustering.
Abstract
We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.
