Learning Page Order in Shuffled WOO Releases
Efe Kahraman, Giulio Tosato
TL;DR
This work tackles recovering the chronological page order from shuffled Dutch WOO releases, a task complicated by heterogeneous content and scarce per-page metadata. It evaluates five permutation-family approaches—pointer networks, seq2seq transformers, and pairwise ranking transformers—under universal, curriculum, and length-specialized training regimes. The results show that a length-specialized, non-autoregressive pairwise ranking transformer delivers the strongest performance for documents up to 15 pages, with Kendall's $\tau$ reaching up to $0.953$ on 2–5 pages and $0.722$ on 11–15 pages, while seq2seq baselines catastrophically fail to generalize to long documents ($\tau$ dropping to $0.014$); curriculum learning also underperforms direct training. These findings highlight the importance of length-aware architectures and training strategies for ordering pages in heterogeneous, multi-type document collections, and point to future work in multimodal embeddings and length-extrapolating encodings.
Abstract
We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall's tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).
