Table of Contents
Fetching ...

ProtoSnap: Prototype Alignment for Cuneiform Signs

Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, Hadar Averbuch-Elor

TL;DR

ProtoSnap tackles the problem of recovering fine-grained internal configurations of cuneiform signs from photographs by aligning skeleton-based prototypes to real signs using unsupervised deep-feature matching. It introduces a semantically-aware 4D similarity volume derived from diffusion features, global alignment via best-buddies correspondences, and local stroke refinement to snap prototype skeletons to sign images. The method is evaluated on a newly annotated benchmark and demonstrates substantial improvements over generic correspondence methods, and its alignments enable structurally conditioned data generation that improves OCR, especially for rare signs. The work also provides a roadmap for scalable paleographic analysis and potential extensions to multi-sign lines and other ancient scripts.

Abstract

The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: https://tau-vailab.github.io/ProtoSnap/

ProtoSnap: Prototype Alignment for Cuneiform Signs

TL;DR

ProtoSnap tackles the problem of recovering fine-grained internal configurations of cuneiform signs from photographs by aligning skeleton-based prototypes to real signs using unsupervised deep-feature matching. It introduces a semantically-aware 4D similarity volume derived from diffusion features, global alignment via best-buddies correspondences, and local stroke refinement to snap prototype skeletons to sign images. The method is evaluated on a newly annotated benchmark and demonstrates substantial improvements over generic correspondence methods, and its alignments enable structurally conditioned data generation that improves OCR, especially for rare signs. The work also provides a roadmap for scalable paleographic analysis and potential extensions to multi-sign lines and other ancient scripts.

Abstract

The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: https://tau-vailab.github.io/ProtoSnap/

Paper Structure

This paper contains 26 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: ProtoSnap applied to a full tablet by cropping each sign using existing bounding boxes (such as those depicted in unique colors), and matching prototypes of the signs (illustrated in the center). Our technique "snaps" the skeletons of the prototypes to the target images depicting real cuneiform signs. These aligned results can be used to produce an automatic digital hand copy (right). We also show that our approach can be used to boost performance of cuneiform sign recognition.
  • Figure 2: Method Overview. Given a prototype image with annotated skeleton and a target image of a real cuneiform sign, ProtoSnap first extracts best-buddy correspondences from deep diffusion features (extracted with our fine-tuned SD-model), globally aligning the target image to the skeleton of the prototype. Our method then "snaps" the individual strokes into place with a local refinement stage by optimizing a per-stroke transform.
  • Figure 3: DIFT-Based Best-Buddies Correspondences. Noised images are passed through our fine-tuned denoising diffusion model SD-to extract deep Diffusion Features (DIFT), used to calculate the 4D similarity volume $S$. For each region $(i,j)$ in the target image, we examine the 2D slice $S[i, j, \cdot, \cdot]$, and determine the indices $(k, \ell)$ which maximize its value. Symmetrically, for each region $(k, \ell)$ in the prototype we find the corresponding region in the target by maximizing the 2D slice $S[\cdot, \cdot, k, \ell]$. If these two regions correspond to each other, they are identified as best buddies.
  • Figure 4: Local Refinement via Skeleton-Based Optimization. To adjust the positioning of individual strokes in a sign, our global alignment is followed by a local refinement stage which learns transformations for each stroke. The loss function encourages positioning on salient regions ($\mathcal{L}_{sal}$) while semantically matching the corresponding regions in the prototype image, as measured by feature similarity ($\mathcal{L}_{sim}$). For each stroke (exemplified by the stroke in red above), these objectives are calculated along points sampled from the skeleton (red dots above). The loss also includes a regularization term ($\mathcal{L}_{reg}$) preventing excessive deviation from the global transformation.
  • Figure 5: Qualitative alignment results, aligning the prototypes (first row) to target cuneiform images (second row). We demonstrate the results after performing global alignment (third row), and the final result after local refinement (fourth row). As illustrated above, the global alignment stage provides a coarse placement of the prototype template, while the refinement stage allows each stroke to slightly diverge from the original prototype, resulting in more accurate alignments.
  • ...and 6 more figures