Table of Contents
Fetching ...

Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material

Shlomo Tannor, Nachum Dershowitz, Moshe Lavee

TL;DR

The paper addresses the problem of identifying origin and lost material in complex rabbinic anthologies by proposing a style-classification pipeline that combines six corpora from Sefaria, multiple modeling approaches (Baseline n-gram LR, AlephBERT, BEREL, Morphological), and a text-reuse module RWFS. It applies the method to detect lost Tanḥuma passages within Yalkut Shimoni, enabling recovery of Tanḥuma-Yelammedenu material and bridging text reuse with stylometry. Key findings show BEREL achieving the highest validation accuracy ($0.922$) while a robust Baseline model provides reliable inference, and the integrated approach yields high precision/recall for candidate passages, with practical precision around $80\%$ at a suitable threshold. The work contributes an end-to-end framework and public tools to support digital scholarship in Jewish studies, with potential extension to Geniza manuscripts and broader rabbinic corpora.

Abstract

Midrash collections are complex rabbinic works that consist of text in multiple languages, which evolved through long processes of unstable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter of dispute among scholars, yet it is essential for scholars' understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recent advances in natural language processing for Hebrew texts. Additionally, we demonstrate how this method can be applied to uncover lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has been preserved in later anthologies.

Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material

TL;DR

The paper addresses the problem of identifying origin and lost material in complex rabbinic anthologies by proposing a style-classification pipeline that combines six corpora from Sefaria, multiple modeling approaches (Baseline n-gram LR, AlephBERT, BEREL, Morphological), and a text-reuse module RWFS. It applies the method to detect lost Tanḥuma passages within Yalkut Shimoni, enabling recovery of Tanḥuma-Yelammedenu material and bridging text reuse with stylometry. Key findings show BEREL achieving the highest validation accuracy () while a robust Baseline model provides reliable inference, and the integrated approach yields high precision/recall for candidate passages, with practical precision around at a suitable threshold. The work contributes an end-to-end framework and public tools to support digital scholarship in Jewish studies, with potential extension to Geniza manuscripts and broader rabbinic corpora.

Abstract

Midrash collections are complex rabbinic works that consist of text in multiple languages, which evolved through long processes of unstable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter of dispute among scholars, yet it is essential for scholars' understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recent advances in natural language processing for Hebrew texts. Additionally, we demonstrate how this method can be applied to uncover lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has been preserved in later anthologies.
Paper Structure (20 sections, 6 figures, 1 table)

This paper contains 20 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The text-reuse engine, RWFS, shows how a medieval midrash paragraph is reusing early material from various sources including Midrash Tanḥuma.
  • Figure 2: From left to right: (1) class frequencies for passages based on text reuse detection in Yalkut Shimoni; (2) predicted class frequencies for passages with high text reuse score; (3) predicted frequencies for passages with low reuse score.
  • Figure 3: Confusion matrix for baseline model, normalized by row.
  • Figure 4: Precision and recall as function of the decision threshold for lost Tanḥuma material.
  • Figure 5: An example of our application's output on a typical Midrash Tanḥuma text.
  • ...and 1 more figures