DEPTH: Discourse Education through Pre-Training Hierarchically

Zachary Bamberger; Ofek Glick; Chaim Baskin; Yonatan Belinkov

DEPTH: Discourse Education through Pre-Training Hierarchically

Zachary Bamberger, Ofek Glick, Chaim Baskin, Yonatan Belinkov

TL;DR

DEPTH tackles discourse-level understanding in language models by introducing hierarchical sentence representations into a pre-training objective that combines span-masking with sentence un-shuffling. Built on an extended T5 encoder-decoder framework, it augments the tokenizer with sentence-level tokens and enforces a hierarchical attention scheme, optimized via the joint loss $L_{DEPTH}$. Across C4 pre-training, GLUE, DiscoEval, and NI benchmarks, DEPTH demonstrates faster convergence and stronger discourse performance than a baseline T5, including when starting from a pre-trained checkpoint. These results suggest discourse-oriented pre-training can yield tangible gains in both generation and understanding tasks, with potential for integration into retrieval-augmented systems and larger-scale models.

Abstract

Language Models (LMs) struggle with linguistic understanding at the discourse level, even though discourse patterns such as coherence, cohesion, and narrative flow are prevalent in their pre-training data. To improve the discourse capabilities of LMs already at the pre-training stage, we introduce DEPTH, an encoder-decoder model that learns latent representations for sentences using a discourse-oriented pre-training objective. DEPTH combines hierarchical sentence representations with two objectives: (1) Sentence Un-Shuffling, and (2) Span-Corruption. Our approach trains the model to represent both sub-word-level and sentence-level dependencies over a pre-training corpora. When trained either from scratch or continuing from a pre-trained T5 checkpoint, DEPTH learns semantic and discourse-level representations faster than T5, outperforming it in span-corruption loss despite the additional sentence-un-shuffling objective. Evaluations on the GLUE, DiscoEval, and NI benchmarks demonstrate DEPTH's ability to quickly learn diverse downstream tasks, which require syntactic, semantic, and discourse capabilities. Our approach extends the discourse capabilities of T5, while minimally impacting other natural language understanding (NLU) capabilities in the resulting LM. We share our codebase for reproducibility: https://github.com/zbambergerNLP/depth.git.

DEPTH: Discourse Education through Pre-Training Hierarchically

TL;DR

. Across C4 pre-training, GLUE, DiscoEval, and NI benchmarks, DEPTH demonstrates faster convergence and stronger discourse performance than a baseline T5, including when starting from a pre-trained checkpoint. These results suggest discourse-oriented pre-training can yield tangible gains in both generation and understanding tasks, with potential for integration into retrieval-augmented systems and larger-scale models.

Abstract

Paper Structure (34 sections, 3 equations, 10 figures, 10 tables)

This paper contains 34 sections, 3 equations, 10 figures, 10 tables.

Introduction
Method
Tokenization
Corruption
Span-Masking:
Sentence Un-Shuffling:
Attention masks
Loss Formulation
Experimental setup
Fine-tuning experiments
Results
C4 pre-training
GLUE fine-tuning
DiscoEval fine-tuning
NI fine-tuning
...and 19 more sections

Figures (10)

Figure 1: DEPTH tokenization and corruption process. Given an input document, DEPTH introduces sentence tokens (<SENT_i> and <EOSEN>), applies span masking, and shuffles sentences with probability 0.5. Attention patterns are shown with arrows (dotted for cross-attention, solid for self-attention).
Figure 2: From Scratch Pre-Training loss (validation) for both T5 and DEPTH
Figure 3: Continuous Pre-Training loss (validation) for both T5 and DEPTH.
Figure 4: GLUE results for FS and CPT models. Top row: From Scratch (FS), Bottom row: From Pretrained (CPT).
Figure 5: DiscoEval results for DEPTH and T5 models. Top row: From Scratch (FS), Bottom row: From Pretrained (CPT).
...and 5 more figures

DEPTH: Discourse Education through Pre-Training Hierarchically

TL;DR

Abstract

DEPTH: Discourse Education through Pre-Training Hierarchically

Authors

TL;DR

Abstract

Table of Contents

Figures (10)