DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger, Ofek Glick, Chaim Baskin, Yonatan Belinkov
TL;DR
DEPTH tackles discourse-level understanding in language models by introducing hierarchical sentence representations into a pre-training objective that combines span-masking with sentence un-shuffling. Built on an extended T5 encoder-decoder framework, it augments the tokenizer with sentence-level tokens and enforces a hierarchical attention scheme, optimized via the joint loss $L_{DEPTH}$. Across C4 pre-training, GLUE, DiscoEval, and NI benchmarks, DEPTH demonstrates faster convergence and stronger discourse performance than a baseline T5, including when starting from a pre-trained checkpoint. These results suggest discourse-oriented pre-training can yield tangible gains in both generation and understanding tasks, with potential for integration into retrieval-augmented systems and larger-scale models.
Abstract
Language Models (LMs) struggle with linguistic understanding at the discourse level, even though discourse patterns such as coherence, cohesion, and narrative flow are prevalent in their pre-training data. To improve the discourse capabilities of LMs already at the pre-training stage, we introduce DEPTH, an encoder-decoder model that learns latent representations for sentences using a discourse-oriented pre-training objective. DEPTH combines hierarchical sentence representations with two objectives: (1) Sentence Un-Shuffling, and (2) Span-Corruption. Our approach trains the model to represent both sub-word-level and sentence-level dependencies over a pre-training corpora. When trained either from scratch or continuing from a pre-trained T5 checkpoint, DEPTH learns semantic and discourse-level representations faster than T5, outperforming it in span-corruption loss despite the additional sentence-un-shuffling objective. Evaluations on the GLUE, DiscoEval, and NI benchmarks demonstrate DEPTH's ability to quickly learn diverse downstream tasks, which require syntactic, semantic, and discourse capabilities. Our approach extends the discourse capabilities of T5, while minimally impacting other natural language understanding (NLU) capabilities in the resulting LM. We share our codebase for reproducibility: https://github.com/zbambergerNLP/depth.git.
