Formalising lexical and syntactic diversity for data sampling in French

Louis Estève; Manon Scholivet; Agata Savary

Formalising lexical and syntactic diversity for data sampling in French

Louis Estève, Manon Scholivet, Agata Savary

TL;DR

This work applies ecology-inspired, entropy-based diversity measures to sampling French text, formalizing lexical and syntactic diversity via $H(\Delta)$ and $H_\alpha$ to capture variety and balance. It introduces a tractable greedy heuristic that augments a large base French corpus with a diverse subset to boost lexical diversity, achieving a notable increase in entropy from $H$ to $H_{diverse}$. Empirical evaluation shows the lexical-diversity heuristic significantly outperforms random sampling, but finds that lexical diversity does not reliably proxy syntactic diversity across datasets or $\alpha$ values. The results highlight both the potential and limitations of using lexical diversity to guide syntactic coverage, pointing to future work on better aligning sampling with syntactic diversity while managing annotation costs.

Abstract

Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.

Formalising lexical and syntactic diversity for data sampling in French

TL;DR

This work applies ecology-inspired, entropy-based diversity measures to sampling French text, formalizing lexical and syntactic diversity via

and

to capture variety and balance. It introduces a tractable greedy heuristic that augments a large base French corpus with a diverse subset to boost lexical diversity, achieving a notable increase in entropy from

. Empirical evaluation shows the lexical-diversity heuristic significantly outperforms random sampling, but finds that lexical diversity does not reliably proxy syntactic diversity across datasets or

values. The results highlight both the potential and limitations of using lexical diversity to guide syntactic coverage, pointing to future work on better aligning sampling with syntactic diversity while managing annotation costs.

Abstract

Paper Structure (11 sections, 2 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 2 equations, 2 figures, 1 table, 1 algorithm.

Introduction
Diversity measures
Source data
Diversity-driven data selection
Diversity evaluation
Q1
Q2
Conclusions and future work
Limitations
Ethical statement
Appendix

Figures (2)

Figure 1: Two toy datasets with sample syntactic categories (on the right) and elements (inside the sentences).
Figure 2: Correlation between lexical and syntactic $H_\alpha$, according to $\alpha$. Europarl (dotted), UN corpus (dashed), Wikipedia (dash-dotted), and the union of all three (solid). Blue for Pearson, red for Spearman.

Formalising lexical and syntactic diversity for data sampling in French

TL;DR

Abstract

Formalising lexical and syntactic diversity for data sampling in French

Authors

TL;DR

Abstract

Table of Contents

Figures (2)