Table of Contents
Fetching ...

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community

TL;DR

The paper presents Lucie-7B, a French-leaning, open-source multilingual LLM trained on a rights-conscious Lucie Training Dataset designed to reduce Anglo-centric biases. It details the careful curation of 25 subcorpora, a two-stage data processing pipeline, and a three-phase pretraining strategy that culminates in a 32k token context length and an annealing phase targeting math and reasoning. Lucie-7B achieves competitive performance on multilingual and French benchmarks relative to open models, while demonstrating the feasibility of building performant, governance-aligned LLMs with open data and transparent processes. In addition to the foundation model, the authors release two instruction-tuned variants to illustrate potential alignment workflows and emphasize ongoing work toward deeper alignment and broader data coverage. The work highlights practical considerations for responsible LLM development and provides concrete resources (datasets, model weights, and code) to the community.

Abstract

We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

TL;DR

The paper presents Lucie-7B, a French-leaning, open-source multilingual LLM trained on a rights-conscious Lucie Training Dataset designed to reduce Anglo-centric biases. It details the careful curation of 25 subcorpora, a two-stage data processing pipeline, and a three-phase pretraining strategy that culminates in a 32k token context length and an annealing phase targeting math and reasoning. Lucie-7B achieves competitive performance on multilingual and French benchmarks relative to open models, while demonstrating the feasibility of building performant, governance-aligned LLMs with open data and transparent processes. In addition to the foundation model, the authors release two instruction-tuned variants to illustrate potential alignment workflows and emphasize ongoing work toward deeper alignment and broader data coverage. The work highlights practical considerations for responsible LLM development and provides concrete resources (datasets, model weights, and code) to the community.

Abstract

We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

Paper Structure

This paper contains 73 sections, 23 figures, 11 tables.

Figures (23)

  • Figure 1: Distribution of documents by year in the AmericanStories dataset.
  • Figure 2: Distribution of documents by type in the Claire French (left) and English (right) datasets.
  • Figure 3: Distribution of documents by year in the FineWebEdu dataset.
  • Figure 4: Distribution of documents by year in the RedPajama v2 dataset for French, German, Spanish, and Italian.
  • Figure 5: Composition of the raw training data (2.32 billion tokens), by language (colors) and category (hatch patterns).
  • ...and 18 more figures