VigogneLLM : a collection of performant SLM for French

The vast majority of small open-source LLMs show significant performance variations depending on the language used. As part of this project, we propose a new version of the Vigogne model family: a series of SLMs specifically optimized for French. Vigogne offers enhanced capabilities for better understanding and generating content in French, thereby addressing the needs of French-speaking users across a wide range of applications.

The emergence of large language models (LLMs) has revolutionized the field of artificial intelligence. However, their Anglophone dominance presents a major challenge. While general-purpose multilingual models are capable of handling French effectively, they are sometimes less performant than models specifically trained on the language when it comes to capturing subtle nuances or addressing specialized use cases. It is in this context that the Vigogne project was conceived—an initiative aimed at adapting open-source models to the French language with remarkable precision.

The project benefited from privileged access to 250,000 hours of computation on H100 GPUs from the new extension of the Jean Zay supercomputer, managed by IDRIS (CNRS) and funded by GENCI. Thanks to these resources, we were able to carry out an intensive training program on five model architectures: Llama-3.2-1B, Llama-3.2-3B, Llama-3.2-8B, Qwen2.5- 1.5B, and Qwen2.5-3B.

Methodology :

Our approach is based on a three-phase methodology.

- The first phase, initial pre-training, involves exposing the models to a large corpus of French texts drawn from FineWeb-2, a massive database of cleaned web pages. This phase is equivalent to having the model “read” millions of documents so that it can learn the typical structure, vocabulary, and expressions of the French language.

- The second phase, known as annealing, gradually refines the model’s learning. We apply a cosine-type scheduler to slowly reduce the learning rate, thereby avoiding oscillations and enabling more stable convergence. During this stage, the model is fed not only new excerpts from the FineWeb-2 dataset but also texts rewritten by other AIs, as well as French Magpie—a dataset specifically created for this project to enrich the diversity of styles and registers. This diversification is crucial to prevent the model from overfitting to a single style and to improve its overall robustness.

- The third phase, supervised fine-tuning (SFT), is dedicated to instruction tasks. In this stage, the model is trained on carefully annotated data from our French Magpie dataset, where each input is an instruction in French and each output is an ideal response. This phase enables the model to understand complex instructions, rephrase texts, answer questions, or adopt a specific tone. Unlike base models, Vigogne thus becomes a reliable, compact French-speaking conversational partner, capable of following directives with precision.

Results :

The results obtained show a significant improvement across several French evaluation tasks;

- In reading comprehension, for example, Vigogne_Llama-3.2-1B improves from 0.5493 to 0.6338, demonstrating a better ability to grasp the overall meaning of a text.

- In gragrammarigogne_Qwen2.5-1.5B reaches 0.8403 compared to the initial 0.7563, resulting in more natural and grammatically correct responses.

- On boolean (yes/no) questions, Vigogne_Llama-3.2-3B progresses from 0.5000 to 0.6966, a major advancement for dialogue systems requiring clear answers.

A detailed performance analysis reveals instructive trends. On one hand, the Vigogne adaptation systematically improves the overall average of each base model in French, regardless of its size. For example, Vigogne_Llama-3.2-3B gains 6.2 points in average performance (from 0.5876 to 0.6496), while Vigogne_Qwen2.5-3B improves by 1.8 points (from 0.6361 to 0.6542). Even for a larger model like Llama-3.1-8B, the adaptation raises the average from 0.6154 to 0.6738, an increase of nearly 6 points—evidence that even already high-performing models benefit from specialization for French.

On the other hand, gains vary depending on the tasks and architectures. The Qwen2.5-3B model, for example, was already strong on boolean questions (0.8932), but Vigogne improves its performance in other areas such as the ARC Challenge (+0.0513) and reading comprehension (+0.0845), thereby balancing its overall profile. Conversely, Vigogne_Llama-3.2-3B, which started with a poor score in BoolQA (0.5000), nearly doubles this ability (0.6966), making it an ideal candidate for applications requiring rapid fact-checking.It is also observed that the effect of adaptation is particularly pronounced on grammar and vocabulary tasks.

All Vigogne models exceed 0.8 in grammar, compared to initial scores often below 0.78. This improvement is explained by intensive exposure to well-formed French texts and corrected instruction-response pairs, which refine the model’s internal syntax. Similarly, scores on the “French Bench Vocab” show systematic improvement, indicating better mastery of the rich and nuanced French lexicon—a valuable asset for writing or translation applications.

Finally, the comparison with reference models of comparable size, such as Mistral-7B or Lucie-7B, is striking: Vigogne_Llama-3.2-3B (3B parameters) achieves an average performance of 0.6496, surpassing Lucie-7B (0.6166) despite being smaller. This demonstrates that, thanks to targeted training, a smaller model can outperform larger but less specialized models. This efficiency is crucial for reducing deployment costs and carbon footprint.

These performances place Vigogne among the best small-scale French-language models available as open source (publication date: 06 feb 2015). Their compact size (1 to 8 billion parameters) makes them agile, eco-friendly, and deployable locally, reducing reliance on foreign clouds and protecting data privacy. This approach aligns with a vision of sovereign, ethical, and accessible AI, capable of serving French-speaking businesses, administrations, and researchers without compromising security or transparency.

The prospects are numerous. In the short term, Vigogne can be integrated into industry-specific chatbots, writing assistance tools, or specialized translation systems. In the longer term, it could serve as the foundation for even more refined models, optimized for domains such as healthcare, law, or education. Future work will explore advanced techniques such as Direct Preference Optimization (DPO) to further fine-tune the model’s preferences beyond simple SFT.

Finally, this project highlights the central role of high-performance computing (HPC) infrastructure in AI innovation. Without access to Jean Zay, training at this scale would have been impossible. It also underscores the importance of public-private collaborations to accelerate research and ensure linguistic diversity within the AI ecosystem.

Key figure :

A 7% average improvement across all adapted models and tasks.

Definitions :

Spotlight 1: Why use compact models? Less resource-intensive and deployable locally, they reduce costs and cloud dependence while remaining effective for targeted use cases.

Spotlight 2: What is French Magpie? A French-language dataset with AI-generated responses, used to diversify styles during training. This dataset was created specifically for this project.

Spotlight 3: What is supervised fine-tuning? Training on instruction-response pairs to teach a model to follow precise instructions reliably