This is the aim of the Socface project, initiated in 2021, which aims to develop technologies for large-scale processing of vast series of historical documents. Archivists, demographers, economists, historians and computer scientists are working together to this end.
The documents targeted are first and foremost made up of images. Automatic handwriting recognition is then used to analyze all the nominative census lists from 1836 to 1936. The project involves the creation of a database of all individuals registered in France over this period. This mass of information will enable fine-grained analysis of individual dynamics and the development of micro knowledge.
The intelligence resources of GENCI's Jean Zay supercomputer, hosted and operated by IDRIS (CNRS), are being mobilized and will thus enhance understanding of French economic and social structures over a century.
In addition, the information available in the nominative lists will be disseminated in Open Access, enabling anyone to freely browse hundreds of millions of records.
Christopher Kermorvant, associate researcher at the University of Rouen and CEO of the startup Teklia - which develops text recognition and information extraction technologies among other things - is actively involved in the project. He refers to "incredible resources for understanding the evolution of population movements, housing and, more broadly, lifestyle dynamics". However, this work involves collecting archive images from all the départements. This work has already been carried out "in almost 50 départements", he points out. Between 20 and 30 million images will be produced and analyzed, and examined using handwriting recognition tools.
The resources at Jean Zay will be used for model development on the one hand, and for processing during the production phase.
This is the biggest project in France for automatic recognition of historical images.
.