Open Science · Ancient Greek · NLP

Corpus Liberatum
Linguæ Graecæ

An open-science initiative for freely accessible ancient Greek texts via a complete pipeline from scanned editions to structured TEI XML.

About the Project

About the Project

The Corpus Liberatum Linguæ Graecæ (CLLG) is a collaborative open-science project aimed at building a freely accessible, high-quality corpus of ancient Greek texts. It provides a complete pipeline from scanned edition images to structured TEI XML, serving the needs of philology, digital humanities, and natural language processing.

Access to ancient Greek texts is currently dominated by the Thesaurus Linguae Graecae (TLG), a proprietary resource that limits reproducible research. While existing open initiatives such as Perseus, First1KGreek, and the Patristic Text Archive (PTA) have made significant contributions, their coverage remains partial — in particular for late antique (Christian and non-Christian) and Byzantine texts. The CLLG aims to fill this gap with an open, sustainable alternative.

What We Do

The project operates along three major axes:

  • Technological development — OCR for polytonic Greek, layout analysis, and automatic TEI XML encoding.
  • Corpus production — High-quality, interoperable corpora structured in TEI XML, covering prose texts with canonical references.
  • Open distribution — All data and tools are released under free licenses and published via Nakala and GitLab.

The project operates within the framework established by French case law (TGI, Droz v. Classiques Garnier, 27 March 2014, confirmed on appeal 9 June 2017), which holds that the scholarly transcription of ancient texts does not constitute an original creative work protected by copyright, as the transcriber’s choices are governed by scientific method rather than personal expression.

Funding

This project is supported by the ANR (Agence Nationale de la Recherche) and carried out within the PIQ project. The project « Corpus Liberatum Linguae Graecae » was supported by the French National Research Agency (ANR) under the France 2030 grant reference number ANR-24-RRII-0002, operated by the Inria Quadrant Program.

The longer-term infrastructure goal is Biblissima Textes, a new component of Biblissima that will serve CLLG texts and other open corpora for ancient and medieval languages via the Distributed Text Services (DTS) API.

Resources

All data and tools are released under free licenses.

HuggingFace

Source Code & Data Repositories

Software

Publications

Research outputs from the CLLG project.

2026 arXiv preprint arXiv:2603.02803

Structure-Aware Text Recognition for Ancient Greek Critical Editions

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

We investigate how visual language models perform on historical scholarly texts. We create a synthetic dataset of 185,000 pages from TEI/XML sources and a real-world benchmark of scanned critical editions. Qwen3VL-8B achieves 1.0% Character Error Rate on actual scans.

Team

A multidisciplinary team at ALMAnaCH, Inria Paris — combining Natural Language Processing and Computational Humanities.

Thibault Clérice
Thibault Clérice Project Lead
Benoit Sagot
Benoit Sagot Researcher
Nicolas Angleraud
Nicolas Angleraud AI Vision Engineer
Antonia Karamolengou
Antonia Karamolengou Data Librarian & NLP Engineer
Maïna Laurent
Maïna Laurent Intern