CLLG — Corpus Liberatum Linguae Graecae

About the Project

The Corpus Liberatum Linguæ Graecæ (CLLG) is a collaborative open-science project aimed at building a freely accessible, high-quality corpus of ancient Greek texts. It provides a complete pipeline from scanned edition images to structured TEI XML, serving the needs of philology, digital humanities, and natural language processing.

Access to ancient Greek texts is currently dominated by the Thesaurus Linguae Graecae (TLG), a proprietary resource that limits reproducible research. While existing open initiatives such as Perseus, First1KGreek, and the Patristic Text Archive (PTA) have made significant contributions, their coverage remains partial — in particular for late antique (Christian and non-Christian) and Byzantine texts. The CLLG aims to fill this gap with an open, sustainable alternative.

What We Do

The project operates along three major axes:

Technological development — OCR for polytonic Greek, layout analysis, and automatic TEI XML encoding.
Corpus production — High-quality, interoperable corpora structured in TEI XML, covering prose texts with canonical references.
Open distribution — All data and tools are released under free licenses and published via Nakala and GitLab.

Legal Context

The project operates within the framework established by French case law (TGI, Droz v. Classiques Garnier, 27 March 2014, confirmed on appeal 9 June 2017), which holds that the scholarly transcription of ancient texts does not constitute an original creative work protected by copyright, as the transcriber’s choices are governed by scientific method rather than personal expression.

Funding

This project is supported by the ANR (Agence Nationale de la Recherche) and carried out within the PIQ project. The project « Corpus Liberatum Linguae Graecae » was supported by the French National Research Agency (ANR) under the France 2030 grant reference number ANR-24-RRII-0002, operated by the Inria Quadrant Program.

The longer-term infrastructure goal is Biblissima Textes, a new component of Biblissima that will serve CLLG texts and other open corpora for ancient and medieval languages via the Distributed Text Services (DTS) API.

Resources

All data and tools are released under free licenses.

GitLab

FreED Corpus

The Freely Editable and Distributable Corpus for ancient Greek, released on GitLab under a free license. Structured TEI XML, covering prose texts with canonical references.

View on GitLab →

HuggingFace

Models on HuggingFace

OCR and layout analysis models for ancient Greek and Latin texts.

→

Datasets on HuggingFace

Synthetic training dataset of ~185,000 pages generated from TEI/XML sources across 4,582 works with over 5,000 style combinations, plus real-world benchmarks of scanned critical editions.

→

Source Code & Data Repositories

CLLG on GitHub

Source code, tools, and supporting repositories.

→

CLLG on GitLab (Inria)

Primary source code and corpus data hosted at Inria.

→

Software

MyDapytains

A ready-to-deploy DTS (Distributed Text Services) API server built on TEI structure, enabling standard programmatic access to corpus texts.

→

HookTest

Validation tool for TEI XML files with CiteStructure encoding and DTS catalog registries. Allows automated testing and reference retrieval without human intervention.

→

Kraken-JS

JavaScript runtime for Kraken OCR/HTR models, running on Node.js and Electron without requiring Python at inference time. Includes segmentation, recognition, and a full end-to-end pipeline with WebGPU/CUDA support.

→

Publications

Research outputs from the CLLG project.

2026 arXiv preprint arXiv:2603.02803

Structure-Aware Text Recognition for Ancient Greek Critical Editions

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

We investigate how visual language models perform on historical scholarly texts. We create a synthetic dataset of 185,000 pages from TEI/XML sources and a real-world benchmark of scanned critical editions. Qwen3VL-8B achieves 1.0% Character Error Rate on actual scans.

2026 arXiv preprint arXiv:2605.27750

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

We examine how VLMs perform OCR on ancient Greek texts and find they can generate plausible but visually unsupported text, relying on language priors rather than actual visual input. Fluent output is not necessarily visually grounded.