1st Workshop on
Multilingual Data Quality Signals

Palais des Congrès
Montreal, Canada

10 October 2025

Shared Task

We invite submissions to the first Shared Task on Language Identification for Web Data.

Key dates

1st Deadline to contribute annotations: July 7, 2025

1st Annotations released (train split): July 14, 2025

Abstract Deadline: July 21, 2025

Decision Notification: July 24, 2025

Camera Ready Deadline: September 21, 2025

Workshop Date: October 10, 2025

(All deadlines are 23:59 AoE.)

Abstract Submission: Please send an email with your abstract to wmdqs-pcs@googlegroups.com.

The email subject should be "[Shared Task Abstract Submission]: Abstract title".

Motivation

The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.

All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.

Objectives

Submissions to the shared task can include either a contribution of annotated training data, LangID systems or both.

Community annotation

We’re currently collecting annotations on web data for this shared task. You can contribute to the annotation task by going to the Common Crawl - MLCommons Dynabench Language Identification Task. All contributors will be invited to be co-authors of the eventual dataset paper, and all data and annotations will be open-source.

We will be holding some informal hackathons on our Discord server in order to encourage more annotations - join the server to stay updated!

LangID systems

The main shared task is to submit LangID models that work well on a wide variety of languages on web data. We encourage participants to employ a range of approaches, including the development of new architectures and the curation of novel high-quality annotated datasets.

We recommend using the GlotLID corpus as a starting point for training data. Access to the data will be managed through the Hugging Face repository. Please note that the data should not be redistributed. We will use the same language label format as those used by GlotLID: an ISO 639-3 language code plus an ISO 15924 script code, separated by an underscore.

Although all systems will be evaluated on the full range of languages in our test set, we encourage submissions that focus on a particular language or set of languages, especially if those language(s) present particular challenges for language identification.

Submission Information

For the initial submission, we require participants to submit an abstract of 300-500 words. The abstract should indicate the general approach that is being taken, which languages you are focusing on (if not all languages), and preliminary results, if any. Evaluation of systems is not required, as the organizers will evaluate all submitted systems.

For the final camera-ready submission, participants should submit a LangID system and/or dataset as well as a paper. Papers should be in the ACL ARR template. Papers can be up to 8 pages of main content (though we welcome much shorter submissions), with unlimited additional pages after the conclusion for citations, limitations and ethical considerations. Authors may use as many pages of appendices (after the bibliography) as they wish, but reviewers are not required to read the appendix.

If datasets are collected, they should be open source and not have a license that prevents their redistribution or use for training models. Datasets should be documented in the paper submission, including their provenance, any cleaning procedures, and how annotations were added (if applicable).

All accepted submissions will be presented as posters. After the workshop, the organizers will lead two papers: one about LangID models and one about the training data for the models. All authors from all accepted submissions will be invited to be authors on either the model or data paper, or both, depending on the nature of their submission.

Ranking and Awards

We will evaluate systems on our held-out test set. We will compare all submissions to a set of existing baselines. For submissions that focus on a specific set of languages, they will be separately ranked with respect to those languages alone. We will recognize both systems that perform well on the whole set of languages and specific languages.

Top contributors to the annotation task will also be recognized as part of the shared task awards.

Web Languages Project

We also invite all participants of the LangID task to contribute to the Web Languages Project. With this, we are asking speakers of Languages Other Than English (LOTE) to contribute URLs of websites that they know and that contain content written in their language. We will then add these URLs into Common Crawl's seed crawl, which we hope will allow us to discover more web content written in these languages. Common Crawl respects Robots Exclusion Protocol directives, ensuring that all this new linguistic content is crawled politely.

If you want to contribute to this project, please visit our GitHub Repository for more instructions.

Ethics Review

Reviewers and ACs may flag submissions for ethics review. Flagged submissions will be sent to an ethics review committee for comments. Comments from ethics reviewers will be considered by the primary reviewers and AC as part of their deliberation. They will also be visible to authors, who will have an opportunity to respond. Ethics reviewers do not have the authority to reject papers, but in extreme cases papers may be rejected by the program chairs on ethical grounds, regardless of scientific quality or contribution.

Guides & Policies

Contact

In case of queries, please message the organisers via wmdqs-pcs@googlegroups.com.