1st Workshop on
Multilingual Data Quality Signals

Palais des Congrès
Montreal, Canada
10 October 2025

Shared Task on Improving Language Identification for Web Text

Key dates

~~Deadline to contribute annotations for the 1st round: July 23, 2025~~
~~Registration Deadline: July 23, 2025~~
~~Decision Notification: July 24, 2025~~
~~Camera Ready Deadline: September 21, 2025~~
Workshop Date: October 10, 2025

(All deadlines are 23:59 AoE.)

Motivation

The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID), a task which remains far from solved for many languages. The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.

All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.

Objectives

The objective of this shared task is to improve language identification for web data. There are two subtasks: community annotation of web data and language identification model development.

Subtask 1: Community annotation

We’re currently collecting annotations on web data for this shared task. You can contribute to the annotation task by going to the Common Crawl - MLCommons Dynabench Language Identification Task. All contributors will be invited to be co-authors of the eventual dataset paper, and all data and annotations will be open-sourced.

All contributors will also be invited to be co-authors of a shared task paper and poster at the workshop where all statistics about the aggregated annotations will be presented. Top contributors to the annotation task will be recognised as part of the shared task awards.

We will be holding some informal hackathons on our Discord server in order to encourage more annotations - join the server to stay updated!

Subtask 2: Language identification models

The goal of this part of the shared task is to develop robust, high-coverage language identification models which are effective on web data. We encourage participants to experiment with a range of approaches, including experimentation with new architectures and the curation of novel high-quality annotated datasets.

Language coverage

As expected for a workshop focusing on multilinguality, we encourage participants to submit models supporting as many language varieties as possible. That said, we support broad participation and welcome submissions that focus on a particular language or set of languages, especially if they present particular challenges for language identification.

Evaluation

Our primary evaluation will be conducted using a held-out test set of web data, covering a subset of the language varieties contained in GlotLID. We will also compare all submissions to a baseline of common existing language identification models.

For submissions that focus on a specific set of languages, they will be separately ranked with respect to those languages alone. We will recognise systems that perform well on the whole set of languages as well as those focusing on specific languages.

We recommend using high-coverage evaluation sets like FLORES+ or UDHR-LID for testing during development. We advise participants to check their training data carefully for contamination and to ensure they use the same language-agnostic pre-processing pipeline for preparing both training and test data. We provide an example here.

Training data and label format

We recommend using the GlotLID corpus as a starting point for training data. Please note that this data should not be redistributed. Access to this dataset is controlled through the Hugging Face platform.

Participants are welcome to use additional datasets to train their models. However, if datasets are collected, they should be open source and not have a license that prevents their redistribution or use for training models. Datasets should be documented in the paper submission, including their provenance, any cleaning procedures, and how annotations were added (if applicable).

We will use the same language label format as those used by GlotLID: an ISO 639-3 language code plus an ISO 15924 script code, separated by an underscore.

Participation information

Registration

Registration is only required for LangID systems, not for annotations. To register, please submit a one-page document with a title, a list of authors, a list of provisional languages that you want to focus on, and a brief description of your approach. This document should be sent to wmdqs-pcs@googlegroups.com. You can change the list of languages or the system description during the shared task. This document's only purpose is to register your participation in the shared task.

Submission and publications

For the final camera-ready submission, participants should submit a LangID system and/or dataset as well as a paper. Papers should be in the ACL ARR template. Papers can be up to 8 pages of main content (though we welcome much shorter submissions), with unlimited additional pages after the conclusion for citations, limitations and ethical considerations. Authors may use as many pages of appendices (after the bibliography) as they wish, but reviewers are not required to read the appendix.

All accepted submissions will be presented as posters. After the workshop, the organizers will lead two papers: one about LangID models and one about the training data for the models. All authors from all accepted submissions will be invited to be authors on either the model or data paper, or both, depending on the nature of their submission.

Web Languages Project

We also invite all participants of the LangID task to contribute to the Web Languages Project. With this, we are asking speakers of Languages Other Than English (LOTE) to contribute URLs of websites that they know and that contain content written in their language. We will then add these URLs into Common Crawl's seed crawl, which we hope will allow us to discover more web content written in these languages. Common Crawl respects Robots Exclusion Protocol directives, ensuring that all this new linguistic content is crawled politely.

If you want to contribute to this project, please visit our GitHub Repository for more instructions.

Ethics Review

Reviewers and ACs may flag submissions for ethics review. Flagged submissions will be sent to an ethics review committee for comments. Comments from ethics reviewers will be considered by the primary reviewers and AC as part of their deliberation. They will also be visible to authors, who will have an opportunity to respond. Ethics reviewers do not have the authority to reject papers, but in extreme cases papers may be rejected by the program chairs on ethical grounds, regardless of scientific quality or contribution.

Guides & Policies

Conflict of Interest Policy. The COLM program committee and all submitting authors must follow the COLM Conflict of Interest Policy.
Author Guidelines. Authors are expected to follow the COLM Author Guide.
AC Guidelines. Authors are expected to follow the COLM Author Guide.
Code of Conduct. All COLM participants, including authors, are required to adhere to the COLM Code of Conduct. More detailed guidance for authors, reviewers, and all other participants will be made available in due course, and participation will require acknowledging and adhering to the provided guidelines.
Code of Ethics. All participants of COLM, including the submission and reviewing process, must abide by COLM’s Code of Ethics.
Reviewing Guidelines. Reviewers and area chairs will be asked to follow the COLM Reviewing Guidelines.

Contact

In case of queries, please message the organisers via wmdqs-pcs@googlegroups.com.

1st Workshop on Multilingual Data Quality Signals