
1st Workshop on
Multilingual
Data Quality Signals
Palais des Congrès
Montreal, Canada
10 October 2025
Shared Task
We invite submissions to the first Shared Task on Language Identification for Web Data.
Key dates
1st Deadline to contribute annotations: July 7, 2025
1st Annotations released (train split): July 14, 2025
Abstract Deadline: July 21, 2025
Decision Notification: July 24, 2025
Camera Ready Deadline: September 21, 2025
Workshop Date: October 10, 2025
(All deadlines are 23:59 AoE.)
Abstract Submission: Please send an email with your abstract to wmdqs-pcs@googlegroups.com.
The email subject should be "[Shared Task Abstract Submission]: Abstract title".
Motivation
The lack of training data—especially high-quality data—is the root cause of poor language model performance for many languages. One obstacle to improving the quantity and quality of available text data is language identification (LangID or LID). Lang ID remains far from solved for many languages. Several of the commonly used LangID models were introduced in 2017 (e.g. fastText and CLD3). The aim of this shared task is to encourage innovation in open-source language identification and improve accuracy on a broad range of languages.
All accepted authors will be invited to contribute a larger paper, which will be submitted to a high-impact NLP venue.
Objectives
Submissions to the shared task can include either a contribution of annotated training data, LangID systems or both.
Community annotation
We’re currently collecting annotations on web data for this shared task. You can contribute to the annotation task by going to the Common Crawl - MLCommons Dynabench Language Identification Task. All contributors will be invited to be co-authors of the eventual dataset paper, and all data and annotations will be open-source.
We will be holding some informal hackathons on our Discord server in order to encourage more annotations - join the server to stay updated!
LangID systems
The main shared task is to submit LangID models that work well on a wide variety of languages on web data. We encourage participants to employ a range of approaches, including the development of new architectures and the curation of novel high-quality annotated datasets.
We recommend using the GlotLID corpus as a starting point for training data. Access to the data will be managed through the Hugging Face repository. Please note that the data should not be redistributed. We will use the same language label format as those used by GlotLID: an ISO 639-3 language code plus an ISO 15924 script code, separated by an underscore.
Although all systems will be evaluated on the full range of languages in our test set, we encourage submissions that focus on a particular language or set of languages, especially if those language(s) present particular challenges for language identification.
Submission Information
For the initial submission, we require participants to submit an abstract of 300-500 words. The abstract should indicate the general approach that is being taken, which languages you are focusing on (if not all languages), and preliminary results, if any. Evaluation of systems is not required, as the organizers will evaluate all submitted systems.
For the final camera-ready submission, participants should submit a LangID system and/or dataset as well as a paper. Papers should be in the ACL ARR template. Papers can be up to 8 pages of main content (though we welcome much shorter submissions), with unlimited additional pages after the conclusion for citations, limitations and ethical considerations. Authors may use as many pages of appendices (after the bibliography) as they wish, but reviewers are not required to read the appendix.
If datasets are collected, they should be open source and not have a license that prevents their redistribution or use for training models. Datasets should be documented in the paper submission, including their provenance, any cleaning procedures, and how annotations were added (if applicable).
All accepted submissions will be presented as posters. After the workshop, the organizers will lead two papers: one about LangID models and one about the training data for the models. All authors from all accepted submissions will be invited to be authors on either the model or data paper, or both, depending on the nature of their submission.
Ranking and Awards
We will evaluate systems on our held-out test set. We will compare all submissions to a set of existing baselines. For submissions that focus on a specific set of languages, they will be separately ranked with respect to those languages alone. We will recognize both systems that perform well on the whole set of languages and specific languages.
Top contributors to the annotation task will also be recognized as part of the shared task awards.
Web Languages Project
We also invite all participants of the LangID task to contribute to the Web Languages Project. With this, we are asking speakers of Languages Other Than English (LOTE) to contribute URLs of websites that they know and that contain content written in their language. We will then add these URLs into Common Crawl's seed crawl, which we hope will allow us to discover more web content written in these languages. Common Crawl respects Robots Exclusion Protocol directives, ensuring that all this new linguistic content is crawled politely.
If you want to contribute to this project, please visit our GitHub Repository for more instructions.
Ethics Review
Reviewers and ACs may flag submissions for ethics review. Flagged submissions will be sent to an ethics review committee for comments. Comments from ethics reviewers will be considered by the primary reviewers and AC as part of their deliberation. They will also be visible to authors, who will have an opportunity to respond. Ethics reviewers do not have the authority to reject papers, but in extreme cases papers may be rejected by the program chairs on ethical grounds, regardless of scientific quality or contribution.
Guides & Policies
- Conflict of Interest Policy. The COLM program committee and all submitting authors must follow the COLM Conflict of Interest Policy.
- Author Guidelines. Authors are expected to follow the COLM Author Guide.
- AC Guidelines. Authors are expected to follow the COLM Author Guide.
- Code of Conduct. All COLM participants, including authors, are required to adhere to the COLM Code of Conduct. More detailed guidance for authors, reviewers, and all other participants will be made available in due course, and participation will require acknowledging and adhering to the provided guidelines.
- Code of Ethics. All participants of COLM, including the submission and reviewing process, must abide by COLM’s Code of Ethics.
- Reviewing Guidelines. Reviewers and area chairs will be asked to follow the COLM Reviewing Guidelines.
Contact
In case of queries, please message the organisers via wmdqs-pcs@googlegroups.com.