Workshop Masthead

1st Workshop on
Multilingual Data Quality Signals

Palais des Congrès
Montreal, Canada

October 10th 2025

Call for Papers

Summary

We propose a workshop on the topic of data quality in multilingual pre-training data. Our workshop will include two invited speakers, selected research talks, and two poster sessions for participants to present their research. We will also run an attached shared task on language identification as a key determinant of data quality in large multilingual datasets.

Motivation

Our workshop topic is motivated by the rise of pre-trained large language models (LLMs) in contemporary natural language processing (NLP) and the accompanying need to obtain and manage the increasingly large datasets needed to train LLMs. Despite the prior focus on quantity of data, recent research has highlighted the importance of effective selection, filtering, and cleaning of the pre-training data for optimal LLM performance - that is, data quality. Tackling data quality is even more important in a multilingual setting, where the amount of training data in many languages is limited and the data that is present is often unacceptably low in quality. Indeed, for many languages even the fundamental step of language identification (LID) remains a challenge, leading to unreliable language labels and thus noisy datasets for under-served languages.

Despite the importance of data quality in LLM pre-training data, there have only been a small number of workshops addressing the topic directly (see section below), and these have mostly taken place in the last year. None of the previous workshops have focused on multilingual applications, despite the fact that data quality is even more crucial for languages other than English. We believe this makes our workshop proposal both important and timely, since it addresses a key contemporary issue for widening access to NLP technologies for more language communities.

One of the unique aspects of this workshop proposal is that we emphasize under-served and under-resourced languages. One of the biggest barriers to people reaping benefits of language models is that the technology is not available in their language. This is further exacerbated by other limitations to entry in the field such as costs, technical background training, as well as a myriad of other issues. One of the major benefits of this workshop is that we are able to enable participation by people with very little technical background.

decorative

Participation

We plan for two modes of participation: submitted research papers on the topic of data quality in multilingual pre-training data, and a shared task on language identification. In both cases, we will focus on building community through poster sessions and invited talks. We will also reach out to relevant communities of interest to ensure engagement with under-served language communities (e.g. Masakhane, academic institutions studying both indigenous and lower-resource languages, and relevant global affinity groups).

Research Papers

Papers must be submitted using COLM format via OpenReview. The reviewing procedure will mitigate conflicts of interest by utilizing the diverse backgrounds of our workshop organizers and the accepted papers will be non-archival. Paper submissions should focus on community building and we will facilitate the sharing of ideas through poster sessions.

Key dates

Submission deadline: June 23 2025

Accept/reject notification: July 24 2025

(All deadlines are 23:59 AoE.)

Shared Task

Currently, one of the main steps in data pipelines is that of LID. This step is one of the first filters applied in order to extract data that is most relevant to a particular NLP use case, as training LLM s in more than a couple dozen languages is rather uncommon. This particular step has also been extensively used as a proxy for data quality, as practitioners set thresholds for language identification discarding all documents below said threshold.

However, most current LID models cover only a couple hundred languages and rely on old architectures such as CLD2 and CLD3, or are no longer maintained projects such as fastText.

Although new models have been introduced, many systems still rely on the fastText architecture, which exhibits important limitations, which have been shown in human audits. Moreover, the impact of using these simple architectures as filtering proxies has been largely understudied.

As such, the Common Crawl Foundation and MLCommons have started an annotation campaign for LID that is already ongoing in an effort to construct a diverse database for LID (often restricted to religious texts.)

Parallel to this, the Common Crawl Foundation started the Web Languages Project that allows people without a technical background to include lists of websites in their languages in Common Crawl.

We plan to use the collected data of these two projects and challenge participants to come up with creative new LID architectures and models that could later be maintained by the Common Crawl Foundation and its community.

Schedule Outline

9:00 - 9:30 9:30 - 10:15 10:15 - 10:30 10:30 - 12:00 12:00 - 13:30 13:30 - 15:00 15:00 - 16:00 16:00 - 16:45 16:45 - 17:00
Intro and results of shared task 1st Keynote Coffee Break 1st Poster Session (shared task) Lunch Invited Talks 2nd Poster Session (with coffee) 2nd Keynote Closing

Differences to Related Workshops

  1. Preparing Good Data for Generative AI: Challenges and Approaches (AAAI 2025): Our workshop focuses on data quality and multilinguality rather than “good data” in general.
  2. Data-Centric Machine Learning Research (ICML, 5th iteration in 2024): Our workshop will deal with data quality and multilinguality specifically rather than the broader topic of data-centric AI.
  3. Workshop on Navigating and Addressing Data Problems for Foundation Models (ICLR, 2nd iteration in 2025): Our workshop considers data quality and multilinguality as specific problems rather than all data-related problems with LLMs.

Invited Speakers

  • Sebastian Nagel Sebastian Nagel (Common Crawl Foundation) [Confirmed]: Sebastian is a programmer and computational linguist. He is responsible for running and maintaining the crawler managed by the Common Crawl Foundation, as well as for supporting users with the data. He is a committer of Apache Nutch and a member of the Apache Software Foundation. He holds a PhD in computational linguistics from University of Munich.
  • David Ifeoluwa Adelani David Ifeoluwa Adelani (McGill University and Mila) [Confirmed]: David is an Assistant Professor at McGill School of Computer Science, a Core Academic Member at MILA, and a Canada CIFAR AI Chair (in 2024). His research focuses on multilingual NLP and speech processing, especially for low-resource languages. His PhD focused on NLP for African languages but he is now expanding his scope to other regions of the world, including languages of South Asia, South-East Asia and the Americas.

Funding Sources

We have created all the data for the shared task which would have been the most expensive portion of funding this workshop. In addition, we have reached out to long-term relationship funding sources that have supported our similar work in the past and are confident that we will be able to secure funding for the workshop.

We will make an effort to help support attendance costs for students and researchers who would not normally attend COLM, but are members of lower-resourced language communities (including indigenous languages of the broader Montreal region) in order to increase broader participation in our workshop, as well as the broader conference in general.

Organizers’ Biographies

    Program Chairs

  • Pedro Ortiz Suarez

    Pedro Ortiz Suarez (Common Crawl Foundation) is a Senior Research Scientist with a PhD in NLP from Sorbonne Université. His work focuses on data quality and data-centric methods for improving ML models. He contributed to CamemBERT, BLOOM, OpenGPT-X, and founded the OSCAR project.

  • Sarah Luger

    Sarah Luger (MLCommons) has over two decades of expertise in AI and NLP. She has worked on low-resource machine translation, online toxicity identification, GenAI for marketing, and more. She holds a PhD in Informatics from the University of Edinburgh and has worked at IBM Watson on Jeopardy! Challenge NLP tasks. She is co-chair of the MLCommons Datasets Working Group.

  • Laurie Burchell

    Laurie Burchell (Common Crawl Foundation) is a Senior Research Data Engineer with a PhD from the University of Edinburgh focused on language identification. Laurie contributes to the Open Language Data Initiative and HPLT and works to broaden multilingual access through open research.

  • Kenton Murray

    Kenton Murray (Johns Hopkins University) is a Research Scientist at JHU’s Human Language Technologies Center of Excellence. He has helped organize WMT, IWSLT, AMTA, and MAGMaR, and is serving as one of the Workshop Chairs for NAACL 2025.

  • Catherine Arnett

    Catherine Arnett (EleutherAI) is an NLP Researcher, mainly interested in cross-lingual and multilingual NLP. She recently finished her PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. She previously was Lead Research Scientist at PleIAs.

  • Organizing Committee

  • Thom Vaughan

    Thom Vaughan (Common Crawl Foundation) is Principal Technologist at Common Crawl with over a decade of experience in multilingual data and large-scale language resources. He has directed global-scale voice and language data initiatives spanning dozens of locales, and played a key role in shaping some of the most widely deployed speech systems in production today.

  • Sara HincapiĂ©

    Sara Hincapié (Factored) is a Software Engineer with experience in product-focused full-stack development and machine learning model integration. Her work focuses on building accessible, user-centered web applications and tools. She has contributed to projects such as MLSuperb 2.0, Helpmed, and Dynabench.

  • Rafael Mosquera

    Rafael Mosquera (MLCommons) is a Machine Learning Engineer specializing in NLP and audio ML systems. He has contributed to several projects including BabyLM, the Prism dataset (NeurIPS 2024 Best Paper Award - Datasets & Benchmarks), People's Speech dataset, and Dynabench.