Workshop Masthead

1st Workshop on
Multilingual Data Quality Signals

Palais des Congrès
Montréal, Canada

10 October 2025

Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages.

In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data.

Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines.

WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.

decorative

Schedule Outline

Room 520A, Palais des Congrès

9:00 - 9:15 9:15 - 10:15 10:15 - 11:00 11:00 - 12:00 12:00 - 13:00 13:00 - 14:30 14:30 - 15:30 15:30 - 16:30 16:30 - 17:00
Opening Remarks (Laurie Burchell) Keynote 1 (Julia Kreutzer) Best Paper and Shared Task Results (Catherine Arnett) Coffee Break and Poster Session (Shared Task) Keynote 2 (David Ifeoluwa Adelani) Lunch Keynote 3 (Sebastian Nagel) Poster Session (Papers) + Coffee Closing Remarks (Pedro Ortiz Suarez)

Keynote Abstracts

  1. Julia Kreutzer

    Optimizing Data Quality for Multilingual Fine-Tuning

    While in pre-training, multilingual data is still relatively abundant, high-quality and relevant data for fine-tuning is hard to come by. In this talk I will focus on three directions of improving multilingual fine-tuning via interventions on the data-level. The first leverages explicit meta-information to directly tap into the long-tail of the data distribution, the second exploits the diversity of open teacher models and lets them collaborate for better supervision, and the third directly optimizes in the prompt space. We will discuss their potential of advancing multilingual LLM quality further, as well as their limitations.

  2. David Ifeoluwa Adelani

    Text quality issues and the impact on NLP for low-resource languages

    Despite significant progress in developing large language models trained on massive multilingual datasets, a major bottleneck that is often overlooked is the quality of pre-training data for low-resource languages. In many cases, noisy data may be more harmful than beneficial for these languages. In this talk, I will highlight two important ingredients for curating high-quality pre-training data: (1) evaluating massively multilingual language identification models on heterogeneous sources, and (2) developing high-quality machine translation quality estimation metrics tailored to low-resource languages. Finally, I will demonstrate how the latter approach can enhance text embedding models for low-resource African languages and discuss the broader implications of high-quality pre-training data in advancing large language models for underrepresented languages.

  3. Sebastian Nagel

    Common Crawl and Languages on the Web

    Common Crawl is a free, open repository of web crawl data collected since 2008 that can anyone can use. After a brief overview of the dataset and its usage, the presentation explains how to obtain a balanced, diverse and representative sample of websites, while operating an efficient and polite web crawler. The talk will focus on language and geographical coverage, as well as the identification and annotation of languages in the data.

Invited Speakers

  • Julia Kreutzer Julia Kreutzer (Cohere Labs): Julia Kreutzer is a Senior Research Scientist at Cohere Labs, where she conducts research on large language models, with current focus on multilinguality, evaluation and inference methods. She has a background in machine translation research, holds a PhD from Heidelberg University and previously worked at Google Translate. She particularly enjoys collaborative research for improving the diversity and accessibility of NLP research, e.g. in collaborations with grassroots NLP communities like Masakhane, or with the Cohere Labs community.
  • David Ifeoluwa Adelani David Ifeoluwa Adelani (McGill University and Mila): David is an Assistant Professor at McGill School of Computer Science, a Core Academic Member at MILA, and a Canada CIFAR AI Chair (in 2024). His research focuses on multilingual NLP and speech processing, especially for low-resource languages. His PhD focused on NLP for African languages but he is now expanding his scope to other regions of the world, including languages of South Asia, South-East Asia and the Americas.
  • Sebastian Nagel Sebastian Nagel (Common Crawl Foundation): Sebastian is a programmer and computational linguist. He is responsible for running and maintaining the crawler managed by the Common Crawl Foundation, as well as for supporting users with the data. He is a committer of Apache Nutch and a member of the Apache Software Foundation. He holds a PhD in computational linguistics from University of Munich.

Organizers’ Biographies

    Program Chairs

  • Pedro Ortiz Suarez

    Pedro Ortiz Suarez (Common Crawl Foundation) is a Senior Research Scientist with a PhD in NLP from Sorbonne Université. His work focuses on data quality and data-centric methods for improving ML models. He contributed to CamemBERT, BLOOM, OpenGPT-X, and founded the OSCAR project.

  • Sarah Luger

    Sarah Luger (MLCommons) has over two decades of expertise in AI and NLP. She has worked on low-resource machine translation, online toxicity identification, GenAI for marketing, and more. She holds a PhD in Informatics from the University of Edinburgh and has worked at IBM Watson on Jeopardy! Challenge NLP tasks. She is co-chair of the MLCommons Datasets Working Group.

  • Laurie Burchell

    Laurie Burchell (Common Crawl Foundation) is a Senior Research Data Engineer with a PhD from the University of Edinburgh focused on language identification. Laurie contributes to the Open Language Data Initiative and HPLT and works to broaden multilingual access through open research.

  • Kenton Murray

    Kenton Murray (Johns Hopkins University) is a Research Scientist at JHU’s Human Language Technologies Center of Excellence. He has helped organize WMT, IWSLT, AMTA, and MAGMaR, and is serving as one of the Workshop Chairs for NAACL 2025.

  • Catherine Arnett

    Catherine Arnett (EleutherAI) is an NLP Researcher, mainly interested in cross-lingual and multilingual NLP. She recently finished her PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. She previously was Lead Research Scientist at PleIAs.

  • Organizing Committee

  • Thom Vaughan

    Thom Vaughan (Common Crawl Foundation) is Principal Technologist at Common Crawl with over a decade of experience in multilingual data and large-scale language resources. He has directed global-scale voice and language data initiatives spanning dozens of locales, and played a key role in shaping some of the most widely deployed speech systems in production today.

  • Sara Hincapié

    Sara Hincapié (Factored) is a Software Engineer with experience in product-focused full-stack development and machine learning model integration. Her work focuses on building accessible, user-centered web applications and tools. She has contributed to projects such as MLSuperb 2.0, Helpmed, and Dynabench.

  • Rafael Mosquera

    Rafael Mosquera (MLCommons) is a Machine Learning Engineer specializing in NLP and audio ML systems. He has contributed to several projects including BabyLM, the Prism dataset (NeurIPS 2024 Best Paper Award - Datasets & Benchmarks), People's Speech dataset, and Dynabench.

Accepted Submissions

Loading papers…