Workshop Masthead

1st Workshop on
Multilingual Data Quality Signals

Palais des Congrès
Montréal, Canada

10 October 2025

Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for underserved languages.

In response to these challenges, we will be holding the first Workshop on Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the submission of long and short research papers related to data quality in multilingual data.

Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit other research communities in areas such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond. We therefore encourage submissions from a wide range of disciplines.

WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.

Schedule Outline

9:00 - 9:30 9:30 - 10:15 10:15 - 10:30 10:30 - 12:00 12:00 - 13:30 13:30 - 15:00 15:00 - 16:00 16:00 - 16:45 16:45 - 17:00
Intro and results of shared task 1st Keynote Coffee Break 1st Poster Session (shared task) Lunch Invited Talks 2nd Poster Session (with coffee) 2nd Keynote Closing

Invited Speakers

  • Sebastian Nagel Sebastian Nagel (Common Crawl Foundation) [Confirmed]: Sebastian is a programmer and computational linguist. He is responsible for running and maintaining the crawler managed by the Common Crawl Foundation, as well as for supporting users with the data. He is a committer of Apache Nutch and a member of the Apache Software Foundation. He holds a PhD in computational linguistics from University of Munich.
  • David Ifeoluwa Adelani David Ifeoluwa Adelani (McGill University and Mila) [Confirmed]: David is an Assistant Professor at McGill School of Computer Science, a Core Academic Member at MILA, and a Canada CIFAR AI Chair (in 2024). His research focuses on multilingual NLP and speech processing, especially for low-resource languages. His PhD focused on NLP for African languages but he is now expanding his scope to other regions of the world, including languages of South Asia, South-East Asia and the Americas.

Organizers’ Biographies

    Program Chairs

  • Pedro Ortiz Suarez

    Pedro Ortiz Suarez (Common Crawl Foundation) is a Senior Research Scientist with a PhD in NLP from Sorbonne Université. His work focuses on data quality and data-centric methods for improving ML models. He contributed to CamemBERT, BLOOM, OpenGPT-X, and founded the OSCAR project.

  • Sarah Luger

    Sarah Luger (MLCommons) has over two decades of expertise in AI and NLP. She has worked on low-resource machine translation, online toxicity identification, GenAI for marketing, and more. She holds a PhD in Informatics from the University of Edinburgh and has worked at IBM Watson on Jeopardy! Challenge NLP tasks. She is co-chair of the MLCommons Datasets Working Group.

  • Laurie Burchell

    Laurie Burchell (Common Crawl Foundation) is a Senior Research Data Engineer with a PhD from the University of Edinburgh focused on language identification. Laurie contributes to the Open Language Data Initiative and HPLT and works to broaden multilingual access through open research.

  • Kenton Murray

    Kenton Murray (Johns Hopkins University) is a Research Scientist at JHU’s Human Language Technologies Center of Excellence. He has helped organize WMT, IWSLT, AMTA, and MAGMaR, and is serving as one of the Workshop Chairs for NAACL 2025.

  • Catherine Arnett

    Catherine Arnett (EleutherAI) is an NLP Researcher, mainly interested in cross-lingual and multilingual NLP. She recently finished her PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. She previously was Lead Research Scientist at PleIAs.

  • Organizing Committee

  • Thom Vaughan

    Thom Vaughan (Common Crawl Foundation) is Principal Technologist at Common Crawl with over a decade of experience in multilingual data and large-scale language resources. He has directed global-scale voice and language data initiatives spanning dozens of locales, and played a key role in shaping some of the most widely deployed speech systems in production today.

  • Sara Hincapié

    Sara Hincapié (Factored) is a Software Engineer with experience in product-focused full-stack development and machine learning model integration. Her work focuses on building accessible, user-centered web applications and tools. She has contributed to projects such as MLSuperb 2.0, Helpmed, and Dynabench.

  • Rafael Mosquera

    Rafael Mosquera (MLCommons) is a Machine Learning Engineer specializing in NLP and audio ML systems. He has contributed to several projects including BabyLM, the Prism dataset (NeurIPS 2024 Best Paper Award - Datasets & Benchmarks), People's Speech dataset, and Dynabench.