Embracing variability in natural language processing
“Embracing variability in natural language processing” is a panel at ICLaVE|12, the 12th International Conference on Language Variation in Europe. Our goal is to connect variationist linguists and NLP researchers focusing on language variation.
Although the panel is over now, please feel free to reach out to us if you are interested in this topic.
Schedule and Talks
July 10, 2024
9:00 | Large language models and small language varieties: Challenges and current methods Verena Blaschke & Barbara Plank Abstract, slides |
9:30 | Extracting dialect-specific features from dialect classifiers Yves Scherrer, Dana Roemling, Aleksandra Miletic & Noëmi Aepli Abstract, slides |
10:00 | EDAudio: Easy data augmentation techniques for audio classification Alfred Lameli, Lea Fischbach, Caroline Kleen, Akbar Karimi & Lucie Flek Abstract, slides |
10:30 | Computationally modeling Low Saxon variation at different linguistic levels Raoul Buurke & Janine Siewert Abstract, slides |
11:00 | Coffee break |
11:30 | Variation is the norm: Orthographic variability and metalinguistic stance in Luxembourgish user comments Anne-Marie Lutgen, Emilia Milano & Christoph Purschke Abstract |
12:00 | Modeling registerial developments with information theory: Variation and change in 300 years of scientific written English Stefania Degaetano-Ortlieb Abstract, slides |
12:30 | Discussion |
Panel abstract
For the longest time, Natural Language Processing (NLP) has not engaged with variation in language in a systematic way – one that matches both the variability of language use in real life and the many different forms of linguistic research on the topic of language variation and change. Often, input variation in language technology is seen as some kind of “noise” to be normalized during processing. Only recently, the NLP community seems to develop more interest in analyzing, modeling or visualizing varied language material (Zampieri et al. 2020, and more generally the VarDial workshop series). On the other hand, even though dialectologists have successfully applied large-scale computational processing and analysis methods in the subfield of dialectometry (Wieling & Nerbonne 2015) for several decades, sociolinguistics (SL) only recently started to systematically make use of computational methods of language processing and data analysis on a larger level (Purschke & Hovy 2019). In this context, the term “computational sociolinguistics” (CSLX; Nguyen et al. 2016) has been coined.
Considering the situation in NLP, this is even more pertinent given the inherent variability of language in all domains and social contexts. For example, the training of language models needs to be able to tackle input variation in text data to represent actual language use in a realistic way. Moreover, the natural variability of language production offers many promising starting points to expand the focus of NLP research and improve the representation of small and non-standardized varieties in large-scale models. The same holds true for the creation of resources such as electronic dictionaries that need to deal with different sources of variation in data: regional, orthographic, stylistic etc.
Against this backdrop, we invite specialists working on variation in NLP and data-driven analysis in dialectology, sociolinguistics and related fields to discuss the state of the art in processing and modeling language variation in its multiplicity, and to showcase its potential for linguistics research. We seek to explore use-cases of small and under-resourced West Germanic languages and varieties to compare different axes of variation in data as well as the particularities of different varieties in terms of linguistic structure, patterns of variation and their technological representation in NLP.
The central domains of language variability discussed in this panel comprise: regional (dialects), orthographic (non-standardized writing practices), stylistic (individual repertoires) and diachronic variation (inter-generational change). We focus on West Germanic varieties for the sake of thematic coherence. The main goals of the panel are to systematically take stock of the available language-technological resources for small languages and non-standardized varieties, and to contribute best practice examples for developing variation-friendly NLP resources.
The panel presents a diverse group of researchers in terms of disciplinary background, countries, target varieties and modalities, gender and career stage, which means embracing variation in NLP also on a thematic and structural level.
Organizers
- Christoph Purschke (University of Luxembourg)
- Yves Scherrer (University of Oslo, University of Helsinki)
- Barbara Plank (LMU Munich)
- Verena Blaschke (LMU Munich)