Projects
Content:
- On-going research projects
- Completed research projects
- For BSc/MSc thesis projects, please check here
At MaiNLP we aim to make NLP models more robust, so that they can deal better with underlying shifts in data due to language variation.
On-going research projects
The following lists selected on-going research projects at MaiNLP, including selected publications:
ERC Consolidator grant DIALECT: Natural Language Understanding for non-standard languages and dialects
Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.
To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, I propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.
Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise a) new algorithms and insights to address extremely scarce data setups and biased labels; b) novel representations which integrate auxiliary sources of information such as complement text data with speech; and c) new datasets with conversational data in its most natural form.
By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers.
For further details about the project, please visit the DIALECT webpage.
Selected publications:
- Plank, 2016. What to do about non-standard (or non-canonical) language in NLP. In KONVENS.
- Plank, 2022. The ‘Problem’ of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. In EMNLP.
- Baan, Aziz, Plank & Fernandez 2022. Stop Measuring Calibration When Humans Disagree. In EMNLP.
- Blaschke, Schütze & Plank, 2023. A Survey of Corpora for Germanic Low-Resource Languages and Dialects. In NoDaLiDa.
KLIMA-MEMES: The Impact of Humorous Communication on Political Decision-Making in the Climate Change Context
Climate change is a pressing problem facing humanity and a major polarizing topic in public discourse. Discussions on climate change pervade political agendas worldwide. IPCC experts agree that currently implemented measures against climate change are inadequate in their efforts. Information akin to this often rapidly breaks into social media spheres (e.g., Instagram and TikTok) in increasingly visual and often humor-driven attention cycles. The KLIMA-MEMES project will analyze how such communication in the form of memes or other visual media with an intent to be humorous can affect political decision-making. This is a collaborative project with multiple partners at LMU, funded by the Bavarian Research Institute for Digital Transformation.
More information:
- Zhou, Peng, Plank, 2024. ClimateEli: Evaluating Entity Linking on Climate Change Data. In Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)
- See the KLIMA-MEMES project website
On-going collaboration projects
The following lists selected on-going research projects at other institutions, which MaiNLP is a partner in:
Travolta: Tracing Attitudes And Variation In Online Luxembourgish Text Archives
This project, led by Prof. Christoph Purschke from the University of Luxembourg combines Natural Language Processing with sociolinguistic analysis to trace the development of written Luxembourgish and public discourse online.
- See the TRAVOLTA project page
Completed research projects
DFF Sapere Aude Project MultiVaLUe: Multilingual Variety-aware Language Understanding Technology
Intelligent machines that understand natural language texts are the Holy Grail of Artificial Intelligence. If achieved, they can automatically extract useful information from big messy text collections. Many challenges must be overcome first. To alleviate the scarcity of resources and broaden the scope to Danish and other small languages, we will unify two strands of research, transfer learning and weak supervision, with the aim to design cross-domain and cross-lingual algorithms that extract information more robustly under minimal guidance. In this project we work on two concrete applications: cross-lingual syntactic parsing (and representation learning on the linguistic manifold) and cross-domain information extraction.
Selected publications:
- Müller-Eberstein, van der Goot & Plank, 2021. Genre as Weak Supervision for Cross-lingual Dependency Parsing. In EMNLP.
- Müller-Eberstein, van der Goot & Plank, 2022. Spectral Probing. In EMNLP.
- Bassignana & Plank, 2022. CrossRE: A Cross-Domain Dataset for Relation Extraction. In EMNLP Findings.
- Bassignana & Plank, 2022. What Do You Mean by Relation Extraction? A Survey on Datasets and Study on Scientific Relation Classification. In ACL SRW.
DFF Project thematic AI, MultiSkill: Multilingual Information Extraction for Job Posting Analysis (2020-2024)
Job markets are about to undergo profound changes in the years to come. The skills required to perform most jobs will shift significantly. This is due to a series of interrelated developments in technology, migration and digitization. As skills change, we are facing increasing needs for quicker and better hiring to better match people to jobs. Big multilingual job vacancy data are emerging on a variety and multitude of platforms. Such big data can provide insights on labor market skill set demands. This project is centered around computational job market analysis, to reliably perform high-precision information extraction on targeted domain data.
Selected publications:
- Zhang, Jensen, Sonniks & Plank, 2022. SkillSpan: Hard and Soft Skill Extraction from English Job Postings. In NAACL.
- Zhang, Jensen & Plank 2021. Kompetencer: Fine-grained Skill Classification in Danish Job Postings via Distant Supervision and Transfer Learning. In LREC.
- Zhang, van der Goot & Plank, 2023. ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain. In ACL.
- PhD thesis by Mike Zhang on Computational Job Market Analysis with Natural Language Processing
Applications for PhD, Postdoc and student assistant positions
Interested in PhD, Postdoc or student assistants jobs? →Open positions