Projects
Content:
At MaiNLP we aim to make NLP models more robust, so that they can deal better with underlying shifts in data due to language variation.
On-going research projects
The following lists selected on-going research projects at MaiNLP, including selected publications:
ERC Consolidator grant DIALECT: Natural Language Understanding for non-standard languages and dialects
Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.
To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, I propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.
Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise a) new algorithms and insights to address extremely scarce data setups and biased labels; b) novel representations which integrate auxiliary sources of information such as complement text data with speech; and c) new datasets with conversational data in its most natural form.
By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers.
Selected publications:
- Plank, 2016. What to do about non-standard (or non-canonical) language in NLP. In KONVENS.
- Plank, 2022. The ‘Problem’ of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. In EMNLP.
- Baan, Aziz, Plank & Fernandez 2022. Stop Measuring Calibration When Humans Disagree. In EMNLP.
- Blaschke, Schütze & Plank, 2023. A Survey of Corpora for Germanic Low-Resource Languages and Dialects. In NoDaLiDa.
DFF Sapere Aude Project MultiVaLUe: Multilingual Variety-aware Language Understanding Technology
Intelligent machines that understand natural language texts are the Holy Grail of Artificial Intelligence. If achieved, they can automatically extract useful information from big messy text collections. Many challenges must be overcome first. To alleviate the scarcity of resources and broaden the scope to Danish and other small languages, we will unify two strands of research, transfer learning and weak supervision, with the aim to design cross-domain and cross-lingual algorithms that extract information more robustly under minimal guidance. In this project we work on two concrete applications: cross-lingual syntactic parsing (and representation learning on the linguistic manifold) and cross-domain information extraction.
Selected publications:
- Müller-Eberstein, van der Goot & Plank, 2021. Genre as Weak Supervision for Cross-lingual Dependency Parsing. In EMNLP.
- Müller-Eberstein, van der Goot & Plank, 2022. Spectral Probing. In EMNLP.
- Bassignana & Plank, 2022. CrossRE: A Cross-Domain Dataset for Relation Extraction. In EMNLP Findings.
- Bassignana & Plank, 2022. What Do You Mean by Relation Extraction? A Survey on Datasets and Study on Scientific Relation Classification. In ACL SRW.
DFF Project thematic AI, MultiSkill: Multilingual Information Extraction for Job Posting Analysis
Job markets are about to undergo profound changes in the years to come. The skills required to perform most jobs will shift significantly. This is due to a series of interrelated developments in technology, migration and digitization. As skills change, we are facing increasing needs for quicker and better hiring to better match people to jobs. Big multilingual job vacancy data are emerging on a variety and multitude of platforms. Such big data can provide insights on labor market skill set demands. This project is centered around computational job market analysis, to reliably perform high-precision information extraction on targeted domain data.
Selected publications:
- Zhang, Jensen, Sonniks & Plank, 2022. SkillSpan: Hard and Soft Skill Extraction from English Job Postings. In NAACL.
- Zhang, Jensen & Plank 2021. Kompetencer: Fine-grained Skill Classification in Danish Job Postings via Distant Supervision and Transfer Learning. In LREC.
KLIMA-MEMES: The Impact of Humorous Communication on Political Decision-Making in the Climate Change Context
Climate change is a pressing problem facing humanity and a major polarizing topic in public discourse. Discussions on climate change pervade political agendas worldwide. IPCC experts agree that currently implemented measures against climate change are inadequate in their efforts. Information akin to this often rapidly breaks into social media spheres (e.g., Instagram and TikTok) in increasingly visual and often humor-driven attention cycles. The KLIMA-MEMES project will analyze how such communication in the form of memes or other visual media with an intent to be humorous can affect political decision-making. This is a collaborative project with multiple partners at LMU, funded by the Bavarian Research Institute for Digital Transformation.
More information:
- See the KLIMA-MEMES project website
Applications for PhD, Postdoc and student assistant positions
Interested in PhD, Postdoc or student assistants jobs? →Open positions
Thesis projects
Interested in doing your MSc or BSc thesis with us? We offer several BSc and MSc thesis topics at MaiNLP.
Currently, the following research vectors characterize broad topics in which we offer MSc and BSc thesis projects. We provide a (non-exhaustive) list of research projects within each vector. We are also interested to supervise projects related to the Research projects indicated above. You are welcome to send us your own project proposal. We recommend to check out the suggested/selected publications to get inspired.
Unless otherwise specified, all projects can be either at the MSc or BSc level. The exact scope of the project will be determined in a meeting before the start of the project.
Important note for summer and winter semester 2023: We currently do not supervise industrial MSc/BSc thesis project (Industrieabschlussarbeiten).
Regularly check back here for updates on thesis project suggestions.
News:
- 2023, September 9: All thesis applications are closed.
- 2023, August 21: Thesis applications for the winter semester are open. See application deadlines below.
- 2023, March 7: All thesis applications are closed.
- 2023, Feb 28: MSc project applications are closed. Updated projects. BSc applications are still open
- 2023, Feb 21: Slight update on projects
- 2023, Feb 17:
denotes that projects are currently reserved
- 2023, Feb 6:
more thesis projects posted, MSc/BSc level indicators added
Legend:
-
topic currently reserved
-
-
strikethrough: topic no longer available
-
V1: Learning from Limited Data (Low-resource, NLP for Dialects, Multilinguality, Transfer)
Selected research projects
-
Analyzing dialect identification systems. How well can we automatically discern between different closely related language varieties, and do the features that are relevant to the success of an automatic classifier correlate with those that linguists would describe? The student can decide on the language varieties to be included in the thesis, provided that high-quality, accessible corpora are available. (Some references for finding relevant corpora: Blaschke ea. 2023, Ramponi 2022, Guellil ea. 2021 – but this list is not exhaustive. If you prefer to work with a language not covered in these overviews, please contact us early on.) The student can also decide which thematic focus/foci the thesis should have: (i) linguistic analysis of the data and classifier output; (ii) applying interpretability methods to the classifier output (in conjunction with focus i); (iii) comparing different kinds of input representations and ML architectures. The specific focus and level of detail will depend on the background and skills of the students as well as the degree (BSc vs. MSc). (Additional references: Gaman ea. 2020 and other VarDial shared task papers for literature on automatic dialect identification, Nerbonne ea. 2021 (alternative link) and Wieling & Nerbonne 2015 for introductions to dialectometry; Madsen ea. 2022 and Barredo Arrieta ea. 2020 for overviews of interpretability methods.) Level: MSc (preferred) or BSc.
-
Slot and Intent Detection for Low-Resource Dialects. Digital assistants are becoming wide-spread, yet current technology covers only a limited set of languages. How can we best do zero-shot transfer to low-resource language variants without standard orthography? Reference: van der Goot et al., 2021 and VarDial 2023 SID4LR. Create a new evaluation dataset of a low-resource language variant you speak, and investigate how to best transfer to it. Topics: Transfer Learning, Cross-linguality, Dataset annotation (Particularly suited for students interested in covering their own language or dialect not yet covered by existing systems including local dialects, e.g. Austrian, Low Saxon, Sardinian dialects or others). Level: MSc or BSc.
-
Adopting Information Retrieval Models for Rare Terms. Neural ranking models have shown impressive results on general retrieval benchmarks, however, domain specific retrieval and representing rare terms are still an open challenge (Thakur et al., 2021). In this thesis, the goal is to explore strategies for rewriting queries and documents with the help of text simplification models or external resources such as WordNet or Wikipedia in order to improve their performance in domain transfer. Level: MSc.
-
NLP methods for Folk Songs Lyrics. Folk music is an essential element of any culture. This project seeks to apply NLP techniques to study folk music of the German-speaking countries with a special focus on song lyrics written in dialect. You will start with building a pipeline for large-scale lyrics collection. Next, you will conduct a comprehensive analysis of song lyrics including (but not limited to): discovering most popular lyrical themes, studying rhymes and the figurative speech used in lyrics. Level: BSc or MSc.
-
Large Language Models for low-resource NLP revisited. At the moment, large language models (LLMs) are all the hype, but do we actually need them for low-resource tasks? In this project, the student compares LLM fine-tuning with computationally cheaper ways of training a model for a low-resource language variety and a NLP task (any classification task or a sequence labeling task). Level: BSc or MSc.
-
Learning Task Representations. We are often interested in transferring NLP/IR models to datasets for which we have little or no label annotations available. In such a zero-shot setting its possible to transfer a model from a single related task or a set of related tasks. Representing tasks and measuring task similarity is an open challenge and active research field, the goal of this thesis is to explore approaches for deriving task representations and evaluating their effectiveness in a multi-task setting. Level: MSc.
-
Code-Switching in Cross-Lingual Information Retrieval. When we train retrieval models on monolingual data the model can learn to predict the document relevance from keyword overlaps with the query or from semantic context. Arguably, keyword matching is an easier task than learning semantic concepts. In our recent work we show that retrieval models trained on English data are biased towards keyword matching, which is less problematic if we transfer the model to other monolingual setups. However, in cross-lingual information retrieval (CLIR) the query language vocabulary is different from the document language vocabulary and relying on keyword overlap is suboptimal. To mitigate this bias and improve retrieval results, we propose code-switching the training data as a way to mitigate this bias. The goal of this thesis is to experiment with more sophisticated code-switching approaches, additional languages and different domains. Level: BSc or MSc.
V2: High-Quality Information Extraction in Targeted Domains
Selected research projects
-
Cross-domain Relation extraction. Extracting structured information such as bornIn(person, location) is central for knowledge base population. Yet current extraction technology is limited to a few text domains and semantic relations. How can we adapt relation extractors to generalize better across different text domains or different relation sets? See references of MultiVaLUe project. Level: BSc or MSc.
-
Computational Job Market Analysis. Job postings are a rich resource to understand the labor market. Recently, several NLP studies have started to provide data resources and model for automatic job posting analysis, like skill extraction. For students interested in real-world applications, and interested in in-depth analysis of existing data and models, or comparisons of different approaches, or extending skill extraction or classification to Germanic and other languages. See references of MultiSkill project. Also Bhola et al., 2020 and Gnehm et al. 2021 and our own recent ESCOXLM-R model. Level: BSc or MSc.
-
Adapt NER Tools to novel entities with gazetteers. Named Entity Recognition (NER) tools are trained on a corpus and then deployed as static resources. When novel entities emerge after deployment, then such tools have problems detecting them (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9014470/). For instance, the state-of-the-art biomedical NER tool HunFlair (https://academic.oup.com/bioinformatics/article/37/17/2792/6122692) has difficulties in reliably detecting concepts related to COVID, because it was trained before the pandemic. In this project, we will develop a novel NER method that allows to easily integrate knowledge emerging entities after training. Level: BSc or MSc.
-
Automated Evaluation of Biomedical Relation Extraction Models. It is an open question how useful information extraction models are for Biological research. The gold standard evaluation is to ask experts for their manual assessment, which is costly and limited to small-scale case studies. However, biological relations have the unique advantage that researchers verify them in biochemical experiments and store the results in large databases. In this project, we will exploit these databases to conduct the first large-scale comparison of different state-of-the-art relation extraction models in terms of usefulness for biological research. Level: BSc or MSc.
V3: Natural Language Understanding, Semantic & Pragmatics
Selected research projects
-
Prominent entities in text summarization. One important factor in summarizing a document is identifying the prominent entities within the document. Here is an example from the CNN summarization dataset: Two selected paragraphs: [Alonso] has been training hard for [his] planned comeback at the Malaysian Grand Prix in nine days and used the {McLaren} simulator to hone [his] mental preparations. The CNN-sponsored team announced the news on Twitter, showing {McLaren} sporting director Eric Boullier and [Alonso] at the team’s headquarters in Woking, England. Reference summary: [Fernando Alonso] steps up comeback preparations in {McLaren} simulator What linguistic phenomena contribute to the prominence of entities in a document? For example, would coreference chain and discourse relation help? Are summarization strategies the same across different genres and languages? Build an NLP model that predicts the prominent entities in a document and evaluate accordingly based on reference summaries. Start with the CNN/DM dataset. Project can be extended to other genres and languages. Level: BSc or MSc.
-
Understanding Indirect Answers. Indirect answers are replies to polar questions without explicit use of Yes/No clues. Example: Q: Do you wanna crash on the couch? A: I gotta go home sometime. Indirect answers are natural in human dialogue, but very difficult for a conversational AI system. Build an NLP model that improves automatic understanding of indirect questions in English, for example by modeling longer dialogue context. Project can be extended to multlingual/transfer learning. References: Louis et al., 2020, Damgaard et al., 2021 and Sanagavarapu et al., 2022. Level: BSc or MSc.
-
Scientific Analysis Tool. The goal of this project is to use off-the-shelf NLP models for citation analysis, semantic search and text summarization to build a tool that (semi-)automates the process of synthesizing and summarizing a line of research. This involves identifying the citation purpose and aggregating/summarizing information across multiple papers. Strong knowledge in Python required, knowledge in web crawling beneficial. Level: BSc
V4: Human-centric Natural Language Understanding: Uncertainty, Perception, Vision, Interpretability
Some general references for this section: References: Plank, 2016., Jensen and Plank, 2022, Plank, 2022 EMNLP
Selected research projects
-
In-context learning from human disagreement on subjective tasks. Aggregating annotations via majority vote could lead to ignoring the opinions of minority groups, especially on subjective tasks. Learning from individual annotators shows a better result on subjective classification tasks such as hate speech detection and emotion detection than from the majority vote (Davani et al., 2022). In this project, we want to investigate the potential of learning from individual annotators in an in-context learning setting. How can the model learn from the disagreement between annotators by instruction tuning and prompting? How do we design such instructions? References: Davani et al., 2022, Schick et al., 2021 and Mishra et al., 2022. Level: MSc.
-
Active learning for Vision Question Answering with Large Pretrained Models. There were several attempts on active learning for VQA [1],[2]. However, these models are small in size and were trained from scratch. Large Pre-trained Models have achieved great success in unimodality (language or vision) and multimodality (vision-language) settings [3]. This project aims to deploy SOTA foundation models in the active learning framework for VQA tasks. A starting point could be re-implementing some VQA active learning works such as [1]. References: Deep Bayesian Active Learning for Multiple Correct Outputs [1] Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering [2] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [3] Level: MSc.
-
Learning from Disagreement by Creating Dense Multi-Annotation Dataset Current deep learning approaches generally learn by modelling the hard label (majority vote) or the soft label (annotation distribution/prediction distribution from a teacher model). Can we go one step further by modelling the annotators? Davani et al., TACL 2022 shows that by modelling the individual annotators, the model’s uncertainty estimation correlates well with the human annotation variation. However, for each data sample in the dataset (Gab Hate Speech), it only contains annotations from a subset of all annotators. It requires a significant amount of effort to create a “dense multi-annotation dataset”. (each data has annotations from all the annotators.) Therefore, the model in Davani et al., TACL 2022 can only learn from 3 annotations at a time. Can we create such a dense annotation dataset by modelling the individual annotators? How differently will the model behave? References: Davani et al., TACL 2022, Geva et al., EMNLP-IJCNLP 2019. Level: BSc
-
Error Analysis of a BERT-based Search Engine: Multi-stage ranking has become a popular paradigm in information retrieval, this approach a fast first-stage ranker generates a candidate set of documents followed by a much slower re-ranker to refine the ranking [0]. Prior work has shown that better candidate sets (higher recall) don’t necessarily translate to a better final ranking [1]. The goal of this thesis is two-fold: First, we would like to perform an error analysis of linguistic triggers that cause this behavior. In the second part, the goal is to apply and interpret automatically generated explanations from tools such as DeepSHAP and LIME [2,3]. Basic knowledge in information retrieval is helpful, but not required. Level: B.Sc. [0] https://arxiv.org/abs/1910.14424 [1] https://arxiv.org/abs/2101.08751 [2] https://arxiv.org/abs/1907.06484 [3] https://arxiv.org/abs/1602.04938
How to apply for a BSc and MSc thesis project
Important information for LMU students: You need to apply for a MSc/BSc thesis project the latest three weeks before the thesis project registration date. Deadlines for the summer winter semester 2023:
- MSc students apply before
February 24, 2023August 31, 2023(closed) - BSc students apply before
March 6, 2023September 4, 2023(closed)
To apply, please send your application material with subject “[BSc (or MSc) thesis project at MaiNLP - inquiry [Name and which semester]” to: thesisplank@cis.lmu.de
It should contain a single pdf with the following information:
- CV, your study program, full grade transcript
- Level: BSc or MSc thesis project
- Which theme or concrete project interests you the most (optional: we are open to project proposals related to the research vectors or on-going research projects above). If you are interested in multiple, list your preferences for up to two (ranked list: first priority, second priority)
- Languages you speak
- Your favorite project so far, and why
- Your knowledge and interest in data annotation, data analysis and machine learning/deep learning (including which toolkits you are familiar with)
- Whether you have access to GPU resources (and which)
- A term project report or your BSc thesis if you apply for a MSc thesis (optional)
Reach out if you have questions, using the email above.