Events

ROBustness in NLP over the years

Speaker:

Rob van der Goot
Assistant Professor at the IT University of Copenhagen

Date:

January 24, 2024; 11:00–12:00

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

This talk will consist of three parts 1. Lexical normalization of social media data and its downstream effect on syntactic tasks. 2. Multi-task learning for adaptation in challenging setups. 3. What are open challenges for fundamental NLP tasks like language identification and word segmentation?
Portrait of Rob van der Goot

Bio:

Rob van der Goot's main interest is in low-resource setups in natural language processing, which could be in a variety of dimensions, including language(-variety), domain, or task. He did his PhD on the use of normalization for syntactic parsing of social media data, one specific case of a challenging transfer setup. Afterwards, he focused on using multi-task learning in challenging settings. Most recently, Rob focuses on more low-level tasks (language identification, tokenization) in challenging settings (cross-lingual, cross-domain, for low-resource languages/scripts). → Website

Representing Low-Resource Language Varieties: Improved Methods for Spoken Language Processing

Speaker:

Martijn Bartelds
Incoming PostDoc at Stanford University

Date:

December 19, 2023; 14:00–15:00

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

Languages are often treated as homogeneous entities, while they are typically composed of multiple varieties. Most language varieties do not correspond to administrative boundaries, such as provinces or states within nations, and they often form a continuum with neighboring varieties. Studying language variation can provide valuable insights into how language varieties relate to their linguistic communities. To this end, it is important to focus on spoken language, as many languages do not have a standard written system.

In this talk, I will introduce our new method to describe and model language variation, which leverages speech representations from self-supervised neural network models to quantify differences between the pronunciations of speakers from different language varieties. This new method assesses the differences between language varieties more accurately and efficiently compared to previously-used methods. Additionally, I will talk about the use of these neural network models to develop speech technology systems that can help empower low-resource language varieties. In particular, I will present our audio-based search algorithm to automatically identify occurrences of a spoken search term in a large collection of spoken materials, improving access to resources that would normally require manual annotation. Furthermore, I will discuss an approach to improve speech recognition performance for several language varieties from different language families. This technology can be a promising step towards the important goal of developing speech technology that is inclusive of the world’s languages.
Portrait of Martijn Bartelds

Bio:

Martijn is an incoming Postdoctoral Scholar in Computer Science at Stanford University, working with Professor Dan Jurafsky. His research focuses on developing and applying natural language processing methods to describe and model resource-scarce languages. He is particularly interested in speech processing with extremely low-resource languages, dialects, and non-native speech. Martijn was awarded his PhD at the University of Groningen (cum laude), where he was advised by Professor Martijn Wieling and Professor Mark Liberman. → Website

We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields

Speaker:

Jan Philip Whale1, Saif M. Mohammad2
1PhD candidate, University of Göttingen
2Senior Research Scientist, National Research Council Canada

Date:

November 20, 2023; 09:00–10:00

Location:

Raum A 017 Geschw.-Scholl-Pl. 1

Abstract:

Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP papers to other papers, and ~1.8m citations from other papers to NLP papers. We show that, unlike most fields, the cross-field engagement of NLP, measured by our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in 1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown more insular -- citing increasingly more NLP papers and having fewer papers that act as bridges between fields. NLP citations are dominated by computer science; Less than 8% of NLP citations are to linguistics, and less than 3% are to math and psychology. These findings underscore NLP's urgent need to reflect on its engagement with various fields.
Portrait of Jan Philip

Bio:

Jan Philip Wahle is a PhD candidate in computer science at the University of Göttingen in Germany. His primary research revolves around paraphrasing, plagiarism detection, and responsible NLP, as well as their various applications such as summarization or misinformation detection. The work presented during this talk was performed during a research visit at the National Research Council Canada. Now, Jan is a visiting researcher at the University of Toronto. Updates about his research can be followed on his website, X, and LinkedIn. → Website | X | LinkedIn

LLM Safety: What does it mean and how do we get there?

Speaker:

Paul Röttger
PostDoc in MilaNLP Lab at Bocconi University

Date:

November 8, 2023; 11:00–12:00

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

AI safety, and specifically the safety of large language models (LLMs) like ChatGPT, is receiving unprecedented public and regulatory attention. In my talk, split into two parts, I will try to give some more concrete meaning to this often nebulous topic and the challenges it poses. First, I will define LLM safety with a focus on near-term risks and explain why LLM safety matters, countering common arguments against this line of work. I will also give an overview of current methods for ensuring LLM safety, from red-teaming to fine-grained feedback learning. Second, I will zoom in on imitation learning, where models are trained on outputs from other models, as a particularly common way of improving the capabilities of open LLMs. I will talk about our own work in progress on safety by imitation, where we extend imitation learning to safety-related behaviours. I will present the resources we have built already, and then transition into an open discussion about our hypotheses and planned experiments, followed by a Q&A to close out the hour.
Portrait of Paul Röttger

Bio:

Paul is a postdoctoral researcher in Dirk Hovy‘s MilaNLP Lab at Bocconi University. His work is located at the intersection of computation, language and society. Right now, he is particularly interested in evaluating and aligning social values in large generative language models, and, by extension, in AI safety. Before coming to Milan, he completed his PhD at the University of Oxford, where he worked on improving the evaluation and effectiveness of large language models for hate speech detection. → Website

The Pivotal Role of Genres: Insights from English RST Parsing and Abstractive Summarization

Speaker:

Janet Liu
PhD candidate, Georgetown University

Date:

September 25, 2023; 10:30–11:30

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

Text exhibits significant variations across types such as news articles, academic papers, social media posts, vlogs, and more. Recognizing the importance of genre and using data from diverse genres in training can enable NLP models to generalize and perform effectively across diverse textual contexts. While previous work has studied the role of genre in tasks and linguistic phenomena such as dependency parsing (Müller-Eberstein et al., EMNLP 2021; Müller-Eberstein et al., TLT-SyntaxFest 2021), NLI (Nangia et al., RepEval 2017), and lexical semantics (Kober et al., COLING 2020), in this talk I will present our work that emphasizes the importance of genre diversity in the case of RST parsing and summarization.
I will first discuss our results from the English RST parsing task that a heterogeneous training regime is critical for stable and generalizable RST models, regardless of parser architectures [1,3]. Then, I will present GUMSum [2], a carefully crafted dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization. This work emphasizes the complexities of producing high-quality summaries across genres, where impressive models like GPT-3 fall short of human performance, highlighting the need to consider genre-specific guidelines for crafting accurate and faithful summaries. Together, we hope our findings and resources can not only raise awareness and help level the playing field across text-types, demographics, and domains in English but also offer insights that can benefit the same or analogous tasks and phenomena in other languages.
[1] https://aclanthology.org/2023.eacl-main.227/
[2] https://aclanthology.org/2023.findings-acl.593/
[3] https://aclanthology.org/2023.law-1.17/
Portrait of Janet Liu

Bio:

Yang Janet Liu (she/her/hers, go by Janet) is a PhD Candidate in Computational Linguistics in the Department of Linguistics at Georgetown University where she is advised by Amir Zeldes, PhD and works on computational and corpus-based approaches to discourse-level linguistic phenomena (e.g., discourse relations and relation signaling) and their applications such as summarization. Specifically, her research focuses on the generalizability of discourse understanding and parsing in Rhetorical Structure Theory (RST). She co-organized the 2021 and 2023 DISRPT Shared Task on Discourse Segmentation, Connective and Relation Identification across Formalisms. She has been a reviewer for the main *ACL venues (ACL, EACL, NAACL, AACL), SIGDIAL, as well as the Dialogue and Discourse journal etc., and is an Area Chair of the Discourse and Pragmatics track at EMNLP 2023. Previously, she did internships at Spotify (2021, 2023) and Alexa AI at Amazon (2020). → Website

Conflicts, Villains, Resolutions: Towards models of Narrative Media Framing

Speaker:

Dr. Lea Frermann
Lecturer, The University of Melbourne

Date:

July 14, 2023; 09:00–10:00

Location:

Amalienstr. 73A - 112

Abstract:

Stories have existed as long as human societies, and are fundamental to communication, culture, and cognition. This talk looks at the interaction of narratives and media framing, i.e., the deliberate presentation of information to elicit a desired response or shift in the reader’s attitude. While rich theories of media framing have emerged from the political and communication sciences, NLP approaches to automatic frame prediction tend to oversimplify the concept. In particular, current approaches focus on overly localized lexical signals, make unwarranted independence assumptions, and ignore the broader, narrative context of news articles. This talk presents our recent work which incorporates narrative themes, roles of involved actors, and the interaction multiple frames in a news article as a step towards a computational framework of narrative framing. Quantitative evaluation and case studies on media framing of climate change reflect a benefit of the more nuanced emerging frame representations.
Portrait of Lea Frermann

Bio:

Lea Frermann is a lecturer (assistant professor) and DECRA fellow at the University of Melbourne. Her research combines natural language processing with the cognitive and social sciences to understand how humans learn about and represent complex information and to enable models to do the same in fair and robust ways. Recent projects include models of meaning change; of common sense knowledge in humans and language representations; and automatic story understanding in both fiction (books or movies) and the real world (as narratives in news reporting on complex issues like climate change). → Website

Corpus-based computational dialectology – Data, methods and results

Speaker:

Dr. Yves Scherrer
University lecturer, University of Helsinki

Date:

June 05, 2023; 17:00–18:00

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

The CorCoDial (corpus-based computational dialectology) project aims to infer dialect classifications from variation-rich corpora, focusing in particular on the dialect-to-standard normalization task to introduce comparability between different texts. I will start by presenting a multilingual collection of phonetically transcribed and orthographically normalized corpora. This collection forms the data basis of four case studies. In the first study, we investigate to what extent topic models can find dialectological rather than semantic topics. In the second experiment, we evaluate character alignment methods from different research traditions on a range of desirable and undesirable characteristics. The third case study introduces dialect-to-standard normalization as a distinct sequence-to-sequence task and compares various normalization methods used in previous work. In the last study, we focus on neural normalization and investigate what the embeddings of speaker labels can tell us about the origin of the speakers.
Portrait of Yves Scherrer

Bio:

Yves Scherrer is a University Lecturer in Language Technology at the University of Helsinki and, from August 2023 onwards, an Associate Professor in NLP at the University of Oslo. He defended his PhD thesis on the computational modelling of Swiss German dialects, with an emphasis on machine translation techniques, in 2012 at the University of Geneva. In 2021, he obtained the title of Docent in Language Technology from the University of Helsinki.
Yves Scherrer has been involved in a wide range of projects in the areas of language technology, dialectology, and corpus linguistics. His current research focuses on the annotation and analysis of dialect corpora as well as on tasks and methods related to machine translation. This research is embedded in the CorCoDial – Corpus-based computational dialectology research project, funded by the Academy of Finland (2021–2025). → Website

Making Building NLP Models More Accessible

Speaker:

Dr. Michael A. Hedderich
Postdoctoral researcher, Cornell University

Date:

May 15, 2023; 17:00–18:00

Location:

Akademiestr. 7, room 218A (meeting room)

Abstract:

AI and NLP are entering more and more disciplines and applications. Individuals, research groups, and organizations who are interested in AI are limited in what they can do, however, due to reasons such as lack of labeled data, complexity of the model-building process, missing AI literacy, and applications that do not apply to their use cases. In this talk, I'll present two projects that aim at lowering the entry barriers to model development. The first part will cover a study on using low-resource techniques for under-resourced African languages. I'll discuss the lessons we learned when evaluating in a realistic environment and the importance of integrating the human factor in this evaluation. In the second part of the talk, I'll present Premise, a tool that explains where an NLP classifier fails. Based on the minimum description length principle, it provides a set of robust and global explanations of a model's behavior. For VQA and NER, we identify the issues different blackbox classifiers have and we also show how these insights can be used to improve models.
Portrait of Michael A. Hedderich

Bio:

Michael A. Hedderich is a postdoctoral researcher at Cornell University, working with Qian Yang at the intersection of NLP and AI with HCI. Having a background in both NLP and ML as well as HCI methodology, he is interested in developing new foundational technology as well as building bridges from AI to other interested fields. His collaborations span a wide range of disciplines including archaeology, education, interaction design, participatory design, and biomedicine. Before joining Cornell, Michael obtained his PhD in ML and NLP at Saarland University, Germany, with Dietrich Klakow and was then part of Antti Oulasvirta's HCI group at Aalto University, Finland. Past research affiliations also include Rutgers University, Disney Research Studios, and Amazon. → Website

The Search for Emotions, Creativity, and Fairness in Language

Speaker:

Dr. Saif M. Mohammad (he, him, his)
Senior Research Scientist, National Research Council Canada

Date:

May 8, 2023; 9:00–10:00

Location:

LMU main building (Geschwister-Scholl-Platz 1), room A 015

Abstract:

Emotions are central to human experience, creativity, and behavior. They are crucial for organizing meaning and reasoning about the world we live in. They are ubiquitous and everyday, yet complex and nuanced. In this talk, I will describe our work on the search for emotions in language — by humans (through data annotation projects) and by machines (in automatic emotion and sentiment analysis systems). I will outline ways in which emotions can be represented, challenges in obtaining reliable annotations, and approaches that lead to high-quality annotations and useful sentiment analysis systems. I will discuss wide-ranging applications of emotion detection in natural language processing, psychology, social sciences, digital humanities, and computational creativity. Along the way, I will discuss various ethical considerations involved in emotion recognition and sentiment analysis — the often unsaid assumptions and the real-world implications of our choices.
Portrait of Saif Mohammad

Bio:

Dr. Saif M. Mohammad is a Senior Research Scientist at the National Research Council Canada (NRC). He received his Ph.D. in Computer Science from the University of Toronto. Before joining NRC, he was a Research Associate at the Institute of Advanced Computer Studies at the University of Maryland, College Park. His research interests are in Natural Language Processing (NLP), especially Lexical Semantics, Emotions and Language, Computational Creativity, AI Ethics, NLP for psychology, and Computational Social Science. He is currently an associate editor for Computational Linguistics, JAIR, and TACL, and Senior Area Chair for ACL Rolling Review. His word--emotion resources, such as the NRC Emotion Lexicon and VAD Lexicon, are widely used for analyzing emotions in text. His work has garnered media attention, including articles in Time, SlashDot, LiveScience, io9, The Physics arXiv Blog, PC World, and Popular Science. → Website