Academic Work
Learning From Free-Text Human Feedback – Collect New Datasets Or Extend Existing Ones?
Dominic Petrak, Nafise Moosavi, Ye Tian, Nikolai Rozanov, Iryna Gurevych. 2023. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16259–16279, Singapore. Association for Computational Linguistics.
Continuous learning from free-text human feedback, such as error corrections, new knowledge, or alternative responses, is essential for today’s chatbots and virtual assistants to stay up-to-date, engaging, and socially acceptable. However, for research on methods for learning from such data, annotated data is scarce. To address this, we examine the error and user response types of six popular dialogue datasets from various types, including MultiWoZ, PersonaChat, Wizards-of-Wikipedia, and others, to assess their extendibility with the needed annotations. For this corpus study, we manually annotate a subset of each dataset with error and user response types using an improved version of the Integrated Error Taxonomy and a newly proposed user response type taxonomy. We provide the resulting dataset (EURTAD) to the community. Our findings provide new insights into dataset composition, including error types, user response types, and the relations between them.
Lessons Learned from a Citizen Science Project for Natural Language Processing
Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, Gözde Şahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart De Castilho, Iryna Gurevych. 2023. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3594–3608, Dubrovnik, Croatia. Association for Computational Linguistics.
Many Natural Language Processing (NLP) systems use annotated corpora for training and evaluation. However, labeled data is often costly to obtain and scaling annotation projects is difficult, which is why annotation tasks are often outsourced to paid crowdworkers. Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. To investigate whether and how well Citizen Science can be applied in this setting, we conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and at- tract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues. We summarize lessons learned in the form of guidelines and provide our code and data to aid future work on Citizen Science.
Arithmetic-Based Pretraining - Improving Numeracy of Pretrained Language Models
Dominic Petrak, Nafise Sadat Moosavi, Iryna Gurevych. 2023. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 477–493, Toronto, Canada. Association for Computational Linguistics.
State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require understanding and working with numbers (usually referred to as numeracy). Recent work suggests two main reasons for this: (1) popular tokenisation algorithms have limited expressiveness for numbers, and (2) common pretraining objectives do not target numeracy. Approaches that address these shortcomings usually require architectural changes or pretraining from scratch. In this paper, we propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step without requiring architectural changes or pretraining from scratch. Arithmetic-Based Pretraining combines contrastive learning to improve the number representation, and a novel extended pretraining objective called Inferable Number Prediction Task to improve numeracy. Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy, i.e., reading comprehension in the DROP dataset, inference-on-tables in the InfoTabs dataset, and table-to-text generation in the WikiBio and SciGen datasets.
Relations Extraction using Indicators (Master
Thesis)
(Dominic Petrak; RheinMain
University of Applied Science, Wiesbaden, Germany;
2021)
Relations between entities are a key for
understanding the semantic context in natural language, and
therefore for popular tasks like question answering, information
extraction or knowledge graph generation. State-of-the-art
approaches for machine learning-based relations classification
encode the entire sentence using pre-trained models of
transformers, without further consideration of syntactic
indicators like certain phrases or words, or prepositions, which
are more informative than other words and may be beneficial for
identifying semantic relations. In this thesis, the effect of
additionally using those indicators for relations extraction is
investigated.
Semantic Code Search with Neural Bag-of-Words and
Graph Convolutional Networks (Paper)
(Anna Abad
Sieper, Omar Amarkhel, Savina Diez and Dominic Petrak; SKILL
Student Conference @ INFORMATIK 2020 - awarded as best
paper)
Approach to semantic code search. We
investigated two ideas for retrieving code that best matches a
natural language query. The first idea was to expand a neural
Bag-of-Words encoder with TF-IDF weighting. The second idea was
to additionally utilize the call hierarchies by using a Graph
Convolutional Network trained on corresponding caller graphs.
The Java and Python datasets from GitHub's CodeSearchNet
challenge were used as the data basis. Call hierarchies have
been added to the Java datasets.
Bug Localization (Bachelor
Thesis)
(Dominic Petrak; RheinMain University of
Applied Science, Wiesbaden, Germany;
2019)
Localization of faulty source code in software
based on human-written bug reports (Static Bug Localization) has
long been the subject of research in the area of information
retrieval and machine learning. In 2018, Bench4BL,
a benchmark dataset, and evaluation framework for this task has
been proposed. Using that, the authors compared state-of-the-art
approaches to this task. In this thesis, those approaches have
been studied and the most promising ideas have been combined.
The resulting approach was trained and evaluated by using
Bench4BL. Finally, the results were compared.