Study session poster

12/10/2023

Poster sessions are often held during conference, professional meetings, or during research days or symposia events. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER) and semantic text similarity (STS), in addition to new language models trained on all Scandinavian languages.A poster session is a venue in which researchers have the opportunity to share their work with a wide audience in the form of a poster presentation. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese.

In this work, we empirically show, in a case study for Faroese - a low-resource language from a high-resource language family - that by leveraging the phylogenetic information and departing from the 'one-size-fits-all' paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. We see this paper is a first step and a proof of concept that instruction finetuning for Swedish is within reach, through resourceful means, and that there exist several directions for further improvements. Results indicate that the use of translated instructions significantly improves the models' zero-shot performance, even on unseen data, while staying competitive with strong baselines ten times in size. We translate a dataset of generated instructions from English to Swedish, using it to finetune both Swedish and non-Swedish models. In this paper we explore the viability of instruction finetuning for Swedish. We see this as an opportunity for the adoption of instruction finetuning for other languages. To overcome this, automatic instruction generation has been proposed as a resourceful alternative. However, the widespread implementation of these models has been limited to the English language, largely due to the costs and challenges associated with creating instruction datasets. In recent years, instruction finetuning models have received increased attention due to their remarkable zero-shot and generalization capabilities. We also find that using pseudonymisation techniques involving random names leads to better performance across most tasks. Our results reveal that anonymisation does have a negative impact on the performance of NLP models, but this impact is relatively low. Based on these experiments, we formulate recommendations on how the anonymisation should be performed to guarantee accurate NLP models. We focus on the anonymisation and pseudonymisation of personal names and compare six different anonymisation strategies for two state-of-the-art pre-trained models.

In this paper, we investigate the impact of anonymisation on the performance of nine downstream NLP tasks. Intuitively, the process of training an NLP model on anonymised data may lower the performance of the resulting model when compared to a model trained on non-anonymised data. However, the risk is that this procedure can distort the data and jeopardise the models built on it. Data anonymisation is often required to comply with regulations when transfering information across departments or entities.

0 Comments

Study session poster

Leave a Reply.

Author

Archives

Categories