Two years ago, I was sitting in my small room in Vesterbro, laying the final touches on my master’s thesis. The topic was: Multilingual Hate Speech Detection: Detecting the Types and Targets of Offensive Language in English and Danish Social Media Data.
At the time, a lot of work was being done in this area, but mostly for the English language. Given the fact that the issue of offensive language, hate speech, and cyberbullying is relevant to all languages, we focused on extending some of the state-of-the-art methods at the time, and apply them to the Danish language.
As part of our work, we created the first training dataset in Danish, containing social media comments from various Danish media outlets, as well as other data sources such as subreddits, Twitter feeds from politicians, and more. Our results were far from perfect, but the work was rewarding and felt important. We were proud of the work we had done in a limited amount of time and hoped that both the dataset and the ideas would prove helpful for future work.
Fast forward two years to last Sunday, when the front page of Politiken (a major Danish newspaper) included an article on the topic. The article is a summary of results from a report which builds on our work (yay!) and extends and improves the methods to achieve some impressive results.
The result from the report is two algorithms; A&ttack (which classifies language as aggressive or not) and Ha&te (which further breaks down the aggressive language into sub-categories such as hate speech). Both of these algorithms will be made publicly available at www.huggingface.co.
In the article from Politiken, they go on and apply these two algorithms, to map out the amount of hate speech, and offensive language surrounding the different political parties in Denmark on social media. This is just one interesting application of this type of technology, that I believe will become an important way for the media to inform the public on the political debate happening on social media platforms.
It makes me incredibly proud to have contributed a tiny part to this important area of NLP. This work and the tools developed will become incredibly useful to map out tough societal problems, and give the public a better overview of the current political climate around these topics. It is fantastic that the media is picking up on this important work, and sharing the results with the public.
I just want to say a great job to everyone involved! Looking forward to seeing what comes next.
At the time, a lot of work was being done in this area, but mostly for the English language. Given the fact that the issue of offensive language, hate speech, and cyberbullying is relevant to all languages, we focused on extending some of the state-of-the-art methods at the time, and apply them to the Danish language.
As part of our work, we created the first training dataset in Danish, containing social media comments from various Danish media outlets, as well as other data sources such as subreddits, Twitter feeds from politicians, and more. Our results were far from perfect, but the work was rewarding and felt important. We were proud of the work we had done in a limited amount of time and hoped that both the dataset and the ideas would prove helpful for future work.
Fast forward two years to last Sunday, when the front page of Politiken (a major Danish newspaper) included an article on the topic. The article is a summary of results from a report which builds on our work (yay!) and extends and improves the methods to achieve some impressive results.
The result from the report is two algorithms; A&ttack (which classifies language as aggressive or not) and Ha&te (which further breaks down the aggressive language into sub-categories such as hate speech). Both of these algorithms will be made publicly available at www.huggingface.co.
In the article from Politiken, they go on and apply these two algorithms, to map out the amount of hate speech, and offensive language surrounding the different political parties in Denmark on social media. This is just one interesting application of this type of technology, that I believe will become an important way for the media to inform the public on the political debate happening on social media platforms.
It makes me incredibly proud to have contributed a tiny part to this important area of NLP. This work and the tools developed will become incredibly useful to map out tough societal problems, and give the public a better overview of the current political climate around these topics. It is fantastic that the media is picking up on this important work, and sharing the results with the public.
I just want to say a great job to everyone involved! Looking forward to seeing what comes next.