Assisted Tagging
We developed a multi-label classification model tailored for common humanitarian analysis frameworks and trained on DEEP data with direct access in the DEEP and a pluggable interface for other applications. The assisted tagging gives automatic suggestions on which tag to choose from according to the analysis framework. This helps speed up the annotation process.
Currently DEEP supports a total of 104 tags as follows:
Primary tags:
Secondary tags:
Automatic Summarization
Automatizing the summarization step for creating reports using annotated data. This pipeline was created and tested with actors of the humanitarian world and different summarization reports have automatically been generated using this method. Now we embedded this feature in the analysis module in DEEP where you can find auto-generated reports using your tagged data in DEEP.
Entry extraction
The key task of analysts is finding entries containing relevant information that would fit the humanitarian analytical framework. Based on the data loaded in DEEP, we developed a model performing an extractive summarization task. This model selects a subset of passages that contain relevant information from the given document; these entries do not necessarily follow the common units of text such as sentence and paragraph and can appear in various lengths. This feature is currently in process to be deployed within DEEP.
Topic modelling
Topic modelling is a type of statistical modelling technique used in natural language processing and machine learning to identify the topics present in a large corpus of text data. It is a powerful tool for analyzing large volumes of text data and identifying meaningful patterns and insights in data. We included this tool in the new DEEP analysis module.
Humanitarian BERT
HumBert (Humanitarian Bert) is a XLM-Roberta model trained on humanitarian texts – approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles. HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as disaster responses text classification) more effective. Data were collected from three main sources: Reliefweb, UNHCR Refworld and Europe Media Monitor News Brief. Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.
HUMSET
HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry.
https://huggingface.co/datasets/nlp-thedeep/humset
AI Ethics
AI ethics is a growing concern in the humanitarian sector. As more and more humanitarian organisations start to work with AI technologies, they are starting to consider the risks of these technologies and how they can mitigate them. As the use of Natural Language Processing (NLP) tools grow in humanitarian contexts, we asked ourselves: how does the conversation on humanitarian AI ethics apply to NLP? We identified some risks such as potential harm coming to those affected by the toolsl, for example beneficiaries of humanitarian aid; risks to human rights of beneficiaries. discriminatory or stereotypical treatment of certain groups in the tool, threats to the safety of certain groups, impact on the humanity, neutrality, impartiality and independence of humanitarian organisations and cybersecurity risks. In our attempt to create technology with the highest ethical considerations. We make sure we work to mitigate those risks in research and practice. We start then our first application as follows:
- Bias
As part of the humanitarian imperatives of neutrality and impartiality, it is important to be aware and to reduce the societal biases and stereotypes encoded in the models. Part of our research is focused on defining, measuring and mitigating the possible societal and harmful biases reflected in the classification models, while maintaining the high accuracy and performance of the models.