HumSet
HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry. See the our paper for details.
Please visit our huggingface page to view about the dataset in more detail.
HumBert
HumBert (Humanitarian Bert) is a XLM-Roberta model trained on humanitarian texts – approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles. Data were collected from three main sources: Reliefweb, UNHCR Refworld and Europe Media Monitor News Brief. Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.
Intended uses
To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as dister responses text classification) more effective. This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.
Please visit our huggingface page to view about the model in more detail.