Credibility corpus fine-tuned ELMo contextual language model for early rumor detection on social media
datasetposted on 14.01.2020 by Jie Gao, Sooji Han
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
This repository contains rumor task specific contextual neural language model that are fine-tuned on large credibility-focused social media dataset.
The model file contains fine-tuned and fixed bidirectional Language Model (biLM) weights that can be used to compute the sentence representation of candidate rumor tweets. The purpose of this release is for research only and for reproducing our results in the paper.
Contextual language model like ELMo provides deep, contextualised, and character based word representations by using bidirectional language models. Previous research shows that fine-tuning Neural Language Models (NLMs) with domain-specific data allows them to learn more meaningful word representations and provides a performance gain.
In our research, we fine-tuned pre-trained ELMo for early rumor detection task on social media dataset, we generate a dataset from CREDBANK. Sentences in original corpus are shuffled and split into training and hold-out sets. About 0.02% of the original data is used as the hold-out set. We also generate a test set using the PHEME data containing 6,162 tweets related to 9 events in the hope that it will offer an independent and robust evaluation of our hypothesis.
The model fine-tuned on Credbank dataset (denoted as "elmo_credbank") was trained more than 800 hours on a Intel E5-2630-v3 CPU with maximum 50GiB RAM used. For a comparative evaluation of its effectiveness, we also fine-tuned pre-trained ELMo model on SNAP corpus (denoted as "elmo_snap") which was trained more than 500 hours on a NVIDIA Kepler K40M GPU. Our results shows that a large improvement in perplexity on both hold-out set and test set with CREDBANK in comparison to the fine-tuned model with SNAP corpus.
Our research shows that a state-of-the-art NLMs and large credibility focused Twitter corpora can be employed to learn context-sensitive representations of rumor tweets.
For more details, please refer our papers as follows. Version "12262018.hdf5" was used in  and Version "10052019.hdf5" was used in . The code using this language model can be found on github (https://github.com/soojihan/Multitask4Veracity).
 Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019
 Han S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD,New Orleans, Louisiana, US
EthicsThere is no human data or any that requires ethical approval
PolicyThe data complies with the funder's policy on access and sharing
Sharing and access restrictionsThe data can be shared openly
- The file formats are open or commonly used
Methodology, headings and units
- Headings and units are explained in the files