Credibility corpus fine-tuned ELMo contextual language model for early rumor detection on social media

dataset

posted on 2020-01-14, 15:15 authored by Jie Gao, Sooji Han

This repository contains rumor task specific contextual neural language model that are fine-tuned on large credibility-focused social media dataset.

The model file contains fine-tuned and fixed bidirectional Language Model (biLM) weights that can be used to compute the sentence representation of candidate rumor tweets. The purpose of this release is for research only and for reproducing our results in the paper.

Contextual language model like ELMo provides deep, contextualised, and character based word representations by using bidirectional language models. Previous research shows that fine-tuning Neural Language Models (NLMs) with domain-specific data allows them to learn more meaningful word representations and provides a performance gain.

In our research, we fine-tuned pre-trained ELMo for early rumor detection task on social media dataset, we generate a dataset from CREDBANK. Sentences in original corpus are shuffled and split into training and hold-out sets. About 0.02% of the original data is used as the hold-out set. We also generate a test set using the PHEME data containing 6,162 tweets related to 9 events in the hope that it will offer an independent and robust evaluation of our hypothesis.

The model fine-tuned on Credbank dataset (denoted as "elmo_credbank") was trained more than 800 hours on a Intel E5-2630-v3 CPU with maximum 50GiB RAM used. For a comparative evaluation of its effectiveness, we also fine-tuned pre-trained ELMo model on SNAP corpus (denoted as "elmo_snap") which was trained more than 500 hours on a NVIDIA Kepler K40M GPU. Our results shows that a large improvement in perplexity on both hold-out set and test set with CREDBANK in comparison to the fine-tuned model with SNAP corpus.

Our research shows that a state-of-the-art NLMs and large credibility focused Twitter corpora can be employed to learn context-sensitive representations of rumor tweets.

For more details, please refer our papers as follows. Version "12262018.hdf5" was used in [2] and Version "10052019.hdf5" was used in [1]. The code using this language model can be found on github (https://github.com/soojihan/Multitask4Veracity).

[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019

[2] Han S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD,New Orleans, Louisiana, US

History

Ethics

There is no human data or any that requires ethical approval

Policy

The data complies with the funder's policy on access and sharing

Sharing and access restrictions

The data can be shared openly

Data description

The file formats are open or commonly used

Methodology, headings and units

Headings and units are explained in the files

Usage metrics

Keywords

early rumor detection Neural Language Models ELMo Embeddings (Mathematics)Applied Computer Science

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM