1/1

2 files

ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition model sets and recording transcripts

dataset

posted on 2017-03-10, 14:07 authored by Wai Man NgWai Man Ng, Alvin C.M. Kwan, Tan LeeTan Lee, Thomas HainThomas Hain

This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].

Speech recognition model sets

-----------------------------------------

The speech recognition model sets are available as a tarball,

named model.tar.gz, in this repository.

The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.

1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.

2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech

3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).

4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)

Recording transcripts

----------------------------

The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.

Four files can be found in the stms.tar.gz archive.

- ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)

- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)

- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)

- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)

The ShefCE corpus data can be accessed online with DOI:10.15131/shef.data.4522907

Please cite [1] for the use of ShefCE data, models or transcripts.

[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment", in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

Funding

IIKE Fund@Sheffield, Google

History

Ethics

There is no personal data or any that requires ethical approval

Policy

The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

The data can be shared openly

Data description

The file formats are open or commonly used

Methodology, headings and units

Headings and units are explained in the files

Usage metrics

Keywords

Cantonese English data sets speech recognition system Language learning Chinese Languages English as a Second Language English Language Natural Language Processing

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM