1/1

ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition model sets and recording transcripts

dataset
posted on 10.03.2017 by Wai Man Ng, Alvin C.M. Kwan, Tan Lee, Thomas Hain
This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].

Speech recognition model sets
-----------------------------------------
The speech recognition model sets are available as a tarball,
named model.tar.gz, in this repository.

The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.

1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.
2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech
3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).
4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)

Recording transcripts
----------------------------
The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.

Four files can be found in the stms.tar.gz archive. 
- ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)
- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)
- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)
- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)


The ShefCE corpus data can be accessed online with DOI:10.15131/shef.data.4522907
Please cite [1] for the use of ShefCE data, models or transcripts.

[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment",  in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

Funding

IIKE Fund@Sheffield, Google

History

Licence

Exports