2 files

ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition model sets and recording transcripts

posted on 10.03.2017, 14:07 by Wai Man Ng, Alvin C.M. Kwan, Tan Lee, Thomas Hain
This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].

Speech recognition model sets
The speech recognition model sets are available as a tarball,
named model.tar.gz, in this repository.

The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.

1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.
2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech
3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).
4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)

Recording transcripts
The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.

Four files can be found in the stms.tar.gz archive. 
- ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)
- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)
- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)
- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)

The ShefCE corpus data can be accessed online with DOI:10.15131/shef.data.4522907
Please cite [1] for the use of ShefCE data, models or transcripts.

[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment",  in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.


IIKE Fund@Sheffield, Google



There is no personal data or any that requires ethical approval


The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

The data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • Headings and units are explained in the files