SNuC Corpus

The SNuC corpus contains:

raw_recordings /
	This folder contains the original raw recordings (44.1 kHz, stereo) as they were recorded. 
	There are 52 folders, one per participant. Each containing one .wav file per number read 
	(e.g. 1035/1035_rec149.wav) and a metadata file (metadata.json) that includes the demographic 
	information and file details (prompt, recording time, etc.).

preprocessed_recordings/
	The raw stereo files were downsampled to 16kHz and converted to mono files. There are 52 
	folders, one per participant. Each containing one .wav file per number read 
	(e.g. 1035/1035_AI73926.wav).

transcribed_recordings /
	This is the core set of recordings that we transcribed. There are 51 folders, one per 
	participant. Each folder contains one .wav file per number read (e.g., 1035/1035_AI73926.wav).

all_recordings.csv
	This CSV file contains information about each recording in the corpus: 
	id, participantId, age, gender, accent_region, number_type, prompt, raw_audio, preprocessed_audio,
	transcribed_audio

transcribed_recordings.csv
	This CSV file contains information about the transcribed recording in the corpus:
	id, participantId, age, gender, accent_region, number_type, prompt, transcribed_audio, transcription


If you use the SNuC corpus in your work, please refer our LREC 2022 paper:
Emma Barker, Jon Barker, Robert Gaizauskas, Ning Ma, Monica Lestari Paramita (2022). SNuC: The Sheffield 
Numbers Spoken Language Corpus. Proceedings of the 13th Edition of the Language Resources and Evaluation 
Conference (LREC2022). Marseille.