File(s) stored somewhere else
Please note: Linked content is NOT stored on The University of Sheffield and we can't guarantee its availability, quality, security or accept any liability.
A Spoken Corpus of Cameroon Pidgin English: pilot study
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
The corpus consists of transcriptions of private and public dialogues and monologues, with mark-up and POS-tagging, together with accompanying sound files. The recordings were conducted in five different locations in Cameroon (Bamenda, Buea, Douala, Kumba and Yaounde), allowing some insights into regional variation. Text categories and the proportions of monologue and dialogue are guided by those of the International Corpus of English (ICE) project, which makes the corpus immediately comparable with existing corpora of post-colonial varieties of English.
- Spelling: since there is no standardised orthography for CPE, the orthography adopted for this project is based on that developed by Ayafor (2014), which was kept under review during the course of the project.
- Annotation was added to the transcriptions based on ICE guidelines for the annotation of spoken texts: standard mark-up symbols were used to denote text unit, speaker identification, overlapping speech, unclear words, uncertain transcriptions, anthropo-phonics, editorial comments, foreign words and indigenous language words.
- Tagging: a tagset for CPE was devised based on CLAWS 5. Initially tagging was conducted manually, and then by means of TreeTagger. A third of the corpus has been post-checked, with accuracy rates at 94%.
The corpus is aimed at providing a resource for linguistic description and comparison. It allows linguists to identify and describe recurring grammatical patterns, as well as the phonology of the language (given the availability of sound files deposited with the text files). It also allows comparison of CPE with other pidgin/creole languages, other Cameroonian and West African languages, and other varieties of post-colonial English. Furthermore, the corpus provides an exceptional resource for the study of general/theoretical linguistics, creolistics, typology, language contact and change, sociolinguistics and discourse analysis.
The corpus contains 80 sound recordings of monologues (scripted and unscripted) and dialogues (public and private). Each sound file (in .wav format) is 10-15 minutes in length. These recordings have been transcribed (each approximately 3,000 words in length) and annotated. Transcriptions are submitted in two formats: (a) plain transcription (with basic markup indicating speaker turns, overlaps, etc.), and (b) a POS-tagged version, which adds POS-tags to the plain version of the transcription.