A Spoken Corpus of Cameroon Pidgin English: pilot study
Melanie Green
Gabriel Ozon
Miriam Ayafor
10.15131/shef.data.4291307.v1
https://orda.shef.ac.uk/articles/dataset/A_Spoken_Corpus_of_Cameroon_Pidgin_English_pilot_study/4291307
This resource is a 240,000-word corpus of spoken Cameroon Pidgin
English (CPE), a widely-used yet stigmatised and largely uncodified
pidgin/creole variety.
<p><a></a>
The corpus consists of transcriptions of private and public dialogues
and monologues, with mark-up and POS-tagging, together with accompanying
sound files. The recordings were conducted in five different locations
in Cameroon (Bamenda, Buea, Douala, Kumba and Yaounde), allowing some
insights into regional variation. Text categories and the proportions of
monologue and dialogue are guided by those of the International Corpus
of English (ICE) project, which makes the corpus immediately comparable
with existing corpora of post-colonial varieties of English.
</p><div>
<a></a>
<ul><li>
<a></a>Spelling: since there is no standardised
orthography for CPE, the orthography adopted for this project is based
on that developed by Ayafor (2014), which was kept under review during
the course of the project.</li><li>
<a></a>
Annotation was added to the transcriptions based on ICE guidelines for
the annotation of spoken texts: standard mark-up symbols were used to
denote text unit, speaker identification, overlapping speech, unclear
words, uncertain transcriptions, anthropo-phonics, editorial comments,
foreign words and indigenous language words.</li><li>
<a></a>
Tagging: a tagset for CPE was devised based on CLAWS 5. Initially
tagging was conducted manually, and then by means of TreeTagger. A third
of the corpus has been post-checked, with accuracy rates at 94%. </li></ul>
</div><p><a></a>
The corpus is aimed at providing a resource for linguistic description
and comparison. It allows linguists to identify and describe recurring
grammatical patterns, as well as the phonology of the language (given
the availability of sound files deposited with the text files). It also
allows comparison of CPE with other pidgin/creole languages, other
Cameroonian and West African languages, and other varieties of
post-colonial English. Furthermore, the corpus provides an exceptional
resource for the study of general/theoretical linguistics, creolistics,
typology, language contact and change, sociolinguistics and discourse
analysis.
</p>
<p><a></a>The corpus contains 80 sound
recordings of monologues (scripted and unscripted) and dialogues (public
and private). Each sound file (in .wav format) is 10-15 minutes in
length. These recordings have been transcribed (each approximately 3,000
words in length) and annotated. Transcriptions are submitted in two
formats: (a) plain transcription (with basic markup indicating speaker
turns, overlaps, etc.), and (b) a POS-tagged version, which adds
POS-tags to the plain version of the transcription.</p>
<p><a></a>The language of the monologues is Cameroon
Pidgin English, with codeswitching into English, French, and indigenous
Cameroonian languages.</p>
<p><a></a>
The accompanying documentation includes (i) a list of submitted files,
(ii) a list of participant data, (iii) a tagging guide, (iv) a word list
and spelling guide.
</p>
2017-03-15 11:09:56
Linguistic corpora
Corpus
Speech--Research
Linguistics analysis
Pidgin English
Code switching
English Language
Language, Communication and Culture not elsewhere classified