Bases de données hébergées par le RISC

Statistical Analysis of Similarity Relations among Spoken Words: Evidence for the Special Status of Rimes in English

Authors: Bruno DE CARA & Usha GOSWAMI
Behavioural and Brain Sciences Unit, Institute of Child Health, London

=> The lexical database discussed in De Cara B, & Goswami U. (2002). Similarity relations among spoken words: The special status of rimes in English. Behavior Research Methods, Instruments, & Computers, 34 (3), 416-423 is available here to download.

Lexical Database - Microsoft Excel 97 (4.04 Mo)

Phonological neighbourhood density was calculated for 4,086 English monosyllabic words. The standard pronunciation in southern British English was selected. Phonetic codes are given below.

Phonetic codes - Microsoft Excel 97 (16.5 Ko)

The Database

The initial database corresponded to all monosyllabic words found in the Celex corpus (7,256 words; Baayen, Piepenbrock & Gulikers, 1995). We removed 2,680 words which were homophone and homograph with other words (e.g. browse as a verb and browse as a noun), 258 words which were homograph non-homophone with other words (e.g. lunch pronounced with a fricative or an affricate), and 128 words which either corresponded to single contractions (e.g. 's), complex contractions (e.g. how's) or abbreviations (e.g. sq). Homophone non-homograph words (e.g. pear, pair) were kept distinct except when words referred to the same morpheme (e.g. disk and disc) (104 words removed). This left 4,086 monosyllabic words. The occurrence of the main syllable types in the database was the following: CVC 43.0%, CCVC 21.0%, CVCC 15.2%, CCVCC 5.7%, CV 4.5%, and CCV 2.5% (C: Consonant, V: Vowel).

Phonological Neighbourhood Density

Phonological neighbourhood density refers to the number of words that are similar in sound to a target word. We used two definitions of phonological neighbourhood. The first definition, based on models of speech recognition, considers phonological neighbourhood as a set of words that differ from a given target by a 1-phoneme substitution, addition or deletion (metric Ph+/-1; Landauer & Streeter, 1973 ; Luce, 1986; Charles-Luce & Luce, 1990; Luce & Pisoni, 1998). For example, according to the Ph+/-1 metric, the phonological neighbourhood for the target word hat would include bat, hot, ham and at, among others. The second definition of phonological neighbourhood was based on the linguistic coding of syllables in 3-D: Onset (initial consonant or consonant cluster), Vowel, and Coda (final consonant or consonant cluster; e.g. /pr/-/i/-/ns/ for prince). This second metric, called here OVC, was derived on the basis of the phonological awareness literature, which has demonstrated the psychological salience of onsets (e.g. /pr/ for prince) and rimes (vowel + coda; e.g. /ins/ for prince) for young children (see Treiman, 1988, for review). The OVC metric generates more neighbours than the Ph+/-1 metric. Indeed, words like skill and wind would not count as phonological neighbours of will in the Ph+/-1 metric (because they involve more than 1-phoneme change from the target word 'will'). On the other hand, they would count as phonological neighbours of will in the OVC metric since skill and wind only differs in one dimension from the OVC coding of 'will' (/w/-/i/-/l/).

In addition to considering the number of phonological neighbours for any target word, it is also important to consider the nature of these neighbours. If one type of neighbours is more dominant than others, then neighbourhood density effects may reflect levels of segmental representation other than the phoneme, particularly prior to literacy. Statistical analyses of the nature of phonological neighbourhoods in terms of Rime Neighbours (sharing the rime; e.g. hat/cat), Consonant Neighbours (sharing the consonants; e.g. hat/hit) and Lead Neighbours (sharing the Onset-Vowel sequence or Lead; e.g. hat/ham) were performed for all monosyllabic words found in the database (4086 words). Our results show that most phonological neighbours (54.2% in the OVC metric and 44.5% in the Ph+/-1 metric) are Rime Neighbours (e.g. hat/cat) in English. Similar patterns were found when a corpus of words for which Age-of-Acquisition ratings (AoA) were available was analysed. The percentage of Rime Neighbours among the words familiar to 3-year-olds was 49.8% and corresponding percentages for 4-, 5-, 6-, and 7-year-olds were respectively 54.8%, 56.2%, 56.7%, and 57.1%.

Phonetic Coding

Each monosyllabic word was phonetically coded on a 9-slot sequence (one phoneme per slot). Words were centred on the Vowel-slot (which is the only obligatory element in the syllable). From the vowel, pre-vocalic consonants (onset) were coded on 4 slots from right to left (o1, o2, o3, o4) and post-vocalic consonants (coda) on 4 slots from left to right (c1, c2, c3, c4). The o1 and c1 slots stand closer to the centre of the syllable than the o2 and c2 slots and so on. Empty slots were coded with a dot. For example, a word like skill was coded as [..skil...] for the o4-o3-o2-o1-V-c1-c2-c3-c4 9-slot sequence and a word like wind as [...wind..].

Lexical Statistics

The database contains the following fields:

1. Measures of Phonological Neighbourhood Density

a. OVC metric: ND represents the number of all phonological neighbours that differ from a target word by a 1-OVC (onset, vowel, or coda) substitution, deletion, or addition. Among these neighbours, RN represents the number of neighbours sharing the rime (the -V-c1-c2-c3-c4 sequence), CN the number of neighbours sharing the consonants (both the o4-o3-o2-o1 and c1-c2-c3-c4 sequences), and LN the number of neighbours sharing the lead (the o4-o3-o2-o1-V- sequence).

b. Ph+/-1 metric: ND represents the number of all phonological neighbours that differ from a target word by a 1-phoneme substitution, deletion, or addition. (1-slot change out of the 9-slot sequence). Only among these Ph+/-1 neighbours, RN represents the number of neighbours sharing the rime, CN the number of neighbours sharing the consonants, and LN the number of neighbours sharing the lead.

For both metrics, the calculation by type is based on the absolute number of neighbours. The calculation by token is based on the cumulated lexical spoken frequencies of neighbours.

2. Measures of lexical frequencies and familiarity

a. CobSMln: Corresponds to the Celex measure for spoken frequency of lemmas (main word forms; occurrence per million within a 17.9 million spoken word corpus).

b. CobWMln: Corresponds to the Celex measure for written frequency of lemmas (main word forms; occurrence per million within a 17.9 million written word corpus).

c. Fam: Adults' familiarity ranking out of a maximum ranking of 7 according to the Luce and Pisoni (1998) norms.

3. Age of Acquisition norms

AoA: Adults' Age of Acquisition ratings on a 7-point scale (1: age 0-2 years; 7: age 13 years and older) for 1,944 words (data reported in Gilhooly and Logie, 1980). A total of 632 words out of 1,944 were found in the 4086 database.

Exhaustive List of Neighbours

Computerised routine (Microsoft Excel only) giving the whole list of phonological neighbours related to any target word.

The routine is available here to download:

Routine - Microsoft Excel 97 (6.33 Mo)


  • Baayen RH, Piepenbrock R & Gulikers L (1995). The CELEX Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.
  • Charles-Luce J & Luce PA (1990). Similarity neigh bourhoods of words in young children's lexicons. Journal of Child Language, 17, 205-215.
  • De Cara B & Goswami U (2002). Statistical analysis of similarity relations among spoken words: The special status of rimes in English. Behavior Research Methods, Instruments, & Computers, 34, 416-423.
  • De Cara B & Goswami U (2003). Phonological neighbourhood density: Effects in a rhyme awareness task in five-year-old children. Journal of Child Language, 30, 695-710.
  • Gilhooly, KJ & Logie RH (1980). Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1,944 words. Behaviour Research Methods and Instrumentation, 12, 395-427.
  • Landauer TK & Streeter LA (1973). Structural differences between common and rare words: Failure of equivalence assumptions for theories of word recognition. Journal of Verbal Learning and Verbal Behaviour, 12, 119-131.
  • Luce PA & Pisoni DB (1998). Recognising spoken words: The neighbourhood activation model. Ear & Hearing, 19, 1-36.
  • Treiman R (1988). The internal structure of the syllable. In G. Carlson & M. Tanenhaus (Eds.), Linguistic structure in language processing (pp. 27-52). Dordrecht, The Netherlands: Kluger.