CantoneseYaleOperator — Cantonese Yale

cjklib.reading.operator.CantoneseYaleOperator is a mature implementation of the Yale transcription for Cantonese. It’s one of the major romanisations used for Cantonese and frequently found in education.


  • tones marked by either diacritics or numbers,
  • choice between high level and high falling tone for number marks,
  • guessing of input form (reading dialect) and
  • splitting of syllables into onset, nucleus and coda.


High Level vs. High Falling Tone

Yale distinguishes two tones often subsumed under one: the high level tone with tone contour 55 as given in the commonly used pitch model by Yuen Ren Chao and the high falling tone given as pitch 53 (as by Chao), 52 or 51 (Bauer and Benedikt, chapter 2.1.1 pp. 115). Many sources state that these two tones aren’t distinguishable anymore in modern Hong Kong Cantonese and thus are subsumed under one tone in some romanisation systems for Cantonese.

In the abbreviated form of the Yale romanisation that uses numbers to represent tones this distinction is not made. The mapping of the tone number 1 to either the high level or the high falling tone can be given by the user and is important when conversion is done involving this abbreviated form of the Yale romanisation. By default the high level tone will be used as this primary use is indicated in the given sources.

Placement of tones

Tone marks, if using the standard form with diacritics, are placed according to Cantonese Yale rules (see getTonalEntity()). The CantoneseYaleOperator by default tries to work around misplaced tone marks though to ease handling of malformed input. There are cases, where this generous behaviour leads to a different segmentation compared to the strict interpretation. No means are implemented to disambiguate between both solutions. The general behaviour is controlled with option 'strictDiacriticPlacement'.


  • Stephen Matthews, Virginia Yip: Cantonese: A Comprehensive Grammar. Routledge, 1994, ISBN 0-415-08945-X.
  • Robert S. Bauer, Paul K. Benedikt: Modern Cantonese Phonology (摩登廣州話語音學). Walter de Gruyter, 1997, ISBN 3-11-014893-5.

See also

Cantonese: A Comprehensive Grammar
Preview on Google Books.
Modern Cantonese Phonology
Preview on Google Books.


class cjklib.reading.operator.CantoneseYaleOperator(**options)

Bases: cjklib.reading.operator.TonalRomanisationOperator

Provides an operator for the Cantonese Yale romanisation. For conversion between different representations the CantoneseYaleDialectConverter can be used.

  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
  • toneMarkType – if set to 'diacritics' tones will be marked using diacritic marks and the character h for low tones, if set to 'numbers' appended numbers from 1 to 6 will be used to mark tones, if set to 'none' no tone marks will be used and no tonal information will be supplied at all.
  • missingToneMark – if set to 'noinfo' no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid and for segmentation the behaviour defined by 'strictSegmentation' will take effect. This option only has effect if the value 'numbers' is given for the option toneMarkType.
  • strictDiacriticPlacement – if set to True syllables have to follow the diacritic placement rule of Cantonese Yale strictly (see getTonalEntity()). Wrong placement will result in splitEntityTone() raising an InvalidEntityError. Defaults to False.
  • yaleFirstTone – tone in Yale which the first tone for tone marks with numbers should be mapped to. Value can be '1stToneLevel' to map to the level tone with contour 55 or '1stToneFalling' to map to the falling tone with contour 53. This option can only be used for tone mark type 'numbers'.
Names of tones used in the romanisation.

Mapping of tone name to representation per tone mark type. Representations includes a diacritic mark and optional the letter ‘h’ marking a low tone.

The 'internal' dialect is used for conversion between different forms of Cantonese Yale. As conversion to the other dialects can lose information (Diacritics: missing tone, Numbers: distinction between high level and high rising, None: no tones at all) conversion to this dialect can retain all information and thus can be used as a standard target reading.

classmethod getDefaultOptions()

Splits the given plain syllable into onset (initial), nucleus and coda, the latter building the rhyme (final).

The syllabic nasals m, ng will be returned as coda. Syllables yu, yun, yut will fall into (y, yu, ), (y, yu, n) and (y, yu, t).

Returned strings will be lowercase.

Parameter:plainSyllable (str) – syllable in the Yale romanisation system without tone marks
Return type:tuple of str
Returns:tuple of syllable onset, nucleus and coda
Raises InvalidEntityError:
 if the entity is invalid (e.g. syllable nucleus or tone invalid).


  • Impl: Finals ing, ik, ung, uk, eun, eut, a differ from other finals with same vowels. What semantics/view do we want to provide on the syllable parts?

Splits the given plain syllable into onset (initial) and rhyme (final).

The syllabic nasals m, ng will be returned as final. Syllables yu, yun, yut will fall into (y, yu, ), (y, yu, n) and (y, yu, t).

Returned strings will be lowercase.

Parameter:plainSyllable (str) – syllable without tone marks
Return type:tuple of str
Returns:tuple of entity onset and rhyme
Raises InvalidEntityError:
 if the entity is invalid.
getPlainReadingEntities(*args, **kwargs)
getReadingCharacters(*args, **kwargs)
getTonalEntity(plainEntity, tone)


  • Lang: Place the tone mark on the first character of the nucleus?
getTones(*args, **kwargs)
classmethod guessReadingDialect(readingString, includeToneless=False)

Takes a string written in Cantonese Yale and guesses the reading dialect.

Currently only the option 'toneMarkType' is guessed. Unless 'includeToneless' is set to True only the tone mark types 'diacritics' and 'numbers' are considered as the latter one can also represent the state of missing tones.

  • readingString (str) – Cantonese Yale string
  • includeToneless (bool) – if set to True option 'toneMarkType' can take on value 'none', but by default (i.e. set to False) is covered by tone mark type set to 'numbers'.
Return type:



dictionary of basic keyword settings


Checks if the given plain syllable can occur with stop tones which is the case for syllables with unreleased finals.

Parameter:plainEntity (str) – entity without tonal information
Return type:bool
Returns:True if given syllable can occur with stop tones, False otherwise
isToneValid(plainEntity, tone)

Checks if the given plain entity and tone combination is valid.

Only syllables with unreleased finals occur with stop tones, other forms must not (see hasStopTone()).

  • plainEntity (str) – entity without tonal information
  • tone (str) – tone
Return type:



True if given combination is valid, False otherwise


Splits the entity into an entity without tone mark and the entity’s tone index.

The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see

Parameter:entity (str) – entity with tonal information
Return type:tuple
Returns:plain entity without tone mark and entity’s tone index (starting with 1)

Regex to split a string in NFD into several syllables in a crude way. The regular expressions works for both, diacritical and number tone marks. It consists of:

  • Nasal syllables,
  • Initial consonants,
  • vowels including diacritics,
  • tone mark h,
  • final consonants,
  • tone numbers.