Skip to main content

Using CLDF for descriptive linguistics: questions and solutions

I am a strong proponent of the CLDF (Forkel et al. 2018), the cross-linguistic data format developed at the MPI. It was conceived with extensibility in mind, meaning that custom specifications and components can be added. This should make it possible to represent many kinds of linguistic information. At the moment, there are two well-established components: the Wordlist for historical linguistics and the StructureDataset for linguistic typology. There are also less well-established ParallelText and Dictionary modules. Finally, there is the Generic module, offering a home for linguistic data not fitting any of the above modules.

I have used the CLDF-CLLD system extensively, most notably in my comparative work on Cariban, and my work towards a digital grammar framework. I've encountered different issues when encoding something in CLDF, mostly related to descriptive linguistics and the representation of various aspects of the synchronic analysis of a given language in CLDF. One main goal of the digital grammar project is to encode the entire dataset serving as the basis for the grammar in CLDF. This post will serve as an overview of questions and solutions I have come across in that endeavour, and will be updated as I further develop the approach.


The current state of my implementation can be viewed here.


  • Should the glosses in Motivation_Structure be segmented in the Form columns? Or how do we know what gloss corresponds to what part of the form? (Github discussion)
  • How can one model allomorphy? (Github discussion)
  • How to model more abstract entities like morphemes and lexemes? Current approach: use a separate morpheme table, potentially also usable for lexemes.
  • Do we need to model phonemes?
  • How to model meaning?

Entities and properties to be represented

These are various concepts from descriptive linguistics which one would want to have at one's disposal when writing a grammar.


  • inherently abstract analytical concept
  • morpheme type (root, affix...)
  • have a meaning
  • allomorphs are specific, tangible forms, potentially a separate model?


  • also abstract
  • potentially morphologically complex
  • related to word forms
  • have a meaning

Word forms

  • not (that) abstract, tangible forms
  • belong to a lexeme
  • can have a stem + inflectional morphology
  • stem can have derivational morphology
  • have a meaning

Interlinear examples

  • has at the least a form and a translation
  • usually also a segmented object language line and corresponding meta language glossing line
  • potentially additional lines
  • also metadata like source, speaker, context, text


Meaning is tough. It should be possible to ascribe meaning to single morphemes (can also be called "function"), lexemes, and word forms. On the other hand, interlinear examples have a 'translation', not a meaning; where is the boundary? See below for an example.

Parts of speech

  • a property of lexemes
  • subgroups may be needed

Further properties

This list is representative of the kind of properties one may want to represent, though highly dependent on the language (and analysis):

  • transitivity of verbs
  • alienability of nouns
  • inflection classes
  • semantic fields

Available CLDF elements


The CLDF form property, a "written denotation[...] of the linguistic sign" (GOLD counterpart) component is clearly geared towards comparative linguistics and the Wordlist module:

A lexical unit is any collection of word forms corresponding to a certain meaning which can be found in comparative datasets.

Ideally, a lexical unit would just present itself as one single form. However, in practice, scholars often list speech variants and at times even non-cognate alternatives for their preferred form.

The name makes clear that it is a concrete form found in a particular language, not an abstract concept (like a morpheme or lexeme). This is in line with the FormTable containing a Segments column, representing the form as a sequence of phonemes. Additional columns optionally contain information about morphological complexity:

  • Motivation_Structure contains a space-separated gloss of morphologically complex forms
  • Root contains the root of a form
  • Stem contains the stem of a form. In these columns, only some morphemes (Root) are represented as simple strings, and the structure of a Form must be inferred from the combination of these.

This makes it possible to represent morphologically complex word forms. For example, the Wayana form ɨhiktei 'I am going to urinate' (Tavares 2005: 244) would be represented as follows:

Language_ID Form Segments Motivation_Structure Root Stem
waya1269 ɨhiktei ɨ h i k t e i 1 urine VBZ NPST SAP hiku hikta

Some issues and questions come to mind:

  • Meaning is usually represented as Parameter_ID in Wordlists, but it does not seem to make sense for such elaborate word forms, since the meaning is composed of the meanings of the participating morphemes. Maybe a column Translation would suffice? There is also the Sense of the Dictionary module and the FunctionalEquivalentSet of the ParallelText module. Further, Parameter_ID is used in StructureDatasets in a different way.
  • There is no designated column for the underlying structure (ɨ-hiku-ta-ja-he), although there is for the glossing corresponding to these morphemes
  • The Root and Stem columns would ideally also reference entities (a root morpheme hiku 'urine' and a lexeme hikta 'to urinate', respectively), instead of just strings.
  • The first two issues would be solved by representing the form as an interlinear example.

Conceivably, one could also model abstract entities like morphemes and lexemes (as well as allomorphs) as Forms: Allomorphs would need a column Morpheme_ID referencing the morpheme; a morpheme would have some aggregation of its allomorphs as its Form (like -ja(h)i). Lexemes would need a Morpheme_IDs column referencing the morphemes of which they are composed. Those could be - (<>, ~) separated in the Form column. Morphemes would probably need some information like Morpheme_Type.

Interlinear text

The example component provides good support for interlinear examples. The lines which are to be aligned are tab-separated by default, meaning that object and gloss words are read as lists. In an approach where morphemes are modelled as well, one would definitely want to include morpheme IDs in interlinear examples. What I have used with success are two additional columns, Morpheme_IDs, a ;-separated list of IDs, and a Identified_Positions, a ;-separated list of boolean values stating whether or not the morpheme at position X is featured in Morpheme_IDs. Ideally, the latter column would not be needed, but as of now not all morphemes are successfully identified in exports of FLEx or Toolbox glossed texts. A similar process may be necessary for uniparser, though there only the order of morphemes needs to be established. Another issue is of course that the application rendering an example (and inserting morpheme links) needs to be able to split the - (=, <>, ~) segmented strings in the object line into single morphemes.

Leipzig Glossing Rules

While the Leipzig glossing rules are of course no full model of morphology, the elements described in them are commonplace enough to at least consider modeling in a digital grammaticography workflow. They need support in various places of the pipeline between analysis and grammar prose.

  1. Word-by-word alignment
    • primarily responsibility of the GUI, in this case existing functionality in CLLD
    • but actually what is aligned are p-words, or -- if you don't believe in that -- at least some sort of p-units
    • OTOH, also g-units (see below) are aligned this way, unless there are = in a segment
  2. Morpheme-by-morpheme correspondence
    1. Hyphens delimit morphemes (or morphs?)
    2. same number: should be checked by pyIGT
    3. Unmodified text:
      • primaryText in CLDF
      • FLEx? Toolbox? Uniparser?
      • CLLD?
    4. Clitic (g-unit) boundaries are marked with =
      • implicitly: g-unit boundaries when coinciding with morpheme boundary, otherwise with )
    5. (2A): p-unit boundaries within the g-unit boundaries segmented by that coincide with morpheme boundaries are marked with a hyphen and a space ("anticlitics")
    6. some sort of p-unit and some sort of g-unit not implemented in CLDF
    7. no known applications
    8. could have definitions by language/dataset
    9. would be useful for corpus research
    10. IGTs could be parsed for g- and p-constituency
  3. Grammatical category labels
    1. use abbreviations for grammatical morphs (not encoded):
    2. standard abbreviations:
      • defined anywhere?
      • related: how to parse glossing abbreviations? clld has something?
    3. deviations: defining own abbreviations possible, but leading to more general question of: how is meaning modeled?
  4. One-to-many:
    • . separating lowercase strings from glosses
    • but . separating lowercase strings means 'multiple words in metalanguage'
    • underscore for second use
    • ; for first use
    • : for obfuscated morphology
    • backslash for morphophonological change
      • morphs ("variants"?) related by process
        • processes are not modeled
    • "2DU>3SG" == "2DU.P.3SG.P"
  5. Non-separation of person and number (see above)
    1. "1SG." == "1sg"
  6. Non-overt elements:

    • It is clear that:

    puer boy[NOM.SG]


    puer-∅ boy-NOM.SG

but what about:


? 7. Inherent categories * ()-segmented glosses (or other labels, too?) at end of morph-gloss indicate some category (how to model?) 8. Bipartite elements 9. infixes 10. reduplication

Postdoc project on Yawarana!

I am very delighted to announce that I have received a grant from the SNSF for a 2-year Postdoc.Mobility project at the University of Oregon. Together with Spike Gildea and Natalia Arandia Cáceres, I will be working with a corpus of Yawarana, a moribund and underdescribed Cariban language spoken in Venezuela. Planned are a digital grammar sketch and some corpus studies. This is exciting! Updates will follow.

New pyradigms version!

Fresh off the github press: a new version of pyradigms, give it a go if you're handling paradigms (or should be)!

pyradigms is a python package that allows you to compose and decompose three-dimensional paradigms from and into lists with parameters. I've rewritten it from the ground up, as I've realized that I can use pandas for the core functionality -- grouping and pivoting of dataframes. To give you an idea of what it does, it basically takes lists like this:

Person Number Verb Mood Form
1 SG kon- IND konün
1 SG kon- SBJV konli
1 SG kon- IMP konchi
1 DU kon- IND koniyu
1 DU kon- SBJV konliyu
1 PL kon- IND koniyiñ
1 PL kon- SBJV konliyiñ
2 SG kon- IND konimi
2 SG kon- SBJV konülmi
2 SG kon- IMP konnge
2 DU kon- IND konimu
2 DU kon- SBJV konülmu
2 DU kon- IMP konmu
2 PL kon- IND konimün
2 PL kon- SBJV konülmün
2 PL kon- IMP konmün
3 SG kon- IND koni
3 SG kon- SBJV konle
3 SG kon- IMP konpe
3 DU kon- IND koningu
3 DU kon- SBJV konle engu
3 DU kon- IMP konpe engu
3 PL kon- IND koningün
3 PL kon- SBJV konle engün
3 PL kon- IMP konpe engün
1 SG pi- IND pin
1 SG pi- SBJV pili
1 SG pi- IMP pichi
1 DU pi- IND piyu
1 DU pi- SBJV piliyu
1 PL pi- IND piiñ
1 PL pi- SBJV piliyiñ
2 SG pi- IND pimi
2 SG pi- SBJV pilmi
2 SG pi- IMP pinge
2 DU pi- IND pimu
2 DU pi- SBJV pilmu
2 DU pi- IMP pimu
2 PL pi- IND pimün
2 PL pi- SBJV pilmün
2 PL pi- IMP pimün
3 SG pi- IND pi
3 SG pi- SBJV pile
3 SG pi- IMP pipe
3 DU pi- IND pingu
3 DU pi- SBJV pile engu
3 DU pi- IMP pipe engu
3 PL pi- IND pingün
3 PL pi- SBJV pile engün
3 PL pi- IMP pipe engün
1 SG tripa- IND tripan
1 SG tripa- SBJV tripali
1 SG tripa- IMP tripachi
1 DU tripa- IND tripayu
1 DU tripa- SBJV tripaliyu
1 PL tripa- IND tripaiñ
1 PL tripa- SBJV tripaliyiñ
2 SG tripa- IND tripaymi
2 SG tripa- SBJV tripalmi
2 SG tripa- IMP tripange
2 DU tripa- IND tripaymu
2 DU tripa- SBJV tripalmu
2 DU tripa- IMP tripamu
2 PL tripa- IND tripaymün
2 PL tripa- SBJV tripalmün
2 PL tripa- IMP tripamün
3 SG tripa- IND tripay
3 SG tripa- SBJV tripale
3 SG tripa- IMP tripape
3 DU tripa- IND tripayngu
3 DU tripa- SBJV tripale engu
3 DU tripa- IMP tripape engu
3 PL tripa- IND tripayngün
3 PL tripa- SBJV tripale engü
3 PL tripa- IMP tripape engün

to this:

tripa- 1SG 1DU 1PL 2SG 2DU 2PL 3SG 3DU 3PL
IND tripan tripayu tripaiñ tripaymi tripaymu tripaymün tripay tripayngu tripayngün
SBJV tripali tripaliyu tripaliyiñ tripalmi tripalmu tripalmün tripale tripale engu tripale engü
IMP tripachi tripange tripamu tripamün tripape tripape engu tripape engün

or this:

IND kon- tripa- pi-
1SG konün tripan pin
1DU koniyu tripayu piyu
1PL koniyiñ tripaiñ piiñ
2SG konimi tripaymi pimi
2DU konimu tripaymu pimu
2PL konimün tripaymün pimün
3SG koni tripay pi
3DU koningu tripayngu pingu
3PL koningün tripayngün pingün

or, if you wanted, even something like this:

pi- pin piyu piiñ pili piliyu piliyiñ pichi
kon- konün koniyu koniyiñ konli konliyu konliyiñ konchi
tripa- tripan tripayu tripaiñ tripali tripaliyu tripaliyiñ tripachi
pi- pimi pimu pimün pilmi pilmu pilmün pinge pimu pimün
kon- konimi konimu konimün konülmi konülmu konülmün konnge konmu konmün
tripa- tripaymi tripaymu tripaymün tripalmi tripalmu tripalmün tripange tripamu tripamün
pi- pi pingu pingün pile pile engu pile engün pipe pipe engu pipe engün
kon- koni koningu koningün konle konle engu konle engün konpe konpe engu konpe engün
tripa- tripay tripayngu tripayngün tripale tripale engu tripale engü tripape tripape engu tripape engün

Vice versa, it also allows you to take a traditional linguistic paradigm (or something laid out like it) like this:

1 2 3M 3F 3N
NOM.SG ég þú hann hún það
ACC.SG mig þig hann hana það
DAT.SG mér þér honum henni því
GEN.SG mín þín hans hennar þess
NOM.PL við þið þeir þær þau
ACC.PL okkur ykkur þá þær þau
DAT.PL okkur ykkur þeim þeim þeim
GEN.PL okkar ykkar þeirra þeirra þeirra

to a list like this:

Person Gender Case Number Value
1 NOM SG ég
1 ACC SG mig
1 DAT SG mér
1 GEN SG mín
1 NOM PL við
1 ACC PL okkur
1 DAT PL okkur
1 GEN PL okkar
2 NOM SG þú
2 ACC SG þig
2 DAT SG þér
2 GEN SG þín
2 NOM PL þið
2 ACC PL ykkur
2 DAT PL ykkur
2 GEN PL ykkar
3 M NOM SG hann
3 M ACC SG hann
3 M DAT SG honum
3 M GEN SG hans
3 M NOM PL þeir
3 M ACC PL þá
3 M DAT PL þeim
3 M GEN PL þeirra
3 F NOM SG hún
3 F ACC SG hana
3 F DAT SG henni
3 F GEN SG hennar
3 F NOM PL þær
3 F ACC PL þær
3 F DAT PL þeim
3 F GEN PL þeirra
3 N NOM SG það
3 N ACC SG það
3 N DAT SG því
3 N GEN SG þess
3 N NOM PL þau
3 N ACC PL þau
3 N DAT PL þeim
3 N GEN PL þeirra

New article on Arara (and Pekodian)

My article in Cadernos de Etnolingüística has finally been published. Writing this was quite the wild ride, as I was just trying to figure out correspondences between Arara and Ikpeng TAM endings. But then I realized that the Arara forms with third person n- were the ones with original main clause TAM suffixes, while the ones with i/∅ or t- used to be subordinate clauses. This sudden realization led me down a rabbit hole of Pekodian verbal person pre- and TAM suffixes, culminating in this paper. Its main contribution is the answer to the question of why the Pekodian languages have verb forms without *n-, hypothesized by Meira et al. (2010) to be due to it being a later addition.