Skip to main content

Using CLDF for descriptive linguistics: questions and solutions

I am a strong proponent of the CLDF (Forkel et al. 2018), the cross-linguistic data format developed at the MPI. It was conceived with extensibility in mind, meaning that custom specifications and components can be added. This should make it possible to represent many kinds of linguistic information. At the moment, there are two well-established components: the Wordlist for historical linguistics and the StructureDataset for linguistic typology. There are also less well-established ParallelText and Dictionary modules. Finally, there is the Generic module, offering a home for linguistic data not fitting any of the above modules.

I have used the CLDF-CLLD system extensively, most notably in my comparative work on Cariban, and my work towards a digital grammar framework. I've encountered different issues when encoding something in CLDF, mostly related to descriptive linguistics and the representation of various aspects of the synchronic analysis of a given language in CLDF. One main goal of the digital grammar project is to encode the entire dataset serving as the basis for the grammar in CLDF. This post will serve as an overview of questions and solutions I have come across in that endeavour, and will be updated as I further develop the approach.

Implementation

The current state of my implementation can be viewed here.

Questions

  • Should the glosses in Motivation_Structure be segmented in the Form columns? Or how do we know what gloss corresponds to what part of the form? (Github discussion)
  • How can one model allomorphy? (Github discussion)
  • How to model more abstract entities like morphemes and lexemes? Current approach: use a separate morpheme table, potentially also usable for lexemes.
  • Do we need to model phonemes?
  • How to model meaning?

Entities and properties to be represented

These are various concepts from descriptive linguistics which one would want to have at one's disposal when writing a grammar.

Morphemes

  • inherently abstract analytical concept
  • morpheme type (root, affix...)
  • have a meaning
  • allomorphs are specific, tangible forms, potentially a separate model?

Lexemes

  • also abstract
  • potentially morphologically complex
  • related to word forms
  • have a meaning

Word forms

  • not (that) abstract, tangible forms
  • belong to a lexeme
  • can have a stem + inflectional morphology
  • stem can have derivational morphology
  • have a meaning

Interlinear examples

  • has at the least a form and a translation
  • usually also a segmented object language line and corresponding meta language glossing line
  • potentially additional lines
  • also metadata like source, speaker, context, text

Meaning

Meaning is tough. It should be possible to ascribe meaning to single morphemes (can also be called "function"), lexemes, and word forms. On the other hand, interlinear examples have a 'translation', not a meaning; where is the boundary? See below for an example.

Parts of speech

  • a property of lexemes
  • subgroups may be needed

Further properties

This list is representative of the kind of properties one may want to represent, though highly dependent on the language (and analysis):

  • transitivity of verbs
  • alienability of nouns
  • inflection classes
  • semantic fields

Available CLDF elements

Form

The CLDF form property, a "written denotation[...] of the linguistic sign" (GOLD counterpart) component is clearly geared towards comparative linguistics and the Wordlist module:

A lexical unit is any collection of word forms corresponding to a certain meaning which can be found in comparative datasets.

Ideally, a lexical unit would just present itself as one single form. However, in practice, scholars often list speech variants and at times even non-cognate alternatives for their preferred form.

The name makes clear that it is a concrete form found in a particular language, not an abstract concept (like a morpheme or lexeme). This is in line with the FormTable containing a Segments column, representing the form as a sequence of phonemes. Additional columns optionally contain information about morphological complexity:

  • Motivation_Structure contains a space-separated gloss of morphologically complex forms
  • Root contains the root of a form
  • Stem contains the stem of a form. In these columns, only some morphemes (Root) are represented as simple strings, and the structure of a Form must be inferred from the combination of these.

This makes it possible to represent morphologically complex word forms. For example, the Wayana form ɨhiktei 'I am going to urinate' (Tavares 2005: 244) would be represented as follows:

Language_ID Form Segments Motivation_Structure Root Stem
waya1269 ɨhiktei ɨ h i k t e i 1 urine VBZ NPST SAP hiku hikta

Some issues and questions come to mind:

  • Meaning is usually represented as Parameter_ID in Wordlists, but it does not seem to make sense for such elaborate word forms, since the meaning is composed of the meanings of the participating morphemes. Maybe a column Translation would suffice? There is also the Sense of the Dictionary module and the FunctionalEquivalentSet of the ParallelText module. Further, Parameter_ID is used in StructureDatasets in a different way.
  • There is no designated column for the underlying structure (ɨ-hiku-ta-ja-he), although there is for the glossing corresponding to these morphemes
  • The Root and Stem columns would ideally also reference entities (a root morpheme hiku 'urine' and a lexeme hikta 'to urinate', respectively), instead of just strings.
  • The first two issues would be solved by representing the form as an interlinear example.

Conceivably, one could also model abstract entities like morphemes and lexemes (as well as allomorphs) as Forms: Allomorphs would need a column Morpheme_ID referencing the morpheme; a morpheme would have some aggregation of its allomorphs as its Form (like -ja(h)i). Lexemes would need a Morpheme_IDs column referencing the morphemes of which they are composed. Those could be - (<>, ~) separated in the Form column. Morphemes would probably need some information like Morpheme_Type.

Interlinear text

The example component provides good support for interlinear examples. The lines which are to be aligned are tab-separated by default, meaning that object and gloss words are read as lists. In an approach where morphemes are modelled as well, one would definitely want to include morpheme IDs in interlinear examples. What I have used with success are two additional columns, Morpheme_IDs, a ;-separated list of IDs, and a Identified_Positions, a ;-separated list of boolean values stating whether or not the morpheme at position X is featured in Morpheme_IDs. Ideally, the latter column would not be needed, but as of now not all morphemes are successfully identified in exports of FLEx or Toolbox glossed texts. A similar process may be necessary for uniparser, though there only the order of morphemes needs to be established. Another issue is of course that the application rendering an example (and inserting morpheme links) needs to be able to split the - (=, <>, ~) segmented strings in the object line into single morphemes.

Leipzig Glossing Rules

While the Leipzig glossing rules are of course no full model of morphology, the elements described in them are commonplace enough to at least consider modeling in a digital grammaticography workflow. They need support in various places of the pipeline between analysis and grammar prose.

  1. Word-by-word alignment
    • primarily responsibility of the GUI, in this case existing functionality in CLLD
    • but actually what is aligned are p-words, or -- if you don't believe in that -- at least some sort of p-units
    • OTOH, also g-units (see below) are aligned this way, unless there are = in a segment
  2. Morpheme-by-morpheme correspondence
    1. Hyphens delimit morphemes (or morphs?)
    2. same number: should be checked by pyIGT
    3. Unmodified text:
      • primaryText in CLDF
      • FLEx? Toolbox? Uniparser?
      • CLLD?
    4. Clitic (g-unit) boundaries are marked with =
      • implicitly: g-unit boundaries when coinciding with morpheme boundary, otherwise with )
    5. (2A): p-unit boundaries within the g-unit boundaries segmented by that coincide with morpheme boundaries are marked with a hyphen and a space ("anticlitics")
    6. some sort of p-unit and some sort of g-unit not implemented in CLDF
    7. no known applications
    8. could have definitions by language/dataset
    9. would be useful for corpus research
    10. IGTs could be parsed for g- and p-constituency
  3. Grammatical category labels
    1. use abbreviations for grammatical morphs (not encoded):
    2. standard abbreviations:
      • defined anywhere?
      • related: how to parse glossing abbreviations? clld has something?
    3. deviations: defining own abbreviations possible, but leading to more general question of: how is meaning modeled?
  4. One-to-many:
    • . separating lowercase strings from glosses
    • but . separating lowercase strings means 'multiple words in metalanguage'
    • underscore for second use
    • ; for first use
    • : for obfuscated morphology
    • backslash for morphophonological change
      • morphs ("variants"?) related by process
        • processes are not modeled
    • "2DU>3SG" == "2DU.P.3SG.P"
  5. Non-separation of person and number (see above)
    1. "1SG." == "1sg"
  6. Non-overt elements:

    • It is clear that:

    puer boy[NOM.SG]

    ==

    puer-∅ boy-NOM.SG

but what about:

∅-puer
NOM.SG-boy

? 7. Inherent categories * ()-segmented glosses (or other labels, too?) at end of morph-gloss indicate some category (how to model?) 8. Bipartite elements 9. infixes 10. reduplication

Comments