Skip to main content

Using CLDF for descriptive linguistics: questions and solutions

I am a strong proponent of the CLDF (Forkel et al. 2018), the cross-linguistic data format developed at the MPI. It was conceived with extensibility in mind, meaning that custom specifications and components can be added. This should make it possible to represent many kinds of linguistic information. At the moment, there are two well-established components: the Wordlist for historical linguistics and the StructureDataset for linguistic typology. There are also less well-established ParallelText and Dictionary modules. Finally, there is the Generic module, offering a home for linguistic data not fitting any of the above modules.

I have used the CLDF-CLLD system extensively, most notably in my comparative work on Cariban, and my work towards a digital grammar framework. I've encountered different issues when encoding something in CLDF, mostly related to descriptive linguistics and the representation of various aspects of the synchronic analysis of a given language in CLDF. One main goal of the digital grammar project is to encode the entire dataset serving as the basis for the grammar in CLDF. This post will serve as an overview of questions and solutions I have come across in that endeavour, and will be updated as I further develop the approach.

Implementation

The current state of my implementation can be viewed here.

Questions

  • Should the glosses in Motivation_Structure be segmented in the Form columns? Or how do we know what gloss corresponds to what part of the form? (Github discussion)
  • How can one model allomorphy? (Github discussion)
  • How to model more abstract entities like morphemes and lexemes? Current approach: use a separate morpheme table, potentially also usable for lexemes.
  • Do we need to model phonemes?
  • How to model meaning?

Entities and properties to be represented

These are various concepts from descriptive linguistics which one would want to have at one's disposal when writing a grammar.

Morphemes

  • inherently abstract analytical concept
  • morpheme type (root, affix...)
  • have a meaning
  • allomorphs are specific, tangible forms, potentially a separate model?

Lexemes

  • also abstract
  • potentially morphologically complex
  • related to word forms
  • have a meaning

Word forms

  • not (that) abstract, tangible forms
  • belong to a lexeme
  • can have a stem + inflectional morphology
  • stem can have derivational morphology
  • have a meaning

Interlinear examples

  • has at the least a form and a translation
  • usually also a segmented object language line and corresponding meta language glossing line
  • potentially additional lines
  • also metadata like source, speaker, context, text

Meaning

Meaning is tough. It should be possible to ascribe meaning to single morphemes (can also be called "function"), lexemes, and word forms. On the other hand, interlinear examples have a 'translation', not a meaning; where is the boundary? See below for an example.

Parts of speech

  • a property of lexemes
  • subgroups may be needed

Further properties

This list is representative of the kind of properties one may want to represent, though highly dependent on the language (and analysis):

  • transitivity of verbs
  • alienability of nouns
  • inflection classes
  • semantic fields

Available CLDF elements

Form

The CLDF form property, a "written denotation[...] of the linguistic sign" (GOLD counterpart) component is clearly geared towards comparative linguistics and the Wordlist module:

A lexical unit is any collection of word forms corresponding to a certain meaning which can be found in comparative datasets.

Ideally, a lexical unit would just present itself as one single form. However, in practice, scholars often list speech variants and at times even non-cognate alternatives for their preferred form.

The name makes clear that it is a concrete form found in a particular language, not an abstract concept (like a morpheme or lexeme). This is in line with the FormTable containing a Segments column, representing the form as a sequence of phonemes. Additional columns optionally contain information about morphological complexity:

  • Motivation_Structure contains a space-separated gloss of morphologically complex forms
  • Root contains the root of a form
  • Stem contains the stem of a form. In these columns, only some morphemes (Root) are represented as simple strings, and the structure of a Form must be inferred from the combination of these.

This makes it possible to represent morphologically complex word forms. For example, the Wayana form ɨhiktei 'I am going to urinate' (Tavares 2005: 244) would be represented as follows:

Language_ID Form Segments Motivation_Structure Root Stem
waya1269 ɨhiktei ɨ h i k t e i 1 urine VBZ NPST SAP hiku hikta

Some issues and questions come to mind:

  • Meaning is usually represented as Parameter_ID in Wordlists, but it does not seem to make sense for such elaborate word forms, since the meaning is composed of the meanings of the participating morphemes. Maybe a column Translation would suffice? There is also the Sense of the Dictionary module and the FunctionalEquivalentSet of the ParallelText module. Further, Parameter_ID is used in StructureDatasets in a different way.
  • There is no designated column for the underlying structure (ɨ-hiku-ta-ja-he), although there is for the glossing corresponding to these morphemes
  • The Root and Stem columns would ideally also reference entities (a root morpheme hiku 'urine' and a lexeme hikta 'to urinate', respectively), instead of just strings.
  • The first two issues would be solved by representing the form as an interlinear example.

Conceivably, one could also model abstract entities like morphemes and lexemes (as well as allomorphs) as Forms: Allomorphs would need a column Morpheme_ID referencing the morpheme; a morpheme would have some aggregation of its allomorphs as its Form (like -ja(h)i). Lexemes would need a Morpheme_IDs column referencing the morphemes of which they are composed. Those could be - (<>, ~) separated in the Form column. Morphemes would probably need some information like Morpheme_Type.

Interlinear text

The example component provides good support for interlinear examples. The lines which are to be aligned are tab-separated by default, meaning that object and gloss words are read as lists. In an approach where morphemes are modelled as well, one would definitely want to include morpheme IDs in interlinear examples. What I have used with success are two additional columns, Morpheme_IDs, a ;-separated list of IDs, and a Identified_Positions, a ;-separated list of boolean values stating whether or not the morpheme at position X is featured in Morpheme_IDs. Ideally, the latter column would not be needed, but as of now not all morphemes are successfully identified in exports of FLEx or Toolbox glossed texts. A similar process may be necessary for uniparser, though there only the order of morphemes needs to be established. Another issue is of course that the application rendering an example (and inserting morpheme links) needs to be able to split the - (=, <>, ~) segmented strings in the object line into single morphemes.

Comments