I am a strong proponent of the CLDF (Forkel et al. 2018), the cross-linguistic data format developed at the MPI. It was conceived with extensibility in mind, meaning that custom specifications and components can be added. This should make it possible to represent many kinds of linguistic information. At the moment, there are two well-established components: the Wordlist for historical linguistics and the StructureDataset for linguistic typology. There are also less well-established ParallelText and Dictionary modules. Finally, there is the Generic module, offering a home for linguistic data not fitting any of the above modules.
I have used the CLDF-CLLD system extensively, most notably in my comparative work on Cariban, and my work towards a digital grammar framework. I've encountered different issues when encoding something in CLDF, mostly related to descriptive linguistics and the representation of various aspects of the synchronic analysis of a given language in CLDF. One main goal of the digital grammar project is to encode the entire dataset serving as the basis for the grammar in CLDF. This post will serve as an overview of questions and solutions I have come across in that endeavour, and will be updated as I further develop the approach.
The current state of my implementation can be viewed here.
- Should the glosses in
Motivation_Structurebe segmented in the
Formcolumns? Or how do we know what gloss corresponds to what part of the form? (Github discussion)
- How can one model allomorphy? (Github discussion)
- How to model more abstract entities like morphemes and lexemes? Current approach: use a separate morpheme table, potentially also usable for lexemes.
- Do we need to model phonemes?
- How to model meaning?
Entities and properties to be represented
These are various concepts from descriptive linguistics which one would want to have at one's disposal when writing a grammar.
- inherently abstract analytical concept
- morpheme type (root, affix...)
- have a meaning
- allomorphs are specific, tangible forms, potentially a separate model?
- also abstract
- potentially morphologically complex
- related to word forms
- have a meaning
- not (that) abstract, tangible forms
- belong to a lexeme
- can have a stem + inflectional morphology
- stem can have derivational morphology
- have a meaning
- has at the least a form and a translation
- usually also a segmented object language line and corresponding meta language glossing line
- potentially additional lines
- also metadata like source, speaker, context, text
Meaning is tough. It should be possible to ascribe meaning to single morphemes (can also be called "function"), lexemes, and word forms. On the other hand, interlinear examples have a 'translation', not a meaning; where is the boundary? See below for an example.
Parts of speech
- a property of lexemes
- subgroups may be needed
This list is representative of the kind of properties one may want to represent, though highly dependent on the language (and analysis):
- transitivity of verbs
- alienability of nouns
- inflection classes
- semantic fields
Available CLDF elements
A lexical unit is any collection of word forms corresponding to a certain meaning which can be found in comparative datasets.
Ideally, a lexical unit would just present itself as one single form. However, in practice, scholars often list speech variants and at times even non-cognate alternatives for their preferred form.
The name makes clear that it is a concrete form found in a particular language, not an abstract concept (like a morpheme or lexeme).
This is in line with the FormTable containing a
Segments column, representing the form as a sequence of phonemes.
Additional columns optionally contain information about morphological complexity:
- Motivation_Structure contains a space-separated gloss of morphologically complex forms
- Root contains the root of a form
Stem contains the stem of a form.
In these columns, only some morphemes (
Root) are represented as simple strings, and the structure of a
Formmust be inferred from the combination of these.
|waya1269||ɨhiktei||ɨ h i k t e i||1 urine VBZ NPST SAP||hiku||hikta|
Some issues and questions come to mind:
- Meaning is usually represented as
Wordlists, but it does not seem to make sense for such elaborate word forms, since the meaning is composed of the meanings of the participating morphemes. Maybe a column
Translationwould suffice? There is also the Sense of the
Dictionarymodule and the FunctionalEquivalentSet of the
Parameter_IDis used in
StructureDatasetsin a different way.
- There is no designated column for the underlying structure (ɨ-hiku-ta-ja-he), although there is for the glossing corresponding to these morphemes
Stemcolumns would ideally also reference entities (a root morpheme hiku 'urine' and a lexeme hikta 'to urinate', respectively), instead of just strings.
- The first two issues would be solved by representing the form as an interlinear example.
Conceivably, one could also model abstract entities like morphemes and lexemes (as well as allomorphs) as
Allomorphs would need a column
Morpheme_ID referencing the morpheme; a morpheme would have some aggregation of its allomorphs as its
Form (like -ja(h)i).
Lexemes would need a
Morpheme_IDs column referencing the morphemes of which they are composed.
Those could be
~) separated in the
Morphemes would probably need some information like
The example component provides good support for interlinear examples.
The lines which are to be aligned are tab-separated by default, meaning that object and gloss words are read as lists.
In an approach where morphemes are modelled as well, one would definitely want to include morpheme IDs in interlinear examples.
What I have used with success are two additional columns,
;-separated list of IDs, and a
;-separated list of boolean values stating whether or not the morpheme at position X is featured in
Ideally, the latter column would not be needed, but as of now not all morphemes are successfully identified in exports of
Toolbox glossed texts.
A similar process may be necessary for uniparser, though there only the order of morphemes needs to be established.
Another issue is of course that the application rendering an example (and inserting morpheme links) needs to be able to split the
~) segmented strings in the object line into single morphemes.
I am very delighted to announce that I have received a grant from the SNSF for a 2-year Postdoc.Mobility project at the University of Oregon. Together with Spike Gildea and Natalia Arandia Cáceres, I will be working with a corpus of Yawarana, a moribund and underdescribed Cariban language spoken in Venezuela. Planned are a digital grammar sketch and some corpus studies. This is exciting! Updates will follow.
Fresh off the github press: a new version of pyradigms, give it a go if you're handling paradigms (or should be)!
pyradigms is a python package that allows you to compose and decompose three-dimensional paradigms from and into lists with parameters. I've rewritten it from the ground up, as I've realized that I can use pandas for the core functionality -- grouping and pivoting of dataframes. To give you an idea of what it does, it basically takes lists like this:
|SBJV||tripali||tripaliyu||tripaliyiñ||tripalmi||tripalmu||tripalmün||tripale||tripale engu||tripale engü|
|IMP||tripachi||tripange||tripamu||tripamün||tripape||tripape engu||tripape engün|
or, if you wanted, even something like this:
|pi-||pi||pingu||pingün||pile||pile engu||pile engün||pipe||pipe engu||pipe engün|
|kon-||koni||koningu||koningün||konle||konle engu||konle engün||konpe||konpe engu||konpe engün|
|tripa-||tripay||tripayngu||tripayngün||tripale||tripale engu||tripale engü||tripape||tripape engu||tripape engün|
Vice versa, it also allows you to take a traditional linguistic paradigm (or something laid out like it) like this:
to a list like this:
My article in Cadernos de Etnolingüística has finally been published. Writing this was quite the wild ride, as I was just trying to figure out correspondences between Arara and Ikpeng TAM endings. But then I realized that the Arara forms with third person n- were the ones with original main clause TAM suffixes, while the ones with i/∅ or t- used to be subordinate clauses. This sudden realization led me down a rabbit hole of Pekodian verbal person pre- and TAM suffixes, culminating in this paper. Its main contribution is the answer to the question of why the Pekodian languages have verb forms without *n-, hypothesized by Meira et al. (2010) to be due to it being a later addition.