Configuration

The unboxer's behavior can be modified by editing a .yaml file and passing it to the unbox command:

unbox corpus <path/to/your/toolbox/texts.db> --conf <path/to/your/config.yaml>`.

A key-value pair added in your configuration file will override the default value. Default values are stored in one of three built-in files. There is a general configuration file for interlinear text, and specific files for toolbox and shoebox. These two files are identically structured and contain best-guess default values for the two applications. By default, the toolbox configuration is loaded; shoebox can be specified with --format shoebox.

General configuration

File: interlinear_config.yaml.

This configuration file contains the default values for variables related to interlinear text.

Tab-aligned fields

A list of fields containing text which will be aligned when rendered in an interlinear representation. Note that you need to use the field labels present after renaming, see here.

Analyzed_Word: cldf#Analyzed_Word
Gloss: cldf#Gloss
Part of speech (Part_Of_Speech): currently specified as a dictionary entry property in CLDF, usable as a foreign key in cldf-ldd, with a corresponding component.

Default:

aligned_fields:
    - Analyzed_Word
    - Gloss
    - Part_Of_Speech

Slugification

Turn record IDs (\ref) into database-usable IDs, e.g. ConvInGarden.003 into convingarden-003.

Default:

slugify: true

Clitic space correction

Remove spaces after proclitics and before enclitics.

Default:

fix_clitics: true

Cell separator

How multiple values in a cell (like variants, or meanings) are delimited.

Default:

cell_separator: '; '

Skip empty records

Default:

skip_empty_obj: true

Toolbox

File: toolbox.yaml.

This is the default configuration for toolbox projects.

File encoding

toolbox files should be in UTF-8.

Default:

encoding: 'utf-8'

Text record marker

The field holding text record identifiers.

Default:

record_marker: 'ref'

Mapping interlinear fields to columns

The fields listed here will be renamed accordingly. Note that all field markers are represented here without the leading \ that is used in toolbox.

Unsegmented object line

The unsegmented first line of the record.

Default:

interlinear_mappings:
  tx: 'Primary_Text'

Segmented object line

Default:

interlinear_mappings:
  mb: 'Analyzed_Word'

Segmented gloss line

Default:

interlinear_mappings:
  ge: 'Gloss'

Part of speech

Default:

interlinear_mappings:
  ps: 'Part_Of_Speech'

Translation

Default:

interlinear_mappings:
  ft: 'Translated_Text'

Lexicon entry marker

The field holding lexicon entry identifiers.

Default:

entry_marker: 'lx'

Mapping lexicon fields to columns

The fields listed here will be renamed accordingly.

Headword

Default:

lexicon_mappings:
  lx: 'Headword'

Part of speech

Default:

lexicon_mappings:
  ps: 'Part_Of_Speech'

Meaning

Default:

lexicon_mappings:
  ge: 'Meaning'

Date

Default:

lexicon_mappings:
  dt: 'Date'

Variants

Default:

lexicon_mappings:
  a: 'Variants'

Example IDs

As suggested by Dictionaria.

Default:

lexicon_mappings:
  xref: 'Example_IDs'

Text information

Stored in record_marker, as a separate entry, or none?

Default:

text_mode: 'none'

Shoebox

File: shoebox.yaml.

This is the default configuration for shoebox projects.

File encoding

The default is the most frequent single-byte character encoding. Other values I've had to use: cp1256, iso8859_2, latin_1.

Default:

encoding: 'cp1252'

Text record marker

The field holding text record identifiers.

Default:

record_marker: 'ref'

Mapping interlinear fields to columns

The fields listed here will be renamed accordingly.

Unsegmented object line

The unsegmented first line of the record.

Default:

interlinear_mappings:
  t: 'Primary_Text'

Segmented object line

Default:

interlinear_mappings:
  m: 'Analyzed_Word'

Segmented gloss line

Default:

interlinear_mappings:
  g: 'Gloss'

Part of speech

Default:

interlinear_mappings:
  p: 'Part_Of_Speech'

Translation

Default:

interlinear_mappings:
  f: 'Translated_Text'

Comment

Default:

interlinear_mappings:
  c: 'Comment'

Lexicon entry marker

The field holding lexicon entry identifiers.

Default:

entry_marker: 'lx'

Mapping lexicon fields to columns

The fields listed here will be renamed accordingly.

Headword

Default:

lexicon_mappings:
  lx: 'Headword'

Part of speech

Default:

lexicon_mappings:
  ps: 'Part_Of_Speech'

Meaning

Default:

lexicon_mappings:
  ge: 'Meaning'

Example IDs

As suggested by Dictionaria.

Default:

lexicon_mappings:
  xref: 'Example_IDs'

Text information

Stored in record_marker, as a separate entry, or none?

Default:

text_mode: 'none'