structure_assignment/README.md

# Structure assignment pipeline

Pipeline for parsing a list of arbitrary Slovene strings and assigning
each to a syntactic structure in the DDD database, generating
provisional new structures if necessary. The pipeline consists of two
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
order to facilitate manual corrections of the parsed units in between
(e.g., with qcat).

## Setup

The scripts are mostly wrappers, in that they largely use generic
scripts and resources from other git repositories, as well as python
libraries. These can be set up with a special script:

```
$ bin/setup.sh
```

## pipeline1.py

The pipeline1.py script expects as input a file of Slovene strings,
one string per line. It then runs the [python obeliks
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
conllu output a little, and runs JOS-configured
[classla](https://pypi.org/project/classla/) to parse the output. It
then translates the JOS tags (msds and dependencies) from English to
Slovene and converts the output to TEI XML.

Example usage:

```
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
```

## pipeline2.py

The pipeline2.py script expects as input a TEI XML file (in the same
particular format as the output of pipeline1.py) and an xml file of
structure specifications (normally the up-to-date CJVT structures.xml
file). It first splits the TEI file into two files, one with the
single-component units and the other with the multiple-component
units. For each, it then assigns each unit to a syntactic structure
from the DDD database and converts the output into CJVT dictionary XML
format. For the single-component units, this is pretty trivial, but
for multiple-component units it is more involved, and includes two
runs of the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
missing structures in between. At the end, the single-component and
multiple-component dictionary files are merged. Both the merged
dictionary file and the updated structure specification file are
validated with the appropriate XML schemas.

Example usage:

```
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
```

Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and
added to the DDD database. That can be done via the django
import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
batch of DDD updates. That script replaces the temporary ids in the
structure specifications and updates the ids in the dictionary file.
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			`# Structure assignment pipeline`

			`Pipeline for parsing a list of arbitrary Slovene strings and assigning`
			`each to a syntactic structure in the DDD database, generating`
			`provisional new structures if necessary. The pipeline consists of two`
			`separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in`
			`order to facilitate manual corrections of the parsed units in between`
			`(e.g., with qcat).`

			`## Setup`

			`The scripts are mostly wrappers, in that they largely use generic`
			`scripts and resources from other git repositories, as well as python`
			`libraries. These can be set up with a special script:`

			```
			`$ bin/setup.sh`
			```

			`## pipeline1.py`

			`The pipeline1.py script expects as input a file of Slovene strings,`
			`one string per line. It then runs the [python obeliks`
			`tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the`
			`conllu output a little, and runs JOS-configured`
			`[classla](https://pypi.org/project/classla/) to parse the output. It`
			`then translates the JOS tags (msds and dependencies) from English to`
			`Slovene and converts the output to TEI XML.`
IssueID #1487: added basic readme 2020-09-30 17:50:00 +00:00
			`Example usage:`

IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			```
			`$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml`
			```

			`## pipeline2.py`

			`The pipeline2.py script expects as input a TEI XML file (in the same`
			`particular format as the output of pipeline1.py) and an xml file of`
			`structure specifications (normally the up-to-date CJVT structures.xml`
			`file). It first splits the TEI file into two files, one with the`
			`single-component units and the other with the multiple-component`
			`units. For each, it then assigns each unit to a syntactic structure`
			`from the DDD database and converts the output into CJVT dictionary XML`
			`format. For the single-component units, this is pretty trivial, but`
			`for multiple-component units it is more involved, and includes two`
			`runs of the MWE extraction script`
			`[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating`
			`missing structures in between. At the end, the single-component and`
			`multiple-component dictionary files are merged. Both the merged`
			`dictionary file and the updated structure specification file are`
			`validated with the appropriate XML schemas.`

			`Example usage:`

			```
			`$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml`
			```

			`Note that any new structures generated are given temporary ids`
			`(@tempId), because they only get real ids once they are approved and`
			`added to the DDD database. That can be done via the django`
			`import_structures.py script in the [ddd_core`
			`repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a`
			`batch of DDD updates. That script replaces the temporary ids in the`
			`structure specifications and updates the ids in the dictionary file.`