67 lines
2.7 KiB
Plaintext
67 lines
2.7 KiB
Plaintext
# Structure assignment pipeline
|
|
|
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
|
each to a syntactic structure in the DDD database, generating
|
|
provisional new structures if necessary. The pipeline consists of two
|
|
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
|
order to facilitate manual corrections of the parsed units in between
|
|
(e.g., with qcat).
|
|
|
|
## Setup
|
|
|
|
The scripts are mostly wrappers, in that they largely use generic
|
|
scripts and resources from other git repositories, as well as python
|
|
libraries. These can be set up with a special script:
|
|
|
|
```
|
|
$ bin/setup.sh
|
|
```
|
|
|
|
## pipeline1.py
|
|
|
|
The pipeline1.py script expects as input a file of Slovene strings,
|
|
one string per line. It then runs the [python obeliks
|
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
|
conllu output a little, and runs JOS-configured
|
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
|
then translates the JOS tags (msds and dependencies) from English to
|
|
Slovene and converts the output to TEI XML.
|
|
|
|
Example usage:
|
|
|
|
```
|
|
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
|
```
|
|
|
|
## pipeline2.py
|
|
|
|
The pipeline2.py script expects as input a TEI XML file (in the same
|
|
particular format as the output of pipeline1.py) and an xml file of
|
|
structure specifications (normally the up-to-date CJVT structures.xml
|
|
file). It first splits the TEI file into two files, one with the
|
|
single-component units and the other with the multiple-component
|
|
units. For each, it then assigns each unit to a syntactic structure
|
|
from the DDD database and converts the output into CJVT dictionary XML
|
|
format. For the single-component units, this is pretty trivial, but
|
|
for multiple-component units it is more involved, and includes two
|
|
runs of the MWE extraction script
|
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
|
missing structures in between. At the end, the single-component and
|
|
multiple-component dictionary files are merged. Both the merged
|
|
dictionary file and the updated structure specification file are
|
|
validated with the appropriate XML schemas.
|
|
|
|
Example usage:
|
|
|
|
```
|
|
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
|
```
|
|
|
|
Note that any new structures generated are given temporary ids
|
|
(@tempId), because they only get real ids once they are approved and
|
|
added to the DDD database. That can be done via the django
|
|
import_structures.py script in the [ddd_core
|
|
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
|
batch of DDD updates. That script replaces the temporary ids in the
|
|
structure specifications and updates the ids in the dictionary file.
|