You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
67 lines
2.7 KiB
67 lines
2.7 KiB
3 years ago
|
# Structure assignment pipeline
|
||
|
|
||
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
||
|
each to a syntactic structure in the DDD database, generating
|
||
|
provisional new structures if necessary. The pipeline consists of two
|
||
|
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
||
|
order to facilitate manual corrections of the parsed units in between
|
||
|
(e.g., with qcat).
|
||
|
|
||
|
## Setup
|
||
|
|
||
|
The scripts are mostly wrappers, in that they largely use generic
|
||
|
scripts and resources from other git repositories, as well as python
|
||
|
libraries. These can be set up with a special script:
|
||
|
|
||
|
```
|
||
|
$ bin/setup.sh
|
||
|
```
|
||
|
|
||
|
## pipeline1.py
|
||
|
|
||
|
The pipeline1.py script expects as input a file of Slovene strings,
|
||
|
one string per line. It then runs the [python obeliks
|
||
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
||
|
conllu output a little, and runs JOS-configured
|
||
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
||
|
then translates the JOS tags (msds and dependencies) from English to
|
||
|
Slovene and converts the output to TEI XML.
|
||
4 years ago
|
|
||
|
Example usage:
|
||
|
|
||
3 years ago
|
```
|
||
|
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
||
|
```
|
||
|
|
||
|
## pipeline2.py
|
||
|
|
||
|
The pipeline2.py script expects as input a TEI XML file (in the same
|
||
|
particular format as the output of pipeline1.py) and an xml file of
|
||
|
structure specifications (normally the up-to-date CJVT structures.xml
|
||
|
file). It first splits the TEI file into two files, one with the
|
||
|
single-component units and the other with the multiple-component
|
||
|
units. For each, it then assigns each unit to a syntactic structure
|
||
|
from the DDD database and converts the output into CJVT dictionary XML
|
||
|
format. For the single-component units, this is pretty trivial, but
|
||
|
for multiple-component units it is more involved, and includes two
|
||
|
runs of the MWE extraction script
|
||
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||
|
missing structures in between. At the end, the single-component and
|
||
|
multiple-component dictionary files are merged. Both the merged
|
||
|
dictionary file and the updated structure specification file are
|
||
|
validated with the appropriate XML schemas.
|
||
|
|
||
|
Example usage:
|
||
|
|
||
|
```
|
||
|
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
||
|
```
|
||
|
|
||
|
Note that any new structures generated are given temporary ids
|
||
|
(@tempId), because they only get real ids once they are approved and
|
||
|
added to the DDD database. That can be done via the django
|
||
|
import_structures.py script in the [ddd_core
|
||
|
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
||
|
batch of DDD updates. That script replaces the temporary ids in the
|
||
|
structure specifications and updates the ids in the dictionary file.
|