IssueID #1487: added documentation for pipeline scripts
This commit is contained in:
parent
b51bc1b87d
commit
6668e38a14
73
README
73
README
|
@ -1,13 +1,66 @@
|
||||||
Pipeline for parsing a file of arbitrary Slovene string and assigning
|
# Structure assignment pipeline
|
||||||
(first creating, if necessary) structure_ids for each string.
|
|
||||||
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
||||||
|
each to a syntactic structure in the DDD database, generating
|
||||||
|
provisional new structures if necessary. The pipeline consists of two
|
||||||
|
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
||||||
|
order to facilitate manual corrections of the parsed units in between
|
||||||
|
(e.g., with qcat).
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
The scripts are mostly wrappers, in that they largely use generic
|
||||||
|
scripts and resources from other git repositories, as well as python
|
||||||
|
libraries. These can be set up with a special script:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ bin/setup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## pipeline1.py
|
||||||
|
|
||||||
|
The pipeline1.py script expects as input a file of Slovene strings,
|
||||||
|
one string per line. It then runs the [python obeliks
|
||||||
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
||||||
|
conllu output a little, and runs JOS-configured
|
||||||
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
||||||
|
then translates the JOS tags (msds and dependencies) from English to
|
||||||
|
Slovene and converts the output to TEI XML.
|
||||||
|
|
||||||
Example usage:
|
Example usage:
|
||||||
|
|
||||||
$ cd scripts
|
```
|
||||||
$ ./setup.sh
|
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
||||||
$ echo "velika miza" > ../tmp/strings.txt
|
```
|
||||||
$ echo "kdo ne more mimo česa" >> ../tmp/strings.txt
|
|
||||||
$ echo "pazi, avto!" >> ../tmp/strings.txt
|
## pipeline2.py
|
||||||
$ echo "počitnice" >> ../tmp/strings.txt
|
|
||||||
$ source ../venv/bin/activate
|
The pipeline2.py script expects as input a TEI XML file (in the same
|
||||||
$ python pipeline.py ../tmp/strings.txt ../tmp/dictionary.xml
|
particular format as the output of pipeline1.py) and an xml file of
|
||||||
|
structure specifications (normally the up-to-date CJVT structures.xml
|
||||||
|
file). It first splits the TEI file into two files, one with the
|
||||||
|
single-component units and the other with the multiple-component
|
||||||
|
units. For each, it then assigns each unit to a syntactic structure
|
||||||
|
from the DDD database and converts the output into CJVT dictionary XML
|
||||||
|
format. For the single-component units, this is pretty trivial, but
|
||||||
|
for multiple-component units it is more involved, and includes two
|
||||||
|
runs of the MWE extraction script
|
||||||
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||||||
|
missing structures in between. At the end, the single-component and
|
||||||
|
multiple-component dictionary files are merged. Both the merged
|
||||||
|
dictionary file and the updated structure specification file are
|
||||||
|
validated with the appropriate XML schemas.
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that any new structures generated are given temporary ids
|
||||||
|
(@tempId), because they only get real ids once they are approved and
|
||||||
|
added to the DDD database. That can be done via the django
|
||||||
|
import_structures.py script in the [ddd_core
|
||||||
|
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
||||||
|
batch of DDD updates. That script replaces the temporary ids in the
|
||||||
|
structure specifications and updates the ids in the dictionary file.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user