IssueID #1487: added documentation for pipeline scripts
This commit is contained in:
parent
b51bc1b87d
commit
6668e38a14
73
README
73
README
|
@ -1,13 +1,66 @@
|
|||
Pipeline for parsing a file of arbitrary Slovene string and assigning
|
||||
(first creating, if necessary) structure_ids for each string.
|
||||
# Structure assignment pipeline
|
||||
|
||||
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
||||
each to a syntactic structure in the DDD database, generating
|
||||
provisional new structures if necessary. The pipeline consists of two
|
||||
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
||||
order to facilitate manual corrections of the parsed units in between
|
||||
(e.g., with qcat).
|
||||
|
||||
## Setup
|
||||
|
||||
The scripts are mostly wrappers, in that they largely use generic
|
||||
scripts and resources from other git repositories, as well as python
|
||||
libraries. These can be set up with a special script:
|
||||
|
||||
```
|
||||
$ bin/setup.sh
|
||||
```
|
||||
|
||||
## pipeline1.py
|
||||
|
||||
The pipeline1.py script expects as input a file of Slovene strings,
|
||||
one string per line. It then runs the [python obeliks
|
||||
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
||||
conllu output a little, and runs JOS-configured
|
||||
[classla](https://pypi.org/project/classla/) to parse the output. It
|
||||
then translates the JOS tags (msds and dependencies) from English to
|
||||
Slovene and converts the output to TEI XML.
|
||||
|
||||
Example usage:
|
||||
|
||||
$ cd scripts
|
||||
$ ./setup.sh
|
||||
$ echo "velika miza" > ../tmp/strings.txt
|
||||
$ echo "kdo ne more mimo česa" >> ../tmp/strings.txt
|
||||
$ echo "pazi, avto!" >> ../tmp/strings.txt
|
||||
$ echo "počitnice" >> ../tmp/strings.txt
|
||||
$ source ../venv/bin/activate
|
||||
$ python pipeline.py ../tmp/strings.txt ../tmp/dictionary.xml
|
||||
```
|
||||
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
||||
```
|
||||
|
||||
## pipeline2.py
|
||||
|
||||
The pipeline2.py script expects as input a TEI XML file (in the same
|
||||
particular format as the output of pipeline1.py) and an xml file of
|
||||
structure specifications (normally the up-to-date CJVT structures.xml
|
||||
file). It first splits the TEI file into two files, one with the
|
||||
single-component units and the other with the multiple-component
|
||||
units. For each, it then assigns each unit to a syntactic structure
|
||||
from the DDD database and converts the output into CJVT dictionary XML
|
||||
format. For the single-component units, this is pretty trivial, but
|
||||
for multiple-component units it is more involved, and includes two
|
||||
runs of the MWE extraction script
|
||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||||
missing structures in between. At the end, the single-component and
|
||||
multiple-component dictionary files are merged. Both the merged
|
||||
dictionary file and the updated structure specification file are
|
||||
validated with the appropriate XML schemas.
|
||||
|
||||
Example usage:
|
||||
|
||||
```
|
||||
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
||||
```
|
||||
|
||||
Note that any new structures generated are given temporary ids
|
||||
(@tempId), because they only get real ids once they are approved and
|
||||
added to the DDD database. That can be done via the django
|
||||
import_structures.py script in the [ddd_core
|
||||
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
||||
batch of DDD updates. That script replaces the temporary ids in the
|
||||
structure specifications and updates the ids in the dictionary file.
|
||||
|
|
Loading…
Reference in New Issue
Block a user