IssueID #1487: added documentation for pipeline scripts

2021-01-15 16:59:09 +01:00 · 2021-01-15 16:59:09 +01:00 · 6668e38a14
commit 6668e38a14
parent b51bc1b87d
1 changed files with 63 additions and 10 deletions
--- a/73
+++ b/73
@ -1,13 +1,66 @@
-Pipeline for parsing a file of arbitrary Slovene string and assigning
+# Structure assignment pipeline
-(first creating, if necessary) structure_ids for each string.
+
 Pipeline for parsing a list of arbitrary Slovene strings and assigning
 each to a syntactic structure in the DDD database, generating
 provisional new structures if necessary. The pipeline consists of two
 separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
 order to facilitate manual corrections of the parsed units in between
 (e.g., with qcat).
 ## Setup
 The scripts are mostly wrappers, in that they largely use generic
 scripts and resources from other git repositories, as well as python
 libraries. These can be set up with a special script:
 ```
 $ bin/setup.sh
 ```
 ## pipeline1.py
 The pipeline1.py script expects as input a file of Slovene strings,
 one string per line. It then runs the [python obeliks
 tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
 conllu output a little, and runs JOS-configured
 [classla](https://pypi.org/project/classla/) to parse the output. It
 then translates the JOS tags (msds and dependencies) from English to
 Slovene and converts the output to TEI XML.
 Example usage:
-$ cd scripts
+```
-$ ./setup.sh
+$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
-$ echo "velika miza" > ../tmp/strings.txt
+```
-$ echo "kdo ne more mimo česa" >> ../tmp/strings.txt
+
-$ echo "pazi, avto!" >> ../tmp/strings.txt
+## pipeline2.py
-$ echo "počitnice" >> ../tmp/strings.txt
+
-$ source ../venv/bin/activate
+The pipeline2.py script expects as input a TEI XML file (in the same
-$ python pipeline.py ../tmp/strings.txt ../tmp/dictionary.xml
+particular format as the output of pipeline1.py) and an xml file of
 structure specifications (normally the up-to-date CJVT structures.xml
 file). It first splits the TEI file into two files, one with the
 single-component units and the other with the multiple-component
 units. For each, it then assigns each unit to a syntactic structure
 from the DDD database and converts the output into CJVT dictionary XML
 format. For the single-component units, this is pretty trivial, but
 for multiple-component units it is more involved, and includes two
 runs of the MWE extraction script
 [wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
 missing structures in between. At the end, the single-component and
 multiple-component dictionary files are merged. Both the merged
 dictionary file and the updated structure specification file are
 validated with the appropriate XML schemas.
 Example usage:
 ```
 $ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
 ```
 Note that any new structures generated are given temporary ids
 (@tempId), because they only get real ids once they are approved and
 added to the DDD database. That can be done via the django
 import_structures.py script in the [ddd_core
 repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
 batch of DDD updates. That script replaces the temporary ids in the
 structure specifications and updates the ids in the dictionary file.