IssueID #1487: added documentation for pipeline scripts

2021-01-15 16:59:09 +01:00
parent b51bc1b87d
commit 6668e38a14
1 changed files with 63 additions and 10 deletions
@@ -1,13 +1,66 @@
-Pipeline for parsing a file of arbitrary Slovene string and assigning
-(first creating, if necessary) structure_ids for each string.
+# Structure assignment pipeline
+
+Pipeline for parsing a list of arbitrary Slovene strings and assigning
+each to a syntactic structure in the DDD database, generating
+provisional new structures if necessary. The pipeline consists of two
+separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
+order to facilitate manual corrections of the parsed units in between
+(e.g., with qcat).
+
+## Setup
+
+The scripts are mostly wrappers, in that they largely use generic
+scripts and resources from other git repositories, as well as python
+libraries. These can be set up with a special script:
+
+```
+$ bin/setup.sh
+```
+
+## pipeline1.py
+
+The pipeline1.py script expects as input a file of Slovene strings,
+one string per line. It then runs the [python obeliks
+tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
+conllu output a little, and runs JOS-configured
+[classla](https://pypi.org/project/classla/) to parse the output. It
+then translates the JOS tags (msds and dependencies) from English to
+Slovene and converts the output to TEI XML.

 Example usage:

-$ cd scripts
-$ ./setup.sh
-$ echo "velika miza" > ../tmp/strings.txt
-$ echo "kdo ne more mimo česa" >> ../tmp/strings.txt
-$ echo "pazi, avto!" >> ../tmp/strings.txt
-$ echo "počitnice" >> ../tmp/strings.txt
-$ source ../venv/bin/activate
-$ python pipeline.py ../tmp/strings.txt ../tmp/dictionary.xml
+```
+$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
+```
+
+## pipeline2.py
+
+The pipeline2.py script expects as input a TEI XML file (in the same
+particular format as the output of pipeline1.py) and an xml file of
+structure specifications (normally the up-to-date CJVT structures.xml
+file). It first splits the TEI file into two files, one with the
+single-component units and the other with the multiple-component
+units. For each, it then assigns each unit to a syntactic structure
+from the DDD database and converts the output into CJVT dictionary XML
+format. For the single-component units, this is pretty trivial, but
+for multiple-component units it is more involved, and includes two
+runs of the MWE extraction script
+[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
+missing structures in between. At the end, the single-component and
+multiple-component dictionary files are merged. Both the merged
+dictionary file and the updated structure specification file are
+validated with the appropriate XML schemas.
+
+Example usage:
+
+```
+$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
+```
+
+Note that any new structures generated are given temporary ids
+(@tempId), because they only get real ids once they are approved and
+added to the DDD database. That can be done via the django
+import_structures.py script in the [ddd_core
+repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
+batch of DDD updates. That script replaces the temporary ids in the
+structure specifications and updates the ids in the dictionary file.