diff --git a/README b/README index 67fcd13..b199a30 100644 --- a/README +++ b/README @@ -1,13 +1,66 @@ -Pipeline for parsing a file of arbitrary Slovene string and assigning -(first creating, if necessary) structure_ids for each string. +# Structure assignment pipeline + +Pipeline for parsing a list of arbitrary Slovene strings and assigning +each to a syntactic structure in the DDD database, generating +provisional new structures if necessary. The pipeline consists of two +separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in +order to facilitate manual corrections of the parsed units in between +(e.g., with qcat). + +## Setup + +The scripts are mostly wrappers, in that they largely use generic +scripts and resources from other git repositories, as well as python +libraries. These can be set up with a special script: + +``` +$ bin/setup.sh +``` + +## pipeline1.py + +The pipeline1.py script expects as input a file of Slovene strings, +one string per line. It then runs the [python obeliks +tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the +conllu output a little, and runs JOS-configured +[classla](https://pypi.org/project/classla/) to parse the output. It +then translates the JOS tags (msds and dependencies) from English to +Slovene and converts the output to TEI XML. Example usage: -$ cd scripts -$ ./setup.sh -$ echo "velika miza" > ../tmp/strings.txt -$ echo "kdo ne more mimo česa" >> ../tmp/strings.txt -$ echo "pazi, avto!" >> ../tmp/strings.txt -$ echo "počitnice" >> ../tmp/strings.txt -$ source ../venv/bin/activate -$ python pipeline.py ../tmp/strings.txt ../tmp/dictionary.xml +``` +$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml +``` + +## pipeline2.py + +The pipeline2.py script expects as input a TEI XML file (in the same +particular format as the output of pipeline1.py) and an xml file of +structure specifications (normally the up-to-date CJVT structures.xml +file). It first splits the TEI file into two files, one with the +single-component units and the other with the multiple-component +units. For each, it then assigns each unit to a syntactic structure +from the DDD database and converts the output into CJVT dictionary XML +format. For the single-component units, this is pretty trivial, but +for multiple-component units it is more involved, and includes two +runs of the MWE extraction script +[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating +missing structures in between. At the end, the single-component and +multiple-component dictionary files are merged. Both the merged +dictionary file and the updated structure specification file are +validated with the appropriate XML schemas. + +Example usage: + +``` +$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml +``` + +Note that any new structures generated are given temporary ids +(@tempId), because they only get real ids once they are approved and +added to the DDD database. That can be done via the django +import_structures.py script in the [ddd_core +repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a +batch of DDD updates. That script replaces the temporary ids in the +structure specifications and updates the ids in the dictionary file.