Pipeline which combines scripts and resources from other repositories to parse strings and assign them to standard CJVT structures, creating new structures if necessary.
Go to file
2021-06-29 17:44:46 +02:00
package IssueID #1835: made encoding-related improvements 2021-06-29 17:44:46 +02:00
scripts Redmine #1835: Rewrote readme and minor renaming 2021-03-25 16:27:40 +01:00
.gitignore Redmine #1487: adjusted directory structure and tweaked setup script 2021-03-25 15:43:16 +01:00
README.md Redmine #1835: README tweak 2021-03-29 10:55:06 +02:00

Structure assignment pipeline

Pipeline for parsing a list of arbitrary Slovene strings and assigning each to a syntactic structure in the DDD database, generating provisional new structures if necessary.

Setup

Most of the scripts come from other repositories and python libraries. Run the set-up script:

$ scripts/setup.sh

Usage

The main script is scripts/process.py. There are several modes ( identified via the -part parameter), depending on whether you want to run the whole pipeline from start to finish (daring!), or with manual intervention (of the parse) in between. XML validation is also provided separately.

strings_to_parse

The input is a file of Slovene strings (one string per line). The script runs the python obeliks tokeniser on the input, tweaks the conllu output a little, and runs JOS-configured classla to parse the output. It then translates the JOS tags (msds and dependencies) from English to Slovene and converts the output to TEI XML. Example:

$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml

parse_to_dictionary

The input should be a TEI XML file (in the same particular format as the output of strings_to_parse) and an xml file of structure specifications (the CJVT structures.xml file, supplemented with temporary new structures, if needed). It first splits the TEI file into two files, one with the single-component units and the other with the multiple-component units. For each, it then assigns each unit to a syntactic structure from the DDD database and converts the output into CJVT dictionary XML format. For the single-component units, this is pretty trivial, but for multiple-component units it is more involved, and includes two runs of the MWE extraction script wani.py, generating missing structures in between. At the end, the single-component and multiple-component dictionary files are merged into one dictionary file. Example:

$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml

strings_to_dictionary

Combines strings_to_parse in parse_to_dictionary into one call (whereby you forfeit the chance to fix potential parse errors in between). Example:

$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml

all

Same as strings_to_dictionary, but also validates the dictionary and structures outputs, just in case.

$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml

REST API

The package provides a REST API with endpoints roughly mirroring the process.py modes. For the calls accepting strings as input, GET calls are also supported for single-string input. For the calls resulting in dictionaries, the results include both the dictionary entries and the structures they use. If processing resulted in temporary new structures, their number is recorded in @new_structures.

Example curl calls:

$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
$ curl -k -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
$ curl -k -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary
$ curl -k https://proc1.cjvt.si/structures/strings_to_dictionary?string=velika%20miza
$ curl -k -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary

Note

Note that any new structures generated are given temporary ids (@tempId), because they only get real ids once they are approved and added to the DDD database. That is normally done via the django import_structures.py script in the ddd_core repository, as part of a batch of DDD updates. That script replaces the temporary ids in the structure specifications and updates the ids in the dictionary file.