Pipeline which combines scripts and resources from other repositories to parse strings and assign them to standard CJVT structures, creating new structures if necessary.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Cyprian Laskowski cc82cc28e1
Redmine #2619: set specific cordex version and added requirements file
4 weeks ago
scripts Redmine #1461: switched from luscenje_struktur to cordex 7 months ago
structure_assignment Redmine #2619: set Slovene as JOS language 1 month ago
.gitignore Redmine #1487: updated gitignore list 2 months ago
README.md Redmine #2619: removes reference to obsolete wani in readme 1 month ago
requirements.txt Redmine #2619: set specific cordex version and added requirements file 4 weeks ago
setup.py Redmine #2619: set specific cordex version and added requirements file 4 weeks ago


Structure assignment pipeline

Pipeline for parsing a list of arbitrary Slovene strings and assigning each to a syntactic structure in the DDD database, generating provisional new structures if necessary.


Installation requires the CLASSLA standard_jos models:

pip install .
python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"

The classla directory does not necessarily need to be placed under resources/, but the wrapper script scripts/process.py assumes that it is.


The main script is scripts/process.py. There are several modes ( identified via the -part parameter), depending on whether you want to run the whole pipeline from start to finish (daring!), or with manual intervention (of the parse) in between. XML validation is also provided separately.


The input is a file of Slovene strings (one string per line). The script runs the python obeliks tokeniser on the input, tweaks the conllu output a little, and runs JOS-configured classla to parse the output. It then translates the JOS tags (msds and dependencies) from English to Slovene and converts the output to TEI XML. Example:

$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml


The input should be a TEI XML file (in the same particular format as the output of strings_to_parse) and an xml file of structure specifications. The script first uses the MWE extraction script to find and assign all matches for collocation structures. For units without such matches, it then finds (creating, if necessary) and assigns single-component or other structures. Finally the TEI is converted to CJVT dictionary XML format. Example:

$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml


Combines strings_to_parse in parse_to_dictionary into one call (whereby you forfeit the chance to fix potential parse errors in between). Example:

$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml


Same as strings_to_dictionary, but also validates the dictionary and structures outputs, just in case.

$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml


The package provides a REST API with endpoints roughly mirroring the process.py modes. For most calls, POST is needed, so that input structures can be easily provided. If processing resulted in temporary new structures, their number is recorded in @new_structures.

Example curl calls:

$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary


Note that any new structures generated are given temporary ids (@tempId), because they only get real ids once they are approved and added to the DDD database. That is normally done via the django import_structures.py script in the ddd_core repository, which replaces the temporary ids in the structure specifications and updates the ids in the dictionary file.