You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

98 lines
3.8 KiB

# Structure assignment pipeline
Pipeline for parsing a list of arbitrary Slovene strings and assigning
each to a syntactic structure in the DDD database, generating
provisional new structures if necessary.
## Installation
Installation requires the [CLASSLA](https://github.com/clarinsi/classla) standard_jos models:
pip install .
python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"
The classla directory does not necessarily need to be placed under resources/, but the wrapper
script scripts/process.py assumes that it is.
## Usage
The main script is scripts/process.py. There are several modes (
identified via the -part parameter), depending on whether you want to
run the whole pipeline from start to finish (daring!), or with manual
intervention (of the parse) in between. XML validation is also
provided separately.
### strings_to_parse
The input is a file of Slovene strings (one string per line). The
script runs the [python obeliks
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
conllu output a little, and runs JOS-configured
[classla](https://pypi.org/project/classla/) to parse the output. It
then translates the JOS tags (msds and dependencies) from English to
Slovene and converts the output to TEI XML. Example:
```
$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
```
### parse_to_dictionary
The input should be a TEI XML file (in the same particular format as
the output of strings_to_parse) and an xml file of structure
specifications. The script first uses the MWE extraction script to
find and assign all matches for collocation structures. For units
without such matches, it then finds (creating, if necessary) and
assigns single-component or other structures. Finally the TEI is
converted to CJVT dictionary XML format. Example:
```
$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```
### strings_to_dictionary
Combines strings_to_parse in parse_to_dictionary into one call
(whereby you forfeit the chance to fix potential parse errors in
between). Example:
```
$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```
### all
Same as strings_to_dictionary, but also validates the dictionary and
structures outputs, just in case.
```
$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```
## REST API
The package provides a REST API with endpoints roughly mirroring the
process.py modes. For most calls, POST is needed, so that input
structures can be easily provided. If processing resulted in temporary
new structures, their number is recorded in @new_structures.
Example curl calls:
```
$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary
```
## Note
Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and
added to the DDD database. That is normally done via the django
import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
temporary ids in the structure specifications and updates the ids in
the dictionary file.