98 lines
3.8 KiB
Markdown
98 lines
3.8 KiB
Markdown
# Structure assignment pipeline
|
|
|
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
|
each to a syntactic structure in the DDD database, generating
|
|
provisional new structures if necessary.
|
|
|
|
## Installation
|
|
|
|
Installation requires the [CLASSLA](https://github.com/clarinsi/classla) standard_jos models:
|
|
|
|
pip install .
|
|
python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"
|
|
|
|
The classla directory does not necessarily need to be placed under resources/, but the wrapper
|
|
script scripts/process.py assumes that it is.
|
|
|
|
## Usage
|
|
|
|
The main script is scripts/process.py. There are several modes (
|
|
identified via the -part parameter), depending on whether you want to
|
|
run the whole pipeline from start to finish (daring!), or with manual
|
|
intervention (of the parse) in between. XML validation is also
|
|
provided separately.
|
|
|
|
|
|
### strings_to_parse
|
|
|
|
The input is a file of Slovene strings (one string per line). The
|
|
script runs the [python obeliks
|
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
|
conllu output a little, and runs JOS-configured
|
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
|
then translates the JOS tags (msds and dependencies) from English to
|
|
Slovene and converts the output to TEI XML. Example:
|
|
|
|
```
|
|
$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
|
|
```
|
|
|
|
### parse_to_dictionary
|
|
|
|
The input should be a TEI XML file (in the same particular format as
|
|
the output of strings_to_parse) and an xml file of structure
|
|
specifications. The script first uses the MWE extraction script to
|
|
find and assign all matches for collocation structures. For units
|
|
without such matches, it then finds (creating, if necessary) and
|
|
assigns single-component or other structures. Finally the TEI is
|
|
converted to CJVT dictionary XML format. Example:
|
|
|
|
```
|
|
$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
### strings_to_dictionary
|
|
|
|
Combines strings_to_parse in parse_to_dictionary into one call
|
|
(whereby you forfeit the chance to fix potential parse errors in
|
|
between). Example:
|
|
|
|
```
|
|
$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
### all
|
|
|
|
Same as strings_to_dictionary, but also validates the dictionary and
|
|
structures outputs, just in case.
|
|
|
|
```
|
|
$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
## REST API
|
|
|
|
The package provides a REST API with endpoints roughly mirroring the
|
|
process.py modes. For most calls, POST is needed, so that input
|
|
structures can be easily provided. If processing resulted in temporary
|
|
new structures, their number is recorded in @new_structures.
|
|
|
|
Example curl calls:
|
|
|
|
```
|
|
$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
|
|
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary
|
|
```
|
|
|
|
## Note
|
|
|
|
Note that any new structures generated are given temporary ids
|
|
(@tempId), because they only get real ids once they are approved and
|
|
added to the DDD database. That is normally done via the django
|
|
import_structures.py script in the [ddd_core
|
|
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
|
|
temporary ids in the structure specifications and updates the ids in
|
|
the dictionary file.
|