102 lines
4.1 KiB
Markdown
102 lines
4.1 KiB
Markdown
# Structure assignment pipeline
|
|
|
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
|
each to a syntactic structure in the DDD database, generating
|
|
provisional new structures if necessary.
|
|
|
|
## Installation
|
|
|
|
Installation requires the [CLASSLA](https://github.com/clarinsi/classla) standard_jos models, as
|
|
well as (for now) the wani.py script from
|
|
[luscenje_struktur](https://gitea.cjvt.si/ozbolt/luscenje_struktur):
|
|
|
|
pip install .
|
|
python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"
|
|
curl -o resources/wani.py https://gitea.cjvt.si/ozbolt/luscenje_struktur/raw/branch/master/wani.py
|
|
|
|
The classla directory and wani.py file do not necessarily need to be placed under resources/, but
|
|
the wrapper script scripts/process.py assumes that they are.
|
|
|
|
## Usage
|
|
|
|
The main script is scripts/process.py. There are several modes (
|
|
identified via the -part parameter), depending on whether you want to
|
|
run the whole pipeline from start to finish (daring!), or with manual
|
|
intervention (of the parse) in between. XML validation is also
|
|
provided separately.
|
|
|
|
|
|
### strings_to_parse
|
|
|
|
The input is a file of Slovene strings (one string per line). The
|
|
script runs the [python obeliks
|
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
|
conllu output a little, and runs JOS-configured
|
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
|
then translates the JOS tags (msds and dependencies) from English to
|
|
Slovene and converts the output to TEI XML. Example:
|
|
|
|
```
|
|
$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
|
|
```
|
|
|
|
### parse_to_dictionary
|
|
|
|
The input should be a TEI XML file (in the same particular format as
|
|
the output of strings_to_parse) and an xml file of structure
|
|
specifications. The script first uses the MWE extraction script
|
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
|
|
assign all matches for collocation structures. For units without such
|
|
matches, it then finds (creating, if necessary) and assigns
|
|
single-component or other structures. Finally the TEI is converted to
|
|
CJVT dictionary XML format. Example:
|
|
|
|
```
|
|
$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
### strings_to_dictionary
|
|
|
|
Combines strings_to_parse in parse_to_dictionary into one call
|
|
(whereby you forfeit the chance to fix potential parse errors in
|
|
between). Example:
|
|
|
|
```
|
|
$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
### all
|
|
|
|
Same as strings_to_dictionary, but also validates the dictionary and
|
|
structures outputs, just in case.
|
|
|
|
```
|
|
$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
|
```
|
|
|
|
## REST API
|
|
|
|
The package provides a REST API with endpoints roughly mirroring the
|
|
process.py modes. For most calls, POST is needed, so that input
|
|
structures can be easily provided. If processing resulted in temporary
|
|
new structures, their number is recorded in @new_structures.
|
|
|
|
Example curl calls:
|
|
|
|
```
|
|
$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
|
|
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary
|
|
```
|
|
|
|
## Note
|
|
|
|
Note that any new structures generated are given temporary ids
|
|
(@tempId), because they only get real ids once they are approved and
|
|
added to the DDD database. That is normally done via the django
|
|
import_structures.py script in the [ddd_core
|
|
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
|
|
temporary ids in the structure specifications and updates the ids in
|
|
the dictionary file.
|