2021-01-15 15:59:09 +00:00
|
|
|
# Structure assignment pipeline
|
|
|
|
|
|
|
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
|
|
|
each to a syntactic structure in the DDD database, generating
|
2021-03-25 15:27:40 +00:00
|
|
|
provisional new structures if necessary.
|
2021-01-15 15:59:09 +00:00
|
|
|
|
|
|
|
## Setup
|
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
Most of the scripts come from other repositories and python libraries.
|
|
|
|
Run the set-up script:
|
2021-01-15 15:59:09 +00:00
|
|
|
|
|
|
|
```
|
2021-03-25 15:27:40 +00:00
|
|
|
$ scripts/setup.sh
|
2021-01-15 15:59:09 +00:00
|
|
|
```
|
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
## Usage
|
2021-01-15 15:59:09 +00:00
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
The main script is scripts/process.py. There are several modes (
|
|
|
|
identified via the -part parameter), depending on whether you want to
|
|
|
|
run the whole pipeline from start to finish (daring!), or with manual
|
|
|
|
intervention (of the parse) in between. XML validation is also
|
|
|
|
provided separately.
|
|
|
|
|
|
|
|
|
|
|
|
### strings_to_parse
|
|
|
|
|
|
|
|
The input is a file of Slovene strings (one string per line). The
|
|
|
|
script runs the [python obeliks
|
2021-01-15 15:59:09 +00:00
|
|
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
|
|
|
conllu output a little, and runs JOS-configured
|
|
|
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
|
|
|
then translates the JOS tags (msds and dependencies) from English to
|
2021-03-25 15:27:40 +00:00
|
|
|
Slovene and converts the output to TEI XML. Example:
|
2020-09-30 17:50:00 +00:00
|
|
|
|
2021-01-15 15:59:09 +00:00
|
|
|
```
|
2021-03-25 15:27:40 +00:00
|
|
|
$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
|
2021-01-15 15:59:09 +00:00
|
|
|
```
|
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
### parse_to_dictionary
|
|
|
|
|
|
|
|
The input should be a TEI XML file (in the same particular format as
|
|
|
|
the output of strings_to_parse) and an xml file of structure
|
2021-06-29 19:00:27 +00:00
|
|
|
specifications. It first splits the TEI file into two files, one with
|
|
|
|
the single-component units and the other with the multiple-component
|
|
|
|
units. For each, it then assigns each unit to a syntactic structure
|
|
|
|
from the DDD database and converts the output into CJVT dictionary XML
|
|
|
|
format. For the single-component units, this is pretty trivial, but
|
|
|
|
for multiple-component units it is more involved, and includes two
|
|
|
|
runs of the MWE extraction script
|
2021-01-15 15:59:09 +00:00
|
|
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
|
|
|
missing structures in between. At the end, the single-component and
|
2021-03-25 15:27:40 +00:00
|
|
|
multiple-component dictionary files are merged into one dictionary
|
|
|
|
file. Example:
|
2021-01-15 15:59:09 +00:00
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
```
|
2021-06-29 19:00:27 +00:00
|
|
|
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
2021-03-25 15:27:40 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
### strings_to_dictionary
|
|
|
|
|
|
|
|
Combines strings_to_parse in parse_to_dictionary into one call
|
|
|
|
(whereby you forfeit the chance to fix potential parse errors in
|
|
|
|
between). Example:
|
2021-01-15 15:59:09 +00:00
|
|
|
|
|
|
|
```
|
2021-06-29 19:00:27 +00:00
|
|
|
$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
2021-01-15 15:59:09 +00:00
|
|
|
```
|
|
|
|
|
2021-03-25 15:27:40 +00:00
|
|
|
### all
|
|
|
|
|
|
|
|
Same as strings_to_dictionary, but also validates the dictionary and
|
|
|
|
structures outputs, just in case.
|
|
|
|
|
|
|
|
```
|
2021-06-29 19:00:27 +00:00
|
|
|
$ python scripts/process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
2021-03-25 15:27:40 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
## REST API
|
|
|
|
|
|
|
|
The package provides a REST API with endpoints roughly mirroring the
|
2021-06-29 19:00:27 +00:00
|
|
|
process.py modes. For most calls, POST is needed, so that input
|
|
|
|
structures can be easily provided. If processing resulted in temporary
|
|
|
|
new structures, their number is recorded in @new_structures.
|
2021-03-25 15:27:40 +00:00
|
|
|
|
2021-03-29 08:55:06 +00:00
|
|
|
Example curl calls:
|
2021-03-25 15:27:40 +00:00
|
|
|
|
|
|
|
```
|
2021-03-25 15:44:49 +00:00
|
|
|
$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
|
2021-06-29 19:00:27 +00:00
|
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
|
|
|
|
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
|
|
|
|
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary
|
2021-03-25 15:27:40 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
## Note
|
|
|
|
|
2021-01-15 15:59:09 +00:00
|
|
|
Note that any new structures generated are given temporary ids
|
|
|
|
(@tempId), because they only get real ids once they are approved and
|
2021-03-25 15:27:40 +00:00
|
|
|
added to the DDD database. That is normally done via the django
|
2021-01-15 15:59:09 +00:00
|
|
|
import_structures.py script in the [ddd_core
|
2021-03-25 15:27:40 +00:00
|
|
|
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
|
|
|
|
DDD updates. That script replaces the temporary ids in the structure
|
|
|
|
specifications and updates the ids in the dictionary file.
|