structure_assignment/README.md

# Structure assignment pipeline

Pipeline for parsing a list of arbitrary Slovene strings and assigning
each to a syntactic structure in the DDD database, generating
provisional new structures if necessary.

## Installation

Installation requires the [CLASSLA](https://github.com/clarinsi/classla) standard_jos models, as
well as (for now) the wani.py script from
[luscenje_struktur](https://gitea.cjvt.si/ozbolt/luscenje_struktur):

    pip install .
    python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"
    curl -o resources/wani.py https://gitea.cjvt.si/ozbolt/luscenje_struktur/raw/branch/master/wani.py

The classla directory and wani.py file do not necessarily need to be placed under resources/, but
the wrapper script scripts/process.py assumes that they are.

## Usage

The main script is scripts/process.py. There are several modes (
identified via the -part parameter), depending on whether you want to
run the whole pipeline from start to finish (daring!), or with manual
intervention (of the parse) in between. XML validation is also
provided separately.


### strings_to_parse

The input is a file of Slovene strings (one string per line). The
script runs the [python obeliks
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
conllu output a little, and runs JOS-configured
[classla](https://pypi.org/project/classla/) to parse the output. It
then translates the JOS tags (msds and dependencies) from English to
Slovene and converts the output to TEI XML. Example:

```
$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
```

### parse_to_dictionary

The input should be a TEI XML file (in the same particular format as
the output of strings_to_parse) and an xml file of structure
specifications. The script first uses the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
assign all matches for collocation structures. For units without such
matches, it then finds (creating, if necessary) and assigns
single-component or other structures. Finally the TEI is converted to
CJVT dictionary XML format. Example:

```
$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```

### strings_to_dictionary

Combines strings_to_parse in parse_to_dictionary into one call
(whereby you forfeit the chance to fix potential parse errors in
between). Example:

```
$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```

### all

Same as strings_to_dictionary, but also validates the dictionary and
structures outputs, just in case.

```
$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```

## REST API

The package provides a REST API with endpoints roughly mirroring the
process.py modes. For most calls, POST is needed, so that input
structures can be easily provided. If processing resulted in temporary
new structures, their number is recorded in @new_structures.

Example curl calls:

```
$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza
$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary
$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary
```

## Note

Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and
added to the DDD database. That is normally done via the django
import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
temporary ids in the structure specifications and updates the ids in
the dictionary file.
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			`# Structure assignment pipeline`

			`Pipeline for parsing a list of arbitrary Slovene strings and assigning`
			`each to a syntactic structure in the DDD database, generating`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`provisional new structures if necessary.`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1487: tweaked readme 2022-09-28 15:24:14 +00:00			`## Installation`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1487: tweaked readme 2022-09-28 15:24:14 +00:00			`Installation requires the [CLASSLA](https://github.com/clarinsi/classla) standard_jos models, as`
			`well as (for now) the wani.py script from`
			`[luscenje_struktur](https://gitea.cjvt.si/ozbolt/luscenje_struktur):`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1487: tweaked readme 2022-09-28 15:24:14 +00:00			`pip install .`
			`python -c "import classla; classla.download('sl', dir='resources/classla', type='standard_jos')"`
			`curl -o resources/wani.py https://gitea.cjvt.si/ozbolt/luscenje_struktur/raw/branch/master/wani.py`

			`The classla directory and wani.py file do not necessarily need to be placed under resources/, but`
			`the wrapper script scripts/process.py assumes that they are.`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`## Usage`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`The main script is scripts/process.py. There are several modes (`
			`identified via the -part parameter), depending on whether you want to`
			`run the whole pipeline from start to finish (daring!), or with manual`
			`intervention (of the parse) in between. XML validation is also`
			`provided separately.`


			`### strings_to_parse`

			`The input is a file of Slovene strings (one string per line). The`
			`script runs the [python obeliks`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			`tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the`
			`conllu output a little, and runs JOS-configured`
			`[classla](https://pypi.org/project/classla/) to parse the output. It`
			`then translates the JOS tags (msds and dependencies) from English to`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`Slovene and converts the output to TEI XML. Example:`
IssueID #1487: added basic readme 2020-09-30 17:50:00 +00:00
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			```
Redmine #1835: tweaked setup and readme 2022-03-10 16:00:51 +00:00			`$ python process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			```

Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`### parse_to_dictionary`

			`The input should be a TEI XML file (in the same particular format as`
			`the output of strings_to_parse) and an xml file of structure`
Redmine #2198: Updated parse_to_dictionary explanation 2021-12-07 12:22:30 +00:00			`specifications. The script first uses the MWE extraction script`
			`[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and`
			`assign all matches for collocation structures. For units without such`
			`matches, it then finds (creating, if necessary) and assigns`
			`single-component or other structures. Finally the TEI is converted to`
			`CJVT dictionary XML format. Example:`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			```
Redmine #1835: tweaked setup and readme 2022-03-10 16:00:51 +00:00			`$ python process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			```

			`### strings_to_dictionary`

			`Combines strings_to_parse in parse_to_dictionary into one call`
			`(whereby you forfeit the chance to fix potential parse errors in`
			`between). Example:`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00
			```
Redmine #1835: tweaked setup and readme 2022-03-10 16:00:51 +00:00			`$ python process.py -mode strings_to_dictionary -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			```

Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`### all`

			`Same as strings_to_dictionary, but also validates the dictionary and`
			`structures outputs, just in case.`

			```
Redmine #1835: tweaked setup and readme 2022-03-10 16:00:51 +00:00			`$ python process.py -mode all -infile /tmp/strings.txt -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			```

			`## REST API`

			`The package provides a REST API with endpoints roughly mirroring the`
Redmine #1835: made input structure specification xml into parameter 2021-06-29 19:00:27 +00:00			`process.py modes. For most calls, POST is needed, so that input`
			`structures can be easily provided. If processing resulted in temporary`
			`new structures, their number is recorded in @new_structures.`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00
Redmine #1835: README tweak 2021-03-29 08:55:06 +00:00			`Example curl calls:`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00
			```
Redmine #1835: made example curl calls realistic proc1 examples 2021-03-25 15:44:49 +00:00			`$ curl -k https://proc1.cjvt.si/structures/strings_to_parse?string=velika%20miza`
Redmine #1835: made input structure specification xml into parameter 2021-06-29 19:00:27 +00:00			`$ curl -k -X POST -F strings=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse`
			`$ curl -k -X POST -F parsed=@/tmp/parse.xml -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/parse_to_dictionary`
			`$ curl -k -X POST -F strings=@/tmp/strings.txt -F structures=@/tmp/structures.xml https://proc1.cjvt.si/structures/strings_to_dictionary`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			```

			`## Note`

IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			`Note that any new structures generated are given temporary ids`
			`(@tempId), because they only get real ids once they are approved and`
Redmine #1835: Rewrote readme and minor renaming 2021-03-25 15:27:40 +00:00			`added to the DDD database. That is normally done via the django`
IssueID #1487: added documentation for pipeline scripts 2021-01-15 15:59:09 +00:00			`import_structures.py script in the [ddd_core`
Redmine #2198: Updated parse_to_dictionary explanation 2021-12-07 12:22:30 +00:00			`repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the`
			`temporary ids in the structure specifications and updates the ids in`
			`the dictionary file.`