Redmine #1835: Rewrote readme and minor renaming

This commit is contained in:
Cyprian Laskowski 2021-03-25 16:27:40 +01:00
parent 19e9eb6341
commit 37a10ae703
3 changed files with 90 additions and 50 deletions

113
README.md
View File

@ -2,65 +2,106 @@
Pipeline for parsing a list of arbitrary Slovene strings and assigning Pipeline for parsing a list of arbitrary Slovene strings and assigning
each to a syntactic structure in the DDD database, generating each to a syntactic structure in the DDD database, generating
provisional new structures if necessary. The pipeline consists of two provisional new structures if necessary.
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
order to facilitate manual corrections of the parsed units in between
(e.g., with qcat).
## Setup ## Setup
The scripts are mostly wrappers, in that they largely use generic Most of the scripts come from other repositories and python libraries.
scripts and resources from other git repositories, as well as python Run the set-up script:
libraries. These can be set up with a special script:
``` ```
$ bin/setup.sh $ scripts/setup.sh
``` ```
## pipeline1.py ## Usage
The pipeline1.py script expects as input a file of Slovene strings, The main script is scripts/process.py. There are several modes (
one string per line. It then runs the [python obeliks identified via the -part parameter), depending on whether you want to
run the whole pipeline from start to finish (daring!), or with manual
intervention (of the parse) in between. XML validation is also
provided separately.
### strings_to_parse
The input is a file of Slovene strings (one string per line). The
script runs the [python obeliks
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
conllu output a little, and runs JOS-configured conllu output a little, and runs JOS-configured
[classla](https://pypi.org/project/classla/) to parse the output. It [classla](https://pypi.org/project/classla/) to parse the output. It
then translates the JOS tags (msds and dependencies) from English to then translates the JOS tags (msds and dependencies) from English to
Slovene and converts the output to TEI XML. Slovene and converts the output to TEI XML. Example:
Example usage:
``` ```
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml $ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
``` ```
## pipeline2.py ### parse_to_dictionary
The pipeline2.py script expects as input a TEI XML file (in the same The input should be a TEI XML file (in the same particular format as
particular format as the output of pipeline1.py) and an xml file of the output of strings_to_parse) and an xml file of structure
structure specifications (normally the up-to-date CJVT structures.xml specifications (the CJVT structures.xml file, supplemented with
file). It first splits the TEI file into two files, one with the temporary new structures, if needed). It first splits the TEI file
single-component units and the other with the multiple-component into two files, one with the single-component units and the other with
units. For each, it then assigns each unit to a syntactic structure the multiple-component units. For each, it then assigns each unit to a
from the DDD database and converts the output into CJVT dictionary XML syntactic structure from the DDD database and converts the output into
format. For the single-component units, this is pretty trivial, but CJVT dictionary XML format. For the single-component units, this is
for multiple-component units it is more involved, and includes two pretty trivial, but for multiple-component units it is more involved,
runs of the MWE extraction script and includes two runs of the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating [wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
missing structures in between. At the end, the single-component and missing structures in between. At the end, the single-component and
multiple-component dictionary files are merged. Both the merged multiple-component dictionary files are merged into one dictionary
dictionary file and the updated structure specification file are file. Example:
validated with the appropriate XML schemas.
Example usage:
``` ```
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml $ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
``` ```
### strings_to_dictionary
Combines strings_to_parse in parse_to_dictionary into one call
(whereby you forfeit the chance to fix potential parse errors in
between). Example:
```
$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
```
### all
Same as strings_to_dictionary, but also validates the dictionary and
structures outputs, just in case.
```
$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
```
## REST API
The package provides a REST API with endpoints roughly mirroring the
process.py modes. For the calls accepting strings as input, GET calls
are also supported for single-string input. For the calls resulting in
dictionaries, the results include both the dictionary entries and the
structures they use. If processing resulted in temporary new
structures, their number is recorded in @new_structures.
Example curl calls
(assuming you deploy at https://www.example.com/structures/):
```
$ curl https://www.example.com/structures/strings_to_parse?string=velika%20miza
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
$ curl -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary
$ curl https://www.example.com/structures/strings_to_dictionary?string=velika%20miza
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary
```
## Note
Note that any new structures generated are given temporary ids Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and (@tempId), because they only get real ids once they are approved and
added to the DDD database. That can be done via the django added to the DDD database. That is normally done via the django
import_structures.py script in the [ddd_core import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
batch of DDD updates. That script replaces the temporary ids in the DDD updates. That script replaces the temporary ids in the structure
structure specifications and updates the ids in the dictionary file. specifications and updates the ids in the dictionary file.

View File

@ -15,8 +15,8 @@ resource_directory = os.environ['API_RESOURCE_DIR']
runner = Runner(resource_directory, True) runner = Runner(resource_directory, True)
@app.route(api_prefix + '/string_to_parse', methods=['GET', 'POST']) @app.route(api_prefix + '/strings_to_parse', methods=['GET', 'POST'])
def string_to_parse(): def strings_to_parse():
tmp_directory = tempfile.mkdtemp() tmp_directory = tempfile.mkdtemp()
string_file_name = tmp_directory + '/input_string.txt' string_file_name = tmp_directory + '/input_string.txt'
@ -77,8 +77,8 @@ def parse_to_dictionary():
return Response(message, mimetype='text/xml') return Response(message, mimetype='text/xml')
@app.route(api_prefix + '/string_to_dictionary', methods=['GET', 'POST']) @app.route(api_prefix + '/strings_to_dictionary', methods=['GET', 'POST'])
def string_to_dictionary(): def strings_to_dictionary():
tmp_directory = tempfile.mkdtemp() tmp_directory = tempfile.mkdtemp()
string_file_name = tmp_directory + '/input_string.txt' string_file_name = tmp_directory + '/input_string.txt'
@ -116,4 +116,3 @@ def string_to_dictionary():
message = '<error>' + str(e) + '</error>' message = '<error>' + str(e) + '</error>'
return Response(message, mimetype='text/xml') return Response(message, mimetype='text/xml')

View File

@ -7,28 +7,28 @@ resource_directory = '../resources'
if (__name__ == '__main__'): if (__name__ == '__main__'):
arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.') arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.')
arg_parser.add_argument('-part', type=str, help='Part name') arg_parser.add_argument('-mode', type=str, help='Mode')
arg_parser.add_argument('-infile', type=str, help='Input file') arg_parser.add_argument('-infile', type=str, help='Input file')
arg_parser.add_argument('-outfile', type=str, help='Output file') arg_parser.add_argument('-outfile', type=str, help='Output file')
arg_parser.add_argument('-structures', type=str, help='Updated structure file') arg_parser.add_argument('-structures', type=str, help='Updated structure file')
arguments = arg_parser.parse_args() arguments = arg_parser.parse_args()
part_name = arguments.part mode = arguments.mode
input_file_name = arguments.infile input_file_name = arguments.infile
output_file_name = arguments.outfile output_file_name = arguments.outfile
structure_file_name = arguments.structures structure_file_name = arguments.structures
nlp_needed = part_name in {'strings_to_parse', 'strings_to_dictionary', 'all'} nlp_needed = mode in {'strings_to_parse', 'strings_to_dictionary', 'all'}
runner = Runner(resource_directory, nlp_needed) runner = Runner(resource_directory, nlp_needed)
if (part_name == 'strings_to_parse'): if (mode == 'strings_to_parse'):
runner.strings_to_parse(input_file_name, output_file_name) runner.strings_to_parse(input_file_name, output_file_name)
elif (part_name == 'strings_to_dictionary'): elif (mode == 'strings_to_dictionary'):
runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name) runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name)
elif (part_name == 'parse_to_dictionary'): elif (mode == 'parse_to_dictionary'):
runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name) runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name)
elif (part_name == 'validate_structures'): elif (mode == 'validate_structures'):
runner.validate_structures(input_file_name) runner.validate_structures(input_file_name)
elif (part_name == 'validate_dictionary'): elif (mode == 'validate_dictionary'):
runner.validate_dictionary(input_file_name) runner.validate_dictionary(input_file_name)
elif (part_name == 'all'): elif (mode == 'all'):
runner.run_all(input_file_name, output_file_name, structure_file_name) runner.run_all(input_file_name, output_file_name, structure_file_name)