Redmine #1835: Rewrote readme and minor renaming
This commit is contained in:
parent
19e9eb6341
commit
37a10ae703
113
README.md
113
README.md
|
@ -2,65 +2,106 @@
|
||||||
|
|
||||||
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
||||||
each to a syntactic structure in the DDD database, generating
|
each to a syntactic structure in the DDD database, generating
|
||||||
provisional new structures if necessary. The pipeline consists of two
|
provisional new structures if necessary.
|
||||||
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
|
||||||
order to facilitate manual corrections of the parsed units in between
|
|
||||||
(e.g., with qcat).
|
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
The scripts are mostly wrappers, in that they largely use generic
|
Most of the scripts come from other repositories and python libraries.
|
||||||
scripts and resources from other git repositories, as well as python
|
Run the set-up script:
|
||||||
libraries. These can be set up with a special script:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
$ bin/setup.sh
|
$ scripts/setup.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
## pipeline1.py
|
## Usage
|
||||||
|
|
||||||
The pipeline1.py script expects as input a file of Slovene strings,
|
The main script is scripts/process.py. There are several modes (
|
||||||
one string per line. It then runs the [python obeliks
|
identified via the -part parameter), depending on whether you want to
|
||||||
|
run the whole pipeline from start to finish (daring!), or with manual
|
||||||
|
intervention (of the parse) in between. XML validation is also
|
||||||
|
provided separately.
|
||||||
|
|
||||||
|
|
||||||
|
### strings_to_parse
|
||||||
|
|
||||||
|
The input is a file of Slovene strings (one string per line). The
|
||||||
|
script runs the [python obeliks
|
||||||
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
||||||
conllu output a little, and runs JOS-configured
|
conllu output a little, and runs JOS-configured
|
||||||
[classla](https://pypi.org/project/classla/) to parse the output. It
|
[classla](https://pypi.org/project/classla/) to parse the output. It
|
||||||
then translates the JOS tags (msds and dependencies) from English to
|
then translates the JOS tags (msds and dependencies) from English to
|
||||||
Slovene and converts the output to TEI XML.
|
Slovene and converts the output to TEI XML. Example:
|
||||||
|
|
||||||
Example usage:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
|
||||||
```
|
```
|
||||||
|
|
||||||
## pipeline2.py
|
### parse_to_dictionary
|
||||||
|
|
||||||
The pipeline2.py script expects as input a TEI XML file (in the same
|
The input should be a TEI XML file (in the same particular format as
|
||||||
particular format as the output of pipeline1.py) and an xml file of
|
the output of strings_to_parse) and an xml file of structure
|
||||||
structure specifications (normally the up-to-date CJVT structures.xml
|
specifications (the CJVT structures.xml file, supplemented with
|
||||||
file). It first splits the TEI file into two files, one with the
|
temporary new structures, if needed). It first splits the TEI file
|
||||||
single-component units and the other with the multiple-component
|
into two files, one with the single-component units and the other with
|
||||||
units. For each, it then assigns each unit to a syntactic structure
|
the multiple-component units. For each, it then assigns each unit to a
|
||||||
from the DDD database and converts the output into CJVT dictionary XML
|
syntactic structure from the DDD database and converts the output into
|
||||||
format. For the single-component units, this is pretty trivial, but
|
CJVT dictionary XML format. For the single-component units, this is
|
||||||
for multiple-component units it is more involved, and includes two
|
pretty trivial, but for multiple-component units it is more involved,
|
||||||
runs of the MWE extraction script
|
and includes two runs of the MWE extraction script
|
||||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||||||
missing structures in between. At the end, the single-component and
|
missing structures in between. At the end, the single-component and
|
||||||
multiple-component dictionary files are merged. Both the merged
|
multiple-component dictionary files are merged into one dictionary
|
||||||
dictionary file and the updated structure specification file are
|
file. Example:
|
||||||
validated with the appropriate XML schemas.
|
|
||||||
|
|
||||||
Example usage:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### strings_to_dictionary
|
||||||
|
|
||||||
|
Combines strings_to_parse in parse_to_dictionary into one call
|
||||||
|
(whereby you forfeit the chance to fix potential parse errors in
|
||||||
|
between). Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||||
|
```
|
||||||
|
|
||||||
|
### all
|
||||||
|
|
||||||
|
Same as strings_to_dictionary, but also validates the dictionary and
|
||||||
|
structures outputs, just in case.
|
||||||
|
|
||||||
|
```
|
||||||
|
$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||||
|
```
|
||||||
|
|
||||||
|
## REST API
|
||||||
|
|
||||||
|
The package provides a REST API with endpoints roughly mirroring the
|
||||||
|
process.py modes. For the calls accepting strings as input, GET calls
|
||||||
|
are also supported for single-string input. For the calls resulting in
|
||||||
|
dictionaries, the results include both the dictionary entries and the
|
||||||
|
structures they use. If processing resulted in temporary new
|
||||||
|
structures, their number is recorded in @new_structures.
|
||||||
|
|
||||||
|
Example curl calls
|
||||||
|
(assuming you deploy at https://www.example.com/structures/):
|
||||||
|
|
||||||
|
```
|
||||||
|
$ curl https://www.example.com/structures/strings_to_parse?string=velika%20miza
|
||||||
|
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
|
||||||
|
$ curl -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary
|
||||||
|
$ curl https://www.example.com/structures/strings_to_dictionary?string=velika%20miza
|
||||||
|
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary
|
||||||
|
```
|
||||||
|
|
||||||
|
## Note
|
||||||
|
|
||||||
Note that any new structures generated are given temporary ids
|
Note that any new structures generated are given temporary ids
|
||||||
(@tempId), because they only get real ids once they are approved and
|
(@tempId), because they only get real ids once they are approved and
|
||||||
added to the DDD database. That can be done via the django
|
added to the DDD database. That is normally done via the django
|
||||||
import_structures.py script in the [ddd_core
|
import_structures.py script in the [ddd_core
|
||||||
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
|
||||||
batch of DDD updates. That script replaces the temporary ids in the
|
DDD updates. That script replaces the temporary ids in the structure
|
||||||
structure specifications and updates the ids in the dictionary file.
|
specifications and updates the ids in the dictionary file.
|
||||||
|
|
|
@ -15,8 +15,8 @@ resource_directory = os.environ['API_RESOURCE_DIR']
|
||||||
runner = Runner(resource_directory, True)
|
runner = Runner(resource_directory, True)
|
||||||
|
|
||||||
|
|
||||||
@app.route(api_prefix + '/string_to_parse', methods=['GET', 'POST'])
|
@app.route(api_prefix + '/strings_to_parse', methods=['GET', 'POST'])
|
||||||
def string_to_parse():
|
def strings_to_parse():
|
||||||
|
|
||||||
tmp_directory = tempfile.mkdtemp()
|
tmp_directory = tempfile.mkdtemp()
|
||||||
string_file_name = tmp_directory + '/input_string.txt'
|
string_file_name = tmp_directory + '/input_string.txt'
|
||||||
|
@ -77,8 +77,8 @@ def parse_to_dictionary():
|
||||||
return Response(message, mimetype='text/xml')
|
return Response(message, mimetype='text/xml')
|
||||||
|
|
||||||
|
|
||||||
@app.route(api_prefix + '/string_to_dictionary', methods=['GET', 'POST'])
|
@app.route(api_prefix + '/strings_to_dictionary', methods=['GET', 'POST'])
|
||||||
def string_to_dictionary():
|
def strings_to_dictionary():
|
||||||
|
|
||||||
tmp_directory = tempfile.mkdtemp()
|
tmp_directory = tempfile.mkdtemp()
|
||||||
string_file_name = tmp_directory + '/input_string.txt'
|
string_file_name = tmp_directory + '/input_string.txt'
|
||||||
|
@ -116,4 +116,3 @@ def string_to_dictionary():
|
||||||
message = '<error>' + str(e) + '</error>'
|
message = '<error>' + str(e) + '</error>'
|
||||||
|
|
||||||
return Response(message, mimetype='text/xml')
|
return Response(message, mimetype='text/xml')
|
||||||
|
|
||||||
|
|
|
@ -7,28 +7,28 @@ resource_directory = '../resources'
|
||||||
if (__name__ == '__main__'):
|
if (__name__ == '__main__'):
|
||||||
|
|
||||||
arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.')
|
arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.')
|
||||||
arg_parser.add_argument('-part', type=str, help='Part name')
|
arg_parser.add_argument('-mode', type=str, help='Mode')
|
||||||
arg_parser.add_argument('-infile', type=str, help='Input file')
|
arg_parser.add_argument('-infile', type=str, help='Input file')
|
||||||
arg_parser.add_argument('-outfile', type=str, help='Output file')
|
arg_parser.add_argument('-outfile', type=str, help='Output file')
|
||||||
arg_parser.add_argument('-structures', type=str, help='Updated structure file')
|
arg_parser.add_argument('-structures', type=str, help='Updated structure file')
|
||||||
arguments = arg_parser.parse_args()
|
arguments = arg_parser.parse_args()
|
||||||
|
|
||||||
part_name = arguments.part
|
mode = arguments.mode
|
||||||
input_file_name = arguments.infile
|
input_file_name = arguments.infile
|
||||||
output_file_name = arguments.outfile
|
output_file_name = arguments.outfile
|
||||||
structure_file_name = arguments.structures
|
structure_file_name = arguments.structures
|
||||||
|
|
||||||
nlp_needed = part_name in {'strings_to_parse', 'strings_to_dictionary', 'all'}
|
nlp_needed = mode in {'strings_to_parse', 'strings_to_dictionary', 'all'}
|
||||||
runner = Runner(resource_directory, nlp_needed)
|
runner = Runner(resource_directory, nlp_needed)
|
||||||
if (part_name == 'strings_to_parse'):
|
if (mode == 'strings_to_parse'):
|
||||||
runner.strings_to_parse(input_file_name, output_file_name)
|
runner.strings_to_parse(input_file_name, output_file_name)
|
||||||
elif (part_name == 'strings_to_dictionary'):
|
elif (mode == 'strings_to_dictionary'):
|
||||||
runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
||||||
elif (part_name == 'parse_to_dictionary'):
|
elif (mode == 'parse_to_dictionary'):
|
||||||
runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
||||||
elif (part_name == 'validate_structures'):
|
elif (mode == 'validate_structures'):
|
||||||
runner.validate_structures(input_file_name)
|
runner.validate_structures(input_file_name)
|
||||||
elif (part_name == 'validate_dictionary'):
|
elif (mode == 'validate_dictionary'):
|
||||||
runner.validate_dictionary(input_file_name)
|
runner.validate_dictionary(input_file_name)
|
||||||
elif (part_name == 'all'):
|
elif (mode == 'all'):
|
||||||
runner.run_all(input_file_name, output_file_name, structure_file_name)
|
runner.run_all(input_file_name, output_file_name, structure_file_name)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user