Redmine #1835: Rewrote readme and minor renaming
This commit is contained in:
parent
19e9eb6341
commit
37a10ae703
113
README.md
113
README.md
|
@ -2,65 +2,106 @@
|
|||
|
||||
Pipeline for parsing a list of arbitrary Slovene strings and assigning
|
||||
each to a syntactic structure in the DDD database, generating
|
||||
provisional new structures if necessary. The pipeline consists of two
|
||||
separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
|
||||
order to facilitate manual corrections of the parsed units in between
|
||||
(e.g., with qcat).
|
||||
provisional new structures if necessary.
|
||||
|
||||
## Setup
|
||||
|
||||
The scripts are mostly wrappers, in that they largely use generic
|
||||
scripts and resources from other git repositories, as well as python
|
||||
libraries. These can be set up with a special script:
|
||||
Most of the scripts come from other repositories and python libraries.
|
||||
Run the set-up script:
|
||||
|
||||
```
|
||||
$ bin/setup.sh
|
||||
$ scripts/setup.sh
|
||||
```
|
||||
|
||||
## pipeline1.py
|
||||
## Usage
|
||||
|
||||
The pipeline1.py script expects as input a file of Slovene strings,
|
||||
one string per line. It then runs the [python obeliks
|
||||
The main script is scripts/process.py. There are several modes (
|
||||
identified via the -part parameter), depending on whether you want to
|
||||
run the whole pipeline from start to finish (daring!), or with manual
|
||||
intervention (of the parse) in between. XML validation is also
|
||||
provided separately.
|
||||
|
||||
|
||||
### strings_to_parse
|
||||
|
||||
The input is a file of Slovene strings (one string per line). The
|
||||
script runs the [python obeliks
|
||||
tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
|
||||
conllu output a little, and runs JOS-configured
|
||||
[classla](https://pypi.org/project/classla/) to parse the output. It
|
||||
then translates the JOS tags (msds and dependencies) from English to
|
||||
Slovene and converts the output to TEI XML.
|
||||
|
||||
Example usage:
|
||||
Slovene and converts the output to TEI XML. Example:
|
||||
|
||||
```
|
||||
$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
|
||||
$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
|
||||
```
|
||||
|
||||
## pipeline2.py
|
||||
### parse_to_dictionary
|
||||
|
||||
The pipeline2.py script expects as input a TEI XML file (in the same
|
||||
particular format as the output of pipeline1.py) and an xml file of
|
||||
structure specifications (normally the up-to-date CJVT structures.xml
|
||||
file). It first splits the TEI file into two files, one with the
|
||||
single-component units and the other with the multiple-component
|
||||
units. For each, it then assigns each unit to a syntactic structure
|
||||
from the DDD database and converts the output into CJVT dictionary XML
|
||||
format. For the single-component units, this is pretty trivial, but
|
||||
for multiple-component units it is more involved, and includes two
|
||||
runs of the MWE extraction script
|
||||
The input should be a TEI XML file (in the same particular format as
|
||||
the output of strings_to_parse) and an xml file of structure
|
||||
specifications (the CJVT structures.xml file, supplemented with
|
||||
temporary new structures, if needed). It first splits the TEI file
|
||||
into two files, one with the single-component units and the other with
|
||||
the multiple-component units. For each, it then assigns each unit to a
|
||||
syntactic structure from the DDD database and converts the output into
|
||||
CJVT dictionary XML format. For the single-component units, this is
|
||||
pretty trivial, but for multiple-component units it is more involved,
|
||||
and includes two runs of the MWE extraction script
|
||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||||
missing structures in between. At the end, the single-component and
|
||||
multiple-component dictionary files are merged. Both the merged
|
||||
dictionary file and the updated structure specification file are
|
||||
validated with the appropriate XML schemas.
|
||||
|
||||
Example usage:
|
||||
multiple-component dictionary files are merged into one dictionary
|
||||
file. Example:
|
||||
|
||||
```
|
||||
$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
|
||||
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||
```
|
||||
|
||||
### strings_to_dictionary
|
||||
|
||||
Combines strings_to_parse in parse_to_dictionary into one call
|
||||
(whereby you forfeit the chance to fix potential parse errors in
|
||||
between). Example:
|
||||
|
||||
```
|
||||
$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||
```
|
||||
|
||||
### all
|
||||
|
||||
Same as strings_to_dictionary, but also validates the dictionary and
|
||||
structures outputs, just in case.
|
||||
|
||||
```
|
||||
$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||
```
|
||||
|
||||
## REST API
|
||||
|
||||
The package provides a REST API with endpoints roughly mirroring the
|
||||
process.py modes. For the calls accepting strings as input, GET calls
|
||||
are also supported for single-string input. For the calls resulting in
|
||||
dictionaries, the results include both the dictionary entries and the
|
||||
structures they use. If processing resulted in temporary new
|
||||
structures, their number is recorded in @new_structures.
|
||||
|
||||
Example curl calls
|
||||
(assuming you deploy at https://www.example.com/structures/):
|
||||
|
||||
```
|
||||
$ curl https://www.example.com/structures/strings_to_parse?string=velika%20miza
|
||||
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
|
||||
$ curl -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary
|
||||
$ curl https://www.example.com/structures/strings_to_dictionary?string=velika%20miza
|
||||
$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary
|
||||
```
|
||||
|
||||
## Note
|
||||
|
||||
Note that any new structures generated are given temporary ids
|
||||
(@tempId), because they only get real ids once they are approved and
|
||||
added to the DDD database. That can be done via the django
|
||||
added to the DDD database. That is normally done via the django
|
||||
import_structures.py script in the [ddd_core
|
||||
repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
|
||||
batch of DDD updates. That script replaces the temporary ids in the
|
||||
structure specifications and updates the ids in the dictionary file.
|
||||
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
|
||||
DDD updates. That script replaces the temporary ids in the structure
|
||||
specifications and updates the ids in the dictionary file.
|
||||
|
|
|
@ -15,8 +15,8 @@ resource_directory = os.environ['API_RESOURCE_DIR']
|
|||
runner = Runner(resource_directory, True)
|
||||
|
||||
|
||||
@app.route(api_prefix + '/string_to_parse', methods=['GET', 'POST'])
|
||||
def string_to_parse():
|
||||
@app.route(api_prefix + '/strings_to_parse', methods=['GET', 'POST'])
|
||||
def strings_to_parse():
|
||||
|
||||
tmp_directory = tempfile.mkdtemp()
|
||||
string_file_name = tmp_directory + '/input_string.txt'
|
||||
|
@ -77,8 +77,8 @@ def parse_to_dictionary():
|
|||
return Response(message, mimetype='text/xml')
|
||||
|
||||
|
||||
@app.route(api_prefix + '/string_to_dictionary', methods=['GET', 'POST'])
|
||||
def string_to_dictionary():
|
||||
@app.route(api_prefix + '/strings_to_dictionary', methods=['GET', 'POST'])
|
||||
def strings_to_dictionary():
|
||||
|
||||
tmp_directory = tempfile.mkdtemp()
|
||||
string_file_name = tmp_directory + '/input_string.txt'
|
||||
|
@ -116,4 +116,3 @@ def string_to_dictionary():
|
|||
message = '<error>' + str(e) + '</error>'
|
||||
|
||||
return Response(message, mimetype='text/xml')
|
||||
|
||||
|
|
|
@ -7,28 +7,28 @@ resource_directory = '../resources'
|
|||
if (__name__ == '__main__'):
|
||||
|
||||
arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.')
|
||||
arg_parser.add_argument('-part', type=str, help='Part name')
|
||||
arg_parser.add_argument('-mode', type=str, help='Mode')
|
||||
arg_parser.add_argument('-infile', type=str, help='Input file')
|
||||
arg_parser.add_argument('-outfile', type=str, help='Output file')
|
||||
arg_parser.add_argument('-structures', type=str, help='Updated structure file')
|
||||
arguments = arg_parser.parse_args()
|
||||
|
||||
part_name = arguments.part
|
||||
mode = arguments.mode
|
||||
input_file_name = arguments.infile
|
||||
output_file_name = arguments.outfile
|
||||
structure_file_name = arguments.structures
|
||||
|
||||
nlp_needed = part_name in {'strings_to_parse', 'strings_to_dictionary', 'all'}
|
||||
nlp_needed = mode in {'strings_to_parse', 'strings_to_dictionary', 'all'}
|
||||
runner = Runner(resource_directory, nlp_needed)
|
||||
if (part_name == 'strings_to_parse'):
|
||||
if (mode == 'strings_to_parse'):
|
||||
runner.strings_to_parse(input_file_name, output_file_name)
|
||||
elif (part_name == 'strings_to_dictionary'):
|
||||
elif (mode == 'strings_to_dictionary'):
|
||||
runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
||||
elif (part_name == 'parse_to_dictionary'):
|
||||
elif (mode == 'parse_to_dictionary'):
|
||||
runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name)
|
||||
elif (part_name == 'validate_structures'):
|
||||
elif (mode == 'validate_structures'):
|
||||
runner.validate_structures(input_file_name)
|
||||
elif (part_name == 'validate_dictionary'):
|
||||
elif (mode == 'validate_dictionary'):
|
||||
runner.validate_dictionary(input_file_name)
|
||||
elif (part_name == 'all'):
|
||||
elif (mode == 'all'):
|
||||
runner.run_all(input_file_name, output_file_name, structure_file_name)
|
||||
|
|
Loading…
Reference in New Issue
Block a user