diff --git a/README.md b/README.md index b199a30..a5dc4df 100644 --- a/README.md +++ b/README.md @@ -2,65 +2,106 @@ Pipeline for parsing a list of arbitrary Slovene strings and assigning each to a syntactic structure in the DDD database, generating -provisional new structures if necessary. The pipeline consists of two -separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in -order to facilitate manual corrections of the parsed units in between -(e.g., with qcat). +provisional new structures if necessary. ## Setup -The scripts are mostly wrappers, in that they largely use generic -scripts and resources from other git repositories, as well as python -libraries. These can be set up with a special script: +Most of the scripts come from other repositories and python libraries. +Run the set-up script: ``` -$ bin/setup.sh +$ scripts/setup.sh ``` -## pipeline1.py +## Usage -The pipeline1.py script expects as input a file of Slovene strings, -one string per line. It then runs the [python obeliks +The main script is scripts/process.py. There are several modes ( +identified via the -part parameter), depending on whether you want to +run the whole pipeline from start to finish (daring!), or with manual +intervention (of the parse) in between. XML validation is also +provided separately. + + +### strings_to_parse + +The input is a file of Slovene strings (one string per line). The +script runs the [python obeliks tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the conllu output a little, and runs JOS-configured [classla](https://pypi.org/project/classla/) to parse the output. It then translates the JOS tags (msds and dependencies) from English to -Slovene and converts the output to TEI XML. - -Example usage: +Slovene and converts the output to TEI XML. Example: ``` -$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml +$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml ``` -## pipeline2.py +### parse_to_dictionary -The pipeline2.py script expects as input a TEI XML file (in the same -particular format as the output of pipeline1.py) and an xml file of -structure specifications (normally the up-to-date CJVT structures.xml -file). It first splits the TEI file into two files, one with the -single-component units and the other with the multiple-component -units. For each, it then assigns each unit to a syntactic structure -from the DDD database and converts the output into CJVT dictionary XML -format. For the single-component units, this is pretty trivial, but -for multiple-component units it is more involved, and includes two -runs of the MWE extraction script +The input should be a TEI XML file (in the same particular format as +the output of strings_to_parse) and an xml file of structure +specifications (the CJVT structures.xml file, supplemented with +temporary new structures, if needed). It first splits the TEI file +into two files, one with the single-component units and the other with +the multiple-component units. For each, it then assigns each unit to a +syntactic structure from the DDD database and converts the output into +CJVT dictionary XML format. For the single-component units, this is +pretty trivial, but for multiple-component units it is more involved, +and includes two runs of the MWE extraction script [wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating missing structures in between. At the end, the single-component and -multiple-component dictionary files are merged. Both the merged -dictionary file and the updated structure specification file are -validated with the appropriate XML schemas. - -Example usage: +multiple-component dictionary files are merged into one dictionary +file. Example: ``` -$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml +$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml ``` +### strings_to_dictionary + +Combines strings_to_parse in parse_to_dictionary into one call +(whereby you forfeit the chance to fix potential parse errors in +between). Example: + +``` +$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml +``` + +### all + +Same as strings_to_dictionary, but also validates the dictionary and +structures outputs, just in case. + +``` +$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml +``` + +## REST API + +The package provides a REST API with endpoints roughly mirroring the +process.py modes. For the calls accepting strings as input, GET calls +are also supported for single-string input. For the calls resulting in +dictionaries, the results include both the dictionary entries and the +structures they use. If processing resulted in temporary new +structures, their number is recorded in @new_structures. + +Example curl calls +(assuming you deploy at https://www.example.com/structures/): + +``` +$ curl https://www.example.com/structures/strings_to_parse?string=velika%20miza +$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse +$ curl -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary +$ curl https://www.example.com/structures/strings_to_dictionary?string=velika%20miza +$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary +``` + +## Note + Note that any new structures generated are given temporary ids (@tempId), because they only get real ids once they are approved and -added to the DDD database. That can be done via the django +added to the DDD database. That is normally done via the django import_structures.py script in the [ddd_core -repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a -batch of DDD updates. That script replaces the temporary ids in the -structure specifications and updates the ids in the dictionary file. +repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of +DDD updates. That script replaces the temporary ids in the structure +specifications and updates the ids in the dictionary file. diff --git a/package/structure_assignment/api.py b/package/structure_assignment/api.py index 6409e46..58686b1 100644 --- a/package/structure_assignment/api.py +++ b/package/structure_assignment/api.py @@ -15,8 +15,8 @@ resource_directory = os.environ['API_RESOURCE_DIR'] runner = Runner(resource_directory, True) -@app.route(api_prefix + '/string_to_parse', methods=['GET', 'POST']) -def string_to_parse(): +@app.route(api_prefix + '/strings_to_parse', methods=['GET', 'POST']) +def strings_to_parse(): tmp_directory = tempfile.mkdtemp() string_file_name = tmp_directory + '/input_string.txt' @@ -77,8 +77,8 @@ def parse_to_dictionary(): return Response(message, mimetype='text/xml') -@app.route(api_prefix + '/string_to_dictionary', methods=['GET', 'POST']) -def string_to_dictionary(): +@app.route(api_prefix + '/strings_to_dictionary', methods=['GET', 'POST']) +def strings_to_dictionary(): tmp_directory = tempfile.mkdtemp() string_file_name = tmp_directory + '/input_string.txt' @@ -116,4 +116,3 @@ def string_to_dictionary(): message = '' + str(e) + '' return Response(message, mimetype='text/xml') - diff --git a/scripts/process.py b/scripts/process.py index ad06b83..0e6c7a0 100644 --- a/scripts/process.py +++ b/scripts/process.py @@ -7,28 +7,28 @@ resource_directory = '../resources' if (__name__ == '__main__'): arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.') - arg_parser.add_argument('-part', type=str, help='Part name') + arg_parser.add_argument('-mode', type=str, help='Mode') arg_parser.add_argument('-infile', type=str, help='Input file') arg_parser.add_argument('-outfile', type=str, help='Output file') arg_parser.add_argument('-structures', type=str, help='Updated structure file') arguments = arg_parser.parse_args() - part_name = arguments.part + mode = arguments.mode input_file_name = arguments.infile output_file_name = arguments.outfile structure_file_name = arguments.structures - nlp_needed = part_name in {'strings_to_parse', 'strings_to_dictionary', 'all'} + nlp_needed = mode in {'strings_to_parse', 'strings_to_dictionary', 'all'} runner = Runner(resource_directory, nlp_needed) - if (part_name == 'strings_to_parse'): + if (mode == 'strings_to_parse'): runner.strings_to_parse(input_file_name, output_file_name) - elif (part_name == 'strings_to_dictionary'): + elif (mode == 'strings_to_dictionary'): runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name) - elif (part_name == 'parse_to_dictionary'): + elif (mode == 'parse_to_dictionary'): runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name) - elif (part_name == 'validate_structures'): + elif (mode == 'validate_structures'): runner.validate_structures(input_file_name) - elif (part_name == 'validate_dictionary'): + elif (mode == 'validate_dictionary'): runner.validate_dictionary(input_file_name) - elif (part_name == 'all'): + elif (mode == 'all'): runner.run_all(input_file_name, output_file_name, structure_file_name)