Redmine #1835: Rewrote readme and minor renaming

2021-03-25 16:27:40 +01:00 · 2021-03-25 16:27:40 +01:00 · 37a10ae703
commit 37a10ae703
parent 19e9eb6341
3 changed files with 90 additions and 50 deletions
--- a/README.md
+++ b/README.md
@ -2,65 +2,106 @@

 Pipeline for parsing a list of arbitrary Slovene strings and assigning
 each to a syntactic structure in the DDD database, generating
-provisional new structures if necessary. The pipeline consists of two
-separate wrapper python3 scripts, pipeline1.py and pipeline2.py, in
-order to facilitate manual corrections of the parsed units in between
-(e.g., with qcat).
+provisional new structures if necessary.

 ## Setup

-The scripts are mostly wrappers, in that they largely use generic
-scripts and resources from other git repositories, as well as python
-libraries. These can be set up with a special script:
+Most of the scripts come from other repositories and python libraries.
+Run the set-up script:

 ```
-$ bin/setup.sh
+$ scripts/setup.sh
 ```

-## pipeline1.py
+## Usage

-The pipeline1.py script expects as input a file of Slovene strings,
-one string per line. It then runs the [python obeliks
+The main script is scripts/process.py. There are several modes (
+identified via the -part parameter), depending on whether you want to
+run the whole pipeline from start to finish (daring!), or with manual
+intervention (of the parse) in between. XML validation is also
+provided separately.
+
+
+### strings_to_parse
+
+The input is a file of Slovene strings (one string per line). The
+script runs the [python obeliks
 tokeniser](https://pypi.org/project/obeliks/) on the input, tweaks the
 conllu output a little, and runs JOS-configured
 [classla](https://pypi.org/project/classla/) to parse the output. It
 then translates the JOS tags (msds and dependencies) from English to
-Slovene and converts the output to TEI XML.
-
-Example usage:
+Slovene and converts the output to TEI XML. Example:

 ```
-$ python scripts/pipeline1.py -inlist strings.txt -outtei parsed.xml
+$ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -outfile /tmp/parsed.xml
 ```

-## pipeline2.py
+### parse_to_dictionary

-The pipeline2.py script expects as input a TEI XML file (in the same
-particular format as the output of pipeline1.py) and an xml file of
-structure specifications (normally the up-to-date CJVT structures.xml
-file). It first splits the TEI file into two files, one with the
-single-component units and the other with the multiple-component
-units. For each, it then assigns each unit to a syntactic structure
-from the DDD database and converts the output into CJVT dictionary XML
-format. For the single-component units, this is pretty trivial, but
-for multiple-component units it is more involved, and includes two
-runs of the MWE extraction script
+The input should be a TEI XML file (in the same particular format as
+the output of strings_to_parse) and an xml file of structure
+specifications (the CJVT structures.xml file, supplemented with
+temporary new structures, if needed). It first splits the TEI file
+into two files, one with the single-component units and the other with
+the multiple-component units. For each, it then assigns each unit to a
+syntactic structure from the DDD database and converts the output into
+CJVT dictionary XML format. For the single-component units, this is
+pretty trivial, but for multiple-component units it is more involved,
+and includes two runs of the MWE extraction script
 [wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
 missing structures in between. At the end, the single-component and
-multiple-component dictionary files are merged. Both the merged
-dictionary file and the updated structure specification file are
-validated with the appropriate XML schemas.
-
-Example usage:
+multiple-component dictionary files are merged into one dictionary
+file. Example:

 ```
-$ python scripts/pipeline2.py -intei parsed_corrected.xml -instructures structures.xml -outstructures structures_new.xml -outlexicon dictionary.xml
+$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
 ```

+### strings_to_dictionary
+
+Combines strings_to_parse in parse_to_dictionary into one call
+(whereby you forfeit the chance to fix potential parse errors in
+between). Example:
+
+```
+$ python scripts/process.py -mode strings_to_dictionary -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
+```
+
+### all
+
+Same as strings_to_dictionary, but also validates the dictionary and
+structures outputs, just in case.
+
+```
+$ python scripts/process.py -mode all -infile /tmp/strings.txt -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
+```
+
+## REST API
+
+The package provides a REST API with endpoints roughly mirroring the
+process.py modes. For the calls accepting strings as input, GET calls
+are also supported for single-string input. For the calls resulting in
+dictionaries, the results include both the dictionary entries and the
+structures they use. If processing resulted in temporary new
+structures, their number is recorded in @new_structures.
+
+Example curl calls
+(assuming you deploy at https://www.example.com/structures/):
+
+```
+$ curl https://www.example.com/structures/strings_to_parse?string=velika%20miza
+$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_parse
+$ curl -X POST -F file=@/tmp/parse.xml https://proc1.cjvt.si/structures/parse_to_dictionary
+$ curl https://www.example.com/structures/strings_to_dictionary?string=velika%20miza
+$ curl -X POST -F file=@/tmp/strings.txt https://proc1.cjvt.si/structures/strings_to_dictionary
+```
+
+## Note
+
 Note that any new structures generated are given temporary ids
 (@tempId), because they only get real ids once they are approved and
-added to the DDD database. That can be done via the django
+added to the DDD database. That is normally done via the django
 import_structures.py script in the [ddd_core
-repository](https://gitea.cjvt.si/ddd/ddd_core), normally as part of a
-batch of DDD updates. That script replaces the temporary ids in the
-structure specifications and updates the ids in the dictionary file.
+repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
+DDD updates. That script replaces the temporary ids in the structure
+specifications and updates the ids in the dictionary file.
--- a/package/structure_assignment/api.py
+++ b/package/structure_assignment/api.py
@ -15,8 +15,8 @@ resource_directory = os.environ['API_RESOURCE_DIR']
 runner = Runner(resource_directory, True)


-@app.route(api_prefix + '/string_to_parse', methods=['GET', 'POST'])
-def string_to_parse():
+@app.route(api_prefix + '/strings_to_parse', methods=['GET', 'POST'])
+def strings_to_parse():

    tmp_directory = tempfile.mkdtemp()
    string_file_name = tmp_directory + '/input_string.txt'
@ -77,8 +77,8 @@ def parse_to_dictionary():
    return Response(message, mimetype='text/xml')


-@app.route(api_prefix + '/string_to_dictionary', methods=['GET', 'POST'])
-def string_to_dictionary():
+@app.route(api_prefix + '/strings_to_dictionary', methods=['GET', 'POST'])
+def strings_to_dictionary():

    tmp_directory = tempfile.mkdtemp()
    string_file_name = tmp_directory + '/input_string.txt'
@ -116,4 +116,3 @@ def string_to_dictionary():
        message = '<error>' + str(e) + '</error>'

    return Response(message, mimetype='text/xml')
-
--- a/scripts/process.py
+++ b/scripts/process.py
@ -7,28 +7,28 @@ resource_directory = '../resources'
 if (__name__ == '__main__'):

    arg_parser = argparse.ArgumentParser(description='Run part or all of structure pipeline.')
-    arg_parser.add_argument('-part', type=str, help='Part name')
+    arg_parser.add_argument('-mode', type=str, help='Mode')
    arg_parser.add_argument('-infile', type=str, help='Input file')
    arg_parser.add_argument('-outfile', type=str, help='Output file')
    arg_parser.add_argument('-structures', type=str, help='Updated structure file')
    arguments = arg_parser.parse_args()

-    part_name = arguments.part
+    mode = arguments.mode
    input_file_name = arguments.infile
    output_file_name = arguments.outfile
    structure_file_name = arguments.structures

-    nlp_needed = part_name in {'strings_to_parse', 'strings_to_dictionary', 'all'}
+    nlp_needed = mode in {'strings_to_parse', 'strings_to_dictionary', 'all'}
    runner = Runner(resource_directory, nlp_needed)
-    if (part_name == 'strings_to_parse'):
+    if (mode == 'strings_to_parse'):
        runner.strings_to_parse(input_file_name, output_file_name)
-    elif (part_name == 'strings_to_dictionary'):
+    elif (mode == 'strings_to_dictionary'):
        runner.strings_to_dictionary(input_file_name, output_file_name, structure_file_name)
-    elif (part_name == 'parse_to_dictionary'):
+    elif (mode == 'parse_to_dictionary'):
        runner.parse_to_dictionary(input_file_name, output_file_name, structure_file_name)
-    elif (part_name == 'validate_structures'):
+    elif (mode == 'validate_structures'):
        runner.validate_structures(input_file_name)
-    elif (part_name == 'validate_dictionary'):
+    elif (mode == 'validate_dictionary'):
        runner.validate_dictionary(input_file_name)
-    elif (part_name == 'all'):
+    elif (mode == 'all'):
        runner.run_all(input_file_name, output_file_name, structure_file_name)