Redmine #2198: Updated parse_to_dictionary explanation

This commit is contained in:
Cyprian Laskowski 2021-12-07 13:22:30 +01:00
parent 9436a0ffd1
commit 6cf298855e

View File

@ -40,20 +40,15 @@ $ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -out
The input should be a TEI XML file (in the same particular format as The input should be a TEI XML file (in the same particular format as
the output of strings_to_parse) and an xml file of structure the output of strings_to_parse) and an xml file of structure
specifications. It first splits the TEI file into two files, one with specifications. The script first uses the MWE extraction script
the single-component units and the other with the multiple-component [wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
units. For each, it then assigns each unit to a syntactic structure assign all matches for collocation structures. For units without such
from the DDD database and converts the output into CJVT dictionary XML matches, it then finds (creating, if necessary) and assigns
format. For the single-component units, this is pretty trivial, but single-component or other structures. Finally the TEI is converted to
for multiple-component units it is more involved, and includes two CJVT dictionary XML format. Example:
runs of the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
missing structures in between. At the end, the single-component and
multiple-component dictionary files are merged into one dictionary
file. Example:
``` ```
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml $ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
``` ```
### strings_to_dictionary ### strings_to_dictionary
@ -97,6 +92,6 @@ Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and (@tempId), because they only get real ids once they are approved and
added to the DDD database. That is normally done via the django added to the DDD database. That is normally done via the django
import_structures.py script in the [ddd_core import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
DDD updates. That script replaces the temporary ids in the structure temporary ids in the structure specifications and updates the ids in
specifications and updates the ids in the dictionary file. the dictionary file.