Redmine #2198: Updated parse_to_dictionary explanation

This commit is contained in:
Cyprian Laskowski 2021-12-07 13:22:30 +01:00
parent 9436a0ffd1
commit 6cf298855e

View File

@ -40,20 +40,15 @@ $ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -out
The input should be a TEI XML file (in the same particular format as
the output of strings_to_parse) and an xml file of structure
specifications. It first splits the TEI file into two files, one with
the single-component units and the other with the multiple-component
units. For each, it then assigns each unit to a syntactic structure
from the DDD database and converts the output into CJVT dictionary XML
format. For the single-component units, this is pretty trivial, but
for multiple-component units it is more involved, and includes two
runs of the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
missing structures in between. At the end, the single-component and
multiple-component dictionary files are merged into one dictionary
file. Example:
specifications. The script first uses the MWE extraction script
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
assign all matches for collocation structures. For units without such
matches, it then finds (creating, if necessary) and assigns
single-component or other structures. Finally the TEI is converted to
CJVT dictionary XML format. Example:
```
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
```
### strings_to_dictionary
@ -97,6 +92,6 @@ Note that any new structures generated are given temporary ids
(@tempId), because they only get real ids once they are approved and
added to the DDD database. That is normally done via the django
import_structures.py script in the [ddd_core
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
DDD updates. That script replaces the temporary ids in the structure
specifications and updates the ids in the dictionary file.
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
temporary ids in the structure specifications and updates the ids in
the dictionary file.