Redmine #2198: Updated parse_to_dictionary explanation
This commit is contained in:
parent
9436a0ffd1
commit
6cf298855e
25
README.md
25
README.md
|
@ -40,20 +40,15 @@ $ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -out
|
|||
|
||||
The input should be a TEI XML file (in the same particular format as
|
||||
the output of strings_to_parse) and an xml file of structure
|
||||
specifications. It first splits the TEI file into two files, one with
|
||||
the single-component units and the other with the multiple-component
|
||||
units. For each, it then assigns each unit to a syntactic structure
|
||||
from the DDD database and converts the output into CJVT dictionary XML
|
||||
format. For the single-component units, this is pretty trivial, but
|
||||
for multiple-component units it is more involved, and includes two
|
||||
runs of the MWE extraction script
|
||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
||||
missing structures in between. At the end, the single-component and
|
||||
multiple-component dictionary files are merged into one dictionary
|
||||
file. Example:
|
||||
specifications. The script first uses the MWE extraction script
|
||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
|
||||
assign all matches for collocation structures. For units without such
|
||||
matches, it then finds (creating, if necessary) and assigns
|
||||
single-component or other structures. Finally the TEI is converted to
|
||||
CJVT dictionary XML format. Example:
|
||||
|
||||
```
|
||||
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
||||
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
||||
```
|
||||
|
||||
### strings_to_dictionary
|
||||
|
@ -97,6 +92,6 @@ Note that any new structures generated are given temporary ids
|
|||
(@tempId), because they only get real ids once they are approved and
|
||||
added to the DDD database. That is normally done via the django
|
||||
import_structures.py script in the [ddd_core
|
||||
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
|
||||
DDD updates. That script replaces the temporary ids in the structure
|
||||
specifications and updates the ids in the dictionary file.
|
||||
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
|
||||
temporary ids in the structure specifications and updates the ids in
|
||||
the dictionary file.
|
||||
|
|
Loading…
Reference in New Issue
Block a user