Redmine #2198: Updated parse_to_dictionary explanation
This commit is contained in:
parent
9436a0ffd1
commit
6cf298855e
25
README.md
25
README.md
|
@ -40,20 +40,15 @@ $ python scripts/process.py -mode strings_to_parse -infile /tmp/strings.txt -out
|
||||||
|
|
||||||
The input should be a TEI XML file (in the same particular format as
|
The input should be a TEI XML file (in the same particular format as
|
||||||
the output of strings_to_parse) and an xml file of structure
|
the output of strings_to_parse) and an xml file of structure
|
||||||
specifications. It first splits the TEI file into two files, one with
|
specifications. The script first uses the MWE extraction script
|
||||||
the single-component units and the other with the multiple-component
|
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur) to find and
|
||||||
units. For each, it then assigns each unit to a syntactic structure
|
assign all matches for collocation structures. For units without such
|
||||||
from the DDD database and converts the output into CJVT dictionary XML
|
matches, it then finds (creating, if necessary) and assigns
|
||||||
format. For the single-component units, this is pretty trivial, but
|
single-component or other structures. Finally the TEI is converted to
|
||||||
for multiple-component units it is more involved, and includes two
|
CJVT dictionary XML format. Example:
|
||||||
runs of the MWE extraction script
|
|
||||||
[wani.py](https://gitea.cjvt.si/ozbolt/luscenje_struktur), generating
|
|
||||||
missing structures in between. At the end, the single-component and
|
|
||||||
multiple-component dictionary files are merged into one dictionary
|
|
||||||
file. Example:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -structures /tmp/structures_new.xml
|
$ python scripts/process.py -mode parse_to_dictionary -infile /tmp/parsed.xml -instructs /tmp/structures_old.xml -outfile /tmp/dictionary.xml -outstructs /tmp/structures_new.xml
|
||||||
```
|
```
|
||||||
|
|
||||||
### strings_to_dictionary
|
### strings_to_dictionary
|
||||||
|
@ -97,6 +92,6 @@ Note that any new structures generated are given temporary ids
|
||||||
(@tempId), because they only get real ids once they are approved and
|
(@tempId), because they only get real ids once they are approved and
|
||||||
added to the DDD database. That is normally done via the django
|
added to the DDD database. That is normally done via the django
|
||||||
import_structures.py script in the [ddd_core
|
import_structures.py script in the [ddd_core
|
||||||
repository](https://gitea.cjvt.si/ddd/ddd_core), as part of a batch of
|
repository](https://gitea.cjvt.si/ddd/ddd_core), which replaces the
|
||||||
DDD updates. That script replaces the temporary ids in the structure
|
temporary ids in the structure specifications and updates the ids in
|
||||||
specifications and updates the ids in the dictionary file.
|
the dictionary file.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user