issue992 | ||
issue1000 | ||
scripts | ||
src | ||
.gitignore | ||
collocation-structures.xml | ||
README.md | ||
run.sh.example | ||
slim_sskj.py |
Navodila
Potrebne datoteke:
- korpus v "ssj500k obliki"
- definicije struktur
- Python 3.5+
Priporocam: pypy3 paket za hitrejse poganjanje.
Primer uporabe: python3 wani.py ssj500k.xml Kolokacije_strukture.xml izhod.csv
About
This script was developed to extract collocations from text in TEI format. Collocations are extracted and presented based on rules provided in structure file (example in collocation-structures.xml
).
Setup
Script may be run via python3 or pypy3. We suggest usage of virtual environments.
pip install -r requirements.txt
Running
python3 wani.py <LOCATION TO STRUCTURES> <EXTRACTION TEXT> --out <RESULTS FILE> --sloleks_db <PATH TO SLOLEKS DB>
python3 wani.py ../data/Kolokacije_strukture_JOS-32-representation_3D_08.xml ../data/ssj500k-sl.body.small.xml --out ../data/izhod.csv --sloleks_db luka:akul:superdb_small:127.0.0.1 --collocation_sentence_map_dest ../data/collocation_sentence_mapper --db /mnt/tmp/mysql-wani-ssj500k
Most important optional parameters
--sloleks_db
To use this sqlalchemy has to be installed as well. PATH TO SLOLEKS DB
--collocation_sentence_map_dest
../data/collocation_sentence_mapper
--db
This is path to file which will contain sqlite database with internal states. Used to save internal states in case code gets modified.
We suggest to put this sqlite file in RAM for faster execution. To do this follow these instructions:
sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
If running on big corpuses (ie. Gigafida have database in RAM):
sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp
Pass path to specific file when running wani.py
. For example:
python3 wani.py ... --db /mnt/tmp/mysql-wani-ssj500k ...
--multiple-output
Used when we want multiple output files (one file per structure_id).
Instructions for running on big files (ie. Gigafida)
Suggested running with saved mysql file in tmpfs. Instructions:
sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
If running on big corpuses (ie. Gigafida have database in RAM):
sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp