As per redmine#687

Go to file

Luka 39692e839f Extended recalculate statistics to filtered output		2021-02-16 17:01:02 +01:00
issue992	Scripts adapted to changes of new structures.xml format	2020-10-14 14:50:35 +02:00
issue1000	Issue #1000	2020-03-02 19:13:19 +01:00
luscenje_struktur	White reset at paragraphs not sentences + progress bar updates on paragraphs not sentences.	2021-01-26 14:57:42 +01:00
scripts	Extended recalculate statistics to filtered output	2021-02-16 17:01:02 +01:00
.gitignore	Modified readme.md + Removed obligatory sloleks_db + Added frequency_limit and sorted parameters in recalculate_statistics.py	2020-09-02 10:53:45 +02:00
collocation-structures.xml	Modified readme.md + Removed obligatory sloleks_db + Added frequency_limit and sorted parameters in recalculate_statistics.py	2020-09-02 10:53:45 +02:00
README.md	Added some functions for compatibility with valency, fixed readme and fixed some minor bugs.	2020-09-10 15:06:09 +02:00
run.sh.example	Moved wani.py + Added ignore of .zstd files for valency	2020-10-01 16:20:52 +02:00
setup.py	Renaming src to luscenje struktur	2020-09-17 14:02:56 +02:00
slim_sskj.py	Removing prints from slimmer	2019-06-08 10:20:53 +02:00
wani.py	Ignoring @type=single and added option for --new-tei	2021-01-13 16:36:44 +01:00

README.md

Navodila

Potrebne datoteke:

korpus v "ssj500k obliki"
definicije struktur
Python 3.5+

Priporocam: pypy3 paket za hitrejse poganjanje.

Primer uporabe: python3 wani.py ssj500k.xml Kolokacije_strukture.xml izhod.csv

About

This script was developed to extract collocations from text in TEI format. Collocations are extracted and presented based on rules provided in structure file (example in collocation-structures.xml).

Setup

Script may be run via python3 or pypy3. We suggest usage of virtual environments.

pip install -r requirements.txt

Running

python3 wani.py <LOCATION TO STRUCTURES> <EXTRACTION TEXT> --out <RESULTS FILE>

Most important optional parameters

--sloleks_db

This parameter is may be used, if you have access to sloleks_db. Parameter is useful when lemma_fallback would be shown in results file, because if you have sloleks_db script looks into this database to find correct replacement.

To use this sqlalchemy has to be installed as well.

This parameter has to include information about database in following order:

<DB_USERNAME>:<DB_PASSWORD>:<DB_NAME>:<DB_URL>

--collocation_sentence_map_dest

If value for this parameter exists (it should be string path to directory), files will be generated that include links between collocation ids and sentence ids.

--db

This is path to file which will contain sqlite database with internal states. Used to save internal states in case code gets modified.

We suggest to put this sqlite file in RAM for faster execution. To do this follow these instructions:

sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp

If running on big corpuses (ie. Gigafida have database in RAM):

sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp

Pass path to specific file when running wani.py. For example:

python3 wani.py ... --db /mnt/tmp/mysql-wani-ssj500k ...

--multiple-output

Used when we want multiple output files (one file per structure_id).

Instructions for running on big files (ie. Gigafida)

Suggested running with saved mysql file in tmpfs. Instructions:

sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp

If running on big corpuses (ie. Gigafida have database in RAM):

sudo mkdir /mnt/tmp
sudo mount -t tmpfs tmpfs /mnt/tmp
sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp