Redmine #2198 : Limited wani to "collocation" type structures

Extended recalculate statistics to filtered output
White reset at paragraphs not sentences + progress bar updates on paragraphs not sentences.
2021-12-03 15:23:29 +01:00 · 2021-02-16 17:01:02 +01:00 · 2021-01-26 14:57:42 +01:00 · 2021-01-23 09:28:10 +01:00 · 2021-01-13 16:36:44 +01:00 · 2020-11-26 09:45:22 +01:00
35 changed files with 5381 additions and 393 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,5 @@
 *.xml
 !collocation-structures.xml
 *.tbl
 *.csv
 *.pdf
--- a/README.md
+++ b/README.md
@@ -10,7 +10,66 @@ Priporocam: pypy3 paket za hitrejse poganjanje.
 Primer uporabe: `python3 wani.py ssj500k.xml Kolokacije_strukture.xml  izhod.csv`
-## Instructions for running on GF
+# About
 This script was developed to extract collocations from text in TEI format. Collocations are extracted and presented based on rules provided in structure file (example in `collocation-structures.xml`).
 # Setup
 Script may be run via python3 or pypy3. We suggest usage of virtual environments.
 ```bash
 pip install -r requirements.txt
 ```
 # Running
 ```bash
 python3 wani.py <LOCATION TO STRUCTURES> <EXTRACTION TEXT> --out <RESULTS FILE>
 ```
 ## Most important optional parameters
 ### --sloleks_db
 This parameter is may be used, if you have access to sloleks_db. Parameter is useful when lemma_fallback would be shown in results file, because if you have sloleks_db script looks into this database to find correct replacement. 
 To use this sqlalchemy has to be installed as well.
 This parameter has to include information about database in following order:
 <DB_USERNAME>:<DB_PASSWORD>:<DB_NAME>:<DB_URL>
 ### --collocation_sentence_map_dest
 If value for this parameter exists (it should be string path to directory), files will be generated that include links between collocation ids and sentence ids.
 ### --db
 This is path to file which will contain sqlite database with internal states. Used to save internal states in case code gets modified.
 We suggest to put this sqlite file in RAM for faster execution. To do this follow these instructions:
 ```bash
 sudo mkdir /mnt/tmp
 sudo mount -t tmpfs tmpfs /mnt/tmp
 ```
 If running on big corpuses (ie. Gigafida have database in RAM):
 ```bash
 sudo mkdir /mnt/tmp
 sudo mount -t tmpfs tmpfs /mnt/tmp
 sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp
 ```
 Pass path to specific file when running `wani.py`. For example:
 ```bash
 python3 wani.py ... --db /mnt/tmp/mysql-wani-ssj500k ...
 ```
 ### --multiple-output
 Used when we want multiple output files (one file per structure_id).
 ## Instructions for running on big files (ie. Gigafida)
 Suggested running with saved mysql file in tmpfs. Instructions:
@@ -21,6 +80,7 @@ sudo mount -t tmpfs tmpfs /mnt/tmp
 If running on big corpuses (ie. Gigafida have database in RAM):
 ```bash
 sudo mkdir /mnt/tmp
 sudo mount -t tmpfs tmpfs /mnt/tmp
 sudo mount -o remount,size=110G,noexec,nosuid,nodev,noatime /mnt/tmp
 ```
--- a/collocation-structures.xml
+++ b/collocation-structures.xml
--- a/issue992/extract.py
+++ b/issue992/extract.py
@@ -1,37 +1,48 @@
 import argparse
 import os
 import sys
 import tqdm
 good_lemmas = ["absurd", "absurdnost", "akuten", "akutno", "alkohol", "alkoholen", "aluminijast", "ananas", "aplikacija", "aplikativen", "aranžma", "arbiter", "armada", "avtomatičen", "avtomatiziran", "babica", "bajen", "bajka", "bakren", "bambusov", "barvan", "barvanje", "baseballski", "bazar", "bazičen", "belina", "bezgov", "bičati", "bife", "bilka", "biomasa", "biotop", "birma", "bivol", "blago", "blaženost", "bliskavica", "bobnič", "bolha", "bolnišnica", "bor", "borov", "borovničev", "brati", "briljant", "briti", "brusiti", "bučanje", "cikličen", "civilizacija", "dopust", "drama", "drezati", "duda", "dvorezen", "embalaža", "faks", "farsa", "glasno", "informiranje", "interier", "intima", "intimno", "investirati", "ironično", "istovetiti", "izvožen", "jagoda", "jeklar", "jezik", "karbon", "kitara", "kodrast", "molče", "mučiti", "novinarski", "obala", "občevati", "okrasiti", "pajčevina", "panoga", "prevajanje", "prevajati", "previti", "prihraniti", "priloga", "prisluškovati", "sopara"]
-N1 = len(good_lemmas)
+def main(args):
-N2 = len(sys.argv) - 1
+    filepaths = [os.path.join(args.input, fn) for fn in os.listdir(args.input)]
    filepaths = sorted(filepaths, key=lambda x: int(x.split('.')[-1]))
    N1 = len(good_lemmas)
    N2 = len(filepaths) - 1
-files_to_write = [open("polona/{}".format(l), 'w') for l in good_lemmas]
+    files_to_write = [open("output/{}".format(l), 'w') for l in good_lemmas]
-for fidx, filename in enumerate(sys.argv[1:]):
+    for fidx, filename in enumerate(filepaths):
-    with open(filename, 'r') as fp:
+        with open(filename, 'r') as fp:
-        print("loading next...", end="", flush=True)
+            print("loading next...", end="", flush=True)
-        line = fp.readline()
+            line = fp.readline()
-        lemma_rows = [idx for idx, cell in enumerate(line.split(",")) if "_Lemma" in cell]
+            lemma_rows = [idx for idx, cell in enumerate(line.split(",")) if "_Lemma" in cell]
-        file_lines = fp.read().split("\n")
+            file_lines = fp.read().split("\n")
-    for lidx, good_lemma in enumerate(good_lemmas):
+        for lidx, good_lemma in enumerate(good_lemmas):
-        spaces = " " * 20 if lidx == 0 else ""
+            spaces = " " * 20 if lidx == 0 else ""
-        print("\r{}.{} / {}.{}{}".format(fidx, lidx, N2, N1, spaces), end="", flush=True)
+            print("\r{}.{} / {}.{}{}".format(fidx, lidx, N2, N1, spaces), end="", flush=True)
-        for line in file_lines:
+            for line in file_lines:
-            if good_lemma not in line:
+                if good_lemma not in line:
-                continue
+                    continue
-            line_split = line.split(',')
+                line_split = line.split(',')
-            for lemma_idx in lemma_rows:
+                for lemma_idx in lemma_rows:
-                lemma = line_split[lemma_idx]
+                    lemma = line_split[lemma_idx]
-                if lemma == good_lemma:
+                    if lemma == good_lemma:
-                    print(line, file=files_to_write[lidx])
+                        print(line, file=files_to_write[lidx])
-                    break
+                        break
 for fp in files_to_write:
    fp.close()
    for fp in files_to_write:
        fp.close()
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Extract structures from a parsed corpus.')
    parser.add_argument('input',
                        help='Path to folder with files')
    args = parser.parse_args()
    main(args)
--- a/issue992/files
+++ b/issue992/files
@@ -1,81 +0,0 @@
 ../data/gf2filesres/izhod.csv.100
 ../data/gf2filesres/izhod.csv.101
 ../data/gf2filesres/izhod.csv.102
 ../data/gf2filesres/izhod.csv.103
 ../data/gf2filesres/izhod.csv.104
 ../data/gf2filesres/izhod.csv.105
 ../data/gf2filesres/izhod.csv.106
 ../data/gf2filesres/izhod.csv.107
 ../data/gf2filesres/izhod.csv.108
 ../data/gf2filesres/izhod.csv.12
 ../data/gf2filesres/izhod.csv.13
 ../data/gf2filesres/izhod.csv.14
 ../data/gf2filesres/izhod.csv.15
 ../data/gf2filesres/izhod.csv.16
 ../data/gf2filesres/izhod.csv.17
 ../data/gf2filesres/izhod.csv.18
 ../data/gf2filesres/izhod.csv.19
 ../data/gf2filesres/izhod.csv.22
 ../data/gf2filesres/izhod.csv.23
 ../data/gf2filesres/izhod.csv.24
 ../data/gf2filesres/izhod.csv.25
 ../data/gf2filesres/izhod.csv.26
 ../data/gf2filesres/izhod.csv.27
 ../data/gf2filesres/izhod.csv.28
 ../data/gf2filesres/izhod.csv.29
 ../data/gf2filesres/izhod.csv.30
 ../data/gf2filesres/izhod.csv.31
 ../data/gf2filesres/izhod.csv.32
 ../data/gf2filesres/izhod.csv.34
 ../data/gf2filesres/izhod.csv.35
 ../data/gf2filesres/izhod.csv.36
 ../data/gf2filesres/izhod.csv.37
 ../data/gf2filesres/izhod.csv.38
 ../data/gf2filesres/izhod.csv.39
 ../data/gf2filesres/izhod.csv.40
 ../data/gf2filesres/izhod.csv.41
 ../data/gf2filesres/izhod.csv.42
 ../data/gf2filesres/izhod.csv.43
 ../data/gf2filesres/izhod.csv.44
 ../data/gf2filesres/izhod.csv.45
 ../data/gf2filesres/izhod.csv.46
 ../data/gf2filesres/izhod.csv.47
 ../data/gf2filesres/izhod.csv.48
 ../data/gf2filesres/izhod.csv.49
 ../data/gf2filesres/izhod.csv.50
 ../data/gf2filesres/izhod.csv.51
 ../data/gf2filesres/izhod.csv.52
 ../data/gf2filesres/izhod.csv.53
 ../data/gf2filesres/izhod.csv.54
 ../data/gf2filesres/izhod.csv.55
 ../data/gf2filesres/izhod.csv.57
 ../data/gf2filesres/izhod.csv.68
 ../data/gf2filesres/izhod.csv.69
 ../data/gf2filesres/izhod.csv.70
 ../data/gf2filesres/izhod.csv.71
 ../data/gf2filesres/izhod.csv.72
 ../data/gf2filesres/izhod.csv.73
 ../data/gf2filesres/izhod.csv.74
 ../data/gf2filesres/izhod.csv.75
 ../data/gf2filesres/izhod.csv.76
 ../data/gf2filesres/izhod.csv.77
 ../data/gf2filesres/izhod.csv.78
 ../data/gf2filesres/izhod.csv.80
 ../data/gf2filesres/izhod.csv.81
 ../data/gf2filesres/izhod.csv.82
 ../data/gf2filesres/izhod.csv.83
 ../data/gf2filesres/izhod.csv.84
 ../data/gf2filesres/izhod.csv.85
 ../data/gf2filesres/izhod.csv.86
 ../data/gf2filesres/izhod.csv.87
 ../data/gf2filesres/izhod.csv.88
 ../data/gf2filesres/izhod.csv.89
 ../data/gf2filesres/izhod.csv.90
 ../data/gf2filesres/izhod.csv.91
 ../data/gf2filesres/izhod.csv.92
 ../data/gf2filesres/izhod.csv.93
 ../data/gf2filesres/izhod.csv.94
 ../data/gf2filesres/izhod.csv.95
 ../data/gf2filesres/izhod.csv.96
 ../data/gf2filesres/izhod.csv.97
 ../data/gf2filesres/izhod.csv.98
--- a/luscenje_struktur/init.py
+++ b/luscenje_struktur/init.py
--- a/luscenje_struktur/codes_tagset.py
+++ b/luscenje_struktur/codes_tagset.py
@@ -133,6 +133,7 @@ CODES = {
    "Interjection": "I",
    "Abbreviation": "Y",
    "Residual": "X",
    "Punctuation": "Z",
    'common': 'c',
    'proper': 'p',
--- a/luscenje_struktur/collocation_sentence_mapper.py
+++ b/luscenje_struktur/collocation_sentence_mapper.py
--- a/luscenje_struktur/component.py
+++ b/luscenje_struktur/component.py
@@ -1,9 +1,10 @@
 from enum import Enum
 import logging
-from restriction import Restriction
+# from luscenje_struktur.restriction import Restriction
-from order import Order
+from luscenje_struktur.order import Order
-from representation_assigner import RepresentationAssigner
+from luscenje_struktur.representation_assigner import RepresentationAssigner
 from luscenje_struktur.restriction_group import RestrictionGroup
 class ComponentStatus(Enum):
@@ -21,7 +22,7 @@ class ComponentType(Enum):
 class Component:
    def __init__(self, info):
        idx = info['cid']
-        name = info['name'] if 'name' in info else None
+        name = info['label'] if 'label' in info else None
        typ = ComponentType.Core if info['type'] == "core" else ComponentType.Other
        if 'status' not in info:
@@ -38,7 +39,7 @@ class Component:
        self.status = status
        self.name = name
        self.idx = idx
-        self.restrictions = []
+        self.restrictions = RestrictionGroup([None]) if 'restriction' in info else []
        self.next_element = []
        self.representation = []
        self.selection = {}
@@ -49,15 +50,17 @@ class Component:
    def add_next(self, next_component, link_label, order):
        self.next_element.append((next_component, link_label, Order.new(order)))
-    def set_restriction(self, restrictions_tag):
+    def set_restriction(self, restrictions_tags):
-        if restrictions_tag is None:
+        if not restrictions_tags:
-            self.restrictions = [Restriction(None)]
+            self.restrictions = RestrictionGroup([None])
-        elif restrictions_tag.tag == "restriction":
+        # if first element is of type restriction all following are as well
-            self.restrictions = [Restriction(restrictions_tag)]
+        elif restrictions_tags[0].tag == "restriction":
            self.restrictions = RestrictionGroup(restrictions_tags)
-        elif restrictions_tag.tag == "restriction_or":
+        # combinations of 'and' and 'or' restrictions are currently not implemented
-            self.restrictions = [Restriction(el) for el in restrictions_tag]
+        elif restrictions_tags[0].tag == "restriction_or":
            self.restrictions = RestrictionGroup(restrictions_tags[0], group_type='or')
        else:
            raise RuntimeError("Unreachable")
@@ -104,37 +107,28 @@ class Component:
            if len(cmatch) == 0:
                continue
-            # if more than one match found for particular component
+            # create new to_ret, to which extend all results
-            elif len(cmatch) > 1:
+            new_to_ret = []
-                # if more than one match in multiple components, NOPE!
+            for tr in to_ret:
-                if len(to_ret) > 1:
+                # make sure that one word is not used twice in same to_ret
-                    logging.warning("Strange multiple match: {}".format(
+                new_to_ret.extend([{**dict(tr), **m} for m in cmatch if all([m_v not in dict(tr).values() for m_v in m.values()])])
-                        str([w.id for w in cmatch[0].values()])))
+            if len(new_to_ret) == 0:
-
+                return None
-                    for tr in to_ret:
+            to_ret = new_to_ret
-                        tr.update(cmatch[0])
+            del new_to_ret
                    continue
                # yeah, so we have found more than one match, =>
                # more than one element in to_ret
                to_ret = [{**dict(to_ret[0]), **m} for m in cmatch]
            else:
                for tr in to_ret:
                    tr.update(cmatch[0])
        return to_ret
    def _match_self(self, word):
        # matching
-        for restr in self.restrictions:
+        if self.restrictions.match(word):
-            if restr.match(word): # match either
+            return {self.idx: word}
                return {self.idx: word}
    def _match_next(self, word):
        # matches for every component in links from this component
        to_ret = []
        # need to get all links that match
        for next, link, order in self.next_element:
            next_links = word.get_links(link)
--- a/luscenje_struktur/database.py
+++ b/luscenje_struktur/database.py
--- a/luscenje_struktur/formatter.py
+++ b/luscenje_struktur/formatter.py
@@ -1,7 +1,7 @@
 from math import log2
 import re
-from component import ComponentType
+from luscenje_struktur.component import ComponentType
 class Formatter:
@@ -82,7 +82,7 @@ class AllFormatter(Formatter):
        word = words[idx]
        return [word.id, word.text, word.lemma, word.msd]
-    def content_right(self, _freq):
+    def content_right(self, _freq, variable_word_order=None):
        return []
    def group(self):
--- a/luscenje_struktur/lemma_features.py
+++ b/luscenje_struktur/lemma_features.py
@@ -1,4 +1,4 @@
-from restriction import MorphologyRegex
+from luscenje_struktur.restriction import MorphologyRegex
 def get_lemma_features(et):
@@ -8,7 +8,7 @@ def get_lemma_features(et):
    result = {}
    for pos in lf.iter('POS'):
-        rgx_list = MorphologyRegex(pos).rgx
+        rgx_list = MorphologyRegex(pos).rgxs[0]
        rgx_str = ""
        for position in rgx_list:
            if position == ".":
--- a/luscenje_struktur/loader.py
+++ b/luscenje_struktur/loader.py
@@ -6,8 +6,8 @@ import sys
 import gzip
 import pathlib
-from progress_bar import progress
+from luscenje_struktur.progress_bar import progress
-from word import Word
+from luscenje_struktur.word import Word
 def is_root_id(id_):
@@ -22,6 +22,10 @@ def load_files(args, database, w_collection=None, input_corpus=None):
    if len(filenames) == 1 and os.path.isdir(filenames[0]):
        filenames = [os.path.join(filenames[0], file) for file in os.listdir(filenames[0]) if file[-5:] != '.zstd']
    if len(filenames) > 1:
        filenames = [filename for filename in filenames if filename[-5:] != '.zstd']
        filenames = sorted(filenames, key=lambda x: int(x.split('.')[-1]))
    database.init("CREATE TABLE Files ( filename varchar(2048) )")
    for idx, fname in enumerate(filenames):
@@ -37,7 +41,7 @@ def load_files(args, database, w_collection=None, input_corpus=None):
        if extension == ".xml":
            et = load_xml(fname)
            if input_corpus is None:
-                yield file_sentence_generator(et, skip_id_check, do_msd_translate, args.pc_tag)
+                yield file_sentence_generator(et, args)
            else:
                sentence_generator = file_sentence_generator_valency(et, skip_id_check, do_msd_translate, args.pc_tag, w_collection)
                for sent_id, sentence, othr_attributes in sentence_generator:
@@ -98,6 +102,8 @@ def load_csv(filename, compressed):
        line_split = line_fixed.split("\t")
        if line_split[1] == "1" and len(words) > 0:
            # adding fake word
            words['0'] = Word('', '', '0', '', False, True)
            sentence_end(bad_sentence)
            bad_sentence = False
            links = []
@@ -110,9 +116,11 @@ def load_csv(filename, compressed):
        full_id = "{}.{}".format(sid, wid)
        words[wid] = Word(lemma, msd, full_id, text, True)
-        if link_src != '0':
+        # if link_src != '0':
-            links.append((link_src, wid, link_type))
+        links.append((link_src, wid, link_type))
    # adding fake word
    words['0'] = Word('', '', '0', '', False, True)
    sentence_end(bad_sentence)
    return result
@@ -181,42 +189,81 @@ def load_xml(filename):
    return ElementTree.XML(xmlstring)
-def file_sentence_generator(et, skip_id_check, do_msd_translate, pc_tag):
+def file_sentence_generator(et, args):
    skip_id_check = args.skip_id_check
    do_msd_translate = not args.no_msd_translate
    pc_tag = args.pc_tag
    use_punctuations = not args.ignore_punctuations
    previous_pc = False
    words = {}
-    sentences = list(et.iter('s'))
+    paragraphs = list(et.iter('p'))
-    for sentence in progress(sentences, "load-text"):
+    for paragraph in progress(paragraphs, "load-text"):
-        for w in sentence.iter("w"):
+        previous_glue = ''
-            words[w.get('id')] = Word.from_xml(w, do_msd_translate)
+        sentences = list(paragraph.iter('s'))
-        for pc in sentence.iter(pc_tag):
+        for sentence in sentences:
-            words[pc.get('id')] = Word.pc_word(pc, do_msd_translate)
+            # create fake root word
            words[sentence.get('id')] = Word.fake_root_word(sentence.get('id'))
            last_word_id = None
-        for l in sentence.iter("link"):
+            if args.new_tei:
-            if 'dep' in l.keys():
+                for w in sentence.iter():
-                ana = l.get('afun')
+                    if w.tag == 'w':
-                lfrom = l.get('from')
+                        words[w.get('id')] = Word.from_xml(w, do_msd_translate)
-                dest = l.get('dep')
+                        if use_punctuations:
                            previous_glue = '' if 'join' in w.attrib and w.get('join') == 'right' else ' '
                    elif w.tag == pc_tag:
                        words[w.get('id')] = Word.pc_word(w, do_msd_translate)
                        if use_punctuations:
                            words[w.get('id')].previous_glue = previous_glue
                            words[w.get('id')].glue = '' if 'join' in w.attrib and w.get('join') == 'right' else ' '
                            previous_glue = '' if 'join' in w.attrib and w.get('join') == 'right' else ' '
            else:
-                ana = l.get('ana')
+                for w in sentence.iter():
-                if ana[:8] != 'jos-syn:': # dont bother...
+                    if w.tag == 'w':
-                    continue
+                        words[w.get('id')] = Word.from_xml(w, do_msd_translate)
-                ana = ana[8:]
+                        if use_punctuations:
-                lfrom, dest = l.get('target').replace('#', '').split()
+                            previous_glue = ''
                            last_word_id = None
                    elif w.tag == pc_tag:
                        words[w.get('id')] = Word.pc_word(w, do_msd_translate)
                        if use_punctuations:
                            last_word_id = w.get('id')
                            words[w.get('id')].previous_glue = previous_glue
                            previous_glue = ''
                    elif use_punctuations and w.tag == 'c':
                        # always save previous glue
                        previous_glue = w.text
                        if last_word_id:
                            words[last_word_id].glue += w.text
-            if lfrom in words:
+            for l in sentence.iter("link"):
-                if not skip_id_check and is_root_id(lfrom):
+                if 'dep' in l.keys():
-                    logging.error("NOO: {}".format(lfrom))
+                    ana = l.get('afun')
-                    sys.exit(1)
+                    lfrom = l.get('from')
-
+                    dest = l.get('dep')
                if dest in words:
                    next_word = words[dest]
                    words[lfrom].add_link(ana, next_word)
                else:
-                    logging.error("Unknown id: {}".format(dest))
+                    ana = l.get('ana')
-                    sys.exit(1)
+                    if ana[:8] != 'jos-syn:': # dont bother...
                        continue
                    ana = ana[8:]
                    lfrom, dest = l.get('target').replace('#', '').split()
-            else:
+                if lfrom in words:
-                # strange errors, just skip...
+                    if not skip_id_check and is_root_id(lfrom):
-                pass
+                        logging.error("Id {} is not fine, you might want to try with tag --skip-id-check".format(lfrom))
                        sys.exit(1)
                    if dest in words:
                        next_word = words[dest]
                        words[lfrom].add_link(ana, next_word)
                    else:
                        logging.error("Unknown id: {}".format(dest))
                        sys.exit(1)
                else:
                    # strange errors, just skip...
                    pass
    return list(words.values())
--- a/luscenje_struktur/match.py
+++ b/luscenje_struktur/match.py
@@ -1,4 +1,4 @@
-from word import Word
+from luscenje_struktur.word import Word
 class StructureMatch:
    def __init__(self, match_id, structure):
--- a/luscenje_struktur/match_store.py
+++ b/luscenje_struktur/match_store.py
@@ -3,9 +3,9 @@ from collections import defaultdict
 from ast import literal_eval
 from time import time
-from match import StructureMatch
+from luscenje_struktur.match import StructureMatch
-from representation_assigner import RepresentationAssigner
+from luscenje_struktur.representation_assigner import RepresentationAssigner
-from progress_bar import progress
+from luscenje_struktur.progress_bar import progress
 class MatchStore:
    def __init__(self, args, db):
--- a/luscenje_struktur/msd_translate.py
+++ b/luscenje_struktur/msd_translate.py
@@ -1911,4 +1911,4 @@ MSD_TRANSLATE = {
    "Ne": "Ne",
    "Nh": "Nh",
    "Na": "Na",
-    "U": "N"}
+    "U": "Z"}
--- a/luscenje_struktur/order.py
+++ b/luscenje_struktur/order.py
--- a/luscenje_struktur/postprocessor.py
+++ b/luscenje_struktur/postprocessor.py
@@ -1,7 +1,8 @@
 class Postprocessor:
-    def __init__(self, fix_one_letter_words=True):
+    def __init__(self, fix_one_letter_words=True, fixed_restriction_order=False):
        self.fix_one_letter_words = fix_one_letter_words
        self.fixed_restriction_order = fixed_restriction_order
    @staticmethod
    def fix_sz(next_word):
@@ -28,3 +29,19 @@ class Postprocessor:
                    match[col_id].text = correct_letter
        collocation_id = [collocation_id[0]] + [tuple(line) for line in collocation_id[1:]]
        return match, collocation_id
    def is_fixed_restriction_order(self, match):
        if not self.fixed_restriction_order:
            return True
        sorted_dict = {k: v for k, v in sorted(match.items(), key=lambda item: item[1].int_id)}
        prev_id = -1
        for key in sorted_dict.keys():
            if key == '#':
                continue
            int_key = int(key)
            if prev_id > int_key:
                return False
            prev_id = int_key
        return True
--- a/luscenje_struktur/progress_bar.py
+++ b/luscenje_struktur/progress_bar.py
--- a/luscenje_struktur/representation.py
+++ b/luscenje_struktur/representation.py
@@ -1,10 +1,10 @@
 import logging
 from collections import Counter
-from codes_tagset import TAGSET, CODES
+from luscenje_struktur.codes_tagset import TAGSET, CODES
-from word import WordMsdOnly
+from luscenje_struktur.word import WordMsdOnly
-from word import WordDummy
+from luscenje_struktur.word import WordDummy
 class ComponentRepresentation:
@@ -71,9 +71,7 @@ class WordFormAnyCR(ComponentRepresentation):
            agreements_matched = [agr.match(word_msd) for agr in self.agreement]
            # in case all agreements do not match try to get data from sloleks and change properly
-            if not all(agreements_matched):
+            if sloleks_db is not None and not all(agreements_matched):
                if sloleks_db is None:
                    raise Exception('sloleks_db not properly setup!')
                for i, agr in enumerate(self.agreement):
                    if not agr.match(word_msd):
                        msd, lemma, text = sloleks_db.get_word_form(agr.lemma, agr.msd(), agr.data, align_msd=word_msd)
@@ -142,9 +140,7 @@ class WordFormMsdCR(WordFormAnyCR):
            super().add_word(word)
    def _render(self, sloleks_db=None):
-        if len(self.words) == 0:
+        if len(self.words) == 0 and sloleks_db is not None:
            if sloleks_db is None:
                raise Exception('sloleks_db not properly setup!')
            msd, lemma, text = sloleks_db.get_word_form(self.lemma, self.msd(), self.data)
            if msd is not None:
                self.words.append(WordDummy(msd, lemma, text))
--- a/luscenje_struktur/representation_assigner.py
+++ b/luscenje_struktur/representation_assigner.py
@@ -1,4 +1,4 @@
-from representation import ComponentRepresentation, LemmaCR, LexisCR, WordFormAgreementCR, WordFormAnyCR, WordFormMsdCR, WordFormAllCR
+from luscenje_struktur.representation import ComponentRepresentation, LemmaCR, LexisCR, WordFormAgreementCR, WordFormAnyCR, WordFormMsdCR, WordFormAllCR
 class RepresentationAssigner:
    def __init__(self):
@@ -27,11 +27,10 @@ class RepresentationAssigner:
            elif feature['selection'] == "all":
                self.representation_factory = WordFormAllCR
            elif feature['selection'] == 'agreement':
                assert feature['head'][:4] == 'cid_'
                assert feature['msd'] is not None
                self.representation_factory = WordFormAgreementCR
                self.more['agreement'] = feature['msd'].split('+')
-                self.more['other'] = feature['head'][4:]
+                self.more['other'] = feature['head_cid']
            else:
                raise NotImplementedError("Representation selection: {}".format(feature))
--- a/luscenje_struktur/restriction.py
+++ b/luscenje_struktur/restriction.py
@@ -0,0 +1,192 @@
 import re
 from enum import Enum
 from luscenje_struktur.codes_tagset import CODES, TAGSET
 class RestrictionType(Enum):
    Morphology = 0
    Lexis = 1
    MatchAll = 2
    Space = 3
 def determine_ppb(rgxs):
    if len(rgxs) != 1:
        return 0
    rgx = rgxs[0]
    if rgx[0] in ("A", "N", "R"):
        return 0
    elif rgx[0] == "V":
        if len(rgx) == 1:
            return 2
        elif 'a' in rgx[1]:
            return 3
        elif 'm' in rgx[1]:
            return 1
        else:
            return 2
    else:
        return 4
 class MorphologyRegex:
    def __init__(self, restriction):
        # self.min_msd_length = 1
        restr_dict = {}
        for feature in restriction:
            feature_dict = dict(feature.items())
            match_type = True
            if "filter" in feature_dict:
                assert feature_dict['filter'] == "negative"
                match_type = False
                del feature_dict['filter']
            assert len(feature_dict) == 1
            key, value = next(iter(feature_dict.items()))
            restr_dict[key] = (value, match_type)
        assert 'POS' in restr_dict
        # handle multiple word types
        if '|' in restr_dict['POS'][0]:
            categories = restr_dict['POS'][0].split('|')
        else:
            categories = [restr_dict['POS'][0]]
        self.rgxs = []
        self.re_objects = []
        self.min_msd_lengths = []
        del restr_dict['POS']
        for category in categories:
            min_msd_length = 1
            category = category.capitalize()
            cat_code = CODES[category]
            rgx = [cat_code] + ['.'] * 10
            for attribute, (value, typ) in restr_dict.items():
                if attribute.lower() not in TAGSET[cat_code]:
                    continue
                index = TAGSET[cat_code].index(attribute.lower())
                assert index >= 0
                if '|' in value:
                    match = "".join(CODES[val] for val in value.split('|'))
                else:
                    match = CODES[value]
                match = "[{}{}]".format("" if typ else "^", match)
                rgx[index + 1] = match
                if typ:
                    min_msd_length = max(index + 1, min_msd_length)
            # strip rgx
            for i in reversed(range(len(rgx))):
                if rgx[i] == '.':
                    rgx = rgx[:-1]
                else:
                    break
            self.re_objects.append([re.compile(r) for r in rgx])
            self.rgxs.append(rgx)
            self.min_msd_lengths.append(min_msd_length)
    def __call__(self, text):
        for i, re_object in enumerate(self.re_objects):
            if len(text) < self.min_msd_lengths[i]:
                continue
            match = True
            for c, r in zip(text, re_object):
                if not r.match(c):
                    match = False
                    break
            if match:
                return True
        return False
 class LexisRegex:
    def __init__(self, restriction):
        restr_dict = {}
        for feature in restriction:
            restr_dict.update(feature.items())
        assert "lemma" in restr_dict
        self.match_list = restr_dict['lemma'].split('|')
    def __call__(self, text):
        return text in self.match_list
 class SpaceRegex:
    def __init__(self, restriction):
        restr_dict = {}
        for feature in restriction:
            restr_dict.update(feature.items())
        assert "contact" in restr_dict
        self.space = restr_dict['contact'].split('|')
        for el in self.space:
            if el not in ['both', 'right', 'left', 'neither']:
                raise Exception('Value of space restriction is not supported (it may be both, left, right or neither).')
    def __call__(self, word):
        match = False
        if 'neither' in self.space:
            match = match or (word.previous_glue != '' and word.glue != '')
        if 'left' in self.space:
            match = match or (word.previous_glue == '' and word.glue != '')
        if 'right' in self.space:
            match = match or (word.previous_glue != '' and word.glue == '')
        if 'both' in self.space:
            match = match or (word.previous_glue == '' and word.glue == '')
        return match
 class Restriction:
    def __init__(self, restriction_tag):
        self.ppb = 4 # polnopomenska beseda (0-4)
        if restriction_tag is None:
            self.type = RestrictionType.MatchAll
            self.matcher = None
            self.present = None
            return
        restriction_type = restriction_tag.get('type')
        if restriction_type == "morphology":
            self.type = RestrictionType.Morphology
            self.matcher = MorphologyRegex(list(restriction_tag))
            self.ppb = determine_ppb(self.matcher.rgxs)
        elif restriction_type == "lexis":
            self.type = RestrictionType.Lexis
            self.matcher = LexisRegex(list(restriction_tag))
        elif restriction_type == "space":
            self.type = RestrictionType.Space
            self.matcher = SpaceRegex(list(restriction_tag))
        else:
            raise NotImplementedError()
    def match(self, word):
        if self.type == RestrictionType.Morphology:
            match_to = word.msd
        elif self.type == RestrictionType.Lexis:
            match_to = word.lemma
        elif self.type == RestrictionType.MatchAll:
            return True
        elif self.type == RestrictionType.Space:
            match_to = word
        else:
            raise RuntimeError("Unreachable!")
        return self.matcher(match_to)
--- a/luscenje_struktur/restriction_group.py
+++ b/luscenje_struktur/restriction_group.py
@@ -0,0 +1,24 @@
 from luscenje_struktur.restriction import Restriction
 class RestrictionGroup:
    def __init__(self, restrictions_tag, group_type='and'):
        self.restrictions = [Restriction(el) for el in restrictions_tag]
        self.group_type = group_type
    def __iter__(self):
        for restriction in self.restrictions:
            yield restriction
    def match(self, word):
        if self.group_type == 'or':
            for restr in self.restrictions:
                if restr.match(word): # match either
                    return True
            return False
        elif self.group_type == 'and':
            for restr in self.restrictions:
                if not restr.match(word): # match and
                    return False
            return True
        else:
            raise Exception("Unsupported group_type - it may only be 'and' or 'or'")
--- a/luscenje_struktur/sloleks_db.py
+++ b/luscenje_struktur/sloleks_db.py
@@ -7,7 +7,7 @@ from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy.orm import Session, aliased
 from sqlalchemy import create_engine
-from codes_tagset import TAGSET, CODES, CODES_TRANSLATION, POSSIBLE_WORD_FORM_FEATURE_VALUES
+from luscenje_struktur.codes_tagset import TAGSET, CODES, CODES_TRANSLATION, POSSIBLE_WORD_FORM_FEATURE_VALUES
 class SloleksDatabase:
--- a/luscenje_struktur/syntactic_structure.py
+++ b/luscenje_struktur/syntactic_structure.py
@@ -2,20 +2,23 @@ from xml.etree import ElementTree
 import logging
 import pickle
-from component import Component, ComponentType
+from luscenje_struktur.component import Component, ComponentType
-from lemma_features import get_lemma_features
+from luscenje_struktur.lemma_features import get_lemma_features
 class SyntacticStructure:
    def __init__(self):
        self.id = None
-        self.lbs = None
+        # self.lbs = None
        self.components = []
        self.fake_root_included = False
    @staticmethod
-    def from_xml(xml):
+    def from_xml(xml, no_stats):
        st = SyntacticStructure()
-        st.id = xml.get('id_nsss')
+        st.id = xml.get('id')
-        st.lbs = xml.get('LBS')
+        if st.id is None:
            st.id = xml.get('tempId')
        # st.lbs = xml.get('LBS')
        assert len(list(xml)) == 1
        system = next(iter(xml))
@@ -31,23 +34,29 @@ class SyntacticStructure:
        for comp in definitions:
            n = comp.get('cid')
-            restrs[n] = None
+            restrs[n] = []
            forms[n] = []
            for el in comp:
                if el.tag.startswith("restriction"):
-                    assert restrs[n] is None
+                    restrs[n].append(el)
                    restrs[n] = el
                elif el.tag.startswith("representation"):
                    st.add_representation(n, el, forms)
                else:
                    raise NotImplementedError("Unknown definition: {} in structure {}"
                                              .format(el.tag, st.id))
-        fake_root_component = Component({'cid': '#', 'type': 'other'})
+        fake_root_component = Component({'cid': '#', 'type': 'other', 'restriction': None})
-        st.components = fake_root_component.find_next(deps, comps, restrs, forms)
+        fake_root_component_children = fake_root_component.find_next(deps, comps, restrs, forms)
        # all dep with value modra point to artificial root - fake_root_component
        if any([dep[2] == 'modra' for dep in deps]):
            st.fake_root_included = True
            st.components = [fake_root_component] + fake_root_component_children
        else:
            st.components = fake_root_component_children
-        st.determine_core2w()
+        if not no_stats:
            st.determine_core2w()
        return st
    def determine_core2w(self):
@@ -98,6 +107,7 @@ class SyntacticStructure:
 def build_structures(args):
    filename = args.structures
    no_stats = args.out is None and args.stats is None
    max_num_components = -1
    with open(filename, 'r') as fp:
@@ -105,12 +115,15 @@ def build_structures(args):
    structures = []
    for structure in et.iter('syntactic_structure'):
-        to_append = SyntacticStructure.from_xml(structure)
+        if structure.attrib['type'] != 'collocation':
            continue
        to_append = SyntacticStructure.from_xml(structure, no_stats)
        if to_append is None:
            continue
        structures.append(to_append)
-        max_num_components = max(max_num_components, len(to_append.components))
+        to_append_len = len(to_append.components) if not to_append.fake_root_included else len(to_append.components) - 1
        max_num_components = max(max_num_components, to_append_len)
    lemma_features = get_lemma_features(et)
    return structures, lemma_features, max_num_components
--- a/luscenje_struktur/time_info.py
+++ b/luscenje_struktur/time_info.py
--- a/luscenje_struktur/word.py
+++ b/luscenje_struktur/word.py
@@ -1,7 +1,7 @@
 from collections import defaultdict
 import logging
-from msd_translate import MSD_TRANSLATE
+from luscenje_struktur.msd_translate import MSD_TRANSLATE
 class WordCompressed:
@@ -32,11 +32,15 @@ class WordDummy:
 class Word:
-    def __init__(self, lemma, msd, wid, text, do_msd_translate):
+    def __init__(self, lemma, msd, wid, text, do_msd_translate, fake_word=False, previous_punctuation=None):
        self.lemma = lemma
        self.msd = MSD_TRANSLATE[msd] if do_msd_translate else msd
        self.id = wid
        self.idi = None
        self.text = text
        self.glue = ''
        self.previous_glue = '' if previous_punctuation is None else previous_punctuation
        self.fake_word = fake_word
        self.links = defaultdict(list)
@@ -72,6 +76,11 @@ class Word:
        pc.set('msd', "N" if do_msd_translate else "U")
        return Word.from_xml(pc, do_msd_translate)
    @staticmethod
    def fake_root_word(sentence_id):
        wid = sentence_id
        return Word('', '', wid, '', False, True)
    def add_link(self, link, to):
        self.links[link].append(to)
--- a/luscenje_struktur/word_stats.py
+++ b/luscenje_struktur/word_stats.py
@@ -1,6 +1,6 @@
 from collections import defaultdict, Counter
-from progress_bar import progress
+from luscenje_struktur.progress_bar import progress
 class WordStats:
@@ -25,6 +25,8 @@ class WordStats:
    def add_words(self, words):
        for w in progress(words, "adding-words"):
            if w.fake_word:
                continue
            params = {'lemma': w.lemma, 'msd': w.msd, 'text': w.text}
            res = self.db.execute("""UPDATE UniqWords SET frequency=frequency + 1
                WHERE lemma=:lemma AND msd=:msd AND text=:text""", params)
--- a/luscenje_struktur/writer.py
+++ b/luscenje_struktur/writer.py
@@ -1,11 +1,11 @@
 import logging
 import os
-from progress_bar import progress
+from luscenje_struktur.progress_bar import progress
-from formatter import OutFormatter, OutNoStatFormatter, AllFormatter, StatsFormatter
+from luscenje_struktur.formatter import OutFormatter, OutNoStatFormatter, AllFormatter, StatsFormatter
-from collocation_sentence_mapper import CollocationSentenceMapper
+from luscenje_struktur.collocation_sentence_mapper import CollocationSentenceMapper
 class Writer:
@@ -16,23 +16,23 @@ class Writer:
    @staticmethod
    def make_output_writer(args, num_components, colocation_ids, word_renderer):
        params = Writer.other_params(args)
-        return Writer(args.out, num_components, OutFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params)
+        return Writer(args.out, num_components, OutFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params, args.separator)
    @staticmethod
    def make_output_no_stat_writer(args, num_components, colocation_ids, word_renderer):
        params = Writer.other_params(args)
-        return Writer(args.out_no_stat, num_components, OutNoStatFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params)
+        return Writer(args.out_no_stat, num_components, OutNoStatFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params, args.separator)
    @staticmethod
    def make_all_writer(args, num_components, colocation_ids, word_renderer):
-        return Writer(args.all, num_components, AllFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, None)
+        return Writer(args.all, num_components, AllFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, None, args.separator)
    @staticmethod
    def make_stats_writer(args, num_components, colocation_ids, word_renderer):
        params = Writer.other_params(args)
-        return Writer(args.stats, num_components, StatsFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params)
+        return Writer(args.stats, num_components, StatsFormatter(colocation_ids, word_renderer), args.collocation_sentence_map_dest, params, args.separator)
-    def __init__(self, file_out, num_components, formatter, collocation_sentence_map_dest, params):
+    def __init__(self, file_out, num_components, formatter, collocation_sentence_map_dest, params, separator):
        # TODO FIX THIS
        self.collocation_sentence_map_dest = collocation_sentence_map_dest
        if params is None:
@@ -49,6 +49,7 @@ class Writer:
        self.num_components = num_components
        self.output_file = file_out
        self.formatter = formatter
        self.separator = separator
    def header(self):
        repeating_cols = self.formatter.header_repeat()
@@ -78,7 +79,7 @@ class Writer:
        return sorted(rows, key=key, reverse=self.sort_order)
    def write_header(self, file_handler):
-        file_handler.write(",".join(self.header()) + "\n")
+        file_handler.write(self.separator.join(self.header()) + "\n")
    def write_out_worker(self, file_handler, structure, colocation_ids, col_sent_map):
        rows = []
@@ -99,12 +100,16 @@ class Writer:
            for words in match.matches:
                to_write = []
-                for idx, _comp in enumerate(components):
+                idx = 1
-                    idx = str(idx + 1)
+                for _comp in components:
-                    if idx not in words:
+                    if _comp.idx == '#':
                        continue
                    idx_s = str(idx)
                    idx += 1
                    if idx_s not in words:
                        to_write.extend([""] * self.formatter.length())
                    else:
-                        to_write.extend(self.formatter.content_repeat(words, match.representations, idx, structure.id))
+                        to_write.extend(self.formatter.content_repeat(words, match.representations, idx_s, structure.id))
                # make them equal size
                to_write.extend([""] * (self.num_components * self.formatter.length() - len(to_write)))
@@ -121,7 +126,7 @@ class Writer:
        if rows != []:
            rows = self.sorted_rows(rows)
-            file_handler.write("\n".join([",".join(row) for row in rows]) + "\n")
+            file_handler.write("\n".join([self.separator.join(row) for row in rows]) + "\n")
            file_handler.flush()
    def write_out(self, structures, colocation_ids):
--- a/luscenje_struktur/writerpy
+++ b/luscenje_struktur/writerpy
--- a/run.sh.example
+++ b/run.sh.example
@@ -1 +1 @@
-pypy3 src/wani.py data/Kolokacije_strukture_JOS-32-representation_3D_08_1.xml data/input --out data/output --sloleks_db '<sloleks db data>' --collocation_sentence_map_dest data/collocation-sentence-mapper --db /mnt/tmp/mysql-wani --multiple-output  --load-sloleks
+pypy3 wani.py data/Kolokacije_strukture_JOS-32-representation_3D_08_1.xml data/input --out data/output --sloleks_db '<sloleks db data>' --collocation_sentence_map_dest data/collocation-sentence-mapper --db /mnt/tmp/mysql-wani --multiple-output  --load-sloleks
--- a/scripts/recalculate_statistics.py
+++ b/scripts/recalculate_statistics.py
@@ -1,4 +1,7 @@
 import argparse
 import csv
 import logging
 import os
 import sys
@@ -166,21 +169,54 @@ def write_new_stats(wf, original_text, stats, file_name, word_order):
        wf.write(','.join(line) + '\n')
 def main(args):
-    word_order = load_word_order(args.word_order_file)
+    if not args.ignore_recalculation:
-    for file_name in os.listdir(args.input):
+        word_order = load_word_order(args.word_order_file)
-        read_file_path = os.path.join(args.input, file_name)
+        for file_name in os.listdir(args.input):
-        write_file_path = os.path.join(args.output, file_name)
+            read_file_path = os.path.join(args.input, file_name)
-        with open(read_file_path, 'r') as rf, open(write_file_path, 'w') as wf:
+            write_file_path = os.path.join(args.output, file_name)
-            original_text, stats = get_new_stats(rf)
+            with open(read_file_path, 'r') as rf, open(write_file_path, 'w') as wf:
-            freq_pos = original_text[0].index('Frequency')
+                original_text, stats = get_new_stats(rf)
-            original_text = [original_text[0]] + [l for l in original_text[1:] if int(l[freq_pos]) >= 10]
+                freq_pos = original_text[0].index('Frequency')
-            if len(original_text) > 1:
+                if args.frequency_limit > 1:
-                original_text = [original_text[0]] + sorted(original_text[1:], key=lambda x: -1 * int(x[freq_pos]))
+                    original_text = [original_text[0]] + [l for l in original_text[1:] if int(l[freq_pos]) >= 10]
-            else:
+                if args.sorted:
-                original_text = [original_text[0]]
+                    if len(original_text) > 1:
-            write_new_stats(wf, original_text, stats, file_name, word_order)
+                        original_text = [original_text[0]] + sorted(original_text[1:], key=lambda x: -1 * int(x[freq_pos]))
                    else:
                        original_text = [original_text[0]]
                write_new_stats(wf, original_text, stats, file_name, word_order)
    if args.format_output:
        for file_name in os.listdir(args.output):
            read_file_path = os.path.join(args.output, file_name)
            write_file_path = os.path.join(args.formatted_output, file_name)
            with open(read_file_path, 'r', encoding="utf-8") as rf, open(write_file_path, 'w') as wf:
                first_line = True
                lines = []
                formatted_output = []
                for line in rf:
                    line = line[:-1].split(',')
                    if first_line:
                        # sorting
                        a = line[-17]
                        b = line[-15]
                        # post frequency
                        c = line[-6]
                        d = line[-8]
                        formatted_output.append(line[:-14] + [line[-6], line[-8]])
                        first_line = False
                        continue
                    lines.append(line[:-14] + [line[-6], line[-8]])
                lines = [line for line in lines if int(line[-3]) >= 10]
                lines = sorted(lines, key=lambda x: (-int(x[-3]), x[-5]))
                formatted_output += lines
                for line in formatted_output:
                    wf.write(','.join(line) + '\n')
            break
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(
@@ -190,6 +226,11 @@ if __name__ == '__main__':
    parser.add_argument('output',
                        help='Path to folder that contains all input files.')
    parser.add_argument('--word_order_file', type=str, help='File that contains word order for DeltaP calculations.')
    parser.add_argument('--frequency_limit', type=int, default=1, help='File that contains word order for DeltaP calculations.')
    parser.add_argument('--sorted', action='store_true', help='File that contains word order for DeltaP calculations.')
    parser.add_argument('--format_output', action='store_true', help='Format and cut data as specified in #1808 on redmine.')
    parser.add_argument('--ignore_recalculation', action='store_true', help='Ignore recalculation.')
    parser.add_argument('--formatted_output', default=None, help='Destination of final results.')
    args = parser.parse_args()
    logging.basicConfig(stream=sys.stderr)
--- a/setup.py
+++ b/setup.py
@@ -0,0 +1,10 @@
 from setuptools import setup, find_packages
 setup(name='luscenje_struktur_loc',
  version='0.0.1',
  description=u"Parser for collocability",
  author=u"CJVT",
  author_email='fake@mail.com',
  license='MIT',
  packages=find_packages(),
 )
--- a/src/restriction.py
+++ b/src/restriction.py
@@ -1,133 +0,0 @@
 import re
 from enum import Enum
 from codes_tagset import CODES, TAGSET
 class RestrictionType(Enum):
    Morphology = 0
    Lexis = 1
    MatchAll = 2
 def determine_ppb(rgx):
    if rgx[0] in ("A", "N", "R"):
        return 0
    elif rgx[0] == "V":
        if len(rgx) == 1:
            return 2
        elif 'a' in rgx[1]:
            return 3
        elif 'm' in rgx[1]:
            return 1
        else:
            return 2
    else:
        return 4
 class MorphologyRegex:
    def __init__(self, restriction):
        self.min_msd_length = 1
        restr_dict = {}
        for feature in restriction:
            feature_dict = dict(feature.items())
            match_type = True
            if "filter" in feature_dict:
                assert feature_dict['filter'] == "negative"
                match_type = False
                del feature_dict['filter']
            assert len(feature_dict) == 1
            key, value = next(iter(feature_dict.items()))
            restr_dict[key] = (value, match_type)
        assert 'POS' in restr_dict
        category = restr_dict['POS'][0].capitalize()
        cat_code = CODES[category]
        rgx = [cat_code] + ['.'] * 10
        del restr_dict['POS']
        for attribute, (value, typ) in restr_dict.items():
            index = TAGSET[cat_code].index(attribute.lower())
            assert index >= 0
            if '|' in value:
                match = "".join(CODES[val] for val in value.split('|'))
            else:
                match = CODES[value]
            match = "[{}{}]".format("" if typ else "^", match)
            rgx[index + 1] = match
            if typ:
                self.min_msd_length = max(index + 1, self.min_msd_length)
        # strip rgx
        for i in reversed(range(len(rgx))):
            if rgx[i] == '.':
                rgx = rgx[:-1]
            else:
                break
        self.re_objects = [re.compile(r) for r in rgx]
        self.rgx = rgx
    def __call__(self, text):
        if len(text) <= self.min_msd_length:
            return False
        for c, r in zip(text, self.re_objects):
            if not r.match(c):
                return False
        return True
 class LexisRegex:
    def __init__(self, restriction):
        restr_dict = {}
        for feature in restriction:
            restr_dict.update(feature.items())
        assert "lemma" in restr_dict
        self.match_list = restr_dict['lemma'].split('|')
    def __call__(self, text):
        return text in self.match_list
 class Restriction:
    def __init__(self, restriction_tag):
        self.ppb = 4 # polnopomenska beseda (0-4)
        if restriction_tag is None:
            self.type = RestrictionType.MatchAll
            self.matcher = None
            self.present = None
            return
        restriction_type = restriction_tag.get('type')
        if restriction_type == "morphology":
            self.type = RestrictionType.Morphology
            self.matcher = MorphologyRegex(list(restriction_tag))
            self.ppb = determine_ppb(self.matcher.rgx)
        elif restriction_type == "lexis":
            self.type = RestrictionType.Lexis
            self.matcher = LexisRegex(list(restriction_tag))
        else:
            raise NotImplementedError()
    def match(self, word):
        if self.type == RestrictionType.Morphology:
            match_to = word.msd
        elif self.type == RestrictionType.Lexis:
            match_to = word.lemma
        elif self.type == RestrictionType.MatchAll:
            return True
        else:
            raise RuntimeError("Unreachable!")
        return self.matcher(match_to)
--- a/src/wani.py
+++ b/src/wani.py
@@ -10,18 +10,18 @@ import subprocess
 import concurrent.futures
 import tempfile
-from progress_bar import progress
+from luscenje_struktur.progress_bar import progress
-from sloleks_db import SloleksDatabase
+from luscenje_struktur.sloleks_db import SloleksDatabase
-from word import Word
+from luscenje_struktur.word import Word
-from syntactic_structure import build_structures
+from luscenje_struktur.syntactic_structure import build_structures
-from match_store import MatchStore
+from luscenje_struktur.match_store import MatchStore
-from word_stats import WordStats
+from luscenje_struktur.word_stats import WordStats
-from writer import Writer
+from luscenje_struktur.writer import Writer
-from loader import load_files
+from luscenje_struktur.loader import load_files
-from database import Database
+from luscenje_struktur.database import Database
-from time_info import TimeInfo
+from luscenje_struktur.time_info import TimeInfo
-from postprocessor import Postprocessor
+from luscenje_struktur.postprocessor import Postprocessor
 def match_file(words, structures, postprocessor):
@@ -31,6 +31,8 @@ def match_file(words, structures, postprocessor):
        for w in words:
            mhere = s.match(w)
            for match in mhere:
                if not postprocessor.is_fixed_restriction_order(match):
                    continue
                colocation_id = [[idx, w.lemma] for idx, w in match.items()]
                colocation_id = [s.id] + list(sorted(colocation_id, key=lambda x: x[0]))
                match, collocation_id = postprocessor.process(match, colocation_id)
@@ -48,6 +50,7 @@ def main(args):
    database = Database(args)
    match_store = MatchStore(args, database)
    word_stats = WordStats(lemma_msds, database)
    postprocessor = Postprocessor(fixed_restriction_order=args.fixed_restriction_order)
    for words in load_files(args, database):
        if words is None:
@@ -55,7 +58,6 @@ def main(args):
            continue
        start_time = time.time()
        postprocessor = Postprocessor()
        matches = match_file(words, structures, postprocessor)
        match_store.add_matches(matches)
@@ -80,9 +82,13 @@ def main(args):
    # figure out representations!
    if args.out or args.out_no_stat:
-        sloleks_db = SloleksDatabase(args.sloleks_db, args.load_sloleks)
+        if args.sloleks_db is not None:
            sloleks_db = SloleksDatabase(args.sloleks_db, args.load_sloleks)
        else:
            sloleks_db = None
        match_store.set_representations(word_stats, structures, sloleks_db=sloleks_db)
-        sloleks_db.close()
+        if args.sloleks_db is not None:
            sloleks_db.close()
    Writer.make_output_writer(args, max_num_components, match_store, word_stats).write_out(
        structures, match_store)
@@ -102,7 +108,7 @@ if __name__ == '__main__':
                        help='Structures definitions in xml file')
    parser.add_argument('input',
                        help='input file in (gz or xml currently). If none, then just database is loaded', nargs='*')
-    parser.add_argument('--sloleks_db', type=str, help='Sloleks database credentials')
+    parser.add_argument('--sloleks_db', type=str, default=None, help='Sloleks database credentials')
    parser.add_argument('--out',
                        help='Classic output file')
    parser.add_argument('--out-no-stat',
@@ -130,7 +136,7 @@ if __name__ == '__main__':
                        action='store_true')
    parser.add_argument('--load-sloleks',
-                        help='Tells weather sloleks is loaded into memory at the beginning of processing or not.',
+                        help='Tells weather sloleks is loaded into memory at the beginning of processing or not. Should be in',
                        action='store_true')
    parser.add_argument('--sort-by',
@@ -147,7 +153,16 @@ if __name__ == '__main__':
    parser.add_argument('--pc-tag',
                        help='Tag for separators, usually pc or c', default="pc")
-
+    parser.add_argument('--separator',
                        help='Separator in output file', default="\t")
    parser.add_argument('--ignore-punctuations',
                        help="Sort in reversed ored", action='store_true')
    parser.add_argument('--fixed-restriction-order',
                        help='If used, words have to be in the same order as components.',
                        action='store_true')
    parser.add_argument('--new-tei',
                        help='Attribute to be used, when using new version of tei. (default=False)',
                        action='store_true')
    args = parser.parse_args()
    logging.basicConfig(stream=sys.stderr, level=args.verbose.upper())
Author	SHA1	Message	Date
Cyprian Laskowski	d0bec69fd8	Redmine #2198 : Limited wani to "collocation" type structures	2021-12-03 15:23:29 +01:00
Luka	39692e839f	Extended recalculate statistics to filtered output	2021-02-16 17:01:02 +01:00
Luka	f1366548b6	White reset at paragraphs not sentences + progress bar updates on paragraphs not sentences.	2021-01-26 14:57:42 +01:00
Luka	552f2e4bd0	Changed whitespace aspect from document to sentence based.	2021-01-23 09:28:10 +01:00
Luka	361331515e	Ignoring @type=single and added option for --new-tei	2021-01-13 16:36:44 +01:00
Luka	fa4479af60	Fixed repeating words bug	2020-11-26 09:45:22 +01:00
Luka	25db8eeb7a	Adding --fixed-restriction-order parameter	2020-10-27 09:48:34 +01:00
Luka	dd5fa4a1b8	Changed spaces settings - both swiched with neither and left with right.	2020-10-26 15:25:46 +01:00
Luka	c63a9d47da	Adding restriction on spaces on punctuations.	2020-10-22 13:16:58 +02:00
Luka	6dd97838b4	Added fix for when two restrictions are satisfied with the same word.	2020-10-19 15:40:43 +02:00
Luka	8c87d07b8a	Scripts adapted to changes of new structures.xml format	2020-10-14 14:50:35 +02:00
Luka	09c4277ebe	Modified error signal + Fixed no_stat	2020-10-09 20:13:37 +02:00
Luka	06435aa3a2	Added options for "modra"	2020-10-09 15:18:52 +02:00
Luka	1ea454f63c	Added fix for punctuations	2020-10-08 18:31:50 +02:00
Luka	d5668c8b68	Moved wani.py + Added ignore of .zstd files for valency	2020-10-01 16:20:52 +02:00
Luka	412d0c0f62	Changing file structure	2020-09-17 14:17:40 +02:00
Luka	c19c95ad97	Renaming src to luscenje struktur	2020-09-17 14:02:56 +02:00
Luka	5bff3e370f	Added setup.py	2020-09-17 13:09:20 +02:00
Luka	01b08667d2	Added some functions for compatibility with valency, fixed readme and fixed some minor bugs.	2020-09-10 15:06:09 +02:00
Luka	1b0e6a27eb	Modified readme.md + Removed obligatory sloleks_db + Added frequency_limit and sorted parameters in recalculate_statistics.py	2020-09-02 10:53:45 +02:00
`@@ -1 +1 @@`
	`pypy3 src/wani.py data/Kolokacije_strukture_JOS-32-representation_3D_08_1.xml data/input --out data/output --sloleks_db '<sloleks db data>' --collocation_sentence_map_dest data/collocation-sentence-mapper --db /mnt/tmp/mysql-wani --multiple-output --load-sloleks`	`pypy3 wani.py data/Kolokacije_strukture_JOS-32-representation_3D_08_1.xml data/input --out data/output --sloleks_db '<sloleks db data>' --collocation_sentence_map_dest data/collocation-sentence-mapper --db /mnt/tmp/mysql-wani --multiple-output --load-sloleks`