Repository moved!

master
Luka 1 year ago
parent 40aaffa632
commit b7193f9126

8
.gitignore vendored

@ -1,8 +0,0 @@
.idea/
venv/
internal_saves/
__pycache__/
results/
data/
config2.ini
configs/

@ -1,78 +1,3 @@
# STARK: a tool for statistical analysis of dependency-parsed corpora
STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora based on various user-defined criteria. It is primarily aimed at processing corpora based on the [Universal Dependencies](https://universaldependencies.org/) annotation scheme, but it also takes any other corpus in the [CONLL-U](https://universaldependencies.org/format.html) format as input.
# Repository moved
STARK was moved to a [new location on GitHub](https://github.com/clarinsi/STARK).
## Windows installation and execution
### Installation
Install Python 3 on your system (https://www.python.org/downloads/).
Download pip installation file (https://bootstrap.pypa.io/get-pip.py) and install it by double clicking on it.
Install other libraries necessary for running by going into program directory and double clicking on `install.bat`. If windows defender is preventing execution of this file you might have to unblock that file by `right-clicking on .bat file -> Properties -> General -> Security -> Select Unblock -> Select Apply`.
### Execution
Set up search parameters in `.ini` file.
Execute extraction by running `run.bat` (in case it is blocked repeat the same procedure as for `install.bat`).
Optionally modify run.bat by pointing it to another .ini file. This can be done by editing run.bat file (changing parameter --config_file).
## Linux installation and execution
### Installation
Install Python 3 on your system (https://www.python.org/downloads/).
Install pip and other libraries required by program, by running the following commands in terminal:
```bash
sudo apt install python3-pip
cd <PATH TO PROJECT DIRECTORY>
pip3 install -r requirements.txt
```
### Execution
Set up search parameters in `.ini` file.
Execute extraction by first moving to project directory with:
```bash
cd <PATH TO PROJECT DIRECTORY>
```
And later executing script with:
```bash
python3 dependency-parsetree.py --config_file=<PATH TO .ini file>
```
Example:
```bash
python3 dependency-parsetree.py --config_file=config.ini
```
## Parameter settings
The type of trees to be extracted can be defined through several parameters in the `config.ini` configuration file.
- `input`: location of the input file or directory (parsed corpus in .conllu)
- `output`: location of the output file (extraction results)
- `internal_saves`: location of the folder with files for optimization during processing
- `cpu_cores`: number of CPU cores to be used during processing
- `tree_size`: number of nodes in the tree (integer or range)
- `tree_type`: extraction of all possible subtrees or full subtrees only (values *all* or *complete*)
- `dependency_type`: extraction of labeled or unlabeled trees (values *labeled* or *unlabeled*)
- `node_order`: extraction of trees by taking surface word order into account (values *free* or *fixed*)
- `node_type`: type of nodes under investigation (values *form*, *lemma*, *upos*, *xpos*, *feats* or *deprel*)
- `label_whitelist`: predefined list of dependency labels allowed in the extracted tree
- `root_whitelist`: predefined characteristics of the root node
- `query`: predefined tree structure based on the modified Turku NLP [query language](http://bionlp.utu.fi/searchexpressions-new.html).
- `print_root`: print root node information in the output file (values *yes* or *no*)
- `nodes_number`: print the number of nodes in the tree in the output file (values *yes* or *no*)
- `association_measures`: calculate the strength of association between nodes by MI, MI3, t-test, logDice, Dice and simple-LL scores (values *yes* or *no*)
- `frequency_threshold`: minimum frequency of occurrences of the tree in the corpus
- `lines_threshold`: maximum number of trees in the output
## Output
The tool returns the resulting list of all relevant trees in the form of a tab-separated `.tsv` file with information on the tree structure, its frequency and other relevant information in relation to specific parameter settings. The tool does not support any data visualization, however, the output structure of the tree is directly transferable to the [Dep_Search](http://bionlp-www.utu.fi/dep_search/) concordancing service giving access to specific examples in many corpora.
## Credits
This program was developed by Luka Krsnik in collaboration with Kaja Dobrovoljc and Marko Robnik Šikonja and with financial support from [CLARIN.SI](https://www.clarin.si/).
<a href="http://www.clarin.si/info/about/"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/CLARIN.png" alt="drawing" height="150"/></a>
<a href="https://www.cjvt.si/en/"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/CJVT.png" alt="drawing" height="150"/></a>
<a href="https://www.fri.uni-lj.si/en/about"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/FRI.png" alt="drawing" height="150"/></a>
<a href="http://www.ff.uni-lj.si/an/aboutFaculty/about_faculty"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/FF.png" alt="drawing" height="150"/></a>

@ -1,25 +0,0 @@
# Copyright 2019 CJVT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from generic import generate_key, generate_name
class ResultNode(object):
def __init__(self, node, architecture_order, create_output_strings):
self.name_parts, self.name = generate_name(node, create_output_strings)
self.location = architecture_order
self.deprel = node.deprel.get_value()
def __repr__(self):
return self.name

@ -1,180 +0,0 @@
# Copyright 2019 CJVT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import string
class ResultTree(object):
def __init__(self, node, children, filters):
self.node = node
self.children = children
self.filters = filters
self.key = None
self.order_key = None
self.order = None
self.array = None
def __repr__(self):
return self.get_key()
def set_children(self, children):
self.children = children
def reset_params(self):
self.key = None
self.order_key = None
self.order = None
self.array = None
def get_key(self):
if self.key:
return self.key
key = ''
write_self_node_to_result = False
if self.children:
children = self.children
for child in children:
if self.filters['node_order'] and child.node.location < self.node.location:
if self.filters['dependency_type']:
separator = ' <' + child.node.deprel + ' '
else:
separator = ' < '
key += child.get_key() + separator
else:
if not write_self_node_to_result:
write_self_node_to_result = True
key += self.node.name
if self.filters['dependency_type']:
separator = ' >' + child.node.deprel + ' '
else:
separator = ' > '
key += separator + child.get_key()
if not write_self_node_to_result:
key += self.node.name
self.key = '(' + key + ')'
else:
self.key = self.node.name
return self.key
def get_key_sorted(self):
key = ''
write_self_node_to_result = False
if self.children:
children = sorted(self.children, key=lambda x: x.node.name)
for child in children:
if not write_self_node_to_result:
write_self_node_to_result = True
key += self.node.name
if self.filters['dependency_type']:
separator = ' >' + child.node.deprel + ' '
else:
separator = ' > '
key += separator + child.get_key_sorted()
if not write_self_node_to_result:
key += self.node.name
key = '(' + key + ')'
else:
key = self.node.name
return key
def get_order_key(self):
if self.order_key:
return self.order_key
order_key = ''
write_self_node_to_result = False
if self.children:
for child in self.children:
if self.filters['node_order'] and child.node.location < self.node.location:
if self.filters['dependency_type']:
separator = ' <' + child.node.deprel + ' '
else:
separator = ' < '
order_key += child.get_order_key() + separator
else:
if not write_self_node_to_result:
write_self_node_to_result = True
order_key += str(self.node.location)
if self.filters['dependency_type']:
separator = ' >' + child.node.deprel + ' '
else:
separator = ' > '
order_key += separator + child.get_order_key()
if not write_self_node_to_result:
order_key += str(self.node.location)
self.order_key = '(' + order_key + ')'
else:
self.order_key = str(self.node.location)
return self.order_key
def get_order(self):
if self.order:
return self.order
order = []
write_self_node_to_result = False
if self.children:
for child in self.children:
if self.filters['node_order'] and child.node.location < self.node.location:
order += child.get_order()
else:
if not write_self_node_to_result:
write_self_node_to_result = True
order += [self.node.location]
order += child.get_order()
if not write_self_node_to_result:
order += [self.node.location]
self.order = order
else:
self.order = [self.node.location]
return self.order
def get_array(self):
if self.array:
return self.array
array = []
write_self_node_to_result = False
if self.children:
for child in self.children:
if self.filters['node_order'] and child.node.location < self.node.location:
array += child.get_array()
else:
if not write_self_node_to_result:
write_self_node_to_result = True
array += [self.node.name_parts]
array += child.get_array()
if not write_self_node_to_result:
array += [self.node.name_parts]
self.array = array
else:
self.array = [self.node.name_parts]
return self.array
def finalize_result(self):
result = copy.copy(self)
result.reset_params()
# create order letters
order = result.get_order()
order_letters = [''] * len(result.order)
for i in range(len(order)):
ind = order.index(min(order))
order[ind] = 10000
order_letters[ind] = string.ascii_uppercase[i]
result.order = ''.join(order_letters)
# TODO When tree is finalized create relative word order (alphabet)!
return result

@ -1,393 +0,0 @@
import sys
from copy import copy
from ResultNode import ResultNode
from ResultTree import ResultTree
from Value import Value
from generic import generate_key
class Tree(object):
def __init__(self, index, form, lemma, upos, xpos, deprel, feats_detailed, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict, feats_detailed_dict, head):
if not hasattr(self, 'feats'):
self.feats_detailed = {}
if form not in form_dict:
form_dict[form] = Value(form)
self.form = form_dict[form]
if lemma not in lemma_dict:
lemma_dict[lemma] = Value(lemma)
self.lemma = lemma_dict[lemma]
if upos not in upos_dict:
upos_dict[upos] = Value(upos)
self.upos = upos_dict[upos]
if xpos not in xpos_dict:
xpos_dict[xpos] = Value(xpos)
self.xpos = xpos_dict[xpos]
if deprel not in deprel_dict:
deprel_dict[deprel] = Value(deprel)
self.deprel = deprel_dict[deprel]
for feat in feats_detailed.keys():
if feat not in feats_detailed_dict:
feats_detailed_dict[feat] = {}
if next(iter(feats_detailed[feat])) not in feats_detailed_dict[feat]:
feats_detailed_dict[feat][next(iter(feats_detailed[feat]))] = Value(next(iter(feats_detailed[feat])))
if not feat in self.feats_detailed:
self.feats_detailed[feat] = {}
self.feats_detailed[feat][next(iter(feats_detailed[feat]))] = feats_detailed_dict[feat][next(iter(feats_detailed[feat]))]
self.parent = head
self.children = []
self.children_split = -1
self.index = index
# for caching answers to questions
self.cache = {}
def add_child(self, child):
self.children.append(child)
def set_parent(self, parent):
self.parent = parent
def fits_static_requirements_feats(self, query_tree):
if 'feats_detailed' not in query_tree:
return True
for feat in query_tree['feats_detailed'].keys():
if feat not in self.feats_detailed or query_tree['feats_detailed'][feat] != next(iter(self.feats_detailed[feat].values())).get_value():
return False
return True
def fits_permanent_requirements(self, filters):
main_attributes = ['deprel', 'feats', 'form', 'lemma', 'upos']
if not filters['root_whitelist']:
return True
for option in filters['root_whitelist']:
filter_passed = True
# check if attributes are valid
for key in option.keys():
if key not in main_attributes:
if key not in self.feats_detailed or option[key] != list(self.feats_detailed[key].items())[0][1].get_value():
filter_passed = False
filter_passed = filter_passed and \
('deprel' not in option or option['deprel'] == self.deprel.get_value()) and \
('form' not in option or option['form'] == self.form.get_value()) and \
('lemma' not in option or option['lemma'] == self.lemma.get_value()) and \
('upos' not in option or option['upos'] == self.upos.get_value())
if filter_passed:
return True
return False
def fits_temporary_requirements(self, filters):
return not filters['label_whitelist'] or self.deprel.get_value() in filters['label_whitelist']
def fits_static_requirements(self, query_tree, filters):
return ('form' not in query_tree or query_tree['form'] == self.form.get_value()) and \
('lemma' not in query_tree or query_tree['lemma'] == self.lemma.get_value()) and \
('upos' not in query_tree or query_tree['upos'] == self.upos.get_value()) and \
('xpos' not in query_tree or query_tree['xpos'] == self.xpos.get_value()) and \
('deprel' not in query_tree or query_tree['deprel'] == self.deprel.get_value()) and \
(not filters['complete_tree_type'] or (len(self.children) == 0 and 'children' not in query_tree) or ('children' in query_tree and len(self.children) == len(query_tree['children']))) and \
self.fits_static_requirements_feats(query_tree)
def generate_children_queries(self, all_query_indices, children):
partial_results = {}
# list of pairs (index of query in group, group of query, is permanent)
child_queries_metadata = []
for child_index, child in enumerate(children):
new_queries = []
# add continuation queries to children
for result_part_index, result_index, is_permanent in child_queries_metadata:
if result_index in partial_results and result_part_index in partial_results[result_index] and len(partial_results[result_index][result_part_index]) > 0:
if len(all_query_indices[result_index][0]) > result_part_index + 1:
new_queries.append((result_part_index + 1, result_index, is_permanent))
child_queries_metadata = new_queries
# add new queries to children
for result_index, (group, is_permanent) in enumerate(all_query_indices):
# check if node has enough children for query to be possible
if len(children) - len(group) >= child_index:
child_queries_metadata.append((0, result_index, is_permanent))
child_queries = []
for result_part_index, result_index, _ in child_queries_metadata:
child_queries.append(all_query_indices[result_index][0][result_part_index])
partial_results = yield child, child_queries, child_queries_metadata
yield None, None, None
def add_subtrees(self, old_subtree, new_subtree):
old_subtree.extend(new_subtree)
def get_all_query_indices(self, temporary_query_nb, permanent_query_nb, permanent_query_trees, all_query_indices, children, create_output_string, filters):
partial_answers = [[] for i in range(permanent_query_nb + temporary_query_nb)]
complete_answers = [[] for i in range(permanent_query_nb)]
# list of pairs (index of query in group, group of query)
# TODO try to erase!!!
child_queries = [all_query_indice[0] for all_query_indice in all_query_indices]
answers_lengths = [len(query) for query in child_queries]
child_queries_flatten = [query_part for query in child_queries for query_part in query]
all_new_partial_answers = [[] for query_part in child_queries_flatten]
child_queries_flatten_dedup = []
child_queries_flatten_dedup_indices = []
for query_part in child_queries_flatten:
try:
index = child_queries_flatten_dedup.index(query_part)
except ValueError:
index = len(child_queries_flatten_dedup)
child_queries_flatten_dedup.append(query_part)
child_queries_flatten_dedup_indices.append(index)
# ask children all queries/partial queries
for child in children:
# obtain children results
new_partial_answers_dedup, new_complete_answers = child.get_subtrees(permanent_query_trees, child_queries_flatten_dedup,
create_output_string, filters)
assert len(new_partial_answers_dedup) == len(child_queries_flatten_dedup)
# duplicate results again on correct places
for i, flattened_index in enumerate(child_queries_flatten_dedup_indices):
all_new_partial_answers[i].append(new_partial_answers_dedup[flattened_index])
for i in range(len(new_complete_answers)):
# TODO add order rearagement (TO KEY)
complete_answers[i].extend(new_complete_answers[i])
# merge answers in appropriate way
i = 0
# iterate over all answers per queries
for answer_i, answer_length in enumerate(answers_lengths):
# iterate over answers of query
# TODO ERROR IN HERE!
partial_answers[answer_i] = self.create_answers(all_new_partial_answers[i:i + answer_length], answer_length, filters)
i += answer_length
return partial_answers, complete_answers
def order_dependent_queries(self, active_permanent_query_trees, active_temporary_query_trees, partial_subtrees,
create_output_string, merged_partial_subtrees, i_query, i_answer, filters):
node = ResultNode(self, self.index, create_output_string)
if i_query < len(active_permanent_query_trees):
if 'children' in active_permanent_query_trees[i_query]:
merged_partial_subtrees.append(
self.create_output_children(partial_subtrees[i_answer], [ResultTree(node, [], filters)], filters))
i_answer += 1
else:
merged_partial_subtrees.append([ResultTree(node, [], filters)])
else:
if 'children' in active_temporary_query_trees[i_query - len(active_permanent_query_trees)]:
merged_partial_subtrees.append(
self.create_output_children(partial_subtrees[i_answer], [ResultTree(node, [], filters)], filters))
i_answer += 1
else:
merged_partial_subtrees.append([ResultTree(node, [], filters)])
return i_answer
def get_unigrams(self, create_output_strings, filters):
unigrams = [generate_key(self, create_output_strings, print_lemma=False)[1]]
for child in self.children:
unigrams += child.get_unigrams(create_output_strings, filters)
return unigrams
def get_subtrees(self, permanent_query_trees, temporary_query_trees, create_output_string, filters):
"""
:param permanent_query_trees:
:param temporary_query_trees:
"""
# list of all children queries grouped by parent queries
all_query_indices = []
active_permanent_query_trees = []
for permanent_query_tree in permanent_query_trees:
if self.fits_static_requirements(permanent_query_tree, filters) and self.fits_permanent_requirements(filters):
active_permanent_query_trees.append(permanent_query_tree)
if 'children' in permanent_query_tree:
all_query_indices.append((permanent_query_tree['children'], True))
# r_all_query_indices.append((permanent_query_tree['r_children'], True))
active_temporary_query_trees = []
successful_temporary_queries = []
for i, temporary_query_tree in enumerate(temporary_query_trees):
if self.fits_static_requirements(temporary_query_tree, filters) and self.fits_temporary_requirements(filters):
active_temporary_query_trees.append(temporary_query_tree)
successful_temporary_queries.append(i)
if 'children' in temporary_query_tree:
all_query_indices.append((temporary_query_tree['children'], False))
partial_subtrees, complete_answers = self.get_all_query_indices(len(temporary_query_trees),
len(permanent_query_trees),
permanent_query_trees,
all_query_indices, self.children,
create_output_string, filters)
merged_partial_answers = []
i_question = 0
# i_child is necessary, because some queries may be answered at the beginning and were not passed to children.
# i_child is used to point where we are inside answers
i_answer = 0
# go over all permanent and temporary query trees
while i_question < len(active_permanent_query_trees) + len(active_temporary_query_trees):
# permanent query trees always have left and right child
i_answer = self.order_dependent_queries(active_permanent_query_trees, active_temporary_query_trees, partial_subtrees,
create_output_string, merged_partial_answers, i_question, i_answer, filters)
i_question += 1
for i in range(len(active_permanent_query_trees)):
# TODO FINALIZE RESULT
# erase first and last braclets when adding new query result
add_subtree = [subtree.finalize_result() for subtree in merged_partial_answers[i]]
complete_answers[i].extend(add_subtree)
# answers to valid queries
partial_answers = [[] for i in range(len(temporary_query_trees))]
for inside_i, outside_i in enumerate(successful_temporary_queries):
partial_answers[outside_i] = merged_partial_answers[
len(active_permanent_query_trees) + inside_i]
return partial_answers, complete_answers
@staticmethod
def create_children_groups(left_parts, right_parts):
if not left_parts:
return right_parts
if not right_parts:
return left_parts
all_children_group_possibilities = []
for left_part in left_parts:
for right_part in right_parts:
new_part = copy(left_part)
new_part.extend(right_part)
all_children_group_possibilities.append(new_part)
return all_children_group_possibilities
@staticmethod
def merge_answer(answer1, answer2, base_answer_i, answer_j):
merged_results = []
merged_indices = []
for answer1p_i, old_result in enumerate(answer1):
for answer2p_i, new_result in enumerate(answer2):
if answer1p_i != answer2p_i:
new_indices = [answer1p_i] + [answer2p_i]
# TODO add comparison answers with different indices if equal than ignore
merged_results.append(old_result + new_result)
merged_indices.append(new_indices)
return merged_results, merged_indices
def merge_results3(self, child, new_results, filters):
if filters['node_order']:
new_child = child
else:
new_child = sorted(child, key=lambda x: x[0].get_key())
children_groups = []
for i_answer, answer in enumerate(new_child):
children_groups = self.create_children_groups(children_groups, [[answer_part] for answer_part in answer])
results = []
for result in new_results:
for children in children_groups:
new_result = copy(result)
new_result.set_children(children)
results.append(new_result)
return results
def create_output_children(self, children, new_results, filters):
merged_results = []
for i_child, child in enumerate(children):
merged_results.extend(self.merge_results3(child, new_results, filters))
return merged_results
def create_answers(self, separated_answers, answer_length, filters):
partly_built_trees = [[None] * answer_length]
partly_built_trees_architecture_indices = [[None] * answer_length]
built_trees = []
built_trees_architecture_indices = []
# iterate over children first, so that new partly built trees are added only after all results of specific
# child are added
for child_i in range(len(separated_answers[0])):
new_partly_built_trees = []
new_partly_built_trees_architecture_indices = []
# iterate over answers parts
for answer_part_i in range(len(separated_answers)):
# necessary because some parts do not pass filters and are not added
if separated_answers[answer_part_i][child_i]:
for tree_part_i, tree_part in enumerate(partly_built_trees):
if not tree_part[answer_part_i]:
new_tree_part = copy(tree_part)
new_tree_part_architecture_indices = copy(partly_built_trees_architecture_indices[tree_part_i])
new_tree_part[answer_part_i] = separated_answers[answer_part_i][child_i]
new_tree_part_architecture_indices[answer_part_i] = child_i
completed_tree_part = True
for val_i, val in enumerate(new_tree_part):
if not val:
completed_tree_part = False
if completed_tree_part:
built_trees.append(new_tree_part)
built_trees_architecture_indices.append(new_tree_part_architecture_indices)
else:
new_partly_built_trees.append(new_tree_part)
new_partly_built_trees_architecture_indices.append(new_tree_part_architecture_indices)
else:
# pass over repetitions of same words
pass
partly_built_trees.extend(new_partly_built_trees)
partly_built_trees_architecture_indices.extend(new_partly_built_trees_architecture_indices)
l_ordered_built_trees, unique_trees_architecture = [], []
if built_trees:
# sort 3 arrays by architecture indices
temp_trees_index, temp_trees = (list(t) for t in zip(
*sorted(zip(built_trees_architecture_indices, built_trees))))
# order outputs and erase duplicates
for tree, tree_index in zip(temp_trees, temp_trees_index):
new_tree_index, new_tree = (list(t) for t in zip(*sorted(zip(tree_index, tree))))
# TODO check if inside new_tree_architecture in ordered_built_trees_architecture and if not append!
is_unique = True
for unique_tree in unique_trees_architecture:
already_in = True
for part_i in range(len(unique_tree)):
if len(unique_tree[part_i]) != len(new_tree[part_i]) or any(unique_tree[part_i][i_unique_part].get_order_key() != new_tree[part_i][i_unique_part].get_order_key() for i_unique_part in range(len(unique_tree[part_i]))):
already_in = False
break
if already_in:
is_unique = False
break
if is_unique:
unique_trees_architecture.append(new_tree)
l_ordered_built_trees.append(new_tree)
return l_ordered_built_trees

@ -1,20 +0,0 @@
# Copyright 2019 CJVT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
class Value(object):
def __init__(self, value):
self.value = value
def get_value(self):
return self.value

@ -1,28 +0,0 @@
[settings]
;___GENERAL SETTINGS___
input = data/sl_ssj-ud_v2.4.conllu
output = results/out_official.tsv
internal_saves = ./internal_saves
cpu_cores = 12
;___TREE SPECIFICATIONS___
tree_size = 2-4
tree_type = complete
dependency_type = labeled
node_order = free
node_type = upos
;___TREE RESTRICTIONS___
;label_whitelist = nsubj|obj|obl
;root_whitelist = lemma=mati&Case=Acc|lemma=lep&Degree=Sup
;___SEARCH BY QUERY___
;query = _ > _
;___OUTPUT SETTINGS___
;lines_threshold = 10000
;frequency_threshold = 0
association_measures = no
print_root = yes
nodes_number = yes

@ -1,97 +0,0 @@
# Copyright 2019 CJVT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import sys
def create_output_string_form(tree):
return tree.form.get_value()
def create_output_string_deprel(tree):
return tree.deprel.get_value()
def create_output_string_lemma(tree):
return tree.lemma.get_value() if tree.lemma.get_value() is not None else '_'
def create_output_string_upos(tree):
return tree.upos.get_value()
def create_output_string_xpos(tree):
return tree.xpos.get_value()
def create_output_string_feats(tree):
return tree.feats.get_value()
def generate_key(node, create_output_strings, print_lemma=True):
array = [[create_output_string(node) for create_output_string in create_output_strings]]
if create_output_string_lemma in create_output_strings and print_lemma:
key_array = [[create_output_string(
node) if create_output_string != create_output_string_lemma else 'L=' + create_output_string(node) for
create_output_string in create_output_strings]]
else:
key_array = array
if len(array[0]) > 1:
key = '&'.join(key_array[0])
else:
key = key_array[0][0]
return array, key
def generate_name(node, create_output_strings, print_lemma=True):
array = [create_output_string(node) for create_output_string in create_output_strings]
if create_output_string_lemma in create_output_strings and print_lemma:
name_array = [create_output_string(
node) if create_output_string != create_output_string_lemma else 'L=' + create_output_string(node) for
create_output_string in create_output_strings]
else:
name_array = array
if len(array) > 1:
name = '&'.join(name_array)
else:
name = name_array[0]
return array, name
def get_collocabilities(ngram, unigrams_dict, corpus_size):
sum_fwi = 0.0
mul_fwi = 1.0
for key_array in ngram['object'].array:
# create key for unigrams
if len(key_array) > 1:
key = '&'.join(key_array)
else:
key = key_array[0]
sum_fwi += unigrams_dict[key]
mul_fwi *= unigrams_dict[key]
if mul_fwi < 0:
mul_fwi = sys.maxsize
# number of all words
N = corpus_size
# n of ngram
n = len(ngram['object'].array)
O = ngram['number']
E = mul_fwi / pow(N, n-1)
# ['MI', 'MI3', 'Dice', 'logDice', 't-score', 'simple-LL']
mi = math.log(O / E, 2)
mi3 = math.log(pow(O, 3) / E, 2)
dice = n * O / sum_fwi
logdice = 14 + math.log(dice, 2)
tscore = (O - E) / math.sqrt(O)
simplell = 2 * (O * math.log10(O / E) - (O - E))
return ['%.4f' % mi, '%.4f' % mi3, '%.4f' % dice, '%.4f' % logdice, '%.4f' % tscore, '%.4f' % simplell]

@ -1 +0,0 @@
py -3 -m pip install -r requirements.txt &

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 314 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 128 KiB

@ -1 +0,0 @@
pyconll==3.1.0

@ -1,17 +0,0 @@
import os
from pathlib import Path
input_path = '/home/lukakrsnik/STARK/data/ud-treebanks-v2.11/'
output_path = '/home/lukakrsnik/STARK/results/ud-treebanks-v2.11_B/'
config_path = '/home/lukakrsnik/STARK/data/B_test-all-treebanks_3_completed_unlabeled_fixed_form_root=NOUN_5.ini'
for path in sorted(os.listdir(input_path)):
path_obj = Path(input_path, path)
pathlist = path_obj.glob('**/*.conllu')
for path in sorted(pathlist):
folder_name = os.path.join(output_path, path.parts[-2])
file_name = os.path.join(folder_name, path.name)
if not os.path.exists(folder_name):
os.makedirs(folder_name)
if not os.path.exists(file_name):
os.system("python /home/luka/Development/STARK/stark.py --config_file " + config_path + " --input " + str(path) + " --output " + file_name)

@ -1,2 +0,0 @@
py -3 dependency-parsetree.py --config_file=config.ini &
@pause

@ -1,3 +0,0 @@
source venv/bin/activate
python3 dependency-parsetree.py --config_file="$1"
deactivate

@ -1,617 +0,0 @@
#!/usr/bin/env python
# Copyright 2019 CJVT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import configparser
import copy
import csv
import hashlib
import math
import os
import pickle
import re
import string
import time
from multiprocessing import Pool
from pathlib import Path
import gzip
import sys
import pyconll
from Tree import Tree
from generic import get_collocabilities, create_output_string_form, create_output_string_deprel, create_output_string_lemma, create_output_string_upos, create_output_string_xpos, create_output_string_feats
sys.setrecursionlimit(25000)
def save_zipped_pickle(obj, filename, protocol=-1):
with gzip.open(filename, 'wb') as f:
pickle.dump(obj, f, protocol)
def load_zipped_pickle(filename):
with gzip.open(filename, 'rb') as f:
loaded_object = pickle.load(f)
return loaded_object
def decode_query(orig_query, dependency_type, feats_detailed_list):
new_query = False
# if command in bracelets remove them and treat command as new query
if orig_query[0] == '(' and orig_query[-1] == ')':
new_query = True
orig_query = orig_query[1:-1]
if dependency_type != '':
decoded_query = {'deprel': dependency_type}
else:
decoded_query = {}
if orig_query == '_':
return decoded_query
# if no spaces in query then this is query node and do this otherwise further split query
elif len(orig_query.split(' ')) == 1:
orig_query_split_parts = orig_query.split(' ')[0].split('&')
for orig_query_split_part in orig_query_split_parts:
orig_query_split = orig_query_split_part.split('=', 1)
if len(orig_query_split) > 1:
if orig_query_split[0] == 'L':
decoded_query['lemma'] = orig_query_split[1]
elif orig_query_split[0] == 'upos':
decoded_query['upos'] = orig_query_split[1]
elif orig_query_split[0] == 'xpos':
decoded_query['xpos'] = orig_query_split[1]
elif orig_query_split[0] == 'form':
decoded_query['form'] = orig_query_split[1]
elif orig_query_split[0] == 'feats':
decoded_query['feats'] = orig_query_split[1]
elif orig_query_split[0] in feats_detailed_list:
decoded_query['feats_detailed'] = {}
decoded_query['feats_detailed'][orig_query_split[0]] = orig_query_split[1]
return decoded_query
elif not new_query:
raise Exception('Not supported yet!')
else:
print('???')
elif not new_query:
decoded_query['form'] = orig_query_split_part
return decoded_query
# split over spaces if not inside braces
all_orders = re.split(r"\s+(?=[^()]*(?:\(|$))", orig_query)
node_actions = all_orders[::2]
priority_actions = all_orders[1::2]
priority_actions_beginnings = [a[0] for a in priority_actions]
# find root index
try:
root_index = priority_actions_beginnings.index('>')
except ValueError:
root_index = len(priority_actions)
children = []
root = None
for i, node_action in enumerate(node_actions):
if i < root_index:
children.append(decode_query(node_action, priority_actions[i][1:], feats_detailed_list))
elif i > root_index:
children.append(decode_query(node_action, priority_actions[i - 1][1:], feats_detailed_list))
else:
root = decode_query(node_action, dependency_type, feats_detailed_list)
if children:
root["children"] = children
return root
def create_trees(input_path, internal_saves, feats_detailed_dict={}, save=True):
hash_object = hashlib.sha1(input_path.encode('utf-8'))
hex_dig = hash_object.hexdigest()
trees_read_outputfile = os.path.join(internal_saves, hex_dig)
print(Path(input_path).name)
if not os.path.exists(trees_read_outputfile) or not save:
train = pyconll.load_from_file(input_path)
form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict = {}, {}, {}, {}, {}, {}
all_trees = []
corpus_size = 0
for sentence in train:
root = None
token_nodes = []
for token in sentence:
if not token.id.isdigit():
continue
# TODO check if 5th place is always there for feats
token_form = token.form if token.form is not None else '_'
node = Tree(int(token.id), token_form, token.lemma, token.upos, token.xpos, token.deprel, token.feats, form_dict,
lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict, feats_detailed_dict, token.head)
token_nodes.append(node)
if token.deprel == 'root':
root = node
corpus_size += 1
for token_id, token in enumerate(token_nodes):
if isinstance(token.parent, int) or token.parent == '':
root = None
print('No parent: ' + sentence.id)
break
if int(token.parent) == 0:
token.set_parent(None)
else:
parent_id = int(token.parent) - 1
if token_nodes[parent_id].children_split == -1 and token_id > parent_id:
token_nodes[parent_id].children_split = len(token_nodes[parent_id].children)
token_nodes[parent_id].add_child(token)
token.set_parent(token_nodes[parent_id])
for token in token_nodes:
if token.children_split == -1:
token.children_split = len(token.children)
if root == None:
print('No root: ' + sentence.id)
continue
all_trees.append(root)
if save:
save_zipped_pickle((all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict), trees_read_outputfile, protocol=2)
else:
print('Reading trees:')
print('Completed')
all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict = load_zipped_pickle(trees_read_outputfile)
return all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict
def printable_answers(query):
# all_orders = re.findall(r"(?:[^ ()]|\([^]*\))+", query)
all_orders = re.split(r"\s+(?=[^()]*(?:\(|$))", query)
# all_orders = orig_query.split()
node_actions = all_orders[::2]
# priority_actions = all_orders[1::2]
if len(node_actions) > 1:
res = []
# for node_action in node_actions[:-1]:
# res.extend(printable_answers(node_action[1:-1]))
# res.extend([node_actions[-1]])
for node_action in node_actions:
# if command in bracelets remove them and treat command as new query
# TODO FIX BRACELETS IN A BETTER WAY
if not node_action:
res.extend(['('])
elif node_action[0] == '(' and node_action[-1] == ')':
res.extend(printable_answers(node_action[1:-1]))
else:
res.extend([node_action])
return res
else:
return [query]
def tree_calculations(input_data):
tree, query_tree, create_output_string_funct, filters = input_data
_, subtrees = tree.get_subtrees(query_tree, [], create_output_string_funct, filters)
return subtrees
def get_unigrams(input_data):
tree, query_tree, create_output_string_funct, filters = input_data
unigrams = tree.get_unigrams(create_output_string_funct, filters)
return unigrams
def tree_calculations_chunks(input_data):
trees, query_tree, create_output_string_funct, filters = input_data
result_dict = {}
for tree in trees:
_, subtrees = tree.get_subtrees(query_tree, [], create_output_string_funct, filters)
for query_results in subtrees:
for r in query_results:
if r in result_dict:
result_dict[r] += 1
else:
result_dict[r] = 1
return result_dict
def add_node(tree):
if 'children' in tree:
tree['children'].append({})
else:
tree['children'] = [{}]
# walk over all nodes in tree and add a node to each possible node
def tree_grow(orig_tree):
new_trees = []
new_tree = copy.deepcopy(orig_tree)
add_node(new_tree)
new_trees.append(new_tree)
if 'children' in orig_tree:
children = []
for child_tree in orig_tree['children']:
children.append(tree_grow(child_tree))
for i, child in enumerate(children):
for child_res in child:
new_tree = copy.deepcopy(orig_tree)
new_tree['children'][i] = child_res
new_trees.append(new_tree)
return new_trees
def compare_trees(tree1, tree2):
if tree1 == {} and tree2 == {}:
return True
if 'children' not in tree1 or 'children' not in tree2 or len(tree1['children']) != len(tree2['children']):
return False
children2_connections = []
for child1_i, child1 in enumerate(tree1['children']):
child_duplicated = False
for child2_i, child2 in enumerate(tree2['children']):
if child2_i in children2_connections:
pass
if compare_trees(child1, child2):
children2_connections.append(child2_i)
child_duplicated = True
break
if not child_duplicated:
return False
return True
def create_ngrams_query_trees(n, trees):
for i in range(n - 1):
new_trees = []
for tree in trees:
# append new_tree only if it is not already inside
for new_tree in tree_grow(tree):
duplicate = False
for confirmed_new_tree in new_trees:
if compare_trees(new_tree, confirmed_new_tree):
duplicate = True
break
if not duplicate:
new_trees.append(new_tree)
trees = new_trees
return trees
def count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict, result_dict):
with Pool(cpu_cores) as p:
if cpu_cores > 1:
all_unigrams = p.map(get_unigrams, [(tree, query_tree, create_output_string_functs, filters) for tree in all_trees])
for unigrams in all_unigrams:
for unigram in unigrams:
if unigram in unigrams_dict:
unigrams_dict[unigram] += 1
else:
unigrams_dict[unigram] = 1
all_subtrees = p.map(tree_calculations, [(tree, query_tree, create_output_string_functs, filters) for tree in all_trees])
for tree_i, subtrees in enumerate(all_subtrees):
for query_results in subtrees:
for r in query_results:
if filters['node_order']:
key = r.get_key() + r.order
else:
key = r.get_key()
if key in result_dict:
result_dict[key]['number'] += 1
else:
result_dict[key] = {'object': r, 'number': 1}
# 3.65 s (1 core)
else:
for tree_i, tree in enumerate(all_trees):
input_data = (tree, query_tree, create_output_string_functs, filters)
if filters['association_measures']:
unigrams = get_unigrams(input_data)
for unigram in unigrams:
if unigram in unigrams_dict:
unigrams_dict[unigram] += 1
else:
unigrams_dict[unigram] = 1
subtrees = tree_calculations(input_data)
for query_results in subtrees:
for r in query_results:
if filters['node_order']:
key = r.get_key() + r.order
else:
key = r.get_key()
if key in result_dict:
result_dict[key]['number'] += 1
else:
result_dict[key] = {'object': r, 'number': 1}
def read_filters(config, args, feats_detailed_list):
tree_size = config.get('settings', 'tree_size', fallback='0') if not args.tree_size else args.tree_size
tree_size_range = tree_size.split('-')
tree_size_range = [int(r) for r in tree_size_range]
if tree_size_range[0] > 0:
if len(tree_size_range) == 1:
query_tree = create_ngrams_query_trees(tree_size_range[0], [{}])
elif len(tree_size_range) == 2:
query_tree = []
for i in range(tree_size_range[0], tree_size_range[1] + 1):
query_tree.extend(create_ngrams_query_trees(i, [{}]))
else:
query = config.get('settings', 'query') if not args.query else args.query
query_tree = [decode_query('(' + query + ')', '', feats_detailed_list)]
# set filters
node_type = config.get('settings', 'node_type') if not args.node_type else args.node_type
node_types = node_type.split('+')
create_output_string_functs = []
for node_type in node_types:
assert node_type in ['deprel', 'lemma', 'upos', 'xpos', 'form', 'feats'], '"node_type" is not set up correctly'
cpu_cores = config.getint('settings', 'cpu_cores') if not args.cpu_cores else args.cpu_cores
if node_type == 'deprel':
create_output_string_funct = create_output_string_deprel
elif node_type == 'lemma':
create_output_string_funct = create_output_string_lemma
elif node_type == 'upos':
create_output_string_funct = create_output_string_upos
elif node_type == 'xpos':
create_output_string_funct = create_output_string_xpos
elif node_type == 'feats':
create_output_string_funct = create_output_string_feats
else:
create_output_string_funct = create_output_string_form
create_output_string_functs.append(create_output_string_funct)
filters = {}
filters['internal_saves'] = config.get('settings', 'internal_saves') if not args.internal_saves else args.internal_saves
filters['input'] = config.get('settings', 'input') if not args.input else args.input
node_order = config.get('settings', 'node_order') if not args.node_order else args.node_order
filters['node_order'] = node_order == 'fixed'
# filters['caching'] = config.getboolean('settings', 'caching')
dependency_type = config.get('settings', 'dependency_type') if not args.dependency_type else args.dependency_type
filters['dependency_type'] = dependency_type == 'labeled'
if config.has_option('settings', 'label_whitelist'):
label_whitelist = config.get('settings', 'label_whitelist') if not args.label_whitelist else args.label_whitelist
filters['label_whitelist'] = label_whitelist.split('|')
else:
filters['label_whitelist'] = []
root_whitelist = config.get('settings', 'root_whitelist') if not args.root_whitelist else args.root_whitelist
if root_whitelist:
# test
filters['root_whitelist'] = []
for option in root_whitelist.split('|'):
attribute_dict = {}
for attribute in option.split('&'):
value = attribute.split('=')
attribute_dict[value[0]] = value[1]
filters['root_whitelist'].append(attribute_dict)
else:
filters['root_whitelist'] = []
tree_type = config.get('settings', 'tree_type') if not args.tree_type else args.tree_type
filters['complete_tree_type'] = tree_type == 'complete'
filters['association_measures'] = config.getboolean('settings', 'association_measures') if not args.association_measures else args.association_measures
filters['nodes_number'] = config.getboolean('settings', 'nodes_number') if not args.nodes_number else args.nodes_number
filters['frequency_threshold'] = config.getfloat('settings', 'frequency_threshold', fallback=0) if not args.frequency_threshold else args.frequency_threshold
filters['lines_threshold'] = config.getint('settings', 'lines_threshold', fallback=0) if not args.lines_threshold else args.lines_threshold
filters['print_root'] = config.getboolean('settings', 'print_root') if not args.print_root else args.print_root
return filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types
def process(input_path, internal_saves, config, args):
if os.path.isdir(input_path):
checkpoint_path = Path(internal_saves, 'checkpoint.pkl')
continuation_processing = config.getboolean('settings', 'continuation_processing',
fallback=False) if not args.continuation_processing else args.input
if not checkpoint_path.exists() or not continuation_processing:
already_processed = set()
result_dict = {}
unigrams_dict = {}
corpus_size = 0
feats_detailed_list = {}
if checkpoint_path.exists():
os.remove(checkpoint_path)
else:
already_processed, result_dict, unigrams_dict, corpus_size, feats_detailed_list = load_zipped_pickle(
checkpoint_path)
for path in sorted(os.listdir(input_path)):
path_obj = Path(input_path, path)
pathlist = path_obj.glob('**/*.conllu')
if path_obj.name in already_processed:
continue
start_exe_time = time.time()
for path in sorted(pathlist):
# because path is object not string
path_str = str(path)
(all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, sub_corpus_size,
feats_detailed_list) = create_trees(path_str, internal_saves, feats_detailed_dict=feats_detailed_list,
save=False)
corpus_size += sub_corpus_size
filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types = read_filters(
config, args, feats_detailed_list)
count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict,
result_dict)
already_processed.add(path_obj.name)
# 15.26
print("Execution time:")
print("--- %s seconds ---" % (time.time() - start_exe_time))
save_zipped_pickle(
(already_processed, result_dict, unigrams_dict, corpus_size, feats_detailed_list),
checkpoint_path, protocol=2)
else:
# 261 - 9 grams
# 647 - 10 grams
# 1622 - 11 grams
# 4126 - 12 grams
# 10598 - 13 grams
(all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size,
feats_detailed_list) = create_trees(input_path, internal_saves)
result_dict = {}
unigrams_dict = {}
filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types = read_filters(config,
args,
feats_detailed_list)
start_exe_time = time.time()
count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict, result_dict)
print("Execution time:")