Repository moved!
This commit is contained in:
parent
40aaffa632
commit
b7193f9126
8
.gitignore
vendored
8
.gitignore
vendored
|
@ -1,8 +0,0 @@
|
|||
.idea/
|
||||
venv/
|
||||
internal_saves/
|
||||
__pycache__/
|
||||
results/
|
||||
data/
|
||||
config2.ini
|
||||
configs/
|
79
README.md
79
README.md
|
@ -1,78 +1,3 @@
|
|||
# STARK: a tool for statistical analysis of dependency-parsed corpora
|
||||
STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora based on various user-defined criteria. It is primarily aimed at processing corpora based on the [Universal Dependencies](https://universaldependencies.org/) annotation scheme, but it also takes any other corpus in the [CONLL-U](https://universaldependencies.org/format.html) format as input.
|
||||
# Repository moved
|
||||
STARK was moved to a [new location on GitHub](https://github.com/clarinsi/STARK).
|
||||
|
||||
## Windows installation and execution
|
||||
### Installation
|
||||
Install Python 3 on your system (https://www.python.org/downloads/).
|
||||
|
||||
Download pip installation file (https://bootstrap.pypa.io/get-pip.py) and install it by double clicking on it.
|
||||
|
||||
Install other libraries necessary for running by going into program directory and double clicking on `install.bat`. If windows defender is preventing execution of this file you might have to unblock that file by `right-clicking on .bat file -> Properties -> General -> Security -> Select Unblock -> Select Apply`.
|
||||
|
||||
### Execution
|
||||
Set up search parameters in `.ini` file.
|
||||
|
||||
Execute extraction by running `run.bat` (in case it is blocked repeat the same procedure as for `install.bat`).
|
||||
Optionally modify run.bat by pointing it to another .ini file. This can be done by editing run.bat file (changing parameter --config_file).
|
||||
|
||||
|
||||
## Linux installation and execution
|
||||
### Installation
|
||||
Install Python 3 on your system (https://www.python.org/downloads/).
|
||||
|
||||
Install pip and other libraries required by program, by running the following commands in terminal:
|
||||
```bash
|
||||
sudo apt install python3-pip
|
||||
cd <PATH TO PROJECT DIRECTORY>
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### Execution
|
||||
Set up search parameters in `.ini` file.
|
||||
|
||||
Execute extraction by first moving to project directory with:
|
||||
```bash
|
||||
cd <PATH TO PROJECT DIRECTORY>
|
||||
```
|
||||
|
||||
And later executing script with:
|
||||
```bash
|
||||
python3 dependency-parsetree.py --config_file=<PATH TO .ini file>
|
||||
```
|
||||
|
||||
Example:
|
||||
```bash
|
||||
python3 dependency-parsetree.py --config_file=config.ini
|
||||
```
|
||||
|
||||
## Parameter settings
|
||||
The type of trees to be extracted can be defined through several parameters in the `config.ini` configuration file.
|
||||
|
||||
- `input`: location of the input file or directory (parsed corpus in .conllu)
|
||||
- `output`: location of the output file (extraction results)
|
||||
- `internal_saves`: location of the folder with files for optimization during processing
|
||||
- `cpu_cores`: number of CPU cores to be used during processing
|
||||
- `tree_size`: number of nodes in the tree (integer or range)
|
||||
- `tree_type`: extraction of all possible subtrees or full subtrees only (values *all* or *complete*)
|
||||
- `dependency_type`: extraction of labeled or unlabeled trees (values *labeled* or *unlabeled*)
|
||||
- `node_order`: extraction of trees by taking surface word order into account (values *free* or *fixed*)
|
||||
- `node_type`: type of nodes under investigation (values *form*, *lemma*, *upos*, *xpos*, *feats* or *deprel*)
|
||||
- `label_whitelist`: predefined list of dependency labels allowed in the extracted tree
|
||||
- `root_whitelist`: predefined characteristics of the root node
|
||||
- `query`: predefined tree structure based on the modified Turku NLP [query language](http://bionlp.utu.fi/searchexpressions-new.html).
|
||||
- `print_root`: print root node information in the output file (values *yes* or *no*)
|
||||
- `nodes_number`: print the number of nodes in the tree in the output file (values *yes* or *no*)
|
||||
- `association_measures`: calculate the strength of association between nodes by MI, MI3, t-test, logDice, Dice and simple-LL scores (values *yes* or *no*)
|
||||
- `frequency_threshold`: minimum frequency of occurrences of the tree in the corpus
|
||||
- `lines_threshold`: maximum number of trees in the output
|
||||
|
||||
## Output
|
||||
The tool returns the resulting list of all relevant trees in the form of a tab-separated `.tsv` file with information on the tree structure, its frequency and other relevant information in relation to specific parameter settings. The tool does not support any data visualization, however, the output structure of the tree is directly transferable to the [Dep_Search](http://bionlp-www.utu.fi/dep_search/) concordancing service giving access to specific examples in many corpora.
|
||||
|
||||
## Credits
|
||||
This program was developed by Luka Krsnik in collaboration with Kaja Dobrovoljc and Marko Robnik Šikonja and with financial support from [CLARIN.SI](https://www.clarin.si/).
|
||||
|
||||
<a href="http://www.clarin.si/info/about/"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/CLARIN.png" alt="drawing" height="150"/></a>
|
||||
<a href="https://www.cjvt.si/en/"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/CJVT.png" alt="drawing" height="150"/></a>
|
||||
<a href="https://www.fri.uni-lj.si/en/about"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/FRI.png" alt="drawing" height="150"/></a>
|
||||
<a href="http://www.ff.uni-lj.si/an/aboutFaculty/about_faculty"><img src="https://gitea.cjvt.si/lkrsnik/dependency_parsing/raw/branch/master/logos/FF.png" alt="drawing" height="150"/></a>
|
||||
|
|
|
@ -1,25 +0,0 @@
|
|||
# Copyright 2019 CJVT
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from generic import generate_key, generate_name
|
||||
|
||||
|
||||
class ResultNode(object):
|
||||
def __init__(self, node, architecture_order, create_output_strings):
|
||||
self.name_parts, self.name = generate_name(node, create_output_strings)
|
||||
self.location = architecture_order
|
||||
self.deprel = node.deprel.get_value()
|
||||
|
||||
def __repr__(self):
|
||||
return self.name
|
180
ResultTree.py
180
ResultTree.py
|
@ -1,180 +0,0 @@
|
|||
# Copyright 2019 CJVT
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import copy
|
||||
import string
|
||||
|
||||
|
||||
class ResultTree(object):
|
||||
def __init__(self, node, children, filters):
|
||||
self.node = node
|
||||
self.children = children
|
||||
self.filters = filters
|
||||
self.key = None
|
||||
self.order_key = None
|
||||
self.order = None
|
||||
self.array = None
|
||||
|
||||
def __repr__(self):
|
||||
return self.get_key()
|
||||
|
||||
def set_children(self, children):
|
||||
self.children = children
|
||||
|
||||
def reset_params(self):
|
||||
self.key = None
|
||||
self.order_key = None
|
||||
self.order = None
|
||||
self.array = None
|
||||
|
||||
def get_key(self):
|
||||
if self.key:
|
||||
return self.key
|
||||
key = ''
|
||||
write_self_node_to_result = False
|
||||
if self.children:
|
||||
children = self.children
|
||||
for child in children:
|
||||
if self.filters['node_order'] and child.node.location < self.node.location:
|
||||
if self.filters['dependency_type']:
|
||||
separator = ' <' + child.node.deprel + ' '
|
||||
else:
|
||||
separator = ' < '
|
||||
key += child.get_key() + separator
|
||||
else:
|
||||
if not write_self_node_to_result:
|
||||
write_self_node_to_result = True
|
||||
key += self.node.name
|
||||
if self.filters['dependency_type']:
|
||||
separator = ' >' + child.node.deprel + ' '
|
||||
else:
|
||||
separator = ' > '
|
||||
key += separator + child.get_key()
|
||||
|
||||
if not write_self_node_to_result:
|
||||
key += self.node.name
|
||||
self.key = '(' + key + ')'
|
||||
else:
|
||||
self.key = self.node.name
|
||||
return self.key
|
||||
|
||||
def get_key_sorted(self):
|
||||
key = ''
|
||||
write_self_node_to_result = False
|
||||
if self.children:
|
||||
children = sorted(self.children, key=lambda x: x.node.name)
|
||||
for child in children:
|
||||
if not write_self_node_to_result:
|
||||
write_self_node_to_result = True
|
||||
key += self.node.name
|
||||
if self.filters['dependency_type']:
|
||||
separator = ' >' + child.node.deprel + ' '
|
||||
else:
|
||||
separator = ' > '
|
||||
key += separator + child.get_key_sorted()
|
||||
|
||||
if not write_self_node_to_result:
|
||||
key += self.node.name
|
||||
key = '(' + key + ')'
|
||||
else:
|
||||
key = self.node.name
|
||||
return key
|
||||
|
||||
def get_order_key(self):
|
||||
if self.order_key:
|
||||
return self.order_key
|
||||
order_key = ''
|
||||
write_self_node_to_result = False
|
||||
if self.children:
|
||||
for child in self.children:
|
||||
if self.filters['node_order'] and child.node.location < self.node.location:
|
||||
if self.filters['dependency_type']:
|
||||
separator = ' <' + child.node.deprel + ' '
|
||||
else:
|
||||
separator = ' < '
|
||||
order_key += child.get_order_key() + separator
|
||||
else:
|
||||
if not write_self_node_to_result:
|
||||
write_self_node_to_result = True
|
||||
order_key += str(self.node.location)
|
||||
if self.filters['dependency_type']:
|
||||
separator = ' >' + child.node.deprel + ' '
|
||||
else:
|
||||
separator = ' > '
|
||||
order_key += separator + child.get_order_key()
|
||||
if not write_self_node_to_result:
|
||||
order_key += str(self.node.location)
|
||||
self.order_key = '(' + order_key + ')'
|
||||
else:
|
||||
self.order_key = str(self.node.location)
|
||||
return self.order_key
|
||||
|
||||
def get_order(self):
|
||||
if self.order:
|
||||
return self.order
|
||||
order = []
|
||||
write_self_node_to_result = False
|
||||
if self.children:
|
||||
for child in self.children:
|
||||
if self.filters['node_order'] and child.node.location < self.node.location:
|
||||
order += child.get_order()
|
||||
else:
|
||||
if not write_self_node_to_result:
|
||||
write_self_node_to_result = True
|
||||
order += [self.node.location]
|
||||
order += child.get_order()
|
||||
|
||||
if not write_self_node_to_result:
|
||||
order += [self.node.location]
|
||||
self.order = order
|
||||
else:
|
||||
self.order = [self.node.location]
|
||||
return self.order
|
||||
|
||||
def get_array(self):
|
||||
if self.array:
|
||||
return self.array
|
||||
array = []
|
||||
write_self_node_to_result = False
|
||||
if self.children:
|
||||
for child in self.children:
|
||||
if self.filters['node_order'] and child.node.location < self.node.location:
|
||||
array += child.get_array()
|
||||
else:
|
||||
if not write_self_node_to_result:
|
||||
write_self_node_to_result = True
|
||||
array += [self.node.name_parts]
|
||||
array += child.get_array()
|
||||
|
||||
if not write_self_node_to_result:
|
||||
array += [self.node.name_parts]
|
||||
self.array = array
|
||||
else:
|
||||
self.array = [self.node.name_parts]
|
||||
return self.array
|
||||
|
||||
def finalize_result(self):
|
||||
result = copy.copy(self)
|
||||
result.reset_params()
|
||||
|
||||
# create order letters
|
||||
order = result.get_order()
|
||||
order_letters = [''] * len(result.order)
|
||||
for i in range(len(order)):
|
||||
ind = order.index(min(order))
|
||||
order[ind] = 10000
|
||||
order_letters[ind] = string.ascii_uppercase[i]
|
||||
result.order = ''.join(order_letters)
|
||||
# TODO When tree is finalized create relative word order (alphabet)!
|
||||
return result
|
393
Tree.py
393
Tree.py
|
@ -1,393 +0,0 @@
|
|||
import sys
|
||||
from copy import copy
|
||||
|
||||
from ResultNode import ResultNode
|
||||
from ResultTree import ResultTree
|
||||
from Value import Value
|
||||
from generic import generate_key
|
||||
|
||||
|
||||
class Tree(object):
|
||||
def __init__(self, index, form, lemma, upos, xpos, deprel, feats_detailed, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict, feats_detailed_dict, head):
|
||||
if not hasattr(self, 'feats'):
|
||||
self.feats_detailed = {}
|
||||
|
||||
if form not in form_dict:
|
||||
form_dict[form] = Value(form)
|
||||
self.form = form_dict[form]
|
||||
if lemma not in lemma_dict:
|
||||
lemma_dict[lemma] = Value(lemma)
|
||||
self.lemma = lemma_dict[lemma]
|
||||
if upos not in upos_dict:
|
||||
upos_dict[upos] = Value(upos)
|
||||
self.upos = upos_dict[upos]
|
||||
if xpos not in xpos_dict:
|
||||
xpos_dict[xpos] = Value(xpos)
|
||||
self.xpos = xpos_dict[xpos]
|
||||
if deprel not in deprel_dict:
|
||||
deprel_dict[deprel] = Value(deprel)
|
||||
self.deprel = deprel_dict[deprel]
|
||||
for feat in feats_detailed.keys():
|
||||
if feat not in feats_detailed_dict:
|
||||
feats_detailed_dict[feat] = {}
|
||||
if next(iter(feats_detailed[feat])) not in feats_detailed_dict[feat]:
|
||||
feats_detailed_dict[feat][next(iter(feats_detailed[feat]))] = Value(next(iter(feats_detailed[feat])))
|
||||
if not feat in self.feats_detailed:
|
||||
self.feats_detailed[feat] = {}
|
||||
self.feats_detailed[feat][next(iter(feats_detailed[feat]))] = feats_detailed_dict[feat][next(iter(feats_detailed[feat]))]
|
||||
|
||||
self.parent = head
|
||||
self.children = []
|
||||
self.children_split = -1
|
||||
|
||||
self.index = index
|
||||
|
||||
# for caching answers to questions
|
||||
self.cache = {}
|
||||
|
||||
def add_child(self, child):
|
||||
self.children.append(child)
|
||||
|
||||
def set_parent(self, parent):
|
||||
self.parent = parent
|
||||
|
||||
def fits_static_requirements_feats(self, query_tree):
|
||||
if 'feats_detailed' not in query_tree:
|
||||
return True
|
||||
|
||||
for feat in query_tree['feats_detailed'].keys():
|
||||
if feat not in self.feats_detailed or query_tree['feats_detailed'][feat] != next(iter(self.feats_detailed[feat].values())).get_value():
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def fits_permanent_requirements(self, filters):
|
||||
main_attributes = ['deprel', 'feats', 'form', 'lemma', 'upos']
|
||||
|
||||
if not filters['root_whitelist']:
|
||||
return True
|
||||
|
||||
for option in filters['root_whitelist']:
|
||||
filter_passed = True
|
||||
|
||||
# check if attributes are valid
|
||||
for key in option.keys():
|
||||
if key not in main_attributes:
|
||||
if key not in self.feats_detailed or option[key] != list(self.feats_detailed[key].items())[0][1].get_value():
|
||||
filter_passed = False
|
||||
|
||||
filter_passed = filter_passed and \
|
||||
('deprel' not in option or option['deprel'] == self.deprel.get_value()) and \
|
||||
('form' not in option or option['form'] == self.form.get_value()) and \
|
||||
('lemma' not in option or option['lemma'] == self.lemma.get_value()) and \
|
||||
('upos' not in option or option['upos'] == self.upos.get_value())
|
||||
|
||||
if filter_passed:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def fits_temporary_requirements(self, filters):
|
||||
return not filters['label_whitelist'] or self.deprel.get_value() in filters['label_whitelist']
|
||||
|
||||
def fits_static_requirements(self, query_tree, filters):
|
||||
return ('form' not in query_tree or query_tree['form'] == self.form.get_value()) and \
|
||||
('lemma' not in query_tree or query_tree['lemma'] == self.lemma.get_value()) and \
|
||||
('upos' not in query_tree or query_tree['upos'] == self.upos.get_value()) and \
|
||||
('xpos' not in query_tree or query_tree['xpos'] == self.xpos.get_value()) and \
|
||||
('deprel' not in query_tree or query_tree['deprel'] == self.deprel.get_value()) and \
|
||||
(not filters['complete_tree_type'] or (len(self.children) == 0 and 'children' not in query_tree) or ('children' in query_tree and len(self.children) == len(query_tree['children']))) and \
|
||||
self.fits_static_requirements_feats(query_tree)
|
||||
|
||||
def generate_children_queries(self, all_query_indices, children):
|
||||
partial_results = {}
|
||||
# list of pairs (index of query in group, group of query, is permanent)
|
||||
child_queries_metadata = []
|
||||
for child_index, child in enumerate(children):
|
||||
new_queries = []
|
||||
|
||||
# add continuation queries to children
|
||||
for result_part_index, result_index, is_permanent in child_queries_metadata:
|
||||
if result_index in partial_results and result_part_index in partial_results[result_index] and len(partial_results[result_index][result_part_index]) > 0:
|
||||
if len(all_query_indices[result_index][0]) > result_part_index + 1:
|
||||
new_queries.append((result_part_index + 1, result_index, is_permanent))
|
||||
|
||||
child_queries_metadata = new_queries
|
||||
|
||||
# add new queries to children
|
||||
for result_index, (group, is_permanent) in enumerate(all_query_indices):
|
||||
# check if node has enough children for query to be possible
|
||||
if len(children) - len(group) >= child_index:
|
||||
child_queries_metadata.append((0, result_index, is_permanent))
|
||||
|
||||
child_queries = []
|
||||
for result_part_index, result_index, _ in child_queries_metadata:
|
||||
child_queries.append(all_query_indices[result_index][0][result_part_index])
|
||||
|
||||
partial_results = yield child, child_queries, child_queries_metadata
|
||||
yield None, None, None
|
||||
|
||||
def add_subtrees(self, old_subtree, new_subtree):
|
||||
old_subtree.extend(new_subtree)
|
||||
|
||||
def get_all_query_indices(self, temporary_query_nb, permanent_query_nb, permanent_query_trees, all_query_indices, children, create_output_string, filters):
|
||||
partial_answers = [[] for i in range(permanent_query_nb + temporary_query_nb)]
|
||||
complete_answers = [[] for i in range(permanent_query_nb)]
|
||||
|
||||
# list of pairs (index of query in group, group of query)
|
||||
# TODO try to erase!!!
|
||||
child_queries = [all_query_indice[0] for all_query_indice in all_query_indices]
|
||||
|
||||
answers_lengths = [len(query) for query in child_queries]
|
||||
|
||||
child_queries_flatten = [query_part for query in child_queries for query_part in query]
|
||||
|
||||
all_new_partial_answers = [[] for query_part in child_queries_flatten]
|
||||
|
||||
child_queries_flatten_dedup = []
|
||||
child_queries_flatten_dedup_indices = []
|
||||
for query_part in child_queries_flatten:
|
||||
try:
|
||||
index = child_queries_flatten_dedup.index(query_part)
|
||||
except ValueError:
|
||||
index = len(child_queries_flatten_dedup)
|
||||
child_queries_flatten_dedup.append(query_part)
|
||||
|
||||
child_queries_flatten_dedup_indices.append(index)
|
||||
|
||||
# ask children all queries/partial queries
|
||||
for child in children:
|
||||
# obtain children results
|
||||
new_partial_answers_dedup, new_complete_answers = child.get_subtrees(permanent_query_trees, child_queries_flatten_dedup,
|
||||
create_output_string, filters)
|
||||
|
||||
assert len(new_partial_answers_dedup) == len(child_queries_flatten_dedup)
|
||||
|
||||
# duplicate results again on correct places
|
||||
for i, flattened_index in enumerate(child_queries_flatten_dedup_indices):
|
||||
all_new_partial_answers[i].append(new_partial_answers_dedup[flattened_index])
|
||||
|
||||
for i in range(len(new_complete_answers)):
|
||||
# TODO add order rearagement (TO KEY)
|
||||
complete_answers[i].extend(new_complete_answers[i])
|
||||
|
||||
# merge answers in appropriate way
|
||||
i = 0
|
||||
# iterate over all answers per queries
|
||||
for answer_i, answer_length in enumerate(answers_lengths):
|
||||
# iterate over answers of query
|
||||
# TODO ERROR IN HERE!
|
||||
partial_answers[answer_i] = self.create_answers(all_new_partial_answers[i:i + answer_length], answer_length, filters)
|
||||
i += answer_length
|
||||
|
||||
return partial_answers, complete_answers
|
||||
|
||||
def order_dependent_queries(self, active_permanent_query_trees, active_temporary_query_trees, partial_subtrees,
|
||||
create_output_string, merged_partial_subtrees, i_query, i_answer, filters):
|
||||
node = ResultNode(self, self.index, create_output_string)
|
||||
|
||||
if i_query < len(active_permanent_query_trees):
|
||||
if 'children' in active_permanent_query_trees[i_query]:
|
||||
merged_partial_subtrees.append(
|
||||
self.create_output_children(partial_subtrees[i_answer], [ResultTree(node, [], filters)], filters))
|
||||
i_answer += 1
|
||||
else:
|
||||
merged_partial_subtrees.append([ResultTree(node, [], filters)])
|
||||
else:
|
||||
if 'children' in active_temporary_query_trees[i_query - len(active_permanent_query_trees)]:
|
||||
merged_partial_subtrees.append(
|
||||
self.create_output_children(partial_subtrees[i_answer], [ResultTree(node, [], filters)], filters))
|
||||
i_answer += 1
|
||||
else:
|
||||
merged_partial_subtrees.append([ResultTree(node, [], filters)])
|
||||
|
||||
|
||||
|
||||
return i_answer
|
||||
|
||||
def get_unigrams(self, create_output_strings, filters):
|
||||
unigrams = [generate_key(self, create_output_strings, print_lemma=False)[1]]
|
||||
for child in self.children:
|
||||
unigrams += child.get_unigrams(create_output_strings, filters)
|
||||
return unigrams
|
||||
|
||||
def get_subtrees(self, permanent_query_trees, temporary_query_trees, create_output_string, filters):
|
||||
"""
|
||||
|
||||
:param permanent_query_trees:
|
||||
:param temporary_query_trees:
|
||||
"""
|
||||
|
||||
# list of all children queries grouped by parent queries
|
||||
all_query_indices = []
|
||||
|
||||
active_permanent_query_trees = []
|
||||
for permanent_query_tree in permanent_query_trees:
|
||||
if self.fits_static_requirements(permanent_query_tree, filters) and self.fits_permanent_requirements(filters):
|
||||
active_permanent_query_trees.append(permanent_query_tree)
|
||||
if 'children' in permanent_query_tree:
|
||||
all_query_indices.append((permanent_query_tree['children'], True))
|
||||
# r_all_query_indices.append((permanent_query_tree['r_children'], True))
|
||||
|
||||
active_temporary_query_trees = []
|
||||
successful_temporary_queries = []
|
||||
for i, temporary_query_tree in enumerate(temporary_query_trees):
|
||||
if self.fits_static_requirements(temporary_query_tree, filters) and self.fits_temporary_requirements(filters):
|
||||
active_temporary_query_trees.append(temporary_query_tree)
|
||||
successful_temporary_queries.append(i)
|
||||
if 'children' in temporary_query_tree:
|
||||
all_query_indices.append((temporary_query_tree['children'], False))
|
||||
|
||||
partial_subtrees, complete_answers = self.get_all_query_indices(len(temporary_query_trees),
|
||||
len(permanent_query_trees),
|
||||
permanent_query_trees,
|
||||
all_query_indices, self.children,
|
||||
create_output_string, filters)
|
||||
|
||||
merged_partial_answers = []
|
||||
i_question = 0
|
||||
# i_child is necessary, because some queries may be answered at the beginning and were not passed to children.
|
||||
# i_child is used to point where we are inside answers
|
||||
i_answer = 0
|
||||
# go over all permanent and temporary query trees
|
||||
while i_question < len(active_permanent_query_trees) + len(active_temporary_query_trees):
|
||||
# permanent query trees always have left and right child
|
||||
i_answer = self.order_dependent_queries(active_permanent_query_trees, active_temporary_query_trees, partial_subtrees,
|
||||
create_output_string, merged_partial_answers, i_question, i_answer, filters)
|
||||
|
||||
i_question += 1
|
||||
|
||||
for i in range(len(active_permanent_query_trees)):
|
||||
# TODO FINALIZE RESULT
|
||||
# erase first and last braclets when adding new query result
|
||||
add_subtree = [subtree.finalize_result() for subtree in merged_partial_answers[i]]
|
||||
complete_answers[i].extend(add_subtree)
|
||||
|
||||
# answers to valid queries
|
||||
partial_answers = [[] for i in range(len(temporary_query_trees))]
|
||||
for inside_i, outside_i in enumerate(successful_temporary_queries):
|
||||
partial_answers[outside_i] = merged_partial_answers[
|
||||
len(active_permanent_query_trees) + inside_i]
|
||||
|
||||
return partial_answers, complete_answers
|
||||
|
||||
@staticmethod
|
||||
def create_children_groups(left_parts, right_parts):
|
||||
if not left_parts:
|
||||
return right_parts
|
||||
|
||||
if not right_parts:
|
||||
return left_parts
|
||||
|
||||
all_children_group_possibilities = []
|
||||
for left_part in left_parts:
|
||||
for right_part in right_parts:
|
||||
new_part = copy(left_part)
|
||||
new_part.extend(right_part)
|
||||
all_children_group_possibilities.append(new_part)
|
||||
return all_children_group_possibilities
|
||||
|
||||
@staticmethod
|
||||
def merge_answer(answer1, answer2, base_answer_i, answer_j):
|
||||
merged_results = []
|
||||
merged_indices = []
|
||||
for answer1p_i, old_result in enumerate(answer1):
|
||||
for answer2p_i, new_result in enumerate(answer2):
|
||||
if answer1p_i != answer2p_i:
|
||||
new_indices = [answer1p_i] + [answer2p_i]
|
||||
# TODO add comparison answers with different indices if equal than ignore
|
||||
merged_results.append(old_result + new_result)
|
||||
merged_indices.append(new_indices)
|
||||
return merged_results, merged_indices
|
||||
|
||||
def merge_results3(self, child, new_results, filters):
|
||||
if filters['node_order']:
|
||||
new_child = child
|
||||
|
||||
else:
|
||||
new_child = sorted(child, key=lambda x: x[0].get_key())
|
||||
|
||||
children_groups = []
|
||||
|
||||
for i_answer, answer in enumerate(new_child):
|
||||
children_groups = self.create_children_groups(children_groups, [[answer_part] for answer_part in answer])
|
||||
|
||||
results = []
|
||||
for result in new_results:
|
||||
for children in children_groups:
|
||||
new_result = copy(result)
|
||||
new_result.set_children(children)
|
||||
results.append(new_result)
|
||||
|
||||
return results
|
||||
|
||||
def create_output_children(self, children, new_results, filters):
|
||||
merged_results = []
|
||||
for i_child, child in enumerate(children):
|
||||
merged_results.extend(self.merge_results3(child, new_results, filters))
|
||||
return merged_results
|
||||
|
||||
def create_answers(self, separated_answers, answer_length, filters):
|
||||
partly_built_trees = [[None] * answer_length]
|
||||
partly_built_trees_architecture_indices = [[None] * answer_length]
|
||||
built_trees = []
|
||||
built_trees_architecture_indices = []
|
||||
|
||||
# iterate over children first, so that new partly built trees are added only after all results of specific
|
||||
# child are added
|
||||
for child_i in range(len(separated_answers[0])):
|
||||
new_partly_built_trees = []
|
||||
new_partly_built_trees_architecture_indices = []
|
||||
# iterate over answers parts
|
||||
for answer_part_i in range(len(separated_answers)):
|
||||
# necessary because some parts do not pass filters and are not added
|
||||
if separated_answers[answer_part_i][child_i]:
|
||||
for tree_part_i, tree_part in enumerate(partly_built_trees):
|
||||
if not tree_part[answer_part_i]:
|
||||
new_tree_part = copy(tree_part)
|
||||
new_tree_part_architecture_indices = copy(partly_built_trees_architecture_indices[tree_part_i])
|
||||
new_tree_part[answer_part_i] = separated_answers[answer_part_i][child_i]
|
||||
new_tree_part_architecture_indices[answer_part_i] = child_i
|
||||
completed_tree_part = True
|
||||
for val_i, val in enumerate(new_tree_part):
|
||||
if not val:
|
||||
completed_tree_part = False
|
||||
if completed_tree_part:
|
||||
built_trees.append(new_tree_part)
|
||||
built_trees_architecture_indices.append(new_tree_part_architecture_indices)
|
||||
else:
|
||||
new_partly_built_trees.append(new_tree_part)
|
||||
new_partly_built_trees_architecture_indices.append(new_tree_part_architecture_indices)
|
||||
else:
|
||||
# pass over repetitions of same words
|
||||
pass
|
||||
|
||||
partly_built_trees.extend(new_partly_built_trees)
|
||||
partly_built_trees_architecture_indices.extend(new_partly_built_trees_architecture_indices)
|
||||
|
||||
l_ordered_built_trees, unique_trees_architecture = [], []
|
||||
|
||||
if built_trees:
|
||||
# sort 3 arrays by architecture indices
|
||||
temp_trees_index, temp_trees = (list(t) for t in zip(
|
||||
*sorted(zip(built_trees_architecture_indices, built_trees))))
|
||||
|
||||
# order outputs and erase duplicates
|
||||
for tree, tree_index in zip(temp_trees, temp_trees_index):
|
||||
new_tree_index, new_tree = (list(t) for t in zip(*sorted(zip(tree_index, tree))))
|
||||
# TODO check if inside new_tree_architecture in ordered_built_trees_architecture and if not append!
|
||||
is_unique = True
|
||||
for unique_tree in unique_trees_architecture:
|
||||
already_in = True
|
||||
for part_i in range(len(unique_tree)):
|
||||
if len(unique_tree[part_i]) != len(new_tree[part_i]) or any(unique_tree[part_i][i_unique_part].get_order_key() != new_tree[part_i][i_unique_part].get_order_key() for i_unique_part in range(len(unique_tree[part_i]))):
|
||||
already_in = False
|
||||
break
|
||||
if already_in:
|
||||
is_unique = False
|
||||
break
|
||||
|
||||
if is_unique:
|
||||
unique_trees_architecture.append(new_tree)
|
||||
l_ordered_built_trees.append(new_tree)
|
||||
return l_ordered_built_trees
|
20
Value.py
20
Value.py
|
@ -1,20 +0,0 @@
|
|||
# Copyright 2019 CJVT
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
class Value(object):
|
||||
def __init__(self, value):
|
||||
self.value = value
|
||||
|
||||
def get_value(self):
|
||||
return self.value
|
28
config.ini
28
config.ini
|
@ -1,28 +0,0 @@
|
|||
[settings]
|
||||
|
||||
;___GENERAL SETTINGS___
|
||||
input = data/sl_ssj-ud_v2.4.conllu
|
||||
output = results/out_official.tsv
|
||||
internal_saves = ./internal_saves
|
||||
cpu_cores = 12
|
||||
|
||||
;___TREE SPECIFICATIONS___
|
||||
tree_size = 2-4
|
||||
tree_type = complete
|
||||
dependency_type = labeled
|
||||
node_order = free
|
||||
node_type = upos
|
||||
|
||||
;___TREE RESTRICTIONS___
|
||||
;label_whitelist = nsubj|obj|obl
|
||||
;root_whitelist = lemma=mati&Case=Acc|lemma=lep&Degree=Sup
|
||||
|
||||
;___SEARCH BY QUERY___
|
||||
;query = _ > _
|
||||
|
||||
;___OUTPUT SETTINGS___
|
||||
;lines_threshold = 10000
|
||||
;frequency_threshold = 0
|
||||
association_measures = no
|
||||
print_root = yes
|
||||
nodes_number = yes
|
97
generic.py
97
generic.py
|
@ -1,97 +0,0 @@
|
|||
# Copyright 2019 CJVT
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import sys
|
||||
|
||||
|
||||
def create_output_string_form(tree):
|
||||
return tree.form.get_value()
|
||||
|
||||
def create_output_string_deprel(tree):
|
||||
return tree.deprel.get_value()
|
||||
|
||||
def create_output_string_lemma(tree):
|
||||
return tree.lemma.get_value() if tree.lemma.get_value() is not None else '_'
|
||||
|
||||
def create_output_string_upos(tree):
|
||||
return tree.upos.get_value()
|
||||
|
||||
def create_output_string_xpos(tree):
|
||||
return tree.xpos.get_value()
|
||||
|
||||
def create_output_string_feats(tree):
|
||||
return tree.feats.get_value()
|
||||
|
||||
def generate_key(node, create_output_strings, print_lemma=True):
|
||||
array = [[create_output_string(node) for create_output_string in create_output_strings]]
|
||||
if create_output_string_lemma in create_output_strings and print_lemma:
|
||||
key_array = [[create_output_string(
|
||||
node) if create_output_string != create_output_string_lemma else 'L=' + create_output_string(node) for
|
||||
create_output_string in create_output_strings]]
|
||||
else:
|
||||
key_array = array
|
||||
if len(array[0]) > 1:
|
||||
key = '&'.join(key_array[0])
|
||||
else:
|
||||
key = key_array[0][0]
|
||||
|
||||
return array, key
|
||||
|
||||
def generate_name(node, create_output_strings, print_lemma=True):
|
||||
array = [create_output_string(node) for create_output_string in create_output_strings]
|
||||
if create_output_string_lemma in create_output_strings and print_lemma:
|
||||
name_array = [create_output_string(
|
||||
node) if create_output_string != create_output_string_lemma else 'L=' + create_output_string(node) for
|
||||
create_output_string in create_output_strings]
|
||||
else:
|
||||
name_array = array
|
||||
if len(array) > 1:
|
||||
name = '&'.join(name_array)
|
||||
else:
|
||||
name = name_array[0]
|
||||
|
||||
return array, name
|
||||
|
||||
def get_collocabilities(ngram, unigrams_dict, corpus_size):
|
||||
sum_fwi = 0.0
|
||||
mul_fwi = 1.0
|
||||
for key_array in ngram['object'].array:
|
||||
# create key for unigrams
|
||||
if len(key_array) > 1:
|
||||
key = '&'.join(key_array)
|
||||
else:
|
||||
key = key_array[0]
|
||||
sum_fwi += unigrams_dict[key]
|
||||
mul_fwi *= unigrams_dict[key]
|
||||
|
||||
if mul_fwi < 0:
|
||||
mul_fwi = sys.maxsize
|
||||
|
||||
# number of all words
|
||||
N = corpus_size
|
||||
|
||||
# n of ngram
|
||||
n = len(ngram['object'].array)
|
||||
O = ngram['number']
|
||||
E = mul_fwi / pow(N, n-1)
|
||||
|
||||
# ['MI', 'MI3', 'Dice', 'logDice', 't-score', 'simple-LL']
|
||||
mi = math.log(O / E, 2)
|
||||
mi3 = math.log(pow(O, 3) / E, 2)
|
||||
dice = n * O / sum_fwi
|
||||
logdice = 14 + math.log(dice, 2)
|
||||
tscore = (O - E) / math.sqrt(O)
|
||||
simplell = 2 * (O * math.log10(O / E) - (O - E))
|
||||
return ['%.4f' % mi, '%.4f' % mi3, '%.4f' % dice, '%.4f' % logdice, '%.4f' % tscore, '%.4f' % simplell]
|
|
@ -1 +0,0 @@
|
|||
py -3 -m pip install -r requirements.txt &
|
BIN
logos/CJVT.png
BIN
logos/CJVT.png
Binary file not shown.
Before Width: | Height: | Size: 76 KiB |
BIN
logos/CLARIN.png
BIN
logos/CLARIN.png
Binary file not shown.
Before Width: | Height: | Size: 28 KiB |
BIN
logos/FF.png
BIN
logos/FF.png
Binary file not shown.
Before Width: | Height: | Size: 44 KiB |
3773
logos/FF.svg
3773
logos/FF.svg
File diff suppressed because it is too large
Load Diff
Before Width: | Height: | Size: 314 KiB |
BIN
logos/FRI.png
BIN
logos/FRI.png
Binary file not shown.
Before Width: | Height: | Size: 128 KiB |
|
@ -1 +0,0 @@
|
|||
pyconll==3.1.0
|
|
@ -1,17 +0,0 @@
|
|||
import os
|
||||
from pathlib import Path
|
||||
|
||||
input_path = '/home/lukakrsnik/STARK/data/ud-treebanks-v2.11/'
|
||||
output_path = '/home/lukakrsnik/STARK/results/ud-treebanks-v2.11_B/'
|
||||
config_path = '/home/lukakrsnik/STARK/data/B_test-all-treebanks_3_completed_unlabeled_fixed_form_root=NOUN_5.ini'
|
||||
|
||||
for path in sorted(os.listdir(input_path)):
|
||||
path_obj = Path(input_path, path)
|
||||
pathlist = path_obj.glob('**/*.conllu')
|
||||
for path in sorted(pathlist):
|
||||
folder_name = os.path.join(output_path, path.parts[-2])
|
||||
file_name = os.path.join(folder_name, path.name)
|
||||
if not os.path.exists(folder_name):
|
||||
os.makedirs(folder_name)
|
||||
if not os.path.exists(file_name):
|
||||
os.system("python /home/luka/Development/STARK/stark.py --config_file " + config_path + " --input " + str(path) + " --output " + file_name)
|
3
run.sh
3
run.sh
|
@ -1,3 +0,0 @@
|
|||
source venv/bin/activate
|
||||
python3 dependency-parsetree.py --config_file="$1"
|
||||
deactivate
|
617
stark.py
617
stark.py
|
@ -1,617 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# Copyright 2019 CJVT
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import configparser
|
||||
import copy
|
||||
import csv
|
||||
import hashlib
|
||||
import math
|
||||
import os
|
||||
import pickle
|
||||
import re
|
||||
import string
|
||||
import time
|
||||
from multiprocessing import Pool
|
||||
from pathlib import Path
|
||||
import gzip
|
||||
import sys
|
||||
import pyconll
|
||||
from Tree import Tree
|
||||
from generic import get_collocabilities, create_output_string_form, create_output_string_deprel, create_output_string_lemma, create_output_string_upos, create_output_string_xpos, create_output_string_feats
|
||||
sys.setrecursionlimit(25000)
|
||||
|
||||
def save_zipped_pickle(obj, filename, protocol=-1):
|
||||
with gzip.open(filename, 'wb') as f:
|
||||
pickle.dump(obj, f, protocol)
|
||||
|
||||
def load_zipped_pickle(filename):
|
||||
with gzip.open(filename, 'rb') as f:
|
||||
loaded_object = pickle.load(f)
|
||||
return loaded_object
|
||||
|
||||
def decode_query(orig_query, dependency_type, feats_detailed_list):
|
||||
new_query = False
|
||||
|
||||
# if command in bracelets remove them and treat command as new query
|
||||
if orig_query[0] == '(' and orig_query[-1] == ')':
|
||||
new_query = True
|
||||
orig_query = orig_query[1:-1]
|
||||
|
||||
if dependency_type != '':
|
||||
decoded_query = {'deprel': dependency_type}
|
||||
else:
|
||||
decoded_query = {}
|
||||
|
||||
if orig_query == '_':
|
||||
return decoded_query
|
||||
# if no spaces in query then this is query node and do this otherwise further split query
|
||||
elif len(orig_query.split(' ')) == 1:
|
||||
orig_query_split_parts = orig_query.split(' ')[0].split('&')
|
||||
for orig_query_split_part in orig_query_split_parts:
|
||||
orig_query_split = orig_query_split_part.split('=', 1)
|
||||
if len(orig_query_split) > 1:
|
||||
if orig_query_split[0] == 'L':
|
||||
decoded_query['lemma'] = orig_query_split[1]
|
||||
elif orig_query_split[0] == 'upos':
|
||||
decoded_query['upos'] = orig_query_split[1]
|
||||
elif orig_query_split[0] == 'xpos':
|
||||
decoded_query['xpos'] = orig_query_split[1]
|
||||
elif orig_query_split[0] == 'form':
|
||||
decoded_query['form'] = orig_query_split[1]
|
||||
elif orig_query_split[0] == 'feats':
|
||||
decoded_query['feats'] = orig_query_split[1]
|
||||
elif orig_query_split[0] in feats_detailed_list:
|
||||
decoded_query['feats_detailed'] = {}
|
||||
decoded_query['feats_detailed'][orig_query_split[0]] = orig_query_split[1]
|
||||
return decoded_query
|
||||
elif not new_query:
|
||||
raise Exception('Not supported yet!')
|
||||
else:
|
||||
print('???')
|
||||
elif not new_query:
|
||||
decoded_query['form'] = orig_query_split_part
|
||||
return decoded_query
|
||||
|
||||
# split over spaces if not inside braces
|
||||
all_orders = re.split(r"\s+(?=[^()]*(?:\(|$))", orig_query)
|
||||
|
||||
node_actions = all_orders[::2]
|
||||
priority_actions = all_orders[1::2]
|
||||
priority_actions_beginnings = [a[0] for a in priority_actions]
|
||||
|
||||
# find root index
|
||||
try:
|
||||
root_index = priority_actions_beginnings.index('>')
|
||||
except ValueError:
|
||||
root_index = len(priority_actions)
|
||||
|
||||
children = []
|
||||
root = None
|
||||
for i, node_action in enumerate(node_actions):
|
||||
if i < root_index:
|
||||
children.append(decode_query(node_action, priority_actions[i][1:], feats_detailed_list))
|
||||
elif i > root_index:
|
||||
children.append(decode_query(node_action, priority_actions[i - 1][1:], feats_detailed_list))
|
||||
else:
|
||||
root = decode_query(node_action, dependency_type, feats_detailed_list)
|
||||
if children:
|
||||
root["children"] = children
|
||||
return root
|
||||
|
||||
|
||||
def create_trees(input_path, internal_saves, feats_detailed_dict={}, save=True):
|
||||
hash_object = hashlib.sha1(input_path.encode('utf-8'))
|
||||
hex_dig = hash_object.hexdigest()
|
||||
trees_read_outputfile = os.path.join(internal_saves, hex_dig)
|
||||
print(Path(input_path).name)
|
||||
if not os.path.exists(trees_read_outputfile) or not save:
|
||||
|
||||
train = pyconll.load_from_file(input_path)
|
||||
|
||||
form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict = {}, {}, {}, {}, {}, {}
|
||||
|
||||
all_trees = []
|
||||
corpus_size = 0
|
||||
|
||||
for sentence in train:
|
||||
root = None
|
||||
token_nodes = []
|
||||
for token in sentence:
|
||||
if not token.id.isdigit():
|
||||
continue
|
||||
|
||||
# TODO check if 5th place is always there for feats
|
||||
token_form = token.form if token.form is not None else '_'
|
||||
node = Tree(int(token.id), token_form, token.lemma, token.upos, token.xpos, token.deprel, token.feats, form_dict,
|
||||
lemma_dict, upos_dict, xpos_dict, deprel_dict, feats_dict, feats_detailed_dict, token.head)
|
||||
token_nodes.append(node)
|
||||
if token.deprel == 'root':
|
||||
root = node
|
||||
|
||||
corpus_size += 1
|
||||
|
||||
for token_id, token in enumerate(token_nodes):
|
||||
if isinstance(token.parent, int) or token.parent == '':
|
||||
root = None
|
||||
print('No parent: ' + sentence.id)
|
||||
break
|
||||
if int(token.parent) == 0:
|
||||
token.set_parent(None)
|
||||
else:
|
||||
parent_id = int(token.parent) - 1
|
||||
if token_nodes[parent_id].children_split == -1 and token_id > parent_id:
|
||||
token_nodes[parent_id].children_split = len(token_nodes[parent_id].children)
|
||||
token_nodes[parent_id].add_child(token)
|
||||
token.set_parent(token_nodes[parent_id])
|
||||
|
||||
for token in token_nodes:
|
||||
if token.children_split == -1:
|
||||
token.children_split = len(token.children)
|
||||
|
||||
if root == None:
|
||||
print('No root: ' + sentence.id)
|
||||
continue
|
||||
all_trees.append(root)
|
||||
|
||||
if save:
|
||||
save_zipped_pickle((all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict), trees_read_outputfile, protocol=2)
|
||||
else:
|
||||
print('Reading trees:')
|
||||
print('Completed')
|
||||
all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict = load_zipped_pickle(trees_read_outputfile)
|
||||
|
||||
return all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size, feats_detailed_dict
|
||||
|
||||
def printable_answers(query):
|
||||
# all_orders = re.findall(r"(?:[^ ()]|\([^]*\))+", query)
|
||||
all_orders = re.split(r"\s+(?=[^()]*(?:\(|$))", query)
|
||||
|
||||
# all_orders = orig_query.split()
|
||||
node_actions = all_orders[::2]
|
||||
# priority_actions = all_orders[1::2]
|
||||
|
||||
if len(node_actions) > 1:
|
||||
res = []
|
||||
# for node_action in node_actions[:-1]:
|
||||
# res.extend(printable_answers(node_action[1:-1]))
|
||||
# res.extend([node_actions[-1]])
|
||||
for node_action in node_actions:
|
||||
# if command in bracelets remove them and treat command as new query
|
||||
# TODO FIX BRACELETS IN A BETTER WAY
|
||||
if not node_action:
|
||||
res.extend(['('])
|
||||
elif node_action[0] == '(' and node_action[-1] == ')':
|
||||
res.extend(printable_answers(node_action[1:-1]))
|
||||
else:
|
||||
res.extend([node_action])
|
||||
return res
|
||||
else:
|
||||
return [query]
|
||||
|
||||
|
||||
def tree_calculations(input_data):
|
||||
tree, query_tree, create_output_string_funct, filters = input_data
|
||||
_, subtrees = tree.get_subtrees(query_tree, [], create_output_string_funct, filters)
|
||||
return subtrees
|
||||
|
||||
def get_unigrams(input_data):
|
||||
tree, query_tree, create_output_string_funct, filters = input_data
|
||||
unigrams = tree.get_unigrams(create_output_string_funct, filters)
|
||||
return unigrams
|
||||
|
||||
|
||||
def tree_calculations_chunks(input_data):
|
||||
trees, query_tree, create_output_string_funct, filters = input_data
|
||||
|
||||
result_dict = {}
|
||||
for tree in trees:
|
||||
_, subtrees = tree.get_subtrees(query_tree, [], create_output_string_funct, filters)
|
||||
|
||||
for query_results in subtrees:
|
||||
for r in query_results:
|
||||
if r in result_dict:
|
||||
result_dict[r] += 1
|
||||
else:
|
||||
result_dict[r] = 1
|
||||
return result_dict
|
||||
|
||||
|
||||
def add_node(tree):
|
||||
if 'children' in tree:
|
||||
tree['children'].append({})
|
||||
else:
|
||||
tree['children'] = [{}]
|
||||
|
||||
|
||||
# walk over all nodes in tree and add a node to each possible node
|
||||
def tree_grow(orig_tree):
|
||||
new_trees = []
|
||||
new_tree = copy.deepcopy(orig_tree)
|
||||
add_node(new_tree)
|
||||
new_trees.append(new_tree)
|
||||
if 'children' in orig_tree:
|
||||
children = []
|
||||
for child_tree in orig_tree['children']:
|
||||
children.append(tree_grow(child_tree))
|
||||
for i, child in enumerate(children):
|
||||
for child_res in child:
|
||||
new_tree = copy.deepcopy(orig_tree)
|
||||
new_tree['children'][i] = child_res
|
||||
new_trees.append(new_tree)
|
||||
|
||||
return new_trees
|
||||
|
||||
|
||||
def compare_trees(tree1, tree2):
|
||||
if tree1 == {} and tree2 == {}:
|
||||
return True
|
||||
|
||||
if 'children' not in tree1 or 'children' not in tree2 or len(tree1['children']) != len(tree2['children']):
|
||||
return False
|
||||
|
||||
children2_connections = []
|
||||
|
||||
for child1_i, child1 in enumerate(tree1['children']):
|
||||
child_duplicated = False
|
||||
for child2_i, child2 in enumerate(tree2['children']):
|
||||
if child2_i in children2_connections:
|
||||
pass
|
||||
if compare_trees(child1, child2):
|
||||
children2_connections.append(child2_i)
|
||||
child_duplicated = True
|
||||
break
|
||||
if not child_duplicated:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def create_ngrams_query_trees(n, trees):
|
||||
for i in range(n - 1):
|
||||
new_trees = []
|
||||
for tree in trees:
|
||||
# append new_tree only if it is not already inside
|
||||
for new_tree in tree_grow(tree):
|
||||
duplicate = False
|
||||
for confirmed_new_tree in new_trees:
|
||||
if compare_trees(new_tree, confirmed_new_tree):
|
||||
duplicate = True
|
||||
break
|
||||
if not duplicate:
|
||||
new_trees.append(new_tree)
|
||||
|
||||
trees = new_trees
|
||||
return trees
|
||||
|
||||
def count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict, result_dict):
|
||||
with Pool(cpu_cores) as p:
|
||||
if cpu_cores > 1:
|
||||
all_unigrams = p.map(get_unigrams, [(tree, query_tree, create_output_string_functs, filters) for tree in all_trees])
|
||||
for unigrams in all_unigrams:
|
||||
for unigram in unigrams:
|
||||
if unigram in unigrams_dict:
|
||||
unigrams_dict[unigram] += 1
|
||||
else:
|
||||
unigrams_dict[unigram] = 1
|
||||
|
||||
all_subtrees = p.map(tree_calculations, [(tree, query_tree, create_output_string_functs, filters) for tree in all_trees])
|
||||
|
||||
for tree_i, subtrees in enumerate(all_subtrees):
|
||||
|
||||
for query_results in subtrees:
|
||||
for r in query_results:
|
||||
if filters['node_order']:
|
||||
key = r.get_key() + r.order
|
||||
else:
|
||||
key = r.get_key()
|
||||
if key in result_dict:
|
||||
result_dict[key]['number'] += 1
|
||||
else:
|
||||
result_dict[key] = {'object': r, 'number': 1}
|
||||
|
||||
# 3.65 s (1 core)
|
||||
else:
|
||||
for tree_i, tree in enumerate(all_trees):
|
||||
input_data = (tree, query_tree, create_output_string_functs, filters)
|
||||
if filters['association_measures']:
|
||||
unigrams = get_unigrams(input_data)
|
||||
for unigram in unigrams:
|
||||
if unigram in unigrams_dict:
|
||||
unigrams_dict[unigram] += 1
|
||||
else:
|
||||
unigrams_dict[unigram] = 1
|
||||
|
||||
subtrees = tree_calculations(input_data)
|
||||
for query_results in subtrees:
|
||||
for r in query_results:
|
||||
if filters['node_order']:
|
||||
key = r.get_key() + r.order
|
||||
else:
|
||||
key = r.get_key()
|
||||
if key in result_dict:
|
||||
result_dict[key]['number'] += 1
|
||||
else:
|
||||
result_dict[key] = {'object': r, 'number': 1}
|
||||
|
||||
def read_filters(config, args, feats_detailed_list):
|
||||
tree_size = config.get('settings', 'tree_size', fallback='0') if not args.tree_size else args.tree_size
|
||||
tree_size_range = tree_size.split('-')
|
||||
tree_size_range = [int(r) for r in tree_size_range]
|
||||
|
||||
if tree_size_range[0] > 0:
|
||||
if len(tree_size_range) == 1:
|
||||
query_tree = create_ngrams_query_trees(tree_size_range[0], [{}])
|
||||
elif len(tree_size_range) == 2:
|
||||
query_tree = []
|
||||
for i in range(tree_size_range[0], tree_size_range[1] + 1):
|
||||
query_tree.extend(create_ngrams_query_trees(i, [{}]))
|
||||
else:
|
||||
query = config.get('settings', 'query') if not args.query else args.query
|
||||
query_tree = [decode_query('(' + query + ')', '', feats_detailed_list)]
|
||||
|
||||
# set filters
|
||||
node_type = config.get('settings', 'node_type') if not args.node_type else args.node_type
|
||||
node_types = node_type.split('+')
|
||||
create_output_string_functs = []
|
||||
for node_type in node_types:
|
||||
assert node_type in ['deprel', 'lemma', 'upos', 'xpos', 'form', 'feats'], '"node_type" is not set up correctly'
|
||||
cpu_cores = config.getint('settings', 'cpu_cores') if not args.cpu_cores else args.cpu_cores
|
||||
if node_type == 'deprel':
|
||||
create_output_string_funct = create_output_string_deprel
|
||||
elif node_type == 'lemma':
|
||||
create_output_string_funct = create_output_string_lemma
|
||||
elif node_type == 'upos':
|
||||
create_output_string_funct = create_output_string_upos
|
||||
elif node_type == 'xpos':
|
||||
create_output_string_funct = create_output_string_xpos
|
||||
elif node_type == 'feats':
|
||||
create_output_string_funct = create_output_string_feats
|
||||
else:
|
||||
create_output_string_funct = create_output_string_form
|
||||
create_output_string_functs.append(create_output_string_funct)
|
||||
|
||||
filters = {}
|
||||
filters['internal_saves'] = config.get('settings', 'internal_saves') if not args.internal_saves else args.internal_saves
|
||||
filters['input'] = config.get('settings', 'input') if not args.input else args.input
|
||||
node_order = config.get('settings', 'node_order') if not args.node_order else args.node_order
|
||||
filters['node_order'] = node_order == 'fixed'
|
||||
# filters['caching'] = config.getboolean('settings', 'caching')
|
||||
dependency_type = config.get('settings', 'dependency_type') if not args.dependency_type else args.dependency_type
|
||||
filters['dependency_type'] = dependency_type == 'labeled'
|
||||
if config.has_option('settings', 'label_whitelist'):
|
||||
label_whitelist = config.get('settings', 'label_whitelist') if not args.label_whitelist else args.label_whitelist
|
||||
filters['label_whitelist'] = label_whitelist.split('|')
|
||||
else:
|
||||
filters['label_whitelist'] = []
|
||||
|
||||
root_whitelist = config.get('settings', 'root_whitelist') if not args.root_whitelist else args.root_whitelist
|
||||
if root_whitelist:
|
||||
# test
|
||||
filters['root_whitelist'] = []
|
||||
|
||||
for option in root_whitelist.split('|'):
|
||||
attribute_dict = {}
|
||||
for attribute in option.split('&'):
|
||||
value = attribute.split('=')
|
||||
attribute_dict[value[0]] = value[1]
|
||||
filters['root_whitelist'].append(attribute_dict)
|
||||
else:
|
||||
filters['root_whitelist'] = []
|
||||
|
||||
tree_type = config.get('settings', 'tree_type') if not args.tree_type else args.tree_type
|
||||
filters['complete_tree_type'] = tree_type == 'complete'
|
||||
filters['association_measures'] = config.getboolean('settings', 'association_measures') if not args.association_measures else args.association_measures
|
||||
filters['nodes_number'] = config.getboolean('settings', 'nodes_number') if not args.nodes_number else args.nodes_number
|
||||
filters['frequency_threshold'] = config.getfloat('settings', 'frequency_threshold', fallback=0) if not args.frequency_threshold else args.frequency_threshold
|
||||
filters['lines_threshold'] = config.getint('settings', 'lines_threshold', fallback=0) if not args.lines_threshold else args.lines_threshold
|
||||
filters['print_root'] = config.getboolean('settings', 'print_root') if not args.print_root else args.print_root
|
||||
|
||||
return filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types
|
||||
|
||||
|
||||
def process(input_path, internal_saves, config, args):
|
||||
if os.path.isdir(input_path):
|
||||
|
||||
checkpoint_path = Path(internal_saves, 'checkpoint.pkl')
|
||||
continuation_processing = config.getboolean('settings', 'continuation_processing',
|
||||
fallback=False) if not args.continuation_processing else args.input
|
||||
|
||||
if not checkpoint_path.exists() or not continuation_processing:
|
||||
already_processed = set()
|
||||
result_dict = {}
|
||||
unigrams_dict = {}
|
||||
corpus_size = 0
|
||||
feats_detailed_list = {}
|
||||
if checkpoint_path.exists():
|
||||
os.remove(checkpoint_path)
|
||||
else:
|
||||
already_processed, result_dict, unigrams_dict, corpus_size, feats_detailed_list = load_zipped_pickle(
|
||||
checkpoint_path)
|
||||
|
||||
for path in sorted(os.listdir(input_path)):
|
||||
path_obj = Path(input_path, path)
|
||||
pathlist = path_obj.glob('**/*.conllu')
|
||||
if path_obj.name in already_processed:
|
||||
continue
|
||||
start_exe_time = time.time()
|
||||
for path in sorted(pathlist):
|
||||
# because path is object not string
|
||||
path_str = str(path)
|
||||
|
||||
(all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, sub_corpus_size,
|
||||
feats_detailed_list) = create_trees(path_str, internal_saves, feats_detailed_dict=feats_detailed_list,
|
||||
save=False)
|
||||
|
||||
corpus_size += sub_corpus_size
|
||||
|
||||
filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types = read_filters(
|
||||
config, args, feats_detailed_list)
|
||||
|
||||
count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict,
|
||||
result_dict)
|
||||
|
||||
already_processed.add(path_obj.name)
|
||||
|
||||
# 15.26
|
||||
print("Execution time:")
|
||||
print("--- %s seconds ---" % (time.time() - start_exe_time))
|
||||
save_zipped_pickle(
|
||||
(already_processed, result_dict, unigrams_dict, corpus_size, feats_detailed_list),
|
||||
checkpoint_path, protocol=2)
|
||||
|
||||
|
||||
|
||||
|
||||
else:
|
||||
# 261 - 9 grams
|
||||
# 647 - 10 grams
|
||||
# 1622 - 11 grams
|
||||
# 4126 - 12 grams
|
||||
# 10598 - 13 grams
|
||||
(all_trees, form_dict, lemma_dict, upos_dict, xpos_dict, deprel_dict, corpus_size,
|
||||
feats_detailed_list) = create_trees(input_path, internal_saves)
|
||||
|
||||
result_dict = {}
|
||||
unigrams_dict = {}
|
||||
|
||||
filters, query_tree, create_output_string_functs, cpu_cores, tree_size_range, node_types = read_filters(config,
|
||||
args,
|
||||
feats_detailed_list)
|
||||
|
||||
start_exe_time = time.time()
|
||||
count_trees(cpu_cores, all_trees, query_tree, create_output_string_functs, filters, unigrams_dict, result_dict)
|
||||
|
||||
print("Execution time:")
|
||||
print("--- %s seconds ---" % (time.time() - start_exe_time))
|
||||
|
||||
return result_dict, tree_size_range, filters, corpus_size, unigrams_dict, node_types
|
||||
|
||||
|
||||
def get_keyness(abs_freq_A, abs_freq_B, count_A, count_B):
|
||||
E1 = count_A * (abs_freq_A + abs_freq_B) / (count_A + count_B)
|
||||
E2 = count_B * (abs_freq_A + abs_freq_B) / (count_A + count_B)
|
||||
|
||||
LL = 2 * ((abs_freq_A * math.log(abs_freq_A / E1)) + (abs_freq_B * math.log(abs_freq_B / E2))) if abs_freq_B > 0 else 'NaN'
|
||||
BIC = LL - math.log(count_A + count_B) if abs_freq_B > 0 else 'NaN'
|
||||
log_ratio = math.log(((abs_freq_A/count_A)/(abs_freq_B/count_B)), 2) if abs_freq_B > 0 else 'NaN'
|
||||
OR = (abs_freq_A/(count_A-abs_freq_A)) / (abs_freq_B/(count_B-abs_freq_B)) if abs_freq_B > 0 else 'NaN'
|
||||
diff = (((abs_freq_A/count_A)*1000000 - (abs_freq_B/count_B)*1000000)*100) / ((abs_freq_B/count_B)*1000000) if abs_freq_B > 0 else 'NaN'
|
||||
|
||||
return [abs_freq_B, abs_freq_B/count_B, LL, BIC, log_ratio, OR, diff]
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
## Required parameters
|
||||
parser.add_argument("--config_file", default=None, type=str, required=True, help="The input config file.")
|
||||
parser.add_argument("--input", default=None, type=str, help="The input file/folder.")
|
||||
parser.add_argument("--output", default=None, type=str, help="The output file.")
|
||||
parser.add_argument("--internal_saves", default=None, type=str, help="Location for internal_saves.")
|
||||
parser.add_argument("--cpu_cores", default=None, type=int, help="Number of cores used.")
|
||||
|
||||
parser.add_argument("--tree_size", default=None, type=str, help="Size of trees.")
|
||||
parser.add_argument("--tree_type", default=None, type=str, help="Tree type.")
|
||||
parser.add_argument("--dependency_type", default=None, type=str, help="Dependency type.")
|
||||
parser.add_argument("--node_order", default=None, type=str, help="Order of node.")
|
||||
parser.add_argument("--node_type", default=None, type=str, help="Type of node.")
|
||||
|
||||
parser.add_argument("--label_whitelist", default=None, type=str, help="Label whitelist.")
|
||||
parser.add_argument("--root_whitelist", default=None, type=str, help="Root whitelist.")
|
||||
|
||||
parser.add_argument("--query", default=None, type=str, help="Query.")
|
||||
|
||||
parser.add_argument("--lines_threshold", default=None, type=str, help="Lines treshold.")
|
||||
parser.add_argument("--frequency_threshold", default=None, type=int, help="Frequency threshold.")
|
||||
parser.add_argument("--association_measures", default=None, type=bool, help="Association measures.")
|
||||
parser.add_argument("--print_root", default=None, type=bool, help="Print root.")
|
||||
parser.add_argument("--nodes_number", default=None, type=bool, help="Nodes number.")
|
||||
parser.add_argument("--continuation_processing", default=None, type=bool, help="Nodes number.")
|
||||
parser.add_argument("--compare", default=None, type=str, help="Corpus with which we want to compare statistics.")
|
||||
args = parser.parse_args()
|
||||
|
||||
config = configparser.ConfigParser()
|
||||
config.read(args.config_file)
|
||||
|
||||
internal_saves = config.get('settings', 'internal_saves') if not args.internal_saves else args.internal_saves
|
||||
input_path = config.get('settings', 'input') if not args.input else args.input
|
||||
|
||||
result_dict, tree_size_range, filters, corpus_size, unigrams_dict, node_types = process(input_path, internal_saves, config, args)
|
||||
|
||||
if args.compare is not None:
|
||||
other_input_path = args.compare
|
||||
other_result_dict, other_tree_size_range, other_filters, other_corpus_size, other_unigrams_dict, other_node_types = process(other_input_path, internal_saves, config, args)
|
||||
|
||||
sorted_list = sorted(result_dict.items(), key=lambda x: x[1]['number'], reverse=True)
|
||||
|
||||
output = config.get('settings', 'output') if not args.output else args.output
|
||||
with open(output, "w", newline="", encoding="utf-8") as f:
|
||||
# header - use every second space as a split
|
||||
writer = csv.writer(f, delimiter='\t')
|
||||
if tree_size_range[-1]:
|
||||
len_words = tree_size_range[-1]
|
||||
else:
|
||||
query = config.get('settings', 'query') if not args.query else args.query
|
||||
len_words = int(len(query.split(" "))/2 + 1)
|
||||
header = ["Structure"] + ["Node " + string.ascii_uppercase[i] + "-" + node_type for i in range(len_words) for node_type in node_types] + ['Absolute frequency']
|
||||
header += ['Relative frequency']
|
||||
if filters['node_order']:
|
||||
header += ['Order']
|
||||
header += ['Free structure']
|
||||
if filters['nodes_number']:
|
||||
header += ['Number of nodes']
|
||||
if filters['print_root']:
|
||||
header += ['Root node']
|
||||
if filters['association_measures']:
|
||||
header += ['MI', 'MI3', 'Dice', 'logDice', 't-score', 'simple-LL']
|
||||
if args.compare:
|
||||
header += ['Other absolute frequency', 'Other relative frequency', 'LL', 'BIC', 'Log ratio', 'OR', '%DIFF']
|
||||
writer.writerow(header)
|
||||
|
||||
if filters['lines_threshold']:
|
||||
sorted_list = sorted_list[:filters['lines_threshold']]
|
||||
|
||||
# body
|
||||
for k, v in sorted_list:
|
||||
v['object'].get_array()
|
||||
relative_frequency = v['number'] * 1000000.0 / corpus_size
|
||||
if filters['frequency_threshold'] and filters['frequency_threshold'] > v['number']:
|
||||
break
|
||||
words_only = [word_att for word in v['object'].array for word_att in word] + ['' for i in range((tree_size_range[-1] - len(v['object'].array)) * len(v['object'].array[0]))]
|
||||
key = [v['object'].get_key()[1:-1]] if v['object'].get_key()[0] == '(' and v['object'].get_key()[-1] == ')' else [v['object'].get_key()]
|
||||
row = key + words_only + [str(v['number'])]
|
||||
row += ['%.4f' % relative_frequency]
|
||||
if filters['node_order']:
|
||||
row += [v['object'].order]
|
||||
row += [v['object'].get_key_sorted()[1:-1]]
|
||||
if filters['nodes_number']:
|
||||
row += ['%d' % len(v['object'].array)]
|
||||
if filters['print_root']:
|
||||
row += [v['object'].node.name]
|
||||
if filters['association_measures']:
|
||||
row += get_collocabilities(v, unigrams_dict, corpus_size)
|
||||
if args.compare:
|
||||
other_abs_freq = other_result_dict[k]['number'] if k in other_result_dict else 0
|
||||
row += get_keyness(v['number'], other_abs_freq, corpus_size, other_corpus_size)
|
||||
writer.writerow(row)
|
||||
|
||||
return "Done"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
start_time = time.time()
|
||||
main()
|
||||
print("Total:")
|
||||
print("--- %s seconds ---" % (time.time() - start_time))
|
Loading…
Reference in New Issue
Block a user