Browse Source

Minimized repozitory

master
Luka 1 year ago
commit
cf527a2fe7
  1. 113
      .gitignore
  2. 21
      LICENSE
  3. 46
      README.md
  4. 0
      __init__.py
  5. 71
      accentuate.py
  6. 79
      accentuate_connected_text.py
  7. 198
      assign_stress2lemmas.py
  8. 1113
      hyphenation
  9. 74
      learn_location_weights.py
  10. 1
      log.txt
  11. 1
      notes
  12. 1779
      prepare_data.py
  13. BIN
      preprocessed_data/environment.pkl
  14. 47
      requirements.txt
  15. 14
      run_multiple_files.py
  16. 161
      sloleks_accentuation.py
  17. 73
      sloleks_accentuation2.py
  18. 154
      sloleks_accentuation2_tab2xml.py
  19. 1050
      sloleks_accetuation.ipynb
  20. 290
      sloleks_accetuation2.ipynb
  21. 36
      sloleks_xml_checker.py
  22. 1
      test_data/accented_connected_text
  23. 6
      test_data/accented_data
  24. 1
      test_data/original_connected_text
  25. 6
      test_data/unaccented_dictionary
  26. 101
      tex_hyphenation.py
  27. 247
      text2SAMPA.py
  28. 5
      tts.sh
  29. 99
      workbench.py
  30. 3
      workbench.sh
  31. 14
      workbench.xrsl

113
.gitignore

@ -0,0 +1,113 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
.idea/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
# Custom
data/
cnn/internal_representations/inputs/
joblist.xml
new_sloleks.xml
grid_results/
.idea/
cnn/word_accetuation/svm/data/
postprocessing/data_merge.ipynb
data_merge.ipynb
postprocessing/data_merge.py
data_merge.py
postprocessing/sp_data_merge.py
sp_data_merge.py
postprocessing/data_merge_tab2xml.py
data_merge_tab2xml.py
postprocessing/data_merge_analysis.py
data_merge_analysis.py
postprocessing/sp_sloleks_data_merge.py
sp_sloleks_data_merge.py
postprocessing/data_merge_xml2tab.py
data_merge_xml2tab.py

21
LICENSE

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2017 lkrsnik
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

46
README.md

@ -0,0 +1,46 @@
# Introduction
There is no simple algorithm for stress assignment of Slovene words. Speakers of Slovene are usually taught accents together with words. Machine learning algorithms give positive results on this problem, therefore we tried deep neural networks. We tested different architectures, data presentations and an ensemble of networks. We achieved the best results using the ensemble method, which correctly predicted 88.72 % of tested words. Our neural network approach improved results of other machine learning methods and proved to be successful in stress assignment.
This is an improved version of master thesis project (source code of previous simplyfied repository is published on https://github.com/lkrsnik/simple_accentuation, previous complete source code is published on https://github.com/lkrsnik/accetuation). Final results from test data set show that results presented here are improved by a little more than 1 % from its predecessor. This repository contains all test results and models trained in different ways. It also includes some finalized scripts for simple accentuation of words from list, accentuation application for connected text and a simple example script of learning neural network (we do not provide learning data for this, due to copy rights, if you want it to work, you have to obtain your own learning set).
# Set up
The majority of programs used in this app are easily installable with following command:
```
pip install -r requiremets.txt
```
If you encounter any problems while installing Keras you should check out their official site (https://keras.io/#installation). The results from neural networks were trained on Theano backend. Although TensorFlow might work as well it is not guaranted.
# Structure
Inside cnn folder we have two folders word_accentuation and accent_classification. They both include all prediction models tested in this experiment, where first folder has models created for searching stress location, second contains models for classification of accents. Repository also contains four important files, prepare_data.py with majority of code, accentuate.py, meant as simple accentuation app, accentuate_connected_text.py - simple app for accentuating connected Slovene text - and learn_location_weights, meant as an example of how you could learn your own neural networks with different parameters.
# accentuate.py
You should use this script, if you would like to accentuate words from list of words with their morphological data. For it to work you should generate file, which in each line contains word of interest and morphological data, separated by tab. It should look like this:
```
absolutistični Afpmsay-n
spoštljivejše Afcfsg
tresoče Afpfsg
raznesena Vmp--sfp
žvižgih Ncmdl
```
You can call this script in bash with following command:
```
python accentuate.py <path_to_input_file> <path_to_results_file>
```
Here is a working example:
```
python accentuate.py 'test_data/unaccented_dictionary' 'test_data/accented_data'
```
# accentuate_connected_text.py
This app uses external tagger for obtaining morphological information from sentences. For it to work you should clone repository from https://github.com/clarinsi/reldi-tagger and pass its location as a parameter.
```
python accentuate_connected_text.py <path_to_input_file> <path_to_results_file> <path_to_reldi_repository>
```
You can try working example with your actual path to reldi:
```
python accentuate_connected_text.py 'test_data/original_connected_text' 'test_data/accented_connected_text' '../reldi_tagger'
```
# learn_location_weights.py
This is an example of script designed for learning weights. For it to work you have to have learning data. Given example can be used for learning neural networks for assigning location of stress from letters. For examples of other neural networks you should look into original repository.

0
__init__.py

71
accentuate.py

@ -0,0 +1,71 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import pickle
import numpy as np
from keras.models import load_model
import sys
from prepare_data import *
# obtain data from parameters
if len(sys.argv) < 3:
print('Please provide arguments for this script to work. First argument should be location of file with unaccented words and morphological data '
'and second the name of file where you would like to save file to. Example: python accentuate.py \'test_data/unaccented_dictionary\' '
'\'test_data/accented_data\'')
raise Exception
read_location = sys.argv[1]
write_location = sys.argv[2]
# get environment variables necessary for calculations
pickle_input = open('preprocessed_data/environment.pkl', 'rb')
environment = pickle.load(pickle_input)
dictionary = environment['dictionary']
max_word = environment['max_word']
max_num_vowels = environment['max_num_vowels']
vowels = environment['vowels']
accented_vowels = environment['accented_vowels']
feature_dictionary = environment['feature_dictionary']
syllable_dictionary = environment['syllable_dictionary']
# load models
data = Data('l', shuffle_all_inputs=False)
letter_location_model, syllable_location_model, syllabled_letters_location_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_3/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_3/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_3/20_final_epoch.h5')
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_2/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_2/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_2/20_final_epoch.h5')
letter_type_model, syllable_type_model, syllabled_letter_type_model = data.load_type_models(
'cnn/accent_classification/letters/v3_1/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_1/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_1/20_final_epoch.h5')
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model = data.load_type_models(
'cnn/accent_classification/letters/v3_0/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_0/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_0/20_final_epoch.h5')
# read from data
content = data._read_content(read_location)
# format data for accentuate_word function it has to be like [['besedišči', '', 'Ncnpi', 'besedišči'], ]
content = [[el[0], '', el[1][:-1], el[0]] for el in content]
# use environment variables and models to accentuate words
data = Data('l', shuffle_all_inputs=False)
location_accented_words, accented_words = data.accentuate_word(content, letter_location_model, syllable_location_model, syllabled_letters_location_model,
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model,
letter_type_model, syllable_type_model, syllabled_letter_type_model,
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model,
dictionary, max_word, max_num_vowels, vowels, accented_vowels, feature_dictionary, syllable_dictionary)
# save accentuated words
with open(write_location, 'w') as f:
for i in range(len(location_accented_words)):
f.write(location_accented_words[i] + ' ' + accented_words[i] + '\n')
f.write('\n')

79
accentuate_connected_text.py

@ -0,0 +1,79 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
sys.path.insert(0, '../../../')
from prepare_data import *
import pickle
# from keras import backend as Input
np.random.seed(7)
# obtain data from parameters
if len(sys.argv) < 3:
print('Please provide arguments for this script to work. First argument should be location of file with unaccented words and morphological data, '
'second the name of file where you would like to save results to and third location of ReLDI tagger. Example: python accentuate.py '
'\'test_data/original_connected_text\' \'test_data/accented_connected_text\' \'../reldi_tagger\'')
raise Exception
read_location = sys.argv[1]
write_location = sys.argv[2]
reldi_location = sys.argv[3]
# get environment variables necessary for calculations
pickle_input = open('preprocessed_data/environment.pkl', 'rb')
environment = pickle.load(pickle_input)
dictionary = environment['dictionary']
max_word = environment['max_word']
max_num_vowels = environment['max_num_vowels']
vowels = environment['vowels']
accented_vowels = environment['accented_vowels']
feature_dictionary = environment['feature_dictionary']
syllable_dictionary = environment['syllable_dictionary']
# get models
data = Data('l', shuffle_all_inputs=False)
letter_location_model, syllable_location_model, syllabled_letters_location_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_3/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_3/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_3/20_final_epoch.h5')
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_2/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_2/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_2/20_final_epoch.h5')
letter_type_model, syllable_type_model, syllabled_letter_type_model = data.load_type_models(
'cnn/accent_classification/letters/v3_1/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_1/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_1/20_final_epoch.h5')
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model = data.load_type_models(
'cnn/accent_classification/letters/v3_0/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_0/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_0/20_final_epoch.h5')
# get word tags
tagged_words, original_text = data.tag_words(reldi_location, read_location)
# find accentuation locations
predictions = data.get_ensemble_location_predictions(tagged_words, letter_location_model, syllable_location_model, syllabled_letters_location_model,
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model,
dictionary, max_word, max_num_vowels, vowels, accented_vowels, feature_dictionary,
syllable_dictionary)
location_accented_text = data.create_connected_text_locations(tagged_words, original_text, predictions, vowels)
# accentuate text
location_y = np.around(predictions)
type_predictions = data.get_ensemble_type_predictions(tagged_words, location_y, letter_type_model, syllable_type_model, syllabled_letter_type_model,
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model,
dictionary, max_word, max_num_vowels, vowels, accented_vowels, feature_dictionary,
syllable_dictionary)
accented_text = data.create_connected_text_accented(tagged_words, original_text, type_predictions, location_y, vowels, accented_vowels)
# save accentuated text
with open(write_location, 'w') as f:
f.write(accented_text)

198
assign_stress2lemmas.py

@ -0,0 +1,198 @@
# Words proccesed: 650250
# Word indeks: 50023
# Word number: 50023
import re
from lxml import etree
import time
# from prepare_data import *
accented_vowels = ['ŕ', 'á', 'à', 'é', 'è', 'ê', 'í', 'ì', 'ó', 'ô', 'ò', 'ú', 'ù']
def stressed2unstressed(w):
w = w.replace('ŕ', 'r')
w = w.replace('á', 'a')
w = w.replace('à', 'a')
w = w.replace('é', 'e')
w = w.replace('è', 'e')
w = w.replace('ê', 'e')
w = w.replace('í', 'i')
w = w.replace('ì', 'i')
w = w.replace('ó', 'o')
w = w.replace('ô', 'o')
w = w.replace('ò', 'o')
w = w.replace('ú', 'u')
w = w.replace('ù', 'u')
return w
"""Works on finalized XML
"""
from text2SAMPA import *
# def xml_words_generator(xml_path):
# for event, element in etree.iterparse(xml_path, tag="LexicalEntry", encoding="UTF-8"):
# words = []
# for child in element:
# if child.tag == 'WordForm':
# msd = None
# word = None
# for wf in child:
# if 'att' in wf.attrib and wf.attrib['att'] == 'msd':
# msd = wf.attrib['val']
# elif wf.tag == 'FormRepresentation':
# for form_rep in wf:
# if form_rep.attrib['att'] == 'zapis_oblike':
# word = form_rep.attrib['val']
# #if msd is not None and word is not None:
# # pass
# #else:
# # print('NOOOOO')
# words.append([word, '', msd, word])
# yield words
#
#
# gen = xml_words_generator('data/Sloleks_v1.2_p2.xml')
word_glob_num = 0
word_limit = 50000
iter_num = 50000
word_index = 0
# iter_index = 0
# words = []
#
# lexical_entries_load_number = 0
# lexical_entries_save_number = 0
#
# # INSIDE
# # word_glob_num = 1500686
# word_glob_num = 1550705
#
# # word_limit = 1500686
# word_limit = 1550705
#
# iter_index = 31
# done_lexical_entries = 33522
# data = Data('s', shuffle_all_inputs=False)
# accentuated_content = data._read_content('data/new_sloleks/final_sloleks2.tab')
start_timer = time.time()
lemmas = 0
print('Copy initialization complete')
with open("data/contextual_changes/stressed_lemmas_sloleks2.xml", "ab") as myfile:
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
for event, element in etree.iterparse('data/contextual_changes/final_sloleks2_inhouse2S.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
# for event, element in etree.iterparse('data/Sloleks_v1.2.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
# if word_glob_num >= word_limit:
# myfile2.close()
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
# iter_index += 1
# print("Words proccesed: " + str(word_glob_num))
#
# print("Word indeks: " + str(word_index))
# print("Word number: " + str(len(words)))
#
# # print("lexical_entries_load_number: " + str(lexical_entries_load_number))
# # print("lexical_entries_save_number: " + str(lexical_entries_save_number))
#
# end_timer = time.time()
# print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer) / 60.0) + " minutes")
lemma = ''
stressed_lemma = ''
msd = ''
word_form_found = False
for child in element:
if child.tag == 'Lemma':
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'zapis_oblike':
lemma = wf.attrib['val']
if child.tag == 'WordForm':
msd = None
word = None
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'msd':
msd = wf.attrib['val']
elif wf.tag == 'FormRepresentation':
for form_rep in wf:
if form_rep.attrib['att'] == 'naglašena_beseda':
stressed_lemma = form_rep.attrib['val']
word_form_found = True
break
break
# new_element = etree.Element('feat')
# new_element.attrib['att'] = 'SAMPA'
#
# wf.append(new_element)
#
# word_glob_num += 1
# word_index += 1
break
if re.match(r'S..ei', msd) or re.match(r'S..mi', msd) or re.match(r'Sometn', msd) or re.match(r'P..mei.*', msd) \
or re.match(r'P..zei.*', msd) or re.match(r'P..sei.*', msd) or re.match(r'G..n.*', msd) \
or re.match(r'R.n', msd) or re.match(r'Rss', msd) or re.match(r'Rd', msd) \
or re.match(r'K.*', msd) or re.match(r'D.', msd) or re.match(r'L', msd) or re.match(r'M', msd) \
or re.match(r'O', msd) or re.match(r'Z.*', msd) or re.match(r'V.', msd) or re.match(r'Rsr.', msd)\
or msd == "":
# when lemma does not equal unstressed version of what is supposed to be lemma, try to find parts of the
# word that are equal and transfer stress to lemma (if possible)
if lemma != stressed2unstressed(stressed_lemma):
identical_length = 0
# if lemma == 'Latkov':
# print('HERE')
for i in range(min(len(lemma), len(stressed2unstressed(stressed_lemma)))):
# a = list(lemma)
# b = list(stressed2unstressed(stressed_lemma))
identical_length += 1
if list(lemma)[i] != list(stressed2unstressed(stressed_lemma))[i]:
break
for l in list(stressed_lemma[identical_length:]):
if l in accented_vowels:
# print(lemma)
# print(stressed2unstressed(stressed_lemma))
# print(stressed_lemma[identical_length:])
print(lemma + " : " + stressed_lemma + " - " + msd)
stressed_lemma = stressed_lemma[:identical_length] + lemma[identical_length:]
# pass
# if lemma != stressed2unstressed(stressed_lemma):
# print(lemma + " : " + stressed_lemma + " - " + msd)
else:
# print("Error2 - " + msd + " " + lemma + " - " + stressed_lemma)
# print(lemma + " - " + msd)
pass
for child in element:
if child.tag == 'Lemma':
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'zapis_oblike':
wf.attrib['val'] = stressed_lemma
break
else:
print('Error1')
break
lemmas += 1
# print(etree.tostring(element, encoding="UTF-8"))
# myfile2.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
if word_glob_num > word_limit:
# print('Proccessed ' + str(word_glob_num) + ' words')
end_timer = time.time()
# print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer) / 60.0) + " minutes")
word_limit += iter_num
myfile.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
element.clear()
print(lemmas)

1113
hyphenation
File diff suppressed because it is too large
View File

74
learn_location_weights.py

@ -0,0 +1,74 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
# text in Western (Windows 1252)
import pickle
import numpy as np
np.random.seed(7)
import sys
from prepare_data import *
# preprocess data
# data = Data('l', allow_shuffle_vector_generation=True, save_generated_data=False, shuffle_all_inputs=True)
data = Data('l', save_generated_data=False, shuffle_all_inputs=True)
data.generate_data('../../internal_representations/inputs/letters_word_accentuation_train',
'../../internal_representations/inputs/letters_word_accentuation_test',
'../../internal_representations/inputs/letters_word_accentuation_validate',
content_location='../accetuation/data/',
content_name='SlovarIJS_BESEDE_utf8.lex',
inputs_location='../accetuation/cnn/internal_representations/inputs/',
content_shuffle_vector='content_shuffle_vector',
shuffle_vector='shuffle_vector')
# combine all data (if it is unwanted comment code below)
data.x_train = np.concatenate((data.x_train, data.x_test, data.x_validate), axis=0)
data.x_other_features_train = np.concatenate((data.x_other_features_train, data.x_other_features_test, data.x_other_features_validate), axis=0)
data.y_train = np.concatenate((data.y_train, data.y_test, data.y_validate), axis=0)
# build neural network architecture
nn_output_dim = 10
batch_size = 16
actual_epoch = 20
num_fake_epoch = 20
conv_input_shape=(23, 36)
othr_input = (140, )
conv_input = Input(shape=conv_input_shape, name='conv_input')
x_conv = Conv1D(115, (3), padding='same', activation='relu')(conv_input)
x_conv = Conv1D(46, (3), padding='same', activation='relu')(x_conv)
x_conv = MaxPooling1D(pool_size=2)(x_conv)
x_conv = Flatten()(x_conv)
othr_input = Input(shape=othr_input, name='othr_input')
x = concatenate([x_conv, othr_input])
x = Dense(256, activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(nn_output_dim, activation='sigmoid')(x)
model = Model(inputs=[conv_input, othr_input], outputs=x)
opt = optimizers.Adam(lr=1E-3, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss='mean_squared_error', optimizer=opt, metrics=[actual_accuracy,])
# model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
# start learning
history = model.fit_generator(data.generator('train', batch_size, content_name='SlovarIJS_BESEDE_utf8.lex', content_location='../accetuation/data/'),
data.x_train.shape[0]/(batch_size * num_fake_epoch),
epochs=actual_epoch*num_fake_epoch,
validation_data=data.generator('test', batch_size),
validation_steps=data.x_test.shape[0]/(batch_size * num_fake_epoch))
# save generated data
name = 'test_data/20_epoch'
model.save(name + '.h5')
output = open(name + '_history.pkl', 'wb')
pickle.dump(history.history, output)
output.close()

1
log.txt

@ -0,0 +1 @@
Copy initialization complete

1
notes

@ -0,0 +1 @@
256(0.3)-512(0.3)-512(0.3):115(3)-46(3) - [LETTERS ACCENT TYPE] One layer less

1779
prepare_data.py
File diff suppressed because it is too large
View File

BIN
preprocessed_data/environment.pkl

47
requirements.txt

@ -0,0 +1,47 @@
appnope==0.1.0
backports.ssl-match-hostname==3.4.0.2
certifi==2015.4.28
decorator==4.0.2
funcsigs==0.4
functools32==3.2.3.post2
gnureadline==6.3.3
ipykernel==4.0.3
ipython==4.0.0
ipython-genutils==0.1.0
ipywidgets==4.0.2
Jinja2==2.8
jsonschema==2.5.1
jupyter==1.0.0
jupyter-client==4.0.0
jupyter-console==4.0.1
jupyter-core==4.0.4
MarkupSafe==0.23
matplotlib==1.4.3
mistune==0.7.1
mock==1.3.0
nbconvert==4.0.0
nbformat==4.0.0
nose==1.3.7
notebook==4.0.4
numpy==1.9.2
path.py==8.1
pbr==1.6.0
pexpect==3.3
pickleshare==0.5
ptyprocess==0.5
PyBrain==0.3
Pygments==2.0.2
pyparsing==2.0.3
python-dateutil==2.4.2
pytz==2015.4
pyzmq==14.7.0
qtconsole==4.0.1
scikit-learn==0.16.1
scipy==0.16.0
simplegeneric==0.8.1
six==1.9.0
sklearn==0.0
terminado==0.5
tornado==4.2.1
traitlets==4.0.0
wheel==0.24.0

14
run_multiple_files.py

@ -0,0 +1,14 @@
# import cnn.word_accetuation.cnn_dictionary.v5_0.workbench
# import cnn.word_accetuation.cnn_dictionary.v5_3.workbench
# import cnn.word_accetuation.syllables.v3_0.workbench
# import cnn.word_accetuation.syllables.v3_3.workbench
# import cnn.word_accetuation.syllabled_letters.v3_0.workbench
# import cnn.word_accetuation.syllabled_letters.v3_3.workbench
#cnn/accent_classification/syllabled_letters/v2_0/
#import cnn.accent_classification.letters.v3_0.workbench
#import cnn.accent_classification.syllables.v2_0.workbench
#import cnn.accent_classification.syllabled_letters.v2_0.workbench
#import cnn.accent_classification.letters.v3_2.workbench
import cnn.accent_classification.syllables.v2_3.workbench
import cnn.accent_classification.syllabled_letters.v2_3.workbench

161
sloleks_accentuation.py

@ -0,0 +1,161 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
from keras.models import load_model
import sys
from prepare_data import *
np.random.seed(7)
data = Data('l', shuffle_all_inputs=False)
content = data._read_content('data/SlovarIJS_BESEDE_utf8.lex')
dictionary, max_word, max_num_vowels, vowels, accented_vowels = data._create_dict(content)
feature_dictionary = data._create_slovene_feature_dictionary()
syllable_dictionary = data._create_syllables_dictionary(content, vowels)
accented_vowels = ['ŕ', 'á', 'ä', 'é', 'ë', 'ě', 'í', 'î', 'ó', 'ô', 'ö', 'ú', 'ü']
letter_location_model, syllable_location_model, syllabled_letters_location_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v3_10/20_test_epoch.h5',
'cnn/word_accetuation/syllables/v2_4/20_test_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v2_5_3/20_test_epoch.h5')
letter_type_model, syllable_type_model, syllabled_letter_type_model = data.load_type_models(
'cnn/accent_classification/letters/v2_1/20_test_epoch.h5',
'cnn/accent_classification/syllables/v1_0/20_test_epoch.h5',
'cnn/accent_classification/syllabled_letters/v1_0/20_test_epoch.h5')
from lxml import etree
def xml_words_generator(xml_path):
for event, element in etree.iterparse(xml_path, tag="LexicalEntry", encoding="UTF-8"):
words = []
for child in element:
if child.tag == 'WordForm':
msd = None
word = None
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'msd':
msd = wf.attrib['val']
elif wf.tag == 'FormRepresentation':
for form_rep in wf:
if form_rep.attrib['att'] == 'zapis_oblike':
word = form_rep.attrib['val']
# if msd is not None and word is not None:
# pass
# else:
# print('NOOOOO')
words.append([word, '', msd, word])
yield words
gen = xml_words_generator('data/Sloleks_v1.2.xml')
# Words proccesed: 650250
# Word indeks: 50023
# Word number: 50023
from lxml import etree
import time
gen = xml_words_generator('data/Sloleks_v1.2.xml')
word_glob_num = 0
word_limit = 0
iter_num = 50000
word_index = 0
start_timer = time.time()
iter_index = 0
words = []
lexical_entries_load_number = 0
lexical_entries_save_number = 0
# INSIDE
word_glob_num = 1500686
word_limit = 50000
iter_index = 30
done_lexical_entries = 33522
import gc
with open("data/new_sloleks/new_sloleks.xml", "ab") as myfile:
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
for event, element in etree.iterparse('data/Sloleks_v1.2.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
# LOAD NEW WORDS AND ACCENTUATE THEM
# print("HERE")
if lexical_entries_save_number < done_lexical_entries:
g = next(gen)
# print(lexical_entries_save_number)
lexical_entries_save_number += 1
lexical_entries_load_number += 1
print(lexical_entries_save_number)
del g
gc.collect()
continue
if word_glob_num >= word_limit:
# myfile2.close()
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
iter_index += 1
print("Words proccesed: " + str(word_glob_num))
print("Word indeks: " + str(word_index))
print("Word number: " + str(len(words)))
print("lexical_entries_load_number: " + str(lexical_entries_load_number))
print("lexical_entries_save_number: " + str(lexical_entries_save_number))
end_timer = time.time()
print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer) / 60.0) + " minutes")
word_index = 0
words = []
while len(words) < iter_num:
try:
words.extend(next(gen))
lexical_entries_load_number += 1
except:
break
# if word_glob_num > 1:
# break
data = Data('l', shuffle_all_inputs=False)
location_accented_words, accented_words = data.accentuate_word(words, letter_location_model, syllable_location_model,
syllabled_letters_location_model,
letter_type_model, syllable_type_model, syllabled_letter_type_model,
dictionary, max_word, max_num_vowels, vowels, accented_vowels,
feature_dictionary, syllable_dictionary)
word_limit += len(words)
# READ DATA
for child in element:
if child.tag == 'WordForm':
msd = None
word = None
for wf in child:
if wf.tag == 'FormRepresentation':
new_element = etree.Element('feat')
new_element.attrib['att'] = 'naglasna_mesta_oblike'
new_element.attrib['val'] = location_accented_words[word_index]
wf.append(new_element)
new_element = etree.Element('feat')
new_element.attrib['att'] = 'naglašena_oblika'
new_element.attrib['val'] = accented_words[word_index]
wf.append(new_element)
word_glob_num += 1
word_index += 1
# print(etree.tostring(element, encoding="UTF-8"))
# myfile2.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
myfile.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
element.clear()
lexical_entries_save_number += 1

73
sloleks_accentuation2.py

@ -0,0 +1,73 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
from keras.models import load_model
import sys
import pickle
import time
from prepare_data import *
np.random.seed(7)
data = Data('l', shuffle_all_inputs=False)
content = data._read_content('data/SlovarIJS_BESEDE_utf8.lex')
dictionary, max_word, max_num_vowels, vowels, accented_vowels = data._create_dict(content)
feature_dictionary = data._create_slovene_feature_dictionary()
syllable_dictionary = data._create_syllables_dictionary(content, vowels)
# accented_vowels = ['ŕ', 'á', 'à', 'é', 'è', 'ê', 'í', 'ì', 'ó', 'ô', 'ò', 'ú', 'ù']
accented_vowels = ['ŕ', 'á', 'ä', 'é', 'ë', 'ě', 'í', 'î', 'ó', 'ô', 'ö', 'ú', 'ü']
data = Data('l', shuffle_all_inputs=False)
letter_location_model, syllable_location_model, syllabled_letters_location_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_3/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_3/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_3/20_final_epoch.h5')
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model = data.load_location_models(
'cnn/word_accetuation/cnn_dictionary/v5_2/20_final_epoch.h5',
'cnn/word_accetuation/syllables/v3_2/20_final_epoch.h5',
'cnn/word_accetuation/syllabled_letters/v3_2/20_final_epoch.h5')
letter_type_model, syllable_type_model, syllabled_letter_type_model = data.load_type_models(
'cnn/accent_classification/letters/v3_1/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_1/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_1/20_final_epoch.h5')
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model = data.load_type_models(
'cnn/accent_classification/letters/v3_0/20_final_epoch.h5',
'cnn/accent_classification/syllables/v2_0/20_final_epoch.h5',
'cnn/accent_classification/syllabled_letters/v2_0/20_final_epoch.h5')
data = Data('s', shuffle_all_inputs=False)
# new_content = data._read_content('data/sloleks-sl_v1.2.tbl')
new_content = data._read_content('data/contextual_changes/small/sloleks-sl_v1.2.tbl')
print('Commencing accentuator!')
rate = 100000
start_timer = time.time()
with open("data/contextual_changes/small/new_sloleks2_small2.tab", "a") as myfile:
for index in range(0, len(new_content), rate):
if index+rate >= len(new_content):
words = [[el[0], '', el[2], el[0]] for el in new_content][index:len(new_content)]
else:
words = [[el[0], '', el[2], el[0]] for el in new_content][index:index+rate]
data = Data('l', shuffle_all_inputs=False)
location_accented_words, accented_words = data.accentuate_word(words, letter_location_model, syllable_location_model, syllabled_letters_location_model,
letter_location_co_model, syllable_location_co_model, syllabled_letters_location_co_model,
letter_type_model, syllable_type_model, syllabled_letter_type_model,
letter_type_co_model, syllable_type_co_model, syllabled_letter_type_co_model,
dictionary, max_word, max_num_vowels, vowels, accented_vowels, feature_dictionary, syllable_dictionary)
res = ''
for i in range(index, index + len(words)):
res += new_content[i][0] + '\t' + new_content[i][1] + '\t' + new_content[i][2] + '\t' \
+ new_content[i][3][:-1] + '\t' + convert_to_correct_stress(location_accented_words[i-index]) + '\t' + \
convert_to_correct_stress(accented_words[i-index]) + '\n'
print('Writing data from ' + str(index) + ' onward.')
end_timer = time.time()
print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer)/60.0) + " minutes")
myfile.write(res)

154
sloleks_accentuation2_tab2xml.py

@ -0,0 +1,154 @@
# Words proccesed: 650250
# Word indeks: 50023
# Word number: 50023
from lxml import etree
import time
from prepare_data import *
from text2SAMPA import *
# def xml_words_generator(xml_path):
# for event, element in etree.iterparse(xml_path, tag="LexicalEntry", encoding="UTF-8"):
# words = []
# for child in element:
# if child.tag == 'WordForm':
# msd = None
# word = None
# for wf in child:
# if 'att' in wf.attrib and wf.attrib['att'] == 'msd':
# msd = wf.attrib['val']
# elif wf.tag == 'FormRepresentation':
# for form_rep in wf:
# if form_rep.attrib['att'] == 'zapis_oblike':
# word = form_rep.attrib['val']
# #if msd is not None and word is not None:
# # pass
# #else:
# # print('NOOOOO')
# words.append([word, '', msd, word])
# yield words
#
#
# gen = xml_words_generator('data/Sloleks_v1.2_p2.xml')
word_glob_num = 0
word_limit = 50000
iter_num = 50000
word_index = 0
# iter_index = 0
# words = []
#
# lexical_entries_load_number = 0
# lexical_entries_save_number = 0
#
# # INSIDE
# # word_glob_num = 1500686
# word_glob_num = 1550705
#
# # word_limit = 1500686
# word_limit = 1550705
#
# iter_index = 31
# done_lexical_entries = 33522
data = Data('s', shuffle_all_inputs=False)
accentuated_content = data._read_content('data/new_sloleks/final_sloleks2.tab')
start_timer = time.time()
print('Copy initialization complete')
with open("data/new_sloleks/final_sloleks2.xml", "ab") as myfile:
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
for event, element in etree.iterparse('data/Sloleks_v1.2.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
# for event, element in etree.iterparse('data/Sloleks_v1.2.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
# if word_glob_num >= word_limit:
# myfile2.close()
# myfile2 = open('data/new_sloleks/p' + str(iter_index) + '.xml', 'ab')
# iter_index += 1
# print("Words proccesed: " + str(word_glob_num))
#
# print("Word indeks: " + str(word_index))
# print("Word number: " + str(len(words)))
#
# # print("lexical_entries_load_number: " + str(lexical_entries_load_number))
# # print("lexical_entries_save_number: " + str(lexical_entries_save_number))
#
# end_timer = time.time()
# print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer) / 60.0) + " minutes")
lemma = ''
accentuated_word_location = ''
accentuated_word = ''
for child in element:
if child.tag == 'Lemma':
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'zapis_oblike':
lemma = wf.attrib['val']
if child.tag == 'WordForm':
msd = None
word = None
for wf in child:
if 'att' in wf.attrib and wf.attrib['att'] == 'msd':
msd = wf.attrib['val']
elif wf.tag == 'FormRepresentation':
for form_rep in wf:
if form_rep.attrib['att'] == 'zapis_oblike':
word = form_rep.attrib['val']
# if msd is not None and word is not None:
# pass
# else:
# print('NOOOOO')
word_index = (word_index - 500) % len(accentuated_content)
word_index_sp = (word_index - 1) % len(accentuated_content)
while word_index != word_index_sp:
if word == accentuated_content[word_index][0] and msd == accentuated_content[word_index][2] and \
lemma == accentuated_content[word_index][1]:
accentuated_word_location = accentuated_content[word_index][4]
accentuated_word = accentuated_content[word_index][5][:-1]
del(accentuated_content[word_index])
break
word_index = (word_index + 1) % len(accentuated_content)
error = word_index == word_index_sp
if word_index == word_index_sp and word == accentuated_content[word_index][0] and msd == accentuated_content[word_index][2] \
and lemma == accentuated_content[word_index][1]:
accentuated_word_location = accentuated_content[word_index][4]
accentuated_word = accentuated_content[word_index][5][:-1]
error = False
del(accentuated_content[word_index])
if error:
print('ERROR IN ' + word + ' : ' + lemma + ' : ' + msd)
# print('ERROR IN ' + word + ' : ' + accentuated_content[word_index][0] + ' OR ' + msd + ' : '
# + accentuated_content[word_index][2])
# words.append([word, '', msd, word])
new_element = etree.Element('feat')
new_element.attrib['att'] = 'naglasna_mesta_besede'
new_element.attrib['val'] = accentuated_word_location
wf.append(new_element)
new_element = etree.Element('feat')
new_element.attrib['att'] = 'naglašena_beseda'
new_element.attrib['val'] = accentuated_word
wf.append(new_element)
new_element = etree.Element('feat')
new_element.attrib['att'] = 'SAMPA'
print(accentuated_word)
new_element.attrib['val'] = convert_to_SAMPA(accentuated_word)
wf.append(new_element)
word_glob_num += 1
# word_index += 1
# print(etree.tostring(element, encoding="UTF-8"))
# myfile2.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
if word_glob_num > word_limit:
# print('Proccessed ' + str(word_glob_num) + ' words')
end_timer = time.time()
# print("Elapsed time: " + "{0:.2f}".format((end_timer - start_timer) / 60.0) + " minutes")
word_limit += iter_num
myfile.write(etree.tostring(element, encoding="UTF-8", pretty_print=True))
element.clear()

1050
sloleks_accetuation.ipynb
File diff suppressed because it is too large
View File

290
sloleks_accetuation2.ipynb
File diff suppressed because it is too large
View File

36
sloleks_xml_checker.py

@ -0,0 +1,36 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
# Words proccesed: 650250
# Word indeks: 50023
# Word number: 50023
from lxml import etree
word_glob_num = 0
word_limit = 50000
iter_num = 50000
word_index = 0
accented_places = 0
accented_words = 0
enters = 0
for event, element in etree.iterparse('data/new_sloleks/final_sloleks.xml', tag="LexicalEntry", encoding="UTF-8", remove_blank_text=True):
for child in element:
for wf in child:
if wf.tag == 'FormRepresentation':
for form_rep in wf:
if form_rep.attrib['att'] == 'naglasna_mesta_besede':
accented_places += 1
if '\n' in list(form_rep.attrib['val']):
enters += 1
if form_rep.attrib['att'] == 'naglašena_beseda':
accented_words += 1
if '\n' in list(form_rep.attrib['val']):
enters += 1
element.clear()
print(accented_places)
print(accented_words)
print(enters)

1
test_data/accented_connected_text

@ -0,0 +1 @@
Izbrúhi na sóncu só žé vëčkrat pokazáli zóbe nášim satelítom, poslédično nášim mobílnim telefónom, navigáciji, celo eléktričnemu omréžju. Á vesóljskega vreména šë në morémo napovédati – kakó bî ga láhko, se tá téden na Blédu pogovárja okóli 70 znánstvenikov Evrópske vesóljske agéncije, ki jé sebój pripeljála svôjo näjvéčjo ikóno, británca Mátta Taylorja.

6
test_data/accented_data

@ -0,0 +1,6 @@
absolutístični absolutístični
spoštljívejše spoštljívejše
tresóče tresóče
razneséna raznesěna
žvížgih žvížgih

1
test_data/original_connected_text

@ -0,0 +1 @@
Izbruhi na soncu so že večkrat pokazali zobe našim satelitom, posledično našim mobilnim telefonom, navigaciji, celo električnemu omrežju. A vesoljskega vremena še ne moremo napovedati – kako bi ga lahko, se ta teden na Bledu pogovarja okoli 70 znanstvenikov Evropske vesoljske agencije, ki je seboj pripeljala svojo največjo ikono, britanca Matta Taylorja.

6
test_data/unaccented_dictionary

@ -0,0 +1,6 @@
absolutistični Afpmsay-n
spoštljivejše Afcfsg
tresoče Afpfsg
raznesena Vmp--sfp
žvižgih Ncmdl

101
tex_hyphenation.py

@ -0,0 +1,101 @@
import sys
sys.path.insert(0, '../../../')
from prepare_data import *
dictionary, max_word, max_num_vowels, content, vowels, accetuated_vowels = create_dict()
feature_dictionary = create_feature_dictionary(content)
def read_hyphenation_pattern():
with open('../../../hyphenation') as f:
content = f.readlines()
return [x[:-1] for x in content]
def find_hyphenation_patterns_in_text(text, pattern):
res = []
index = 0
while index < len(text):
index = text.find(pattern, index)
if index == -1:
break
res.append(index)
index += 1 # +2 because len('ll') == 2
return res
def create_hyphenation_dictionary(hyphenation_pattern):
dictionary = []
for el in hyphenation_pattern:
substring = ''
anomalies_indices = []
digit_location = 0
for let in list(el):
if let.isdigit():
anomalies_indices.append([digit_location, int(let)])
else:
substring += let
digit_location += 1
dictionary.append([substring, anomalies_indices])
return dictionary
def split_hyphenated_word(split, word):
split = split[2:-2]
print(split)
word = list(word)[1:-1]
res = []
hyphenate = ''
loc = 0
for let in word:
hyphenate += let
if loc == len(split) or split[loc] % 2 == 1:
res.append(hyphenate)
hyphenate = ''
loc += 1
return res
def hyphenate_word(word, hyphenation_dictionary):
word = word.replace('è', 'č')
word = '.' + word + '.'
split = [0] * (len(word) + 1)
for pattern in hyphenation_dictionary:
pattern_locations = find_hyphenation_patterns_in_text(word, pattern[0])
for pattern_location in pattern_locations:
for el_hyphenation_dictionary in pattern[1]:
if split[pattern_location + el_hyphenation_dictionary[0]] < el_hyphenation_dictionary[1]:
split[pattern_location + el_hyphenation_dictionary[0]] = el_hyphenation_dictionary[1]
return split_hyphenated_word(split, word)
hyphenation_pattern = read_hyphenation_pattern()
# ['zz', [{0:2},{1:1},{2:2}]]
hyphenation_dictionary = create_hyphenation_dictionary(hyphenation_pattern)
separated_word = hyphenate_word('izziv', hyphenation_dictionary)
print(separated_word)
all_words = []
i = 0
for el in content:
separated_word = hyphenate_word(el[0], hyphenation_dictionary)
all_words.append([el[0], separated_word])
if i % 10000 == 0:
print(str(i)+'/'+str(len(content)))
i += 1
errors = []
errors2 = []
for word in all_words:
for hyphenated_part in word[1]:
num_vowels = 0
for let in list(hyphenated_part):
if let in vowels:
num_vowels += 1
if num_vowels == 0:
for let in list(hyphenated_part):
if let == 'r':
errors2.append(word[0])
num_vowels += 1
if num_vowels != 1:
errors.append(word)

247
text2SAMPA.py

@ -0,0 +1,247 @@
from copy import copy
import sys
vowels = ['à', 'á', 'ä', 'é', 'ë', 'ì', 'í', 'î', 'ó', 'ô', 'ö', 'ú', 'ü', 'a', 'e', 'i', 'o', 'u', 'O', 'E']
def syllable_stressed(syllable):
# stressed_letters = [u'ŕ', u'á', u'ä', u'é', u'ë', u'ě', u'í', u'î', u'ó', u'ô', u'ö', u'ú', u'ü']
stressed_letters = [u'ŕ', u'á', u'à', u'é', u'è', u'ê', u'í', u'ì', u'ó', u'ô', u'ò', u'ú', u'ù']
for letter in syllable:
if letter in stressed_letters:
return True
return False
def is_vowel(word_list, position, vowels):
if word_list[position] in vowels:
return True
if (word_list[position] == u'r' or word_list[position] == u'R') and (position - 1 < 0 or word_list[position - 1] not in vowels) and (
position + 1 >= len(word_list) or word_list[position + 1] not in vowels):
return True
return False
def get_voiced_consonants():
return ['m', 'n', 'v', 'l', 'r', 'j', 'y', 'w', 'F', 'N']
def get_resonant_silent_consonants():
return ['b', 'd', 'z', 'ž', 'g']
def get_nonresonant_silent_consonants():
return ['p', 't', 's', 'š', 'č', 'k', 'f', 'h', 'c', 'x']
def split_consonants(consonants):
voiced_consonants = get_voiced_consonants()
resonant_silent_consonants = get_resonant_silent_consonants()
unresonant_silent_consonants = get_nonresonant_silent_consonants()
if len(consonants) == 0:
return [''], ['']
elif len(consonants) == 1:
return [''], consonants
else:
split_options = []
for i in range(len(consonants) - 1):
if consonants[i] == '-' or consonants[i] == '_':
split_options.append([i, -1])
elif consonants[i] == consonants[i + 1]:
split_options.append([i, 0])
elif consonants[i] in voiced_consonants:
if consonants[i + 1] in resonant_silent_consonants or consonants[i + 1] in unresonant_silent_consonants:
split_options.append([i, 2])
elif consonants[i] in resonant_silent_consonants:
if consonants[i + 1] in resonant_silent_consonants:
split_options.append([i, 1])
elif consonants[i + 1] in unresonant_silent_consonants:
split_options.append([i, 3])
elif consonants[i] in unresonant_silent_consonants:
if consonants[i + 1] in resonant_silent_consonants:
split_options.append([i, 4])
if split_options == []:
return [''], consonants
else:
split = min(split_options, key=lambda x: x[1])
return consonants[:split[0] + 1], consonants[split[0] + 1:]
def create_syllables(word, vowels):
word_list = list(word)
consonants = []
syllables = []
for i in range(len(word_list)):
if is_vowel(word_list, i, vowels):
if syllables == []:
consonants.append(word_list[i])
syllables.append(''.join(consonants))
else:
left_consonants, right_consonants = split_consonants(list(''.join(consonants).lower()))
syllables[-1] += ''.join(left_consonants)
right_consonants.append(word_list[i])
syllables.append(''.join(right_consonants))
consonants = []
else:
consonants.append(word_list[i])
if len(syllables) < 1:
return word
syllables[-1] += ''.join(consonants)
return syllables
def convert_to_SAMPA(word):
word = word.lower()
syllables = create_syllables(word, vowels)
letters_in_stressed_syllable = [False] * len(word)
# print(syllables)
l = 0
for syllable in syllables:
if syllable_stressed(syllable):
for i in range(len(syllable)):
letters_in_stressed_syllable[l + i] = True
# print(l)
l += len(syllable)
previous_letter = ''
word = list(word)
for i in range(len(word)):
if word[i] == 'e':
word[i] = 'E'
elif word[i] == 'o':
word[i] = 'O'
elif word[i] == 'š':
word[i] = 'S'
elif word[i] == 'ž':
word[i] = 'Z'
elif word[i] == 'h':
word[i] = 'x'
elif word[i] == 'c':
word[i] = 'ts'
elif word[i] == 'č':
word[i] = 'tS'
elif word[i] == 'á':
word[i] = 'a:'
elif word[i] == 'à':
word[i] = 'a'
elif word[i] == 'é':
word[i] = 'e:'
elif word[i] == 'è':
word[i] = 'E'
elif word[i] == 'ê':
word[i] = 'E:'
elif word[i] == 'í':
word[i] = 'i:'
elif word[i] == 'î':
word[i] = 'i'
elif word[i] == 'ó':
word[i] = 'o:'
elif word[i] == 'ô':
word[i] = 'O:'
elif word[i] == 'ò':
word[i] = 'O'
elif word[i] == 'ú':
word[i] = 'u:'
elif word[i] == 'ù':
word[i] = 'u'
elif word[i] == 'ŕ':
word[i] = '@r'
if letters_in_stressed_syllable[0]:
word[0] = '\"' + word[0]
for i in range(1, len(letters_in_stressed_syllable)):
if not letters_in_stressed_syllable[i - 1] and letters_in_stressed_syllable[i]:
word[i] = '\"' + word[i]
# if letters_in_stressed_syllable[i - 1] and not letters_in_stressed_syllable[i]:
# word[i - 1] = word[i - 1] + ':'
# if letters_in_stressed_syllable[-1]:
# word[-1] = word[-1] + ':'
word = list(''.join(word))
test_word = ''.join(word)
test_word = test_word.replace('"', '').replace(':', '')
if len(test_word) <= 1:
return ''.join(word)
previous_letter_i = -1
letter_i = 0
next_letter_i = 1
if word[0] == '\"':
letter_i = 1
if len(word) > 2 and word[2] == ':':
if len(word) > 3:
next_letter_i = 3
else:
#if word[next_letter_i] == 'l':
# word[next_letter_i] = 'l\''
#elif word[next_letter_i] == 'n':
# word[next_letter_i] = 'n\''
return ''.join(word)
else:
if len(word) > 2:
next_letter_i = 2
else:
return ''.join(word)
elif len(word) > 1 and word[1] == '\"':
next_letter_i = 2
# {('m', 'f'): 'F'}
new_word = copy(word)
while True:
if word[letter_i] == 'm' and (word[next_letter_i] == 'f' or word[next_letter_i] == 'v'):
new_word[letter_i] = 'F'
elif word[letter_i] == 'n' and (word[next_letter_i] == 'k' or word[next_letter_i] == 'g' or word[next_letter_i] == 'x'):
new_word[letter_i] = 'N'
elif word[letter_i] == 'n' and (word[next_letter_i] == 'f' or word[next_letter_i] == 'v'):
new_word[letter_i] = 'F'
elif word[letter_i] == 'n' and not word[next_letter_i] in vowels and letter_i == len(word) - 2:
new_word[letter_i] = 'n\''
elif word[letter_i] == 'l' and not word[next_letter_i] in vowels and letter_i == len(word) - 2:
new_word[letter_i] = 'l\''
elif previous_letter_i >= 0 and word[letter_i] == 'v' and not word[previous_letter_i] in vowels and word[
next_letter_i] in get_voiced_consonants():
new_word[letter_i] = 'w'
elif previous_letter_i >= 0 and word[letter_i] == 'v' and not word[previous_letter_i] in vowels and word[
next_letter_i] in get_nonresonant_silent_consonants():
new_word[letter_i] = 'W'
elif word[letter_i] == 'p' and word[next_letter_i] == 'm':
new_word[letter_i] = 'p_n'
elif word[letter_i] == 'p' and (word[next_letter_i] == 'f' or word[next_letter_i] == 'v'):
new_word[letter_i] = 'p_f'
elif word[letter_i] == 'b' and word[next_letter_i] == 'm':
new_word[letter_i] = 'b_n'
elif word[letter_i] == 'b' and (word[next_letter_i] == 'f' or word[next_letter_i] == 'v'):
new_word[letter_i] = 'b_f'
elif word[letter_i] == 't' and word[next_letter_i] == 'l':
new_word[letter_i] = 't_l'
elif word[letter_i] == 't' and word[next_letter_i] == 'n':
new_word[letter_i] = 't_n'
elif word[letter_i] == 'd' and word[next_letter_i] == 'l':
new_word[letter_i] = 'd_l'
elif word[letter_i] == 'd' and word[next_letter_i] == 'n':
new_word[letter_i] = 'd_n'
if len(word) > next_letter_i + 1:
if word[next_letter_i + 1] == ':' or word[next_letter_i + 1] == '\"':
if len(word) > next_letter_i + 2:
previous_letter_i = letter_i
letter_i = next_letter_i
next_letter_i = next_letter_i + 2
else:
#if word[next_letter_i] == 'l':
# new_word[next_letter_i] = 'l\''
#elif word[next_letter_i] == 'n':
# new_word[next_letter_i] = 'n\''
return ''.join(new_word)
else:
previous_letter_i = letter_i
letter_i = next_letter_i
next_letter_i = next_letter_i + 1
else:
#if word[next_letter_i] == 'l':
# new_word[next_letter_i] = 'l\''
#elif word[next_letter_i] == 'n':
# new_word[next_letter_i] = 'n\''
return ''.join(new_word)
# print(word)
#result = convert_to_SAMPA(sys.argv[1])
#final_result = result.replace('\"', '\'')
#print(final_result)
#return final_result

5
tts.sh

@ -0,0 +1,5 @@
#!/bin/sh
SAMPA=$(python text2SAMPA.py $1)
echo $SAMPA
espeak -v en "[[$SAMPA]]"

99
workbench.py

@ -0,0 +1,99 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
# text in Western (Windows 1252)
import pickle
import numpy as np
from keras import optimizers
from keras.models import Model
from keras.layers import Dense, Dropout, Input
from keras.layers.merge import concatenate
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Flatten
# from keras import backend as Input
np.random.seed(7)
# get_ipython().magic('run ../../../prepare_data.py')
# import sys
# # sys.path.insert(0, '../../../')
# sys.path.insert(0, '/home/luka/Developement/accetuation/')
from prepare_data import *
# X_train, X_other_features_train, y_train, X_validate, X_other_features_validate, y_validate = generate_full_matrix_inputs()