Go to file

lkrsnik 43a7866636 Added tab2xml conversion modifications		2018-04-30 12:30:37 +02:00
preprocessed_data	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
test_data	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
__init__.py	Changed to python3	2017-06-30 13:18:48 +02:00
.gitignore	Testing oversampling	2018-04-20 21:13:31 +02:00
accentuate_connected_text.py	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
accentuate.py	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
hyphenation	[MAJOR UPDATE] Changed additional features to version 4, erased unnecessary input letters (unused vowels), split validation data to test data and validation data	2017-07-16 14:29:17 +02:00
learn_location_weights.py	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
LICENSE	Initial commit	2017-02-10 10:31:17 +01:00
notes	Added multiple results and created working grid settings and scripts	2017-08-29 20:06:59 +02:00
prepare_data.py	Added some runnable applications of this model	2018-04-27 11:15:37 +02:00
README.md	Update 'README.md'	2018-04-27 09:12:47 +00:00
requirements.txt	Created character_based_cnn	2017-06-20 12:50:34 +02:00
run_multiple_files.py	Added some tests	2018-04-22 08:06:53 +02:00
sloleks_accentuation2_tab2xml.py	Added tab2xml conversion modifications	2018-04-30 12:30:37 +02:00
sloleks_accentuation2.py	Small modifications	2018-04-20 21:15:40 +02:00
sloleks_accentuation.py	Commit before major RAM lack update	2018-03-21 11:35:05 +01:00
sloleks_accetuation2.ipynb	Testing oversampling	2018-04-20 21:13:31 +02:00
sloleks_accetuation.ipynb	Accentuation on sloleks	2018-04-14 10:25:40 +02:00
tex_hyphenation.py	[MAJOR UPDATE] Changed additional features to version 4, erased unnecessary input letters (unused vowels), split validation data to test data and validation data	2017-07-16 14:29:17 +02:00
workbench.py	Tested bidirectional architectural input	2018-03-27 11:32:20 +02:00
workbench.sh	Added num of letters to x_other_features	2017-08-18 19:08:42 +02:00
workbench.xrsl	Added multiple results and created working grid settings and scripts	2017-08-29 20:06:59 +02:00

README.md

Introduction

There is no simple algorithm for stress assignment of Slovene words. Speakers of Slovene are usually taught accents together with words. Machine learning algorithms give positive results on this problem, therefore we tried deep neural networks. We tested different architectures, data presentations and an ensemble of networks. We achieved the best results using the ensemble method, which correctly predicted 88.72 % of tested words. Our neural network approach improved results of other machine learning methods and proved to be successful in stress assignment.

This is an improved version of master thesis project (source code of previous simplyfied repository is published on https://github.com/lkrsnik/simple_accentuation, previous complete source code is published on https://github.com/lkrsnik/accetuation). Final results from test data set show that results presented here are improved by a little more than 1 % from its predecessor. This repository contains all test results and models trained in different ways. It also includes some finalized scripts for simple accentuation of words from list, accentuation application for connected text and a simple example script of learning neural network (we do not provide learning data for this, due to copy rights, if you want it to work, you have to obtain your own learning set).

Set up

The majority of programs used in this app are easily installable with following command:

pip install -r requiremets.txt

If you encounter any problems while installing Keras you should check out their official site (https://keras.io/#installation). The results from neural networks were trained on Theano backend. Although TensorFlow might work as well it is not guaranted.

Structure

Inside cnn folder we have two folders word_accentuation and accent_classification. They both include all prediction models tested in this experiment, where first folder has models created for searching stress location, second contains models for classification of accents. Repository also contains four important files, prepare_data.py with majority of code, accentuate.py, meant as simple accentuation app, accentuate_connected_text.py - simple app for accentuating connected Slovene text - and learn_location_weights, meant as an example of how you could learn your own neural networks with different parameters.

accentuate.py

You should use this script, if you would like to accentuate words from list of words with their morphological data. For it to work you should generate file, which in each line contains word of interest and morphological data, separated by tab. It should look like this:

absolutistični	Afpmsay-n
spoštljivejše	Afcfsg
tresoče	Afpfsg
raznesena	Vmp--sfp
žvižgih	Ncmdl

You can call this script in bash with following command:

python accentuate.py <path_to_input_file> <path_to_results_file>

Here is a working example:

python accentuate.py 'test_data/unaccented_dictionary' 'test_data/accented_data'

accentuate_connected_text.py

This app uses external tagger for obtaining morphological information from sentences. For it to work you should clone repository from https://github.com/clarinsi/reldi-tagger and pass its location as a parameter.

python accentuate_connected_text.py <path_to_input_file> <path_to_results_file> <path_to_reldi_repository>

You can try working example with your actual path to reldi:

python accentuate_connected_text.py 'test_data/original_connected_text' 'test_data/accented_connected_text' '../reldi_tagger'

learn_location_weights.py

This is an example of script designed for learning weights. For it to work you have to have learning data. Given example can be used for learning neural networks for assigning location of stress from letters. For examples of other neural networks you should look into original repository.