Go to file

Luka b11a547a05 Updated readme.md		2023-08-17 16:50:05 +02:00
data_sample/KOST	Removed some data.	2023-08-17 16:44:13 +02:00
src	Added instructions for running with less RAM	2023-08-17 16:17:32 +02:00
svala_formatter	Writting latest changes	2023-08-17 08:44:59 +02:00
.gitignore	Updated .gitignore	2023-08-17 16:46:12 +02:00
README.md	Updated readme.md	2023-08-17 16:50:05 +02:00
requirements.txt	Updated requirements.txt	2023-08-17 15:44:43 +02:00
solar2svala.py	Writting latest changes	2023-08-17 08:44:59 +02:00
svala2tei_solar.py	Writting latest changes	2023-08-17 08:44:59 +02:00
svala2tei.py	Cleaning code.	2023-08-17 09:20:32 +02:00
tag_selection.py	Writting latest changes	2023-08-17 08:44:59 +02:00
txt2svala.py	Updated documentation.	2023-08-17 09:16:15 +02:00

README.md

KOST

Instructions

Create a new virtual environment (using i.e. virtualenv)
Run pip install -r requirements.txt, to install necessary libraries.
Using python console download classla models (used for annotation and part of tokenization):

import classla
classla.download(lang='sl', type='standard_jos')

Extract/create metadata to data_sample/KOST folder.
Run svala2tei.py script.

Example

python svala2tei.py --svala_folder data_sample/KOST/svala_small --raw_text data_sample/KOST/raw_small --results_folder data_sample/KOST/results_small --texts_metadata data_sample/KOST/texts_metadata5.csv --authors_metadata data_sample/KOST/authors_metadata5.csv --teachers_metadata data_sample/KOST/teachers_metadata.csv --translations data_sample/KOST/translations.csv --tokenization_interprocessing data_sample/processing.tokenization --annotation_interprocessing data_sample/processing.annotation --overwrite_tokenization --overwrite_annotation

Parameter descriptions

--svala_folder

Path to directory with *.svala files.

--results_folder

Destination of results folder.

--raw_text

Path to directory that contains raw texts.

--texts_metadata

Location of metadata csv that contains information about texts.

--authors_metadata

Location of metadata csv that contains information about authors.

--teachers_metadata

Location of metadata csv that contains information about teachers.

--translations

Path to mapper that translates column names in metadata files.

--tokenization_interprocessing

Path to file where tokenized semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_tokenization

Tag that forces script to redo tokenization and overrides interprocessing file.

--annotation_interprocessing

Path to file where annotated semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_annotation

Tag that forces script to redo annotation and overrides interprocessing file.