Go to file
2023-08-17 16:50:05 +02:00
data_sample/KOST Removed some data. 2023-08-17 16:44:13 +02:00
src Added instructions for running with less RAM 2023-08-17 16:17:32 +02:00
svala_formatter Writting latest changes 2023-08-17 08:44:59 +02:00
.gitignore Updated .gitignore 2023-08-17 16:46:12 +02:00
README.md Updated readme.md 2023-08-17 16:50:05 +02:00
requirements.txt Updated requirements.txt 2023-08-17 15:44:43 +02:00
solar2svala.py Writting latest changes 2023-08-17 08:44:59 +02:00
svala2tei_solar.py Writting latest changes 2023-08-17 08:44:59 +02:00
svala2tei.py Cleaning code. 2023-08-17 09:20:32 +02:00
tag_selection.py Writting latest changes 2023-08-17 08:44:59 +02:00
txt2svala.py Updated documentation. 2023-08-17 09:16:15 +02:00

KOST

Instructions

  • Create a new virtual environment (using i.e. virtualenv)
  • Run pip install -r requirements.txt, to install necessary libraries.
  • Using python console download classla models (used for annotation and part of tokenization):
import classla
classla.download(lang='sl', type='standard_jos')
  • Extract/create metadata to data_sample/KOST folder.
  • Run svala2tei.py script.

Example

python svala2tei.py --svala_folder data_sample/KOST/svala_small --raw_text data_sample/KOST/raw_small --results_folder data_sample/KOST/results_small --texts_metadata data_sample/KOST/texts_metadata5.csv --authors_metadata data_sample/KOST/authors_metadata5.csv --teachers_metadata data_sample/KOST/teachers_metadata.csv --translations data_sample/KOST/translations.csv --tokenization_interprocessing data_sample/processing.tokenization --annotation_interprocessing data_sample/processing.annotation --overwrite_tokenization --overwrite_annotation

Parameter descriptions

--svala_folder

Path to directory with *.svala files.

--results_folder

Destination of results folder.

--raw_text

Path to directory that contains raw texts.

--texts_metadata

Location of metadata csv that contains information about texts.

--authors_metadata

Location of metadata csv that contains information about authors.

--teachers_metadata

Location of metadata csv that contains information about teachers.

--translations

Path to mapper that translates column names in metadata files.

--tokenization_interprocessing

Path to file where tokenized semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_tokenization

Tag that forces script to redo tokenization and overrides interprocessing file.

--annotation_interprocessing

Path to file where annotated semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_annotation

Tag that forces script to redo annotation and overrides interprocessing file.