svala-scripts/README.md

2.0 KiB

KOST

Instructions

  • Create a new virtual environment (using i.e. virtualenv)
  • Run pip install -r requirements.txt, to install necessary libraries.
  • Using python console download classla models (used for annotation and part of tokenization):
import classla
classla.download(lang='sl', type='standard_jos')
  • Run svala2tei.py script.

Example

python svala2tei.py --svala_folder data_sample/KOST/svala_small --raw_text data_sample/KOST/raw_small --results_folder data_sample/KOST/results_small --texts_metadata data_sample/KOST/texts_metadata5.csv --authors_metadata data_sample/KOST/authors_metadata5.csv --teachers_metadata data_sample/KOST/teachers_metadata.csv --translations data_sample/KOST/translations.csv --tokenization_interprocessing data_sample/processing.tokenization --annotation_interprocessing data_sample/processing.annotation --overwrite_tokenization --overwrite_annotation

Parameter descriptions

--svala_folder

Path to directory with *.svala files.

--results_folder

Destination of results folder.

--raw_text

Path to directory that contains raw texts.

--texts_metadata

Location of metadata csv that contains information about texts.

--authors_metadata

Location of metadata csv that contains information about authors.

--teachers_metadata

Location of metadata csv that contains information about teachers.

--translations

Path to mapper that translates column names in metadata files.

--tokenization_interprocessing

Path to file where tokenized semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_tokenization

Tag that forces script to redo tokenization and overrides interprocessing file.

--annotation_interprocessing

Path to file where annotated semi processed data is stored, to be able to proceed with processsing without rerunning whole test.

--overwrite_annotation

Tag that forces script to redo annotation and overrides interprocessing file.