data_sample/KOST | ||
src | ||
svala_formatter | ||
.gitignore | ||
README.md | ||
requirements.txt | ||
solar2svala.py | ||
svala2tei_solar.py | ||
svala2tei.py | ||
tag_selection.py | ||
txt2svala.py |
KOST
Instructions
- Create a new virtual environment (using i.e.
virtualenv
) - Run
pip install -r requirements.txt
, to install necessary libraries. - Using python console download classla models (used for annotation and part of tokenization):
import classla
classla.download(lang='sl', type='standard_jos')
- Extract/create metadata to
data_sample/KOST
folder. - Run
svala2tei.py
script.
Example
python svala2tei.py --svala_folder data_sample/KOST/svala_small --raw_text data_sample/KOST/raw_small --results_folder data_sample/KOST/results_small --texts_metadata data_sample/KOST/texts_metadata5.csv --authors_metadata data_sample/KOST/authors_metadata5.csv --teachers_metadata data_sample/KOST/teachers_metadata.csv --translations data_sample/KOST/translations.csv --tokenization_interprocessing data_sample/processing.tokenization --annotation_interprocessing data_sample/processing.annotation --overwrite_tokenization --overwrite_annotation
Parameter descriptions
--svala_folder
Path to directory with *.svala
files.
--results_folder
Destination of results folder.
--raw_text
Path to directory that contains raw texts.
--texts_metadata
Location of metadata csv that contains information about texts.
--authors_metadata
Location of metadata csv that contains information about authors.
--teachers_metadata
Location of metadata csv that contains information about teachers.
--translations
Path to mapper that translates column names in metadata files.
--tokenization_interprocessing
Path to file where tokenized semi processed data is stored, to be able to proceed with processsing without rerunning whole test.
--overwrite_tokenization
Tag that forces script to redo tokenization and overrides interprocessing file.
--annotation_interprocessing
Path to file where annotated semi processed data is stored, to be able to proceed with processsing without rerunning whole test.
--overwrite_annotation
Tag that forces script to redo annotation and overrides interprocessing file.