You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

53 lines
2.0 KiB

# KOST
## Instructions
- Create a new virtual environment (using i.e. `virtualenv`)
- Run `pip install -r requirements.txt`, to install necessary libraries.
- Using python console download classla models (used for annotation and part of tokenization):
```python
import classla
classla.download(lang='sl', type='standard_jos')
```
9 months ago
- Extract/create metadata to `data_sample/KOST` folder.
- Run `svala2tei.py` script.
### Example
```python
python svala2tei.py --svala_folder data_sample/KOST/svala_small --raw_text data_sample/KOST/raw_small --results_folder data_sample/KOST/results_small --texts_metadata data_sample/KOST/texts_metadata5.csv --authors_metadata data_sample/KOST/authors_metadata5.csv --teachers_metadata data_sample/KOST/teachers_metadata.csv --translations data_sample/KOST/translations.csv --tokenization_interprocessing data_sample/processing.tokenization --annotation_interprocessing data_sample/processing.annotation --overwrite_tokenization --overwrite_annotation
```
## Parameter descriptions
### --svala_folder
Path to directory with `*.svala` files.
### --results_folder
Destination of results folder.
### --raw_text
Path to directory that contains raw texts.
### --texts_metadata
Location of metadata csv that contains information about texts.
### --authors_metadata
Location of metadata csv that contains information about authors.
### --teachers_metadata
Location of metadata csv that contains information about teachers.
### --translations
Path to mapper that translates column names in metadata files.
### --tokenization_interprocessing
Path to file where tokenized semi processed data is stored, to be able to proceed with processsing without rerunning whole test.
### --overwrite_tokenization
Tag that forces script to redo tokenization and overrides interprocessing file.
### --annotation_interprocessing
Path to file where annotated semi processed data is stored, to be able to proceed with processsing without rerunning whole test.
### --overwrite_annotation
Tag that forces script to redo annotation and overrides interprocessing file.
##