Used in cjvt-valency as a submodule. Reads SSJ500k.xml file and Kres F00...xml files. Creates a python generator that outputs sentences as python dictionaries. Also has a method to output the data into .json files.
Go to file
2019-04-11 07:55:44 +02:00
corpusparser update instead of insert, fixing sentence duplication in db 2019-04-11 07:55:44 +02:00
dockerfiles/cjvt-corpusparser-env some updates 2019-03-17 17:25:19 +01:00
.gitignore some updates 2019-03-17 17:25:19 +01:00
Makefile corpusparser as standalone project 2019-03-17 13:22:01 +01:00
README.md corpusparser as standalone project 2019-03-17 13:22:01 +01:00
setup.py corpusparser as standalone project 2019-03-17 13:22:01 +01:00

corpusparser

A tool for parsing ssj500k and Kres into a unified .json format.

Quickstart

Run make. You will get a container with python3 and this package installed.

Input:

ssj500k

To parse ssj500k, point to the monolythic ssj500k-sl.body.xml file (tested on ssj500k 2.1).

Kres

To parse Kres, point to folders:

  • Kres folder, containig several (around 20K) .xml files (F00XXXXX.xml.parsed.xml).
  • Kres SRL folder, containing SRL links for the corresponding F00...xml files (F00XXXXX.srl.json).

Internal data format

This is the internal python dict data format. It can be stored to file as .json or stored into a database for application usage.

{
	'sid': 'F0034713.5.0',
	'text': 'Mednarodni denarni sklad je odobril 30 milijard evrov vredno posojilo Grčiji. ',
	'tokens': [
		{'text': 'Mednarodni', 'lemma': 'mednaroden', 'msd': 'Ppnmeid', 'word': True, 'tid': 1},
		{'text': 'denarni', 'lemma': 'denaren', 'msd': 'Ppnmeid', 'word': True, 'tid': 2},
		{'text': 'sklad', 'lemma': 'sklad', 'msd': 'Somei', 'word': True, 'tid': 3},
		{'text': 'je', 'lemma': 'biti', 'msd': 'Gp-ste-n', 'word': True, 'tid': 4},
		{'text': 'odobril', 'lemma': 'odobriti', 'msd': 'Ggdd-em', 'word': True, 'tid': 5},
		{'text': '30', 'lemma': '30', 'msd': 'Kag', 'word': True, 'tid': 6},
		{'text': 'milijard', 'lemma': 'milijarda', 'msd': 'Sozmr', 'word': True, 'tid': 7}, # ... 
	]
	'jos_links': [
		{'to': 1, 'from': 3, 'afun': 'dol'},
		{'to': 2, 'from': 3, 'afun': 'dol'},
		{'to': 3, 'from': 5, 'afun': 'ena'}, # ...
	]
	'srl_links': [
		{'to': 3, 'from': 5, 'afun': 'ACT'},
		{'to': 7, 'from': 5, 'afun': 'PAT'}
	]
}