From d1ba56be3755bdfdf19d4c39300f19e17652c6fc Mon Sep 17 00:00:00 2001 From: voje Date: Sun, 3 Feb 2019 22:54:26 +0100 Subject: [PATCH] parser.py can read kres and/or ssj500k --- README.md | 89 +---------- dockerfiles/parser-env/Dockerfile | 3 + dockerfiles/parser-env/README.md | 13 ++ tools/main.py | 23 +++ tools/parser/Parser.pyc | Bin 0 -> 325 bytes tools/parser/__init__.py | 0 tools/parser/__init__.pyc | Bin 0 -> 147 bytes .../__pycache__/__init__.cpython-37.pyc | Bin 0 -> 129 bytes .../parser/__pycache__/parser.cpython-37.pyc | Bin 0 -> 1427 bytes tools/parser/bench_parser.py | 70 +++++++++ tools/parser/ozbolt.py | 144 ++++++++++++++++++ tools/parser/parser.py | 86 +++++++++++ tools/parser/parser.pyc | Bin 0 -> 325 bytes tools/srl-20131216/README.md | 91 +++++++++++ 14 files changed, 437 insertions(+), 82 deletions(-) create mode 100644 dockerfiles/parser-env/Dockerfile create mode 100644 dockerfiles/parser-env/README.md create mode 100644 tools/main.py create mode 100644 tools/parser/Parser.pyc create mode 100644 tools/parser/__init__.py create mode 100644 tools/parser/__init__.pyc create mode 100644 tools/parser/__pycache__/__init__.cpython-37.pyc create mode 100644 tools/parser/__pycache__/parser.cpython-37.pyc create mode 100644 tools/parser/bench_parser.py create mode 100755 tools/parser/ozbolt.py create mode 100644 tools/parser/parser.py create mode 100644 tools/parser/parser.pyc create mode 100644 tools/srl-20131216/README.md diff --git a/README.md b/README.md index f503b14..c0d02c5 100644 --- a/README.md +++ b/README.md @@ -6,93 +6,18 @@ The tools require Java. See `./dockerfiles/mate-tool-env/README.md` for environment preparation. ## mate-tools -Using **Full srl pipeline (including anna-3.3)** from the Downloads section. -Benchmarking the tool for slo and hr: [2] (submodule of this repo). +Check out `./tools/srl-20131216/README.md`. -Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. +## Scripts +Check all possible xml tags (that occur after the tag. +'cat F0006347.xml.parsed.xml | grep -A 999999999999 -e '' | grep -o -e '<[^" "]*' | sort | uniq' -### train -Create the `model-file`: +## Tools +* Parser for reading both `SSJ500k 2.1 TEI xml` and `Kres F....xml.parsed.xml"` files found in `./tools/parser/parser.py`. -`--help` output: -```bash -java -cp srl.jar se.lth.cs.srl.Learn --help -Not enough arguments, aborting. -Usage: - java -cp se.lth.cs.srl.Learn [options] - -Example: - java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train - - trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl - - corresponds to the language and is one of - chi, eng, ger - -Options: - -aibeam the size of the ai-beam for the reranker - -acbeam the size of the ac-beam for the reranker - -help prints this message - -Learning-specific options: - -fdir the directory with feature files (see below) - -reranker trains a reranker also (not done by default) - -llbinary a reference to a precompiled version of liblinear, - makes training much faster than the java version. - -partitions number of partitions used for the reranker - -dontInsertGold don't insert the gold standard proposition during - training of the reranker. - -skipUnknownPredicates skips predicates not matching any POS-tags from - the feature files. - -dontDeleteTrainData doesn't delete the temporary files from training - on exit. (For debug purposes) - -ndPipeline Causes the training data and feature mappings to be - derived in a non-deterministic way. I.e. training the pipeline - on the same corpus twice does not yield the exact same models. - This is however slightly faster. - -The feature file dir needs to contain four files with feature sets. See -the website for further documentation. The files are called -pi.feats, pd.feats, ai.feats, and ac.feats -All need to be in the feature file dir, otherwise you will get an error. -``` -Input: `lang`, `input-corpus`. - -### parse -`--help` output: -```bash -$ java -cp srl.jar se.lth.cs.srl.Parse --help -Not enough arguments, aborting. -Usage: - java -cp se.lth.cs.srl.Parse [options] - -Example: - java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out - - parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step - - corresponds to the language and is one of - chi, eng, ger - -Options: - -aibeam the size of the ai-beam for the reranker - -acbeam the size of the ac-beam for the reranker - -help prints this message - -Parsing-specific options: - -nopi skips the predicate identification. This is equivalent to the - setting in the CoNLL 2009 ST. - -reranker uses a reranker (assumed to be included in the model) - -alfa the alfa used by the reranker. (default 1.0) -``` -We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). - -### input data: -* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`; -formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match); ## Sources * [1] (mate-tools) https://code.google.com/archive/p/mate-tools/ * [2] (benchmarking) https://github.com/clarinsi/bilateral-srl * [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf -* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll \ No newline at end of file +* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll diff --git a/dockerfiles/parser-env/Dockerfile b/dockerfiles/parser-env/Dockerfile new file mode 100644 index 0000000..528a356 --- /dev/null +++ b/dockerfiles/parser-env/Dockerfile @@ -0,0 +1,3 @@ +FROM python + +RUN pip install lxml diff --git a/dockerfiles/parser-env/README.md b/dockerfiles/parser-env/README.md new file mode 100644 index 0000000..f4b5ada --- /dev/null +++ b/dockerfiles/parser-env/README.md @@ -0,0 +1,13 @@ +You might want to mount this whole repo into the docker container. +Also mount data locations. + +Example container: +```bash +$ docker build . -t my_python +$ docker run \ + -it \ + -v /home/kristjan/git/cjvt-srl-tagging:/cjvt-srl-tagging \ + -v /home/kristjan/some_corpus_data:/some_corpus_data \ + my_python \ + /bin/bash +``` diff --git a/tools/main.py b/tools/main.py new file mode 100644 index 0000000..38cd5e3 --- /dev/null +++ b/tools/main.py @@ -0,0 +1,23 @@ +from parser import parser +import os +from os.path import join +import sys + +SSJ500K_2_1 = 27829 # number of sentences + +if __name__ == "__main__": + # make sure you sanitize every input into unicode + + print("parsing ssj") + # ssj_file = "/home/kristjan/git/diploma/data/ssj500k-sl.TEI/ssj500k-sl.body.xml" + # ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.xml" + ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.sample.xml" # smaller file + ssj_dict = parser.parse_tei(ssj_file) + # assert (len(ssj_dict) == 27829), "Parsed wrong number of sentences." + + print("parsing kres") + # kres_file = "../data/kres_example/F0019343.xml.parsed.xml" + kres_dir = "../data/kres_example/" + for kres_file in os.listdir(kres_dir): + parser.parse_tei(join(kres_dir, kres_file)) + print("end parsing kres") diff --git a/tools/parser/Parser.pyc b/tools/parser/Parser.pyc new file mode 100644 index 0000000000000000000000000000000000000000..0e9ac3284925fb8f4338ea134ecc0467a51dda25 GIT binary patch literal 325 zcmZSn%*(a(a9B(-0~9aWd3=Ay{3{gM^BSWwT6Ho+2oij)sCrBM5gS!V% zF+`0YSZ_vZPENj#LP26tacYqUP!R(fQEUYi)6dAyP1VmX$}BF)O3c$w&n(eT&MGU> zEiTH@ElEsI&&*5LFUil(Db|M=q#pn_Pp_b|ga@b{Y)L#s3gq%)kQ*5o{WL(FAP}3% VmT~}%w8_m+Da}c>16v5P0|44^NQVFb literal 0 HcmV?d00001 diff --git a/tools/parser/__init__.py b/tools/parser/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tools/parser/__init__.pyc b/tools/parser/__init__.pyc new file mode 100644 index 0000000000000000000000000000000000000000..d333b89c362415aca1b483e4a42d569584230a4a GIT binary patch literal 147 zcmZSn%**xXKv+yN0~9a+XNg?O} literal 0 HcmV?d00001 diff --git a/tools/parser/__pycache__/__init__.cpython-37.pyc b/tools/parser/__pycache__/__init__.cpython-37.pyc new file mode 100644 index 0000000000000000000000000000000000000000..8b49a1ebc35c5467384b158794c4ef7f27d2ab9d GIT binary patch literal 129 zcmZ?b<>g`kf>K!yVl7qb9~6oz01O-8?!3`HPe1o2BtKRK(cM7Ovo zN4F#~Jv}ooUB4thKc`r~AhD=8wMaicJ~J<~BtBlRpz;=nO>TZlX-=vg$c$njW&i*J CV;vs= literal 0 HcmV?d00001 diff --git a/tools/parser/__pycache__/parser.cpython-37.pyc b/tools/parser/__pycache__/parser.cpython-37.pyc new file mode 100644 index 0000000000000000000000000000000000000000..f99921e0e06f1c4b5a144c2b9d9c75d23e9f3723 GIT binary patch literal 1427 zcmZWp&2Jnv6t_K|+1bhN>_^H++9E{}AXbp12P9Bb5YSdg4It4)N@W_-W$j%inVp%~ z-jHlP5~9EjkrU#8svU0pABaD~S0tofbIhf@nWPX@d*Ac>@_WDMvG;>!Ge9slx3}&+ z@)7#0RGtn3!h1mKO<){x9HR+l7?(1M2_xJYW9D!QoN||YW5QfsvP>1uk&zj#2LMx^d^V0_}Nnmj^jE?ilA@ z67q#op8LE3k=`*r{LOeb(J|isC9j#P=$vl;9V%tO|<=Y-ZHJ-hHmZw3TaI1nO$wu=A9+P zyL$v8u7O55`kp}DwrRrN@GX=#O~@BbyVR--mkMk!52_vaE6Z~QPP&!!e1QS`3uo9z za2~Kf+&`&Mv?>&TJVkMgv!D5kP^W#N+y7TxI;%Q@8*w2|8_xU0GTnV#>YbI(zgXFReZ`{3<@GAB-Y}gedgd;Jf8!J{&)lgVt-ftZ z#2MYZbF0sse*gB&TU%e}(ZP(unpwQ|_z8?kQ?krtv6c?3%#FolGGNX`apnw?!71Kg zG)WU7DY$3;gM6*}%{qLV#nEt}1z(Q%#`1%_=P*Y_+)^zLHS=`3Clb|Lv^32`VyP4Z zZiy5SH2qSRua7x*KmN#iA{nM!Sh^>U6swJ*gb(7_I%A=&tDuIqG}1y^M-Rr<+uHf; z{SW)HU4r&*REqWb=j$kylYzGWrZ~DKWh#4J>yM&XWCQ&bYmBEtshwdev#Daf6l#Y@ zLv6_@V;&rz(h^XnsfOns9VmGE(E+2Q9cT$>jAoUf7*!y!rV@z;SjC>gKoG>z4IHjqll^JOhNvh~UWRiY3! zs{6nYrZ7X|<1X+vp|}C$0|g|+0lsuTt0>$6Ed(B5QcZ}&*GL1m!CJ-!IG~Wt3m{h+ zu|njlP@`94G=@{DzLZ=9i@aR2DKY}UU%N%=Gu@0K@9mH)sEK<8l( G`~E-go_R9> literal 0 HcmV?d00001 diff --git a/tools/parser/bench_parser.py b/tools/parser/bench_parser.py new file mode 100644 index 0000000..3251ea3 --- /dev/null +++ b/tools/parser/bench_parser.py @@ -0,0 +1,70 @@ +import xml.etree.ElementTree as ET +import random +random.seed(42) +tree=ET.parse('../../data/kres_example/F0006347.xml.parsed.xmll') +print(ET.tostring(tree)) +root=tree.getroot() +train=[] +dev=[] +test=[] +train_text=open('train.txt','w') +dev_text=open('dev.txt','w') +test_text=open('test.txt','w') +for doc in root.iter('{http://www.tei-c.org/ns/1.0}div'): + rand=random.random() + if rand<0.8: + pointer=train + pointer_text=train_text + elif rand<0.9: + pointer=dev + pointer_text=dev_text + else: + pointer=test + pointer_text=test_text + for p in doc.iter('{http://www.tei-c.org/ns/1.0}p'): + for element in p: + if element.tag.endswith('s'): + sentence=element + text='' + tokens=[] + for element in sentence: + if element.tag[-3:]=='seg': + for subelement in element: + text+=subelement.text + if not subelement.tag.endswith('}c'): + if subelement.tag.endswith('w'): + lemma=subelement.attrib['lemma'] + else: + lemma=subelement.text + tokens.append((subelement.text,lemma,subelement.attrib['ana'].split(':')[1])) + if element.tag[-2:] not in ('pc','}w','}c'): + continue + text+=element.text + if not element.tag.endswith('}c'): + if element.tag.endswith('w'): + lemma=element.attrib['lemma'] + else: + lemma=element.text + tokens.append((element.text,lemma,element.attrib['ana'].split(':')[1])) + pointer.append((text,tokens)) + pointer_text.write(text.encode('utf8')) + else: + pointer_text.write(element.text.encode('utf8')) + pointer_text.write('\n') + #pointer_text.write('\n') + +def write_list(lst,fname): + f=open(fname,'w') + for text,tokens in lst: + f.write('# text = '+text.encode('utf8')+'\n') + for idx,token in enumerate(tokens): + f.write(str(idx+1)+'\t'+token[0].encode('utf8')+'\t'+token[1].encode('utf8')+'\t_\t'+token[2]+'\t_\t_\t_\t_\t_\n') + f.write('\n') + f.close() +write_list(train,'train.conllu') +write_list(dev,'dev.conllu') +write_list(test,'test.conllu') +train_text.close() +dev_text.close() +test_text.close() + \ No newline at end of file diff --git a/tools/parser/ozbolt.py b/tools/parser/ozbolt.py new file mode 100755 index 0000000..9617c73 --- /dev/null +++ b/tools/parser/ozbolt.py @@ -0,0 +1,144 @@ +#!/usr/bin/python3 + +from __future__ import print_function, unicode_literals, division +import sys +import os +import re +import pickle +from pathlib import Path + +try: + from lxml import etree as ElementTree +except ImportError: + import xml.etree.ElementTree as ElementTree + + +# attributes +ID_ATTR = "id" +LEMMA_ATTR = "lemma" +ANA_ATTR = "ana" + + +# tags +SENTENCE_TAG = 's' +BIBL_TAG = 'bibl' +PARAGRAPH_TAG = 'p' +PC_TAG = 'pc' +WORD_TAG = 'w' +C_TAG = 'c' +S_TAG = 'S' +SEG_TAG = 'seg' + + +class Sentence: + def __init__(self, sentence, s_id): + self.id = s_id + self.words = [] + self.text = "" + + for word in sentence: + self.handle_word(word) + + def handle_word(self, word): + # handle space after + if word.tag == S_TAG: + assert(word.text is None) + self.text += ' ' + return + + # ASK am I handling this correctly? + elif word.tag == SEG_TAG: + for segword in word: + self.handle_word(segword) + return + + # ASK handle unknown tags (are there others?) + elif word.tag not in (WORD_TAG, C_TAG): + return + + # ID + idx = str(len(self.words) + 1) + + # TOKEN + token = word.text + + # LEMMA + if word.tag == WORD_TAG: + lemma = word.get(LEMMA_ATTR) + assert(lemma is not None) + else: + lemma = token + + # XPOS + xpos = word.get('msd') + if word.tag == C_TAG: + xpos = "Z" + elif xpos in ("Gp-ppdzn", "Gp-spmzd"): + xpos = "N" + elif xpos is None: + print(self.id) + + # save word entry + self.words.append(['F{}.{}'.format(self.id, idx), token, lemma, xpos]) + + # save for text + self.text += word.text + + + def to_conllu(self): + lines = [] + # lines.append('# sent_id = ' + self.id) + # CONLLu does not like spaces at the end of # text + # lines.append('# text = ' + self.text.strip()) + for word in self.words: + lines.append('\t'.join('_' if data is None else data for data in word)) + + return lines + +def convert_file(in_file, out_file): + print("Nalaganje xml: {}".format(in_file)) + with open(str(in_file), 'r') as fp: + xmlstring = re.sub(' xmlns="[^"]+"', '', fp.read(), count=1) + xmlstring = xmlstring.replace(' xml:', ' ') + xml_tree = ElementTree.XML(xmlstring) + + print("Pretvarjanje TEI -> TSV-U ...") + lines = [] + + for pidx, paragraph in enumerate(xml_tree.iterfind('.//body/p')): + sidx = 1 + for sentence in paragraph: + if sentence.tag != SENTENCE_TAG: + continue + + sentence = Sentence(sentence, "{}.{}".format(pidx + 1, sidx)) + lines.extend(sentence.to_conllu()) + lines.append('') # ASK newline between sentences + sidx += 1 + + if len(lines) == 0: + raise RuntimeError("Nobenih stavkov najdenih") + + print("Zapisovanje izhodne datoteke: {}".format(out_file)) + with open(out_file, 'w') as fp: + for line in lines: + if sys.version_info < (3, 0): + line = line.encode('utf-8') + print(line, file=fp) + + +if __name__ == "__main__": + """ + Input: folder of TEI files, msds are encoded as msd="Z" + Ouput: just a folder + """ + + in_folder = sys.argv[1] + out_folder = sys.argv[2] + num_processes = int(sys.argv[3]) + + files = Path(in_folder).rglob("*.xml") + in_out = [] + for filename in files: + out_file = out_folder + "/" + filename.name[:-4] + ".txt" + convert_file(filename, out_file) diff --git a/tools/parser/parser.py b/tools/parser/parser.py new file mode 100644 index 0000000..b016030 --- /dev/null +++ b/tools/parser/parser.py @@ -0,0 +1,86 @@ +from lxml import etree +import re + +W_TAGS = ['w'] +C_TAGS = ['c'] +S_TAGS = ['S', 'pc'] + +# reads a TEI xml file and returns a dictionary: +# { : { +# sid: , # serves as index in MongoDB +# text: , +# tokens: , +# }} +def parse_tei(filepath): + guess_corpus = None # SSJ | KRES + res_dict = {} + with open(filepath, "r") as fp: + # remove namespaces + xmlstr = fp.read() + xmlstr = re.sub('\\sxmlns="[^"]+"', '', xmlstr, count=1) + xmlstr = re.sub(' xml:', ' ', xmlstr) + + root = etree.XML(xmlstr.encode("utf-8")) + + divs = [] # in ssj, there are divs, in Kres, there are separate files + if "id" in root.keys(): + # Kres files start with + guess_corpus = "KRES" + divs = [root] + else: + guess_corpus = "SSJ" + divs = root.findall(".//div") + + # parse divs + for div in divs: + f_id = div.get("id") + + # parse paragraphs + for p in div.findall(".//p"): + p_id = p.get("id").split(".")[-1] + + # parse sentences + for s in p.findall(".//s"): + s_id = s.get("id").split(".")[-1] + sentence_text = "" + sentence_tokens = [] + + # parse tokens + for el in s.iter(): + if el.tag in W_TAGS: + el_id = el.get("id").split(".")[-1] + if el_id[0] == 't': + el_id = el_id[1:] # ssj W_TAG ids start with t + sentence_text += el.text + sentence_tokens += [( + "w", + el_id, + el.text, + el.get("lemma"), + (el.get("msd") if guess_corpus == "KRES" else el.get("ana").split(":")[-1]), + )] + elif el.tag in C_TAGS: + el_id = el.get("id") or "none" # only Kres' C_TAGS have ids + el_id = el_id.split(".")[-1] + sentence_text += el.text + sentence_tokens += [("c", el_id, el.text,)] + elif el.tag in S_TAGS: + sentence_text += " " # Kres' doesn't contain .text + else: + # pass links and linkGroups + # print(el.tag) + pass + sentence_id = "{}.{}.{}".format(f_id, p_id, s_id) + """ + print(sentence_id) + print(sentence_text) + print(sentence_tokens) + """ + if sentence_id in res_dict: + raise KeyError("duplicated id: {}".format(sentence_id)) + res_dict[sentence_id] = { + "sid": sentence_id, + "text": sentence_text, + "tokens": sentence_tokens + } + return res_dict diff --git a/tools/parser/parser.pyc b/tools/parser/parser.pyc new file mode 100644 index 0000000000000000000000000000000000000000..5ddae18ab3193dcd82bc96287f8c80e21cb84f52 GIT binary patch literal 325 zcmZSn%*(a(a9B(-0~9aWd3=Ay{3{gM^BSWwT6Ho+2oij)sCrBM5gS!V% zF+`0YSZ_vZPENj#LP26tacYqUP!R(fQEUYi)6dAyP1VmX$}BF)O3c$w&n(eT&MGU> zEiTH@ElEsI&&*5LFUil(Db|M=1f}!}Doc2P+QF8@L!>}1F9x}hfzeL`#0dhiscb0+ S&`6uy{FKt1R6DSRAUgo?&`8Gs literal 0 HcmV?d00001 diff --git a/tools/srl-20131216/README.md b/tools/srl-20131216/README.md new file mode 100644 index 0000000..ffde26f --- /dev/null +++ b/tools/srl-20131216/README.md @@ -0,0 +1,91 @@ +# mate-tools +Using **Full srl pipeline (including anna-3.3)** from the Downloads section. +Benchmarking the tool for slo and hr: [2] (submodule of this repo). + +Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. + +## train +Create the `model-file`: + +`--help` output: +```bash +java -cp srl.jar se.lth.cs.srl.Learn --help +Not enough arguments, aborting. +Usage: + java -cp se.lth.cs.srl.Learn [options] + +Example: + java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train + + trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl + + corresponds to the language and is one of + chi, eng, ger + +Options: + -aibeam the size of the ai-beam for the reranker + -acbeam the size of the ac-beam for the reranker + -help prints this message + +Learning-specific options: + -fdir the directory with feature files (see below) + -reranker trains a reranker also (not done by default) + -llbinary a reference to a precompiled version of liblinear, + makes training much faster than the java version. + -partitions number of partitions used for the reranker + -dontInsertGold don't insert the gold standard proposition during + training of the reranker. + -skipUnknownPredicates skips predicates not matching any POS-tags from + the feature files. + -dontDeleteTrainData doesn't delete the temporary files from training + on exit. (For debug purposes) + -ndPipeline Causes the training data and feature mappings to be + derived in a non-deterministic way. I.e. training the pipeline + on the same corpus twice does not yield the exact same models. + This is however slightly faster. + +The feature file dir needs to contain four files with feature sets. See +the website for further documentation. The files are called +pi.feats, pd.feats, ai.feats, and ac.feats +All need to be in the feature file dir, otherwise you will get an error. +``` +Input: `lang`, `input-corpus`. + +### parse +`--help` output: +```bash +$ java -cp srl.jar se.lth.cs.srl.Parse --help +Not enough arguments, aborting. +Usage: + java -cp se.lth.cs.srl.Parse [options] + +Example: + java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out + + parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step + + corresponds to the language and is one of + chi, eng, ger + +Options: + -aibeam the size of the ai-beam for the reranker + -acbeam the size of the ac-beam for the reranker + -help prints this message + +Parsing-specific options: + -nopi skips the predicate identification. This is equivalent to the + setting in the CoNLL 2009 ST. + -reranker uses a reranker (assumed to be included in the model) + -alfa the alfa used by the reranker. (default 1.0) +``` +We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). + +## input data: +* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`; +formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match); + +## Sources +* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/ +* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl +* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf +* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll