forked from kristjan/cjvt-srl-tagging
parser.py can read kres and/or ssj500k
This commit is contained in:
parent
648f4e53d2
commit
d1ba56be37
89
README.md
89
README.md
|
@ -6,93 +6,18 @@ The tools require Java.
|
|||
See `./dockerfiles/mate-tool-env/README.md` for environment preparation.
|
||||
|
||||
## mate-tools
|
||||
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
|
||||
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
|
||||
Check out `./tools/srl-20131216/README.md`.
|
||||
|
||||
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
|
||||
## Scripts
|
||||
Check all possible xml tags (that occur after the <body> tag.
|
||||
'cat F0006347.xml.parsed.xml | grep -A 999999999999 -e '<body>' | grep -o -e '<[^" "]*' | sort | uniq'
|
||||
|
||||
### train
|
||||
Create the `model-file`:
|
||||
## Tools
|
||||
* Parser for reading both `SSJ500k 2.1 TEI xml` and `Kres F....xml.parsed.xml"` files found in `./tools/parser/parser.py`.
|
||||
|
||||
`--help` output:
|
||||
```bash
|
||||
java -cp srl.jar se.lth.cs.srl.Learn --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
|
||||
|
||||
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Learning-specific options:
|
||||
-fdir <dir> the directory with feature files (see below)
|
||||
-reranker trains a reranker also (not done by default)
|
||||
-llbinary <file> a reference to a precompiled version of liblinear,
|
||||
makes training much faster than the java version.
|
||||
-partitions <int> number of partitions used for the reranker
|
||||
-dontInsertGold don't insert the gold standard proposition during
|
||||
training of the reranker.
|
||||
-skipUnknownPredicates skips predicates not matching any POS-tags from
|
||||
the feature files.
|
||||
-dontDeleteTrainData doesn't delete the temporary files from training
|
||||
on exit. (For debug purposes)
|
||||
-ndPipeline Causes the training data and feature mappings to be
|
||||
derived in a non-deterministic way. I.e. training the pipeline
|
||||
on the same corpus twice does not yield the exact same models.
|
||||
This is however slightly faster.
|
||||
|
||||
The feature file dir needs to contain four files with feature sets. See
|
||||
the website for further documentation. The files are called
|
||||
pi.feats, pd.feats, ai.feats, and ac.feats
|
||||
All need to be in the feature file dir, otherwise you will get an error.
|
||||
```
|
||||
Input: `lang`, `input-corpus`.
|
||||
|
||||
### parse
|
||||
`--help` output:
|
||||
```bash
|
||||
$ java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
|
||||
|
||||
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Parsing-specific options:
|
||||
-nopi skips the predicate identification. This is equivalent to the
|
||||
setting in the CoNLL 2009 ST.
|
||||
-reranker uses a reranker (assumed to be included in the model)
|
||||
-alfa <double> the alfa used by the reranker. (default 1.0)
|
||||
```
|
||||
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
|
||||
|
||||
### input data:
|
||||
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
|
||||
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
|
||||
|
||||
## Sources
|
||||
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
|
||||
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
|
||||
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
|
||||
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
|
||||
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
|
||||
|
|
3
dockerfiles/parser-env/Dockerfile
Normal file
3
dockerfiles/parser-env/Dockerfile
Normal file
|
@ -0,0 +1,3 @@
|
|||
FROM python
|
||||
|
||||
RUN pip install lxml
|
13
dockerfiles/parser-env/README.md
Normal file
13
dockerfiles/parser-env/README.md
Normal file
|
@ -0,0 +1,13 @@
|
|||
You might want to mount this whole repo into the docker container.
|
||||
Also mount data locations.
|
||||
|
||||
Example container:
|
||||
```bash
|
||||
$ docker build . -t my_python
|
||||
$ docker run \
|
||||
-it \
|
||||
-v /home/kristjan/git/cjvt-srl-tagging:/cjvt-srl-tagging \
|
||||
-v /home/kristjan/some_corpus_data:/some_corpus_data \
|
||||
my_python \
|
||||
/bin/bash
|
||||
```
|
23
tools/main.py
Normal file
23
tools/main.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
from parser import parser
|
||||
import os
|
||||
from os.path import join
|
||||
import sys
|
||||
|
||||
SSJ500K_2_1 = 27829 # number of sentences
|
||||
|
||||
if __name__ == "__main__":
|
||||
# make sure you sanitize every input into unicode
|
||||
|
||||
print("parsing ssj")
|
||||
# ssj_file = "/home/kristjan/git/diploma/data/ssj500k-sl.TEI/ssj500k-sl.body.xml"
|
||||
# ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.xml"
|
||||
ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.sample.xml" # smaller file
|
||||
ssj_dict = parser.parse_tei(ssj_file)
|
||||
# assert (len(ssj_dict) == 27829), "Parsed wrong number of sentences."
|
||||
|
||||
print("parsing kres")
|
||||
# kres_file = "../data/kres_example/F0019343.xml.parsed.xml"
|
||||
kres_dir = "../data/kres_example/"
|
||||
for kres_file in os.listdir(kres_dir):
|
||||
parser.parse_tei(join(kres_dir, kres_file))
|
||||
print("end parsing kres")
|
BIN
tools/parser/Parser.pyc
Normal file
BIN
tools/parser/Parser.pyc
Normal file
Binary file not shown.
0
tools/parser/__init__.py
Normal file
0
tools/parser/__init__.py
Normal file
BIN
tools/parser/__init__.pyc
Normal file
BIN
tools/parser/__init__.pyc
Normal file
Binary file not shown.
BIN
tools/parser/__pycache__/__init__.cpython-37.pyc
Normal file
BIN
tools/parser/__pycache__/__init__.cpython-37.pyc
Normal file
Binary file not shown.
BIN
tools/parser/__pycache__/parser.cpython-37.pyc
Normal file
BIN
tools/parser/__pycache__/parser.cpython-37.pyc
Normal file
Binary file not shown.
70
tools/parser/bench_parser.py
Normal file
70
tools/parser/bench_parser.py
Normal file
|
@ -0,0 +1,70 @@
|
|||
import xml.etree.ElementTree as ET
|
||||
import random
|
||||
random.seed(42)
|
||||
tree=ET.parse('../../data/kres_example/F0006347.xml.parsed.xmll')
|
||||
print(ET.tostring(tree))
|
||||
root=tree.getroot()
|
||||
train=[]
|
||||
dev=[]
|
||||
test=[]
|
||||
train_text=open('train.txt','w')
|
||||
dev_text=open('dev.txt','w')
|
||||
test_text=open('test.txt','w')
|
||||
for doc in root.iter('{http://www.tei-c.org/ns/1.0}div'):
|
||||
rand=random.random()
|
||||
if rand<0.8:
|
||||
pointer=train
|
||||
pointer_text=train_text
|
||||
elif rand<0.9:
|
||||
pointer=dev
|
||||
pointer_text=dev_text
|
||||
else:
|
||||
pointer=test
|
||||
pointer_text=test_text
|
||||
for p in doc.iter('{http://www.tei-c.org/ns/1.0}p'):
|
||||
for element in p:
|
||||
if element.tag.endswith('s'):
|
||||
sentence=element
|
||||
text=''
|
||||
tokens=[]
|
||||
for element in sentence:
|
||||
if element.tag[-3:]=='seg':
|
||||
for subelement in element:
|
||||
text+=subelement.text
|
||||
if not subelement.tag.endswith('}c'):
|
||||
if subelement.tag.endswith('w'):
|
||||
lemma=subelement.attrib['lemma']
|
||||
else:
|
||||
lemma=subelement.text
|
||||
tokens.append((subelement.text,lemma,subelement.attrib['ana'].split(':')[1]))
|
||||
if element.tag[-2:] not in ('pc','}w','}c'):
|
||||
continue
|
||||
text+=element.text
|
||||
if not element.tag.endswith('}c'):
|
||||
if element.tag.endswith('w'):
|
||||
lemma=element.attrib['lemma']
|
||||
else:
|
||||
lemma=element.text
|
||||
tokens.append((element.text,lemma,element.attrib['ana'].split(':')[1]))
|
||||
pointer.append((text,tokens))
|
||||
pointer_text.write(text.encode('utf8'))
|
||||
else:
|
||||
pointer_text.write(element.text.encode('utf8'))
|
||||
pointer_text.write('\n')
|
||||
#pointer_text.write('\n')
|
||||
|
||||
def write_list(lst,fname):
|
||||
f=open(fname,'w')
|
||||
for text,tokens in lst:
|
||||
f.write('# text = '+text.encode('utf8')+'\n')
|
||||
for idx,token in enumerate(tokens):
|
||||
f.write(str(idx+1)+'\t'+token[0].encode('utf8')+'\t'+token[1].encode('utf8')+'\t_\t'+token[2]+'\t_\t_\t_\t_\t_\n')
|
||||
f.write('\n')
|
||||
f.close()
|
||||
write_list(train,'train.conllu')
|
||||
write_list(dev,'dev.conllu')
|
||||
write_list(test,'test.conllu')
|
||||
train_text.close()
|
||||
dev_text.close()
|
||||
test_text.close()
|
||||
|
144
tools/parser/ozbolt.py
Executable file
144
tools/parser/ozbolt.py
Executable file
|
@ -0,0 +1,144 @@
|
|||
#!/usr/bin/python3
|
||||
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
import pickle
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from lxml import etree as ElementTree
|
||||
except ImportError:
|
||||
import xml.etree.ElementTree as ElementTree
|
||||
|
||||
|
||||
# attributes
|
||||
ID_ATTR = "id"
|
||||
LEMMA_ATTR = "lemma"
|
||||
ANA_ATTR = "ana"
|
||||
|
||||
|
||||
# tags
|
||||
SENTENCE_TAG = 's'
|
||||
BIBL_TAG = 'bibl'
|
||||
PARAGRAPH_TAG = 'p'
|
||||
PC_TAG = 'pc'
|
||||
WORD_TAG = 'w'
|
||||
C_TAG = 'c'
|
||||
S_TAG = 'S'
|
||||
SEG_TAG = 'seg'
|
||||
|
||||
|
||||
class Sentence:
|
||||
def __init__(self, sentence, s_id):
|
||||
self.id = s_id
|
||||
self.words = []
|
||||
self.text = ""
|
||||
|
||||
for word in sentence:
|
||||
self.handle_word(word)
|
||||
|
||||
def handle_word(self, word):
|
||||
# handle space after
|
||||
if word.tag == S_TAG:
|
||||
assert(word.text is None)
|
||||
self.text += ' '
|
||||
return
|
||||
|
||||
# ASK am I handling this correctly?
|
||||
elif word.tag == SEG_TAG:
|
||||
for segword in word:
|
||||
self.handle_word(segword)
|
||||
return
|
||||
|
||||
# ASK handle unknown tags (are there others?)
|
||||
elif word.tag not in (WORD_TAG, C_TAG):
|
||||
return
|
||||
|
||||
# ID
|
||||
idx = str(len(self.words) + 1)
|
||||
|
||||
# TOKEN
|
||||
token = word.text
|
||||
|
||||
# LEMMA
|
||||
if word.tag == WORD_TAG:
|
||||
lemma = word.get(LEMMA_ATTR)
|
||||
assert(lemma is not None)
|
||||
else:
|
||||
lemma = token
|
||||
|
||||
# XPOS
|
||||
xpos = word.get('msd')
|
||||
if word.tag == C_TAG:
|
||||
xpos = "Z"
|
||||
elif xpos in ("Gp-ppdzn", "Gp-spmzd"):
|
||||
xpos = "N"
|
||||
elif xpos is None:
|
||||
print(self.id)
|
||||
|
||||
# save word entry
|
||||
self.words.append(['F{}.{}'.format(self.id, idx), token, lemma, xpos])
|
||||
|
||||
# save for text
|
||||
self.text += word.text
|
||||
|
||||
|
||||
def to_conllu(self):
|
||||
lines = []
|
||||
# lines.append('# sent_id = ' + self.id)
|
||||
# CONLLu does not like spaces at the end of # text
|
||||
# lines.append('# text = ' + self.text.strip())
|
||||
for word in self.words:
|
||||
lines.append('\t'.join('_' if data is None else data for data in word))
|
||||
|
||||
return lines
|
||||
|
||||
def convert_file(in_file, out_file):
|
||||
print("Nalaganje xml: {}".format(in_file))
|
||||
with open(str(in_file), 'r') as fp:
|
||||
xmlstring = re.sub(' xmlns="[^"]+"', '', fp.read(), count=1)
|
||||
xmlstring = xmlstring.replace(' xml:', ' ')
|
||||
xml_tree = ElementTree.XML(xmlstring)
|
||||
|
||||
print("Pretvarjanje TEI -> TSV-U ...")
|
||||
lines = []
|
||||
|
||||
for pidx, paragraph in enumerate(xml_tree.iterfind('.//body/p')):
|
||||
sidx = 1
|
||||
for sentence in paragraph:
|
||||
if sentence.tag != SENTENCE_TAG:
|
||||
continue
|
||||
|
||||
sentence = Sentence(sentence, "{}.{}".format(pidx + 1, sidx))
|
||||
lines.extend(sentence.to_conllu())
|
||||
lines.append('') # ASK newline between sentences
|
||||
sidx += 1
|
||||
|
||||
if len(lines) == 0:
|
||||
raise RuntimeError("Nobenih stavkov najdenih")
|
||||
|
||||
print("Zapisovanje izhodne datoteke: {}".format(out_file))
|
||||
with open(out_file, 'w') as fp:
|
||||
for line in lines:
|
||||
if sys.version_info < (3, 0):
|
||||
line = line.encode('utf-8')
|
||||
print(line, file=fp)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""
|
||||
Input: folder of TEI files, msds are encoded as msd="Z"
|
||||
Ouput: just a folder
|
||||
"""
|
||||
|
||||
in_folder = sys.argv[1]
|
||||
out_folder = sys.argv[2]
|
||||
num_processes = int(sys.argv[3])
|
||||
|
||||
files = Path(in_folder).rglob("*.xml")
|
||||
in_out = []
|
||||
for filename in files:
|
||||
out_file = out_folder + "/" + filename.name[:-4] + ".txt"
|
||||
convert_file(filename, out_file)
|
86
tools/parser/parser.py
Normal file
86
tools/parser/parser.py
Normal file
|
@ -0,0 +1,86 @@
|
|||
from lxml import etree
|
||||
import re
|
||||
|
||||
W_TAGS = ['w']
|
||||
C_TAGS = ['c']
|
||||
S_TAGS = ['S', 'pc']
|
||||
|
||||
# reads a TEI xml file and returns a dictionary:
|
||||
# { <sentence_id>: {
|
||||
# sid: <sentence_id>, # serves as index in MongoDB
|
||||
# text: ,
|
||||
# tokens: ,
|
||||
# }}
|
||||
def parse_tei(filepath):
|
||||
guess_corpus = None # SSJ | KRES
|
||||
res_dict = {}
|
||||
with open(filepath, "r") as fp:
|
||||
# remove namespaces
|
||||
xmlstr = fp.read()
|
||||
xmlstr = re.sub('\\sxmlns="[^"]+"', '', xmlstr, count=1)
|
||||
xmlstr = re.sub(' xml:', ' ', xmlstr)
|
||||
|
||||
root = etree.XML(xmlstr.encode("utf-8"))
|
||||
|
||||
divs = [] # in ssj, there are divs, in Kres, there are separate files
|
||||
if "id" in root.keys():
|
||||
# Kres files start with <TEI id=...>
|
||||
guess_corpus = "KRES"
|
||||
divs = [root]
|
||||
else:
|
||||
guess_corpus = "SSJ"
|
||||
divs = root.findall(".//div")
|
||||
|
||||
# parse divs
|
||||
for div in divs:
|
||||
f_id = div.get("id")
|
||||
|
||||
# parse paragraphs
|
||||
for p in div.findall(".//p"):
|
||||
p_id = p.get("id").split(".")[-1]
|
||||
|
||||
# parse sentences
|
||||
for s in p.findall(".//s"):
|
||||
s_id = s.get("id").split(".")[-1]
|
||||
sentence_text = ""
|
||||
sentence_tokens = []
|
||||
|
||||
# parse tokens
|
||||
for el in s.iter():
|
||||
if el.tag in W_TAGS:
|
||||
el_id = el.get("id").split(".")[-1]
|
||||
if el_id[0] == 't':
|
||||
el_id = el_id[1:] # ssj W_TAG ids start with t
|
||||
sentence_text += el.text
|
||||
sentence_tokens += [(
|
||||
"w",
|
||||
el_id,
|
||||
el.text,
|
||||
el.get("lemma"),
|
||||
(el.get("msd") if guess_corpus == "KRES" else el.get("ana").split(":")[-1]),
|
||||
)]
|
||||
elif el.tag in C_TAGS:
|
||||
el_id = el.get("id") or "none" # only Kres' C_TAGS have ids
|
||||
el_id = el_id.split(".")[-1]
|
||||
sentence_text += el.text
|
||||
sentence_tokens += [("c", el_id, el.text,)]
|
||||
elif el.tag in S_TAGS:
|
||||
sentence_text += " " # Kres' <S /> doesn't contain .text
|
||||
else:
|
||||
# pass links and linkGroups
|
||||
# print(el.tag)
|
||||
pass
|
||||
sentence_id = "{}.{}.{}".format(f_id, p_id, s_id)
|
||||
"""
|
||||
print(sentence_id)
|
||||
print(sentence_text)
|
||||
print(sentence_tokens)
|
||||
"""
|
||||
if sentence_id in res_dict:
|
||||
raise KeyError("duplicated id: {}".format(sentence_id))
|
||||
res_dict[sentence_id] = {
|
||||
"sid": sentence_id,
|
||||
"text": sentence_text,
|
||||
"tokens": sentence_tokens
|
||||
}
|
||||
return res_dict
|
BIN
tools/parser/parser.pyc
Normal file
BIN
tools/parser/parser.pyc
Normal file
Binary file not shown.
91
tools/srl-20131216/README.md
Normal file
91
tools/srl-20131216/README.md
Normal file
|
@ -0,0 +1,91 @@
|
|||
# mate-tools
|
||||
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
|
||||
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
|
||||
|
||||
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
|
||||
|
||||
## train
|
||||
Create the `model-file`:
|
||||
|
||||
`--help` output:
|
||||
```bash
|
||||
java -cp srl.jar se.lth.cs.srl.Learn --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
|
||||
|
||||
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Learning-specific options:
|
||||
-fdir <dir> the directory with feature files (see below)
|
||||
-reranker trains a reranker also (not done by default)
|
||||
-llbinary <file> a reference to a precompiled version of liblinear,
|
||||
makes training much faster than the java version.
|
||||
-partitions <int> number of partitions used for the reranker
|
||||
-dontInsertGold don't insert the gold standard proposition during
|
||||
training of the reranker.
|
||||
-skipUnknownPredicates skips predicates not matching any POS-tags from
|
||||
the feature files.
|
||||
-dontDeleteTrainData doesn't delete the temporary files from training
|
||||
on exit. (For debug purposes)
|
||||
-ndPipeline Causes the training data and feature mappings to be
|
||||
derived in a non-deterministic way. I.e. training the pipeline
|
||||
on the same corpus twice does not yield the exact same models.
|
||||
This is however slightly faster.
|
||||
|
||||
The feature file dir needs to contain four files with feature sets. See
|
||||
the website for further documentation. The files are called
|
||||
pi.feats, pd.feats, ai.feats, and ac.feats
|
||||
All need to be in the feature file dir, otherwise you will get an error.
|
||||
```
|
||||
Input: `lang`, `input-corpus`.
|
||||
|
||||
### parse
|
||||
`--help` output:
|
||||
```bash
|
||||
$ java -cp srl.jar se.lth.cs.srl.Parse --help
|
||||
Not enough arguments, aborting.
|
||||
Usage:
|
||||
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
|
||||
|
||||
Example:
|
||||
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
|
||||
|
||||
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
|
||||
|
||||
<lang> corresponds to the language and is one of
|
||||
chi, eng, ger
|
||||
|
||||
Options:
|
||||
-aibeam <int> the size of the ai-beam for the reranker
|
||||
-acbeam <int> the size of the ac-beam for the reranker
|
||||
-help prints this message
|
||||
|
||||
Parsing-specific options:
|
||||
-nopi skips the predicate identification. This is equivalent to the
|
||||
setting in the CoNLL 2009 ST.
|
||||
-reranker uses a reranker (assumed to be included in the model)
|
||||
-alfa <double> the alfa used by the reranker. (default 1.0)
|
||||
```
|
||||
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
|
||||
|
||||
## input data:
|
||||
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
|
||||
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
|
||||
|
||||
## Sources
|
||||
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
|
||||
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
|
||||
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
|
||||
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
|
Loading…
Reference in New Issue
Block a user