parser.py can read kres and/or ssj500k

per-file
voje 5 years ago
parent 648f4e53d2
commit d1ba56be37

@ -6,93 +6,18 @@ The tools require Java.
See `./dockerfiles/mate-tool-env/README.md` for environment preparation.
## mate-tools
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
Check out `./tools/srl-20131216/README.md`.
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
## Scripts
Check all possible xml tags (that occur after the <body> tag.
'cat F0006347.xml.parsed.xml | grep -A 999999999999 -e '<body>' | grep -o -e '<[^" "]*' | sort | uniq'
### train
Create the `model-file`:
## Tools
* Parser for reading both `SSJ500k 2.1 TEI xml` and `Kres F....xml.parsed.xml"` files found in `./tools/parser/parser.py`.
`--help` output:
```bash
java -cp srl.jar se.lth.cs.srl.Learn --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Learning-specific options:
-fdir <dir> the directory with feature files (see below)
-reranker trains a reranker also (not done by default)
-llbinary <file> a reference to a precompiled version of liblinear,
makes training much faster than the java version.
-partitions <int> number of partitions used for the reranker
-dontInsertGold don't insert the gold standard proposition during
training of the reranker.
-skipUnknownPredicates skips predicates not matching any POS-tags from
the feature files.
-dontDeleteTrainData doesn't delete the temporary files from training
on exit. (For debug purposes)
-ndPipeline Causes the training data and feature mappings to be
derived in a non-deterministic way. I.e. training the pipeline
on the same corpus twice does not yield the exact same models.
This is however slightly faster.
The feature file dir needs to contain four files with feature sets. See
the website for further documentation. The files are called
pi.feats, pd.feats, ai.feats, and ac.feats
All need to be in the feature file dir, otherwise you will get an error.
```
Input: `lang`, `input-corpus`.
### parse
`--help` output:
```bash
$ java -cp srl.jar se.lth.cs.srl.Parse --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Parsing-specific options:
-nopi skips the predicate identification. This is equivalent to the
setting in the CoNLL 2009 ST.
-reranker uses a reranker (assumed to be included in the model)
-alfa <double> the alfa used by the reranker. (default 1.0)
```
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
### input data:
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
## Sources
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll

@ -0,0 +1,3 @@
FROM python
RUN pip install lxml

@ -0,0 +1,13 @@
You might want to mount this whole repo into the docker container.
Also mount data locations.
Example container:
```bash
$ docker build . -t my_python
$ docker run \
-it \
-v /home/kristjan/git/cjvt-srl-tagging:/cjvt-srl-tagging \
-v /home/kristjan/some_corpus_data:/some_corpus_data \
my_python \
/bin/bash
```

@ -0,0 +1,23 @@
from parser import parser
import os
from os.path import join
import sys
SSJ500K_2_1 = 27829 # number of sentences
if __name__ == "__main__":
# make sure you sanitize every input into unicode
print("parsing ssj")
# ssj_file = "/home/kristjan/git/diploma/data/ssj500k-sl.TEI/ssj500k-sl.body.xml"
# ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.xml"
ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.sample.xml" # smaller file
ssj_dict = parser.parse_tei(ssj_file)
# assert (len(ssj_dict) == 27829), "Parsed wrong number of sentences."
print("parsing kres")
# kres_file = "../data/kres_example/F0019343.xml.parsed.xml"
kres_dir = "../data/kres_example/"
for kres_file in os.listdir(kres_dir):
parser.parse_tei(join(kres_dir, kres_file))
print("end parsing kres")

Binary file not shown.

Binary file not shown.

@ -0,0 +1,70 @@
import xml.etree.ElementTree as ET
import random
random.seed(42)
tree=ET.parse('../../data/kres_example/F0006347.xml.parsed.xmll')
print(ET.tostring(tree))
root=tree.getroot()
train=[]
dev=[]
test=[]
train_text=open('train.txt','w')
dev_text=open('dev.txt','w')
test_text=open('test.txt','w')
for doc in root.iter('{http://www.tei-c.org/ns/1.0}div'):
rand=random.random()
if rand<0.8:
pointer=train
pointer_text=train_text
elif rand<0.9:
pointer=dev
pointer_text=dev_text
else:
pointer=test
pointer_text=test_text
for p in doc.iter('{http://www.tei-c.org/ns/1.0}p'):
for element in p:
if element.tag.endswith('s'):
sentence=element
text=''
tokens=[]
for element in sentence:
if element.tag[-3:]=='seg':
for subelement in element:
text+=subelement.text
if not subelement.tag.endswith('}c'):
if subelement.tag.endswith('w'):
lemma=subelement.attrib['lemma']
else:
lemma=subelement.text
tokens.append((subelement.text,lemma,subelement.attrib['ana'].split(':')[1]))
if element.tag[-2:] not in ('pc','}w','}c'):
continue
text+=element.text
if not element.tag.endswith('}c'):
if element.tag.endswith('w'):
lemma=element.attrib['lemma']
else:
lemma=element.text
tokens.append((element.text,lemma,element.attrib['ana'].split(':')[1]))
pointer.append((text,tokens))
pointer_text.write(text.encode('utf8'))
else:
pointer_text.write(element.text.encode('utf8'))
pointer_text.write('\n')
#pointer_text.write('\n')
def write_list(lst,fname):
f=open(fname,'w')
for text,tokens in lst:
f.write('# text = '+text.encode('utf8')+'\n')
for idx,token in enumerate(tokens):
f.write(str(idx+1)+'\t'+token[0].encode('utf8')+'\t'+token[1].encode('utf8')+'\t_\t'+token[2]+'\t_\t_\t_\t_\t_\n')
f.write('\n')
f.close()
write_list(train,'train.conllu')
write_list(dev,'dev.conllu')
write_list(test,'test.conllu')
train_text.close()
dev_text.close()
test_text.close()

@ -0,0 +1,144 @@
#!/usr/bin/python3
from __future__ import print_function, unicode_literals, division
import sys
import os
import re
import pickle
from pathlib import Path
try:
from lxml import etree as ElementTree
except ImportError:
import xml.etree.ElementTree as ElementTree
# attributes
ID_ATTR = "id"
LEMMA_ATTR = "lemma"
ANA_ATTR = "ana"
# tags
SENTENCE_TAG = 's'
BIBL_TAG = 'bibl'
PARAGRAPH_TAG = 'p'
PC_TAG = 'pc'
WORD_TAG = 'w'
C_TAG = 'c'
S_TAG = 'S'
SEG_TAG = 'seg'
class Sentence:
def __init__(self, sentence, s_id):
self.id = s_id
self.words = []
self.text = ""
for word in sentence:
self.handle_word(word)
def handle_word(self, word):
# handle space after
if word.tag == S_TAG:
assert(word.text is None)
self.text += ' '
return
# ASK am I handling this correctly?
elif word.tag == SEG_TAG:
for segword in word:
self.handle_word(segword)
return
# ASK handle unknown tags (are there others?)
elif word.tag not in (WORD_TAG, C_TAG):
return
# ID
idx = str(len(self.words) + 1)
# TOKEN
token = word.text
# LEMMA
if word.tag == WORD_TAG:
lemma = word.get(LEMMA_ATTR)
assert(lemma is not None)
else:
lemma = token
# XPOS
xpos = word.get('msd')
if word.tag == C_TAG:
xpos = "Z"
elif xpos in ("Gp-ppdzn", "Gp-spmzd"):
xpos = "N"
elif xpos is None:
print(self.id)
# save word entry
self.words.append(['F{}.{}'.format(self.id, idx), token, lemma, xpos])
# save for text
self.text += word.text
def to_conllu(self):
lines = []
# lines.append('# sent_id = ' + self.id)
# CONLLu does not like spaces at the end of # text
# lines.append('# text = ' + self.text.strip())
for word in self.words:
lines.append('\t'.join('_' if data is None else data for data in word))
return lines
def convert_file(in_file, out_file):
print("Nalaganje xml: {}".format(in_file))
with open(str(in_file), 'r') as fp:
xmlstring = re.sub(' xmlns="[^"]+"', '', fp.read(), count=1)
xmlstring = xmlstring.replace(' xml:', ' ')
xml_tree = ElementTree.XML(xmlstring)
print("Pretvarjanje TEI -> TSV-U ...")
lines = []
for pidx, paragraph in enumerate(xml_tree.iterfind('.//body/p')):
sidx = 1
for sentence in paragraph:
if sentence.tag != SENTENCE_TAG:
continue
sentence = Sentence(sentence, "{}.{}".format(pidx + 1, sidx))
lines.extend(sentence.to_conllu())
lines.append('') # ASK newline between sentences
sidx += 1
if len(lines) == 0:
raise RuntimeError("Nobenih stavkov najdenih")
print("Zapisovanje izhodne datoteke: {}".format(out_file))
with open(out_file, 'w') as fp:
for line in lines:
if sys.version_info < (3, 0):
line = line.encode('utf-8')
print(line, file=fp)
if __name__ == "__main__":
"""
Input: folder of TEI files, msds are encoded as msd="Z"
Ouput: just a folder
"""
in_folder = sys.argv[1]
out_folder = sys.argv[2]
num_processes = int(sys.argv[3])
files = Path(in_folder).rglob("*.xml")
in_out = []
for filename in files:
out_file = out_folder + "/" + filename.name[:-4] + ".txt"
convert_file(filename, out_file)

@ -0,0 +1,86 @@
from lxml import etree
import re
W_TAGS = ['w']
C_TAGS = ['c']
S_TAGS = ['S', 'pc']
# reads a TEI xml file and returns a dictionary:
# { <sentence_id>: {
# sid: <sentence_id>, # serves as index in MongoDB
# text: ,
# tokens: ,
# }}
def parse_tei(filepath):
guess_corpus = None # SSJ | KRES
res_dict = {}
with open(filepath, "r") as fp:
# remove namespaces
xmlstr = fp.read()
xmlstr = re.sub('\\sxmlns="[^"]+"', '', xmlstr, count=1)
xmlstr = re.sub(' xml:', ' ', xmlstr)
root = etree.XML(xmlstr.encode("utf-8"))
divs = [] # in ssj, there are divs, in Kres, there are separate files
if "id" in root.keys():
# Kres files start with <TEI id=...>
guess_corpus = "KRES"
divs = [root]
else:
guess_corpus = "SSJ"
divs = root.findall(".//div")
# parse divs
for div in divs:
f_id = div.get("id")
# parse paragraphs
for p in div.findall(".//p"):
p_id = p.get("id").split(".")[-1]
# parse sentences
for s in p.findall(".//s"):
s_id = s.get("id").split(".")[-1]
sentence_text = ""
sentence_tokens = []
# parse tokens
for el in s.iter():
if el.tag in W_TAGS:
el_id = el.get("id").split(".")[-1]
if el_id[0] == 't':
el_id = el_id[1:] # ssj W_TAG ids start with t
sentence_text += el.text
sentence_tokens += [(
"w",
el_id,
el.text,
el.get("lemma"),
(el.get("msd") if guess_corpus == "KRES" else el.get("ana").split(":")[-1]),
)]
elif el.tag in C_TAGS:
el_id = el.get("id") or "none" # only Kres' C_TAGS have ids
el_id = el_id.split(".")[-1]
sentence_text += el.text
sentence_tokens += [("c", el_id, el.text,)]
elif el.tag in S_TAGS:
sentence_text += " " # Kres' <S /> doesn't contain .text
else:
# pass links and linkGroups
# print(el.tag)
pass
sentence_id = "{}.{}.{}".format(f_id, p_id, s_id)
"""
print(sentence_id)
print(sentence_text)
print(sentence_tokens)
"""
if sentence_id in res_dict:
raise KeyError("duplicated id: {}".format(sentence_id))
res_dict[sentence_id] = {
"sid": sentence_id,
"text": sentence_text,
"tokens": sentence_tokens
}
return res_dict

Binary file not shown.

@ -0,0 +1,91 @@
# mate-tools
Using **Full srl pipeline (including anna-3.3)** from the Downloads section.
Benchmarking the tool for slo and hr: [2] (submodule of this repo).
Mate-tool for srl tagging can be found in `./tools/srl-20131216/`.
## train
Create the `model-file`:
`--help` output:
```bash
java -cp srl.jar se.lth.cs.srl.Learn --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Learning-specific options:
-fdir <dir> the directory with feature files (see below)
-reranker trains a reranker also (not done by default)
-llbinary <file> a reference to a precompiled version of liblinear,
makes training much faster than the java version.
-partitions <int> number of partitions used for the reranker
-dontInsertGold don't insert the gold standard proposition during
training of the reranker.
-skipUnknownPredicates skips predicates not matching any POS-tags from
the feature files.
-dontDeleteTrainData doesn't delete the temporary files from training
on exit. (For debug purposes)
-ndPipeline Causes the training data and feature mappings to be
derived in a non-deterministic way. I.e. training the pipeline
on the same corpus twice does not yield the exact same models.
This is however slightly faster.
The feature file dir needs to contain four files with feature sets. See
the website for further documentation. The files are called
pi.feats, pd.feats, ai.feats, and ac.feats
All need to be in the feature file dir, otherwise you will get an error.
```
Input: `lang`, `input-corpus`.
### parse
`--help` output:
```bash
$ java -cp srl.jar se.lth.cs.srl.Parse --help
Not enough arguments, aborting.
Usage:
java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
Example:
java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
<lang> corresponds to the language and is one of
chi, eng, ger
Options:
-aibeam <int> the size of the ai-beam for the reranker
-acbeam <int> the size of the ac-beam for the reranker
-help prints this message
Parsing-specific options:
-nopi skips the predicate identification. This is equivalent to the
setting in the CoNLL 2009 ST.
-reranker uses a reranker (assumed to be included in the model)
-alfa <double> the alfa used by the reranker. (default 1.0)
```
We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train).
## input data:
* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
## Sources
* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
Loading…
Cancel
Save