parser.py can read kres and/or ssj500k

2019-02-03 22:54:26 +01:00 · 2019-02-03 22:54:26 +01:00 · d1ba56be37
commit d1ba56be37
parent 648f4e53d2
14 changed files with 437 additions and 82 deletions
--- a/README.md
+++ b/README.md
@ -6,93 +6,18 @@ The tools require Java.
 See `./dockerfiles/mate-tool-env/README.md` for environment preparation. 

 ## mate-tools
-Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
-Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
+Check out `./tools/srl-20131216/README.md`.  

-Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
+## Scripts
+Check all possible xml tags (that occur after the <body> tag.  
+'cat F0006347.xml.parsed.xml | grep -A 999999999999 -e '<body>' | grep -o -e '<[^" "]*' | sort | uniq'

-### train
-Create the `model-file`:
+## Tools
+* Parser for reading both `SSJ500k 2.1 TEI xml` and `Kres F....xml.parsed.xml"` files found in `./tools/parser/parser.py`.  

-`--help` output:
-```bash
-java -cp srl.jar se.lth.cs.srl.Learn --help
-Not enough arguments, aborting.
-Usage:
- java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
-
-Example:
- java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
-
- trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
-
-<lang> corresponds to the language and is one of
- chi, eng, ger
-
-Options:
- -aibeam <int>     the size of the ai-beam for the reranker
- -acbeam <int>     the size of the ac-beam for the reranker
- -help             prints this message
-
-Learning-specific options:
- -fdir <dir>             the directory with feature files (see below)
- -reranker               trains a reranker also (not done by default)
- -llbinary <file>        a reference to a precompiled version of liblinear,
-                         makes training much faster than the java version.
- -partitions <int>       number of partitions used for the reranker
- -dontInsertGold         don't insert the gold standard proposition during
-                         training of the reranker.
- -skipUnknownPredicates  skips predicates not matching any POS-tags from
-                         the feature files.
- -dontDeleteTrainData    doesn't delete the temporary files from training
-                         on exit. (For debug purposes)
- -ndPipeline             Causes the training data and feature mappings to be
-                         derived in a non-deterministic way. I.e. training the pipeline
-                         on the same corpus twice does not yield the exact same models.
-                         This is however slightly faster.
-
-The feature file dir needs to contain four files with feature sets. See
-the website for further documentation. The files are called
-pi.feats, pd.feats, ai.feats, and ac.feats
-All need to be in the feature file dir, otherwise you will get an error.
-```
-Input: `lang`, `input-corpus`. 
-
-### parse
-`--help` output:
-```bash
-$ java -cp srl.jar se.lth.cs.srl.Parse --help
-Not enough arguments, aborting.
-Usage:
- java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
-
-Example:
- java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
-
- parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
-
-<lang> corresponds to the language and is one of
- chi, eng, ger
-
-Options:
- -aibeam <int>     the size of the ai-beam for the reranker
- -acbeam <int>     the size of the ac-beam for the reranker
- -help             prints this message
-
-Parsing-specific options:
- -nopi           skips the predicate identification. This is equivalent to the
-                 setting in the CoNLL 2009 ST.
- -reranker       uses a reranker (assumed to be included in the model)
- -alfa <double>  the alfa used by the reranker. (default 1.0)
-```
-We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
-
-### input data:
-* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
-formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);

 ## Sources
 * [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
 * [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
 * [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
-* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
+* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll
--- a/dockerfiles/parser-env/Dockerfile
+++ b/dockerfiles/parser-env/Dockerfile
@ -0,0 +1,3 @@
+FROM python
+
+RUN pip install lxml
--- a/dockerfiles/parser-env/README.md
+++ b/dockerfiles/parser-env/README.md
@ -0,0 +1,13 @@
+You might want to mount this whole repo into the docker container.  
+Also mount data locations.  
+
+Example container:
+```bash
+$ docker build . -t my_python
+$ docker run \
+    -it \
+    -v /home/kristjan/git/cjvt-srl-tagging:/cjvt-srl-tagging \
+    -v /home/kristjan/some_corpus_data:/some_corpus_data \
+    my_python \
+    /bin/bash
+```
--- a/tools/main.py
+++ b/tools/main.py
@ -0,0 +1,23 @@
+from parser import parser
+import os
+from os.path import join
+import sys
+
+SSJ500K_2_1 = 27829  # number of sentences
+
+if __name__ == "__main__":
+	# make sure you sanitize every input into unicode
+
+	print("parsing ssj")
+	# ssj_file = "/home/kristjan/git/diploma/data/ssj500k-sl.TEI/ssj500k-sl.body.xml"
+	# ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.xml"
+	ssj_file = "/dipldata/ssj500k-sl.TEI/ssj500k-sl.body.sample.xml"  # smaller file
+	ssj_dict = parser.parse_tei(ssj_file)
+	# assert (len(ssj_dict) == 27829), "Parsed wrong number of sentences."
+
+	print("parsing kres")
+	# kres_file = "../data/kres_example/F0019343.xml.parsed.xml"
+	kres_dir = "../data/kres_example/"
+	for kres_file in os.listdir(kres_dir):
+		parser.parse_tei(join(kres_dir, kres_file))
+	print("end parsing kres")
--- a/tools/parser/Parser.pyc
+++ b/tools/parser/Parser.pyc
--- a/tools/parser/init.py
+++ b/tools/parser/init.py
--- a/tools/parser/init.pyc
+++ b/tools/parser/init.pyc
--- a/tools/parser/pycache/init.cpython-37.pyc
+++ b/tools/parser/pycache/init.cpython-37.pyc
--- a/tools/parser/pycache/parser.cpython-37.pyc
+++ b/tools/parser/pycache/parser.cpython-37.pyc
--- a/tools/parser/bench_parser.py
+++ b/tools/parser/bench_parser.py
@ -0,0 +1,70 @@
+import xml.etree.ElementTree as ET
+import random
+random.seed(42)
+tree=ET.parse('../../data/kres_example/F0006347.xml.parsed.xmll')
+print(ET.tostring(tree))
+root=tree.getroot()
+train=[]
+dev=[]
+test=[]
+train_text=open('train.txt','w')
+dev_text=open('dev.txt','w')
+test_text=open('test.txt','w')
+for doc in root.iter('{http://www.tei-c.org/ns/1.0}div'):
+  rand=random.random()
+  if rand<0.8:
+    pointer=train
+    pointer_text=train_text
+  elif rand<0.9:
+    pointer=dev
+    pointer_text=dev_text
+  else:
+    pointer=test
+    pointer_text=test_text
+  for p in doc.iter('{http://www.tei-c.org/ns/1.0}p'):
+    for element in p:
+      if element.tag.endswith('s'):
+        sentence=element
+        text=''
+        tokens=[]
+        for element in sentence:
+          if element.tag[-3:]=='seg':
+            for subelement in element:
+              text+=subelement.text
+              if not subelement.tag.endswith('}c'):
+                if subelement.tag.endswith('w'):
+                  lemma=subelement.attrib['lemma']
+                else:
+                  lemma=subelement.text
+                tokens.append((subelement.text,lemma,subelement.attrib['ana'].split(':')[1]))
+          if element.tag[-2:] not in ('pc','}w','}c'):
+            continue
+          text+=element.text
+          if not element.tag.endswith('}c'):
+            if element.tag.endswith('w'):
+              lemma=element.attrib['lemma']
+            else:
+              lemma=element.text
+            tokens.append((element.text,lemma,element.attrib['ana'].split(':')[1]))
+        pointer.append((text,tokens))
+        pointer_text.write(text.encode('utf8'))
+      else:
+        pointer_text.write(element.text.encode('utf8'))
+    pointer_text.write('\n')
+  #pointer_text.write('\n')
+
+def write_list(lst,fname):
+  f=open(fname,'w')
+  for text,tokens in lst:
+    f.write('# text = '+text.encode('utf8')+'\n')
+    for idx,token in enumerate(tokens):
+      f.write(str(idx+1)+'\t'+token[0].encode('utf8')+'\t'+token[1].encode('utf8')+'\t_\t'+token[2]+'\t_\t_\t_\t_\t_\n')
+    f.write('\n')
+  f.close()
+write_list(train,'train.conllu')
+write_list(dev,'dev.conllu')
+write_list(test,'test.conllu')
+train_text.close()
+dev_text.close()
+test_text.close()
+    
--- a/tools/parser/ozbolt.py
+++ b/tools/parser/ozbolt.py
@ -0,0 +1,144 @@
+#!/usr/bin/python3
+
+from __future__ import print_function, unicode_literals, division
+import sys
+import os
+import re
+import pickle
+from pathlib import Path
+
+try:
+    from lxml import etree as ElementTree
+except ImportError:
+    import xml.etree.ElementTree as ElementTree
+
+
+# attributes
+ID_ATTR = "id"
+LEMMA_ATTR = "lemma"
+ANA_ATTR = "ana"
+
+
+# tags
+SENTENCE_TAG = 's'
+BIBL_TAG = 'bibl'
+PARAGRAPH_TAG = 'p'
+PC_TAG = 'pc'
+WORD_TAG = 'w'
+C_TAG = 'c'
+S_TAG = 'S'
+SEG_TAG = 'seg'
+
+
+class Sentence:
+    def __init__(self, sentence, s_id):
+        self.id = s_id
+        self.words = []
+        self.text = ""
+
+        for word in sentence:
+            self.handle_word(word)
+
+    def handle_word(self, word):
+        # handle space after
+        if word.tag == S_TAG:
+            assert(word.text is None)
+            self.text += ' '
+            return
+
+        # ASK am I handling this correctly?
+        elif word.tag == SEG_TAG:
+            for segword in word:
+                self.handle_word(segword)
+            return
+
+        # ASK handle unknown tags (are there others?)
+        elif word.tag not in (WORD_TAG, C_TAG):
+            return
+
+        # ID
+        idx = str(len(self.words) + 1)
+
+        # TOKEN
+        token = word.text
+
+        # LEMMA
+        if word.tag == WORD_TAG:
+            lemma = word.get(LEMMA_ATTR)
+            assert(lemma is not None)
+        else:
+            lemma = token
+
+        # XPOS
+        xpos = word.get('msd')
+        if word.tag == C_TAG:
+            xpos = "Z"
+        elif xpos in ("Gp-ppdzn", "Gp-spmzd"):
+            xpos = "N"
+        elif xpos is None:
+            print(self.id)
+
+        # save word entry
+        self.words.append(['F{}.{}'.format(self.id, idx), token, lemma, xpos])
+
+        # save for text
+        self.text += word.text
+
+
+    def to_conllu(self):
+        lines = []
+        # lines.append('# sent_id = ' + self.id)
+        # CONLLu does not like spaces at the end of # text
+        # lines.append('# text = ' + self.text.strip())
+        for word in self.words:
+            lines.append('\t'.join('_' if data is None else data for data in word))
+
+        return lines
+
+def convert_file(in_file, out_file):
+    print("Nalaganje xml: {}".format(in_file))
+    with open(str(in_file), 'r') as fp:
+        xmlstring = re.sub(' xmlns="[^"]+"', '', fp.read(), count=1)
+        xmlstring = xmlstring.replace(' xml:', ' ')
+        xml_tree = ElementTree.XML(xmlstring)
+
+    print("Pretvarjanje TEI -> TSV-U ...")
+    lines = []
+
+    for pidx, paragraph in enumerate(xml_tree.iterfind('.//body/p')):
+        sidx = 1
+        for sentence in paragraph:
+            if sentence.tag != SENTENCE_TAG:
+                continue
+
+            sentence = Sentence(sentence, "{}.{}".format(pidx + 1, sidx))
+            lines.extend(sentence.to_conllu())
+            lines.append('') # ASK newline between sentences
+            sidx += 1
+
+    if len(lines) == 0:
+        raise RuntimeError("Nobenih stavkov najdenih")
+
+    print("Zapisovanje izhodne datoteke: {}".format(out_file))
+    with open(out_file, 'w') as fp:
+        for line in lines:
+            if sys.version_info < (3, 0):
+                line = line.encode('utf-8')
+            print(line, file=fp)
+
+
+if __name__ == "__main__":
+    """
+    Input: folder of TEI files, msds are encoded as msd="Z"
+    Ouput: just a folder
+    """
+
+    in_folder = sys.argv[1]
+    out_folder = sys.argv[2]
+    num_processes = int(sys.argv[3])
+
+    files = Path(in_folder).rglob("*.xml")
+    in_out = []
+    for filename in files:
+        out_file = out_folder + "/" + filename.name[:-4] + ".txt"
+        convert_file(filename, out_file)
--- a/tools/parser/parser.py
+++ b/tools/parser/parser.py
@ -0,0 +1,86 @@
+from lxml import etree
+import re
+
+W_TAGS = ['w']
+C_TAGS = ['c']
+S_TAGS = ['S', 'pc']
+
+# reads a TEI xml file and returns a dictionary:
+# { <sentence_id>: {
+# 		sid: <sentence_id>,  # serves as index in MongoDB
+# 		text: ,
+# 		tokens: ,	
+# }}
+def parse_tei(filepath):
+	guess_corpus = None  # SSJ | KRES
+	res_dict = {}
+	with open(filepath, "r") as fp:
+		# remove namespaces
+		xmlstr = fp.read()
+		xmlstr = re.sub('\\sxmlns="[^"]+"', '', xmlstr, count=1)
+		xmlstr = re.sub(' xml:', ' ', xmlstr)
+
+		root = etree.XML(xmlstr.encode("utf-8"))
+
+		divs = []  # in ssj, there are divs, in Kres, there are separate files
+		if "id" in root.keys():
+			# Kres files start with <TEI id=...>
+			guess_corpus = "KRES"
+			divs = [root]
+		else:
+			guess_corpus = "SSJ"
+			divs = root.findall(".//div")
+
+		# parse divs
+		for div in divs:
+			f_id = div.get("id")
+
+			# parse paragraphs
+			for p in div.findall(".//p"):
+				p_id = p.get("id").split(".")[-1]
+
+				# parse sentences
+				for s in p.findall(".//s"):
+					s_id = s.get("id").split(".")[-1]
+					sentence_text = ""
+					sentence_tokens = []
+
+					# parse tokens
+					for el in s.iter():
+						if el.tag in W_TAGS:
+							el_id = el.get("id").split(".")[-1]
+							if el_id[0] == 't':
+								el_id = el_id[1:]  # ssj W_TAG ids start with t
+							sentence_text += el.text
+							sentence_tokens += [(
+								"w",
+								el_id, 
+								el.text, 
+								el.get("lemma"), 
+								(el.get("msd") if guess_corpus == "KRES" else el.get("ana").split(":")[-1]),
+							)]
+						elif el.tag in C_TAGS:
+							el_id = el.get("id") or "none"  # only Kres' C_TAGS have ids
+							el_id = el_id.split(".")[-1]
+							sentence_text += el.text
+							sentence_tokens += [("c", el_id, el.text,)]
+						elif el.tag in S_TAGS:
+							sentence_text += " "  # Kres' <S /> doesn't contain .text
+						else:
+							# pass links and linkGroups
+							# print(el.tag)
+							pass
+					sentence_id = "{}.{}.{}".format(f_id, p_id, s_id)
+					"""
+					print(sentence_id)
+					print(sentence_text)
+					print(sentence_tokens)
+					"""
+					if sentence_id in res_dict:
+						raise KeyError("duplicated id: {}".format(sentence_id))
+					res_dict[sentence_id] = {
+						"sid": sentence_id,
+						"text": sentence_text,
+						"tokens": sentence_tokens
+					}
+	return res_dict
--- a/tools/parser/parser.pyc
+++ b/tools/parser/parser.pyc
--- a/tools/srl-20131216/README.md
+++ b/tools/srl-20131216/README.md
@ -0,0 +1,91 @@
+# mate-tools
+Using **Full srl pipeline (including anna-3.3)** from the Downloads section. 
+Benchmarking the tool for slo and hr: [2] (submodule of this repo). 
+
+Mate-tool for srl tagging can be found in `./tools/srl-20131216/`. 
+
+## train
+Create the `model-file`:
+
+`--help` output:
+```bash
+java -cp srl.jar se.lth.cs.srl.Learn --help
+Not enough arguments, aborting.
+Usage:
+ java -cp <classpath> se.lth.cs.srl.Learn <lang> <input-corpus> <model-file> [options]
+
+Example:
+ java -cp srl.jar:lib/liblinear-1.51-with-deps.jar se.lth.cs.srl.Learn eng ~/corpora/eng/CoNLL2009-ST-English-train.txt eng-srl.mdl -reranker -fdir ~/features/eng -llbinary ~/liblinear-1.6/train
+
+ trains a complete pipeline and reranker based on the corpus and saves it to eng-srl.mdl
+
+<lang> corresponds to the language and is one of
+ chi, eng, ger
+
+Options:
+ -aibeam <int>     the size of the ai-beam for the reranker
+ -acbeam <int>     the size of the ac-beam for the reranker
+ -help             prints this message
+
+Learning-specific options:
+ -fdir <dir>             the directory with feature files (see below)
+ -reranker               trains a reranker also (not done by default)
+ -llbinary <file>        a reference to a precompiled version of liblinear,
+                         makes training much faster than the java version.
+ -partitions <int>       number of partitions used for the reranker
+ -dontInsertGold         don't insert the gold standard proposition during
+                         training of the reranker.
+ -skipUnknownPredicates  skips predicates not matching any POS-tags from
+                         the feature files.
+ -dontDeleteTrainData    doesn't delete the temporary files from training
+                         on exit. (For debug purposes)
+ -ndPipeline             Causes the training data and feature mappings to be
+                         derived in a non-deterministic way. I.e. training the pipeline
+                         on the same corpus twice does not yield the exact same models.
+                         This is however slightly faster.
+
+The feature file dir needs to contain four files with feature sets. See
+the website for further documentation. The files are called
+pi.feats, pd.feats, ai.feats, and ac.feats
+All need to be in the feature file dir, otherwise you will get an error.
+```
+Input: `lang`, `input-corpus`. 
+
+### parse
+`--help` output:
+```bash
+$ java -cp srl.jar se.lth.cs.srl.Parse --help
+Not enough arguments, aborting.
+Usage:
+ java -cp <classpath> se.lth.cs.srl.Parse <lang> <input-corpus> <model-file> [options] <output>
+
+Example:
+ java -cp srl.jar:lib/liblinear-1.51-with-deps.jarse.lth.cs.srl.Parse eng ~/corpora/eng/CoNLL2009-ST-English-evaluation-SRLonly.txt eng-srl.mdl -reranker -nopi -alfa 1.0 eng-eval.out
+
+ parses in the input corpus using the model eng-srl.mdl and saves it to eng-eval.out, using a reranker and skipping the predicate identification step
+
+<lang> corresponds to the language and is one of
+ chi, eng, ger
+
+Options:
+ -aibeam <int>     the size of the ai-beam for the reranker
+ -acbeam <int>     the size of the ac-beam for the reranker
+ -help             prints this message
+
+Parsing-specific options:
+ -nopi           skips the predicate identification. This is equivalent to the
+                 setting in the CoNLL 2009 ST.
+ -reranker       uses a reranker (assumed to be included in the model)
+ -alfa <double>  the alfa used by the reranker. (default 1.0)
+```
+We need to provide `lang` (`ger` for German feature functions?), `input-corpus` and `model` (see train). 
+
+## input data:
+* `ssj500k` data found in `./bilateral-srl/data/sl/sl.{test,train}`;
+formatted for mate-tools usage in `./bilaterla-srl/tools/mate-tools/sl.{test,train}.mate` (line counts match);
+
+## Sources
+* [1] (mate-tools) https://code.google.com/archive/p/mate-tools/
+* [2] (benchmarking) https://github.com/clarinsi/bilateral-srl
+* [3] (conll 2008 paper) http://www.aclweb.org/anthology/W08-2121.pdf
+* [4] (format CoNLL 2009) https://wiki.ufal.ms.mff.cuni.cz/format-conll