Index: branches/weighted-transfer/apertium-weights-learner/README.md
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/README.md	(revision 72158)
+++ branches/weighted-transfer/apertium-weights-learner/README.md	(revision 72159)
@@ -5,7 +5,8 @@
 ## Prerequisites
 To run this version of transfer weights training for a given language pair, you need:
 * source and target language corpora (they don't have to be parallel to each other)
-* apertium with the language pair of interest
+* apertium with apertium-transfer modified to use transfer weights (may be checked out from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/)
+* language pair of interest with ambiguous rules marked with ids (for an example, see the version of en-es pair from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/)
 * kenlm (https://kheafield.com/code/kenlm/)
 
 ## Get the corpora
@@ -12,7 +13,7 @@
 The corpora can be obtained from http://www.statmt.org/wmt12/translation-task.html
 
 ## Prepare language model
-You need to make a language model for your target language.
+In order to run the training, you need to make a language model for your target language.
 
 * First, take a big corpus, tokenize and normalize it using tools/simpletok.py script:
 ```
@@ -28,16 +29,83 @@
 ```
 bin/lmplz -o 5 -T FOLDER_FOR_TMP_FILE <CORPUS_FILE >MODEL_NAME.arpa
 ```
-Be advised that you might need disk space for your language model rougly 15 times of the corpus volume.
+Be advised that you might need disk space rougly 15 times the corpus volume for your language model. You can also use gz in order to compress the model file during its creation reducing its size:
+```
+bin/lmplz -o 5 -T FOLDER_FOR_TMP_FILE <CORPUS_FILE  | gzip >MODEL_NAME.arpa.gz
+```
 * It is highly recommended that you compile a binary after that as it works significantly faster:
 ```
 bin/build_binary -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa MODEL_NAME.mmap
 ```
-Be advised that you might need disk space for your language model rougly half of the arpa file volume.
+or, if you used gz in the previous step:
+```
+bin/build_binary -T FOLDER_FOR_TMP_FILE MODEL_NAME.arpa.gz MODEL_NAME.mmap
+'''
+Be advised that you might need additional disk space rougly half the arpa file volume for your binary.
 
 ## Run training
-* Edit configuration file twlconfig.py, which is self-explanatory.
+* Edit configuration file twlconfig.py, which is (hopefully) self-explanatory.
 * Run training script:
 ```
 ./twlearner.py
 ```
+
+## Sample run
+In order to ensure that everything works fine, you may perform a sample run using prepared corpus:
+
+* Download all parts (years 2007 to 2011) of Spanish news crawl corpora from http://www.statmt.org/wmt12/translation-task.html
+* Concatenate them using glue.py script from tools folder.
+* Train a language model on the resulting corpus.
+* Check out the en-es pair from https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/
+* Run weights training on new-software-sample.txt file located in the data folder with the en-es pair.
+
+The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved.
+
+The contents of the unpruned w1x file should look like the following:
+```
+<?xml version="1.0" encoding="UTF-8"?>
+<transfer-weights>
+  <rule-group>
+    <rule id="adj-nom">
+      <pattern weight="0.2940047506474463">
+        <pattern-item lemma="new" tags="adj.sint"/>
+        <pattern-item lemma="software" tags="n.sg"/>
+      </pattern>
+    </rule>
+    <rule id="adj-nom-ns">
+      <pattern weight="1.7059952493525534">
+        <pattern-item lemma="new" tags="adj.sint"/>
+        <pattern-item lemma="software" tags="n.sg"/>
+      </pattern>
+    </rule>
+  </rule-group>
+  <rule-group>
+    <rule id="det-adj-nom">
+      <pattern weight="0.262703645221423">
+        <pattern-item lemma="its" tags="det.pos.sp"/>
+        <pattern-item lemma="own" tags="adj"/>
+        <pattern-item lemma="code" tags="n.sg"/>
+      </pattern>
+      <pattern weight="0.05124922803710481">
+        <pattern-item lemma="this" tags="det.dem.sg"/>
+        <pattern-item lemma="new" tags="adj.sint"/>
+        <pattern-item lemma="software" tags="n.sg"/>
+      </pattern>
+    </rule>
+    <rule id="det-adj-nom-ns">
+      <pattern weight="0.737296354778577">
+        <pattern-item lemma="its" tags="det.pos.sp"/>
+        <pattern-item lemma="own" tags="adj"/>
+        <pattern-item lemma="code" tags="n.sg"/>
+      </pattern>
+      <pattern weight="0.9487507719628953">
+        <pattern-item lemma="this" tags="det.dem.sg"/>
+        <pattern-item lemma="new" tags="adj.sint"/>
+        <pattern-item lemma="software" tags="n.sg"/>
+      </pattern>
+    </rule>
+  </rule-group>
+</transfer-weights>
+```
+
+This would mean that '-ns' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun.
Index: branches/weighted-transfer/apertium-weights-learner/coverage.py
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/coverage.py	(revision 72158)
+++ branches/weighted-transfer/apertium-weights-learner/coverage.py	(revision 72159)
@@ -217,11 +217,13 @@
                 # try to continue each coverage obtained on the previous step
                 for coverage, state in zip(coverage_list, state_list):
 
+                    # first, check if we can go further along current pattern
                     if (state, cat) in self.transitions:
                         # current pattern can be made longer: add one more token
                         new_coverage_list.append(coverage + [('w', token)])
                         new_state_list.append(self.transitions[(state, cat)])
 
+                    # if not, check if we can finalize current pattern
                     elif state in self.final_states:
                         # current state is one of the final states: close previous pattern
                         new_coverage = coverage + [('r', self.final_states[state])]
@@ -232,12 +234,13 @@
                             new_state_list.append(self.transitions[(self.start_state, cat)])
                         elif '*' in token:
                             # can not start new pattern because of an unknown word
-                            new_coverage_list.append(new_coverage + [('w', token), ('r', -1)])
+                            new_coverage_list.append(new_coverage + [('w', token), ('r', 'unknown')])
                             new_state_list.append(self.start_state)
 
+                    # if not, check if it is just an unknown word
                     elif state == self.start_state and '*' in token:
                         # unknown word at start state: add it to pattern, start new
-                        new_coverage_list.append(coverage + [('w', token), ('r', -1)])
+                        new_coverage_list.append(coverage + [('w', token), ('r', 'unknown')])
                         new_state_list.append(self.start_state)
 
                     # if nothing worked, just discard this coverage
@@ -250,7 +253,7 @@
             if state in self.final_states:
                 # current state is one of the final states: close the last pattern
                 new_coverage_list.append(coverage + [('r', self.final_states[state])])
-            elif coverage[-1][0] == 'r':
+            elif coverage != [] and coverage[-1][0] == 'r':
                 # the last pattern is already closed
                 new_coverage_list.append(coverage)
             # if nothing worked, just discard this coverage as incomplete
@@ -272,13 +275,16 @@
                     pattern = []
             formatted_coverage_list.append(formatted_coverage)
 
+        # now we filter out some not-lrlm coverages
+        # that still got into
+
         # sort coverages by signature, which is a tuple
         # of coverage part lengths
         formatted_coverage_list.sort(key=signature, reverse=True)
         signature_max = signature(formatted_coverage_list[0])
 
-        # keep only those with top signature in terms of signature
-        # they would be LRLM ones
+        # keep only those with top signature
+        # they would be the LRLM ones
         LRLM_list = []
         for coverage in formatted_coverage_list:
             if signature(coverage) == signature_max:
@@ -300,7 +306,7 @@
     cat_dict, rules, ambiguous_rules, rule_id_map = prepare('../apertium-en-es/apertium-en-es.en-es.t1x')
     pattern_FST = FST(rules)
 
-    coverages = pattern_FST.get_lrlm('^publish<vblex><pp>$ ^in<pr>$ ^the<det><def><sp>$ ^journal<n><sg>$ ^of<pr>$ ^the<det><def><sp>$ ^american<adj>$ ^medical<adj>$ ^association<n><sg>$ ^the<det><def><sp>$ ^study<n><sg>$ ^track<vblex><past>$ ^the<det><def><sp>$ ^mental<adj>$ ^health<n><sg>$ ^of<pr>$ ^88,000<num>$ ^army<n><sg>$ ^combat<n><sg>$ ^veteran<n><pl>$ ^by<pr>$ ^compare<vblex><ger>$ ^their<det><pos><sp>$ ^response<n><pl>$ ^in<pr>$ ^a<det><ind><sg>$ ^mental<adj>$ ^health<n><sg>$ ^questionnaire<n><sg>$ ^fill<vblex><past>$ ^out<adv>$ ^upon<pr>$ ^their<det><pos><sp>$ ^return<n><sg>$ ^home<n><sg>$ ^with<pr>$ ^a<det><ind><sg>$ ^second<det><ord><sp>$ ^mental<adj>$ ^health<n><sg>$ ^screening<n><sg>$ ^three<num><sp>$ ^to<pr>$ ^six<num><sp>$ ^month<n><pl>$ ^later<adv>$^.<sent>$', cat_dict)
+    coverages = pattern_FST.get_lrlm('^prpers<prn><subj><p1><mf><pl>$ ^want# to<vbmod><pp>$ ^wait<vblex><inf>$ ^until<cnjadv>$ ^prpers<prn><subj><p1><mf><pl>$ ^can<vaux><past>$ ^offer<vblex><inf>$ ^what<prn><itg><m><sp>$ ^would<vaux><inf>$ ^be<vbser><inf>$ ^totally<adv>$ ^satisfy<vblex><ger>$ ^for<pr>$ ^consumer<n><pl>$^.<sent>$', cat_dict)
     print('Coverages detected:')
     for coverage in coverages:
         print(coverage)
Index: branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt	(nonexistent)
+++ branches/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt	(revision 72159)
@@ -0,0 +1,3 @@
+Mr Stephen said the council had agreed to consider new software which would make the test more difficult.
+What's Next: Simonyi's new software writes its own code
+This new software makes it easier to get a movie done quickly, though harder to get it done well.
Index: branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py	(revision 72158)
+++ branches/weighted-transfer/apertium-weights-learner/tools/simpletok.py	(revision 72159)
@@ -4,7 +4,7 @@
 
 beforepunc_re = re.compile(r'([¿("/])(\w)')
 afterpunc_re = re.compile(r'(\w)([;:,.!?)"/—])')
-quot_re = re.compile("[«»']")
+quot_re = re.compile("[«»`'“”„‘’‛]")
 numfix_re = re.compile('([0-9]) ([,.:][0-9])')
 beforedash_re = re.compile(r'(\W)-(\w)')
 afterdash_re = re.compile(r'(\w)-(\W)')
Index: branches/weighted-transfer/apertium-weights-learner/twlconfig.py
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/twlconfig.py	(revision 72158)
+++ branches/weighted-transfer/apertium-weights-learner/twlconfig.py	(revision 72159)
@@ -1,7 +1,8 @@
 # full path to source corpus from which to learn the rules
-source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-100-special.txt"
+#source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/2007-100-special.txt"
+source_corpus = "/home/nm/source/apertium/weighted-transfer/apertium-weights-learner/data/new-software-sample.txt"
 
-# name of apertium pair (not direction)
+# name of apertium language pair (not translation direction)
 apertium_pair_name = "en-es"
 
 # full path to apertium language pair data folder
Index: branches/weighted-transfer/apertium-weights-learner/twlearner.py
===================================================================
--- branches/weighted-transfer/apertium-weights-learner/twlearner.py	(revision 72158)
+++ branches/weighted-transfer/apertium-weights-learner/twlearner.py	(revision 72159)
@@ -15,7 +15,7 @@
 tmpweights_fname = 'tmpweights.w1x'
 
 # regular expression to cut out a sentence 
-sent_re = re.compile('.*?<sent>\$')
+sent_re = re.compile('.*?<sent>\$|.+?$')
 
 # anything between $ and ^
 inter_re = re.compile(r'\$.*?\^')
@@ -31,7 +31,7 @@
 # for scoring against language model
 beforepunc_re = re.compile(r'([¿("/])(\w)')
 afterpunc_re = re.compile(r'(\w)([;:,.!?)"/—])')
-quot_re = re.compile("[«»']")
+quot_re = re.compile("[«»`'“”„‘’‛]")
 numfix_re = re.compile('([0-9]) ([,.:][0-9])')
 beforedash_re = re.compile(r'(\W)-(\w)')
 afterdash_re = re.compile(r'(\w)-(\W)')
@@ -144,6 +144,7 @@
 
             # look at each sentence in line
             for sent_match in sent_re.finditer(line.strip()):
+                if sent_match.group(0) != '':
                 total_sents_count += 1
 
                 # get coverages
@@ -150,16 +151,16 @@
                 coverage_list = pattern_FST.get_lrlm(sent_match.group(0), cat_dict)
                 if coverage_list == []:
                     botched_coverages += 1
-                    print('Botched coverage:', sent_match.group(0))
-                    print()
+                    #print('Botched coverage:', sent_match.group(0))
+                    #print()
                 else:
                     # look for ambiguous chunks
                     coverage_item = coverage_list[0]
                     pattern_list = search_ambiguous(ambiguous_rules, coverage_item)
                     if pattern_list != []:
-                        print('Coverage:', coverage_item)
-                        print('Pattern list:', pattern_list)
-                        print()
+                        #print('Coverage:', coverage_item)
+                        #print('Pattern list:', pattern_list)
+                        #print()
                         ambig_sents_count += 1
                         # segment the sentence into parts each containing one ambiguous chunk
                         sentence_segments, prev = [], 0
@@ -314,16 +315,20 @@
         # read and process other lines
         for line in ifile:
             group_number, rule_number, pattern, weight = line.rstrip('\n').split('\t')
-            if pattern != prev_pattern:
-                # pattern changed, flush previous
+            if group_number != prev_group_number:
+                # rule group changed, flush pattern, close previuos, open new
                 ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight))
                 total_pattern_weight = 0.
-            if group_number != prev_group_number:
-                # rule group changed, close previuos, open new
                 ofile.write('    </rule>\n  </rule-group>\n  <rule-group>\n    <rule id="{}">\n'.format(rule_map[rule_number]))
             elif rule_number != prev_rule_number:
-                # rule changed, close previuos, open new
+                # rule changed, flush pattern, close previuos rule, open new
+                ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight))
+                total_pattern_weight = 0.
                 ofile.write('    </rule>\n    <rule id="{}">\n'.format(rule_map[rule_number]))
+            elif pattern != prev_pattern:
+                # pattern changed, flush previous
+                ofile.write(pattern_to_xml(apertium_token_re.findall(prev_pattern), total_pattern_weight))
+                total_pattern_weight = 0.
             # add up rule-pattern weights
             total_pattern_weight += float(weight)
             prev_group_number, prev_rule_number, prev_pattern = group_number, rule_number, pattern