commit 26b35771d203e467713e98698bb2de964af3b4cf Author: Amr Keleg Date: Wed Aug 21 16:30:54 2019 +0200 Add a README file diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 0000000..0bc184f --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,276 @@ +# Weighting of automata +The project aims at implementing a set of algorithms for weighting transducers. + +- [Weighting of automata](#weighting-of-automata) + * [Dependencies](#dependencies) + * [Models Description](#models-description) + - [lt-weight](#lt-weight) + * [Methodology](#methodology) + * [Usage](#usage) + * [Example](#example) + + [Weightlist generation algorithms](#weightlist-generation-algorithms) + - [annotated-corpus-to-weightlist](#annotated-corpus-to-weightlist) + * [Methodology](#methodology-1) + * [Usage](#usage-1) + * [Example](#example-1) + * [Limitations](#limitations) + * [Future work](#future-work) + - [unannotated-corpus-to-weightlist](#unannotated-corpus-to-weightlist) + * [Methodology](#methodology-2) + * [Usage](#usage-2) + * [Example](#example-2) + * [Future work](#future-work-1) + - [equal-weightlist](#equal-weightlist) + * [Methodology](#methodology-3) + * [Usage](#usage-3) + * [Example](#example-3) + - [analysis-length-reweight](#analysis-length-reweight) + * [Methodology](#methodology-4) + * [Usage](#usage-4) + * [Example](#example-4) + * [Future work](#future-work-2) + - [w2v-weightlist](#w2v-weightlist) + * [Methodology](#methodology-5) + * [Usage](#usage-5) + * [Example](#example-5) + * [Limitations](#limitations-1) + * [Future work](#future-work-3) + * [Evaluation](#evaluation) + + [eval/corpus_split.py](#eval-corpus-splitpy) + - [Usage](#usage-6) + + [eval/\*\_fit.py](#eval-----fitpy) + - [Usage](#usage-7) + + [Example](#example-6) + + [Results](#results) + * [Appendix](#appendix) + + [Xerox regexp](#xerox-regexp) + +## Dependencies +**lt-weight** +- `lttoolbox` +- `hfst` + +**Most of the weighting scripts** +- `python3` + +**Word2vec scripts** +- `tqdm` (A python package for showing the progress) +- `gensim` (A python package for training word2vec models) +`pip install tqdm gensim` + +**Evaluation scripts** +- `tabulate` (A python package for generating evaluation results in Markdown table format) +`pip install tabulate` + +## Models Description +#### lt-weight + +##### Methodology +Add weights to a compiled dictionary using a set of weightlists. +- The weightlists are written as Xerox regexp which is more powerful than using weighted string pairs as it permits matching all analyses of certain prefixes, suffixes. +BUT, take care when using complex regexp since composing a transducer with a complex regexp transducer is time and memory consuming (TODO: WHAT IS THE EXACT COMPLEXITY?) +- The script can make use of a sequence of a weightlists and it considers each weightlist to be a fallback one for paths that weren't weighted by the preceding weightlist. +- It's also advised to use a default weightlist - in the form `[?*]::LARGE_WEIGHT` - as the last weightlist such that paths that weren't part of any of the weightlists aren't dropped from the final composed transducer and are given large default weight. +- Note: The script makes use of composition of weighted FSTs using `hfst-compose` and assumes that the weights belong to a tropical semi-ring. +So, if the input fst had a path mapping `cat` to `dog` with a weight of `1` and the weightlist was in form `[d o g]::2` then the weighted fst will map `cat` to `dog` with a weight of `3 (1+2)` +TODO: SO?? ADVANTAGE IN ANY WAY? + +##### Usage +TODO: PRINT USAGE + +##### Example +For the ambiguous English word `saw`, `apertium-eng` already gives it four distinct morphological analyses: +`^saw/saw/saw/saw/saw/see$` +Let's assume we want to priortize the analyses as follows: +1) `^saw/see$` as the verb see is somehow common in language. +2) `^saw/saw$` as it's both a popular movie and a useful tool. +3) `^saw/saw$` as imperative verbs seem to occur a lot in the linguist's opinion. +4) Any other possible should be given high default weight. + +``` +$ cat saw.att +0 1 s s +1 2 a a +2 3 w w +3 9 ε +9 8 ε +3 6 ε +6 8 ε +6 8 ε +6 8 ε +1 4 a e +4 5 w e +5 7 ε +7 8 ε +8 + +$ lt-comp lr saw.att saw.bin +main@standard 10 13 + +$ echo '[s e e % %]::1' > wl1 +$ echo '[s a w % %]::2' > wl2 +$ echo '[?* % %]::3' > wl3 +$ echo '[?*]::4' > wl4 + +$ ./lt-weight saw.bin weighted_saw.bin wl1 wl2 wl3 wl4 +Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and /tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to /tmp/tmp.mOItp9wzUO/weighted-transducer.hfst +Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)... +Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and /tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to /tmp/tmp.mOItp9wzUO/weighted-transducer.hfst +Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)... +Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and /tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to /tmp/tmp.mOItp9wzUO/weighted-transducer.hfst +Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)... +Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and /tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to /tmp/tmp.mOItp9wzUO/weighted-transducer.hfst +Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)... +main@standard 10 13 + +$ echo 'saw' | lt-proc weighted_saw.bin -W +^saw/see/saw/saw/saw/saw$ +``` + +**Note**: +- wl1 and wl2 can be merged into a single weightlist since they aren't actually acting as fallback for each other. +- On the other hand, if wl3 and wl4 were merged into a single weightlist then the analysis `saw` would actually be added twice to the weighted transducer with two different weights 3 and 4! + +### Weightlist generation algorithms +#### annotated-corpus-to-weightlist +##### Methodology +Generate a weightlist given an annotated corpus. +The annotated corpus is in form `^surface_form/analyzed_form$`. +The script will estimate the weight for the analyzed form by calculating the probability of the analysis in the corpus. +`P(analysis) = Count(analysis) / size of corpus` +This model acts as a benchmark model for the other unsupervised techniques. + +To account for the OOV analyses, laplace smoothing is used such that: +- the weight for analyses that aren't part of the corpus is +`1 / (size of corpus + number of unique analysis in corpus + 1)` +- the weight for an analysis that is part of the corpus is +`(1 + Count(analysis)) / (size of corpus + number of unique analysis in corpus + 1)` + +A tweak to the weightlist generation (`--tag_weightlist`) was also added to give priority to analyses with tags that are common in the corpus. +i.e: If the noun `` tag is highly probable in the corpus then we might want to weight OOV analyses with a tag with a lower weight (higher probablility). +However, using the tag weightlist won't make the weights probabilistic. + +##### Usage +TODO: PRINT USAGE + +##### Example +TODO +##### Limitations +- The `--tag_weightlist` seems to be slow since it envolves the composition of the original fst with a regexp-fst in the form [?* ]. + +##### Future work +- Make use of advanced techniques other than the unigram counts (example: n-grams). + +#### unannotated-corpus-to-weightlist +##### Methodology +To generate a tagged corpus similar to that used in the `annotated-corpus-to-weightlist` method, the script will: +- analyze a raw corpus using a compiled unweighted dictionary. +- make use of a constraint grammar to discard immpossible analyses or select certain ones. +- write all the possible analyses to a file using the following format for each line `^surface_form/analyzed_form$`. +- use the `annotated-corpus-to-weightlist` to estimate a weightlist. + +##### Usage +TODO: PRINT USAGE + +##### Example +TODO + +##### Future work +- Investigate the effect of using a constraint grammar and perform a thorough error analysis + +#### equal-weightlist +##### Methodology +Generate a simple weightlist with the same weight for all analyses. +This model acts as the baseline for all the other techniques. + +##### Usage +TODO: PRINT USAGE + +##### Example +TODO + +#### analysis-length-reweight +##### Methodology +Use `hfst-reweight` to directly give a weight of one to all the edges of the finite state transducer. + +##### Usage +TODO: PRINT USAGE + +##### Example +TODO + +##### Future work +- Compare the results of using `equal-weightlist` and `analysis-length-reweight` +Weighting epsilon input might be the reason for the drift in the results of both methods. + +#### w2v-weightlist +##### Methodology +Generate a weightlist for words based on a word2vec CBOW (continuous bag of words) model. +- First, use the whole raw corpus to train a word2vec model. +- Then, use a sliding window and for each center word, predict the most probable words given the current context. +- Finally, If the center word isn't ambiguous, just increment the count of its analysis. +Else, For each ambiguous analysis of the center word, Count the number of similar words that aren't ambiguous AND have tags matching the tag of the center word's analysis. + +##### Usage +TODO: PRINT USAGE + +##### Example +TODO + +##### Limitations +- Training a word2vec model using a raw corpus is both time and memory intense (that's why the script currently avoids loading the whole data into memory at the same time) + +##### Future work +- Refactor the word2vec training function + +## Evaluation +To evaluate and compare the performance of the weighting methods, cross validation is used. + +### eval/corpus_split.py +A script to divide a tagged corpus into n folds (where n is a parameter) + +#### Usage +TODO: PRINT USAGE + +### eval/\*\_fit.py +Each weighting method has a script for training n models and report the metrics for each one in a tabular form +#### Usage +TODO: PRINT USAGE + +### Example +``` +REPO="../../apertium-kaz" +BIN="${REPO}/kaz.automorf.bin" +TAGGED_CORPUS="${REPO}/corpus/kaz.tagged" +UNTAGGED_CORPUS="${REPO}/kaz-clean" +CONSTRAINT_GRAMMAR="${REPO}/kaz.rlx.bin" +BASE_DIR=$(mktemp -d) +FOLDS_DIR="folds" +CLEANED_CORPUS="${BASE_DIR}/kaz.cleaned" + +apertium-cleanstream -n < "$TAGGED_CORPUS" > "$CLEANED_CORPUS" +python eval/corpus_split.py -o "$FOLDS_DIR" "$CLEANED_CORPUS" + +python eval/unigram_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_uni_bin +python eval/equalweight_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_eq_bin +python eval/constraintgrammar_fit.py -i "$FOLDS_DIR" -cg "$CONSTRAINT_GRAMMAR" -corpus "$UNTAGGED_CORPUS" -b "$BIN" -o temp_cg_bin +python eval/analysis_length_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_ana_bin +python eval/w2v_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_w2v_bin -corpus "$UNTAGGED_CORPUS" + +echo 'Uni' +python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_uni_bin +echo 'Eq' +python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_eq_bin +echo 'Cg' +python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_cg_bin +echo 'Length' +python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_ana_bin +echo 'W2V' +python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_w2v_bin +``` +### Results +TODO: Add the results on apertium repos + +## Appendix +### Xerox regexp +After a series of shot-gun debugging, I found this file (ftp://ftp.cis.upenn.edu/pub/cis639/public_html/docs/xfst.html) and it was a great help in understanding how to use XEROX regexp them -especially the `Commands <-> RE operators` section.