TODO: - PET cheats and has an explicit mapping from POS type to generic lexeme... which it may or may not actually use? ACE actually unifies every token into every generic lexeme. - release.sh: - should look at version.h *in the checked-out copy* and make sure it matches the commandline - auto regression tests: - should include NorSource, GG and JaCY - ideally would like confirmation from respective grammarians that ace does the right thing (some version of ACE, some version of the grammar), so that knowing that answers haven't changed is useful - should include some transfer grammar as well - need a tool to test results... do tsdb profiles even store transfer results? probably... - transfer: what is the correct interpretation of the logical identity of variables attached to a +copy+'d EP? - transfer: should variables that are QEQ be unifiable? (trigger rules: we believe 'yes'. berthold wants 'yes' for fixup rules too.) fully automated regression testing of both parsing and generation notes: + this works and is invaluable - we limit unpacking to some number of MB and 1000 results - comparison tool should be aware of that somehow - e.g. diffs in # of results when we run out of RAM are acceptable - diffs after first 1000 results aren't detected... unfortunate. grammars: ERG: mrs, csli, hike HaG: hausa.items GG: mrs, babel - tried adding 'mrs' to the mix. - parsing: ok. - generating: the "gold" mrs profile included with GG is sufficiently out of date that generation is impossible. - no "decisions" data, so an update may be tricky. - but maybe a tree match update could succeed? - problem: logon deustch 'mrs' profile has items numbered as multiples of 10, but gg gold 'mrs' profile has them as multiples of 1... sigh. JaCY: ? NorSource: massifcentral-pos ... LKB and ACE agree (jul-26-2012) modulo LKB bugs berthold wants - *updating* an ace tsdb profile -- doesn't work, apparently? - LUI support -generator trees +full chart -partial chart -simple MRS ideas: - properly support lettersets with nonascii characters - fix recursive labeling to check for cycles or empty SLASH difflists - [mode to?] make output more script-friendly - automatic time profile of a grammar parsing: pretty good, but add - orthographemic analysis - each rule for lexical parsing - MRS extraction - idiom checking generation: + fixup rule application + each rule - semantic index lookup - trigger rule application - each rule - main generation - unpacking - MRS extraction - subsumption checking - idiom checking? - unifying and copying: - unifying: Glenn shows a way to completely avoid the ->carcs list, which sounds nice. - unfortunately, it requires us to choose a particular order in unification, so we don't get to try not to forward into frozen dags necessarily -- that means less structure sharing is possible and memory usage goes up. - but if we can ditch ->carcs, memory usage would go *down* too, and throughput might go up? - currently glenn's scheme doesn't work for ACE, because: unifying a passive edge (some subtype of `sign') (with no ARCS feature present due to deleted-daughters) with a daughter position in a rule (type `sign' typically, with an ARCS feature intact): the passive edge's root dag node is technically not wellformed. it has the more specific type, so it should be the forwarding target, but there's nowhere to store the ARGS feature - actually, comparing numbers of arcs shows identical arities so we try forwarding to the rule position, which has ARGS, but doesn't have C-CONT, a feature introduced on the passive edge. - possible solution: when trimming ARGS, leave a *top* node placeholder. that would preserve the appropriateness of features to a type, but would violate the appropriateness off types to a feature. - other possible solution: implement Glenn's fallback search - copying: to determine that a node X is shareable, we need to determine that X has not been changed, and that everything reachable from X has not been changed. - in principal this could involve traversing a large section of the graph just to decide not to copy it - idea: what if nodes X had a flag indicating whether it was the case that all access routes to all nodes under X go through X? maintaining such a flag would be potentially difficult. - but if we had it, then when we come across an X with that flag set, and we see that X has not been recursed into by the unifier, we know X (and its associated subgraph) is shareable, without looking at the whole subgraph under X. - one thing to worry about: if X is shared between two AVMs, it's possible that there is a reentrancy to below X in one of the AVMS and not in the other. I guess we need to be conservative and not flag X as safe in that case. - more complicated: if something below X is shared between two AVMs, then down the road someone else could get a reference into something below X... but I guess the flag is only relevant w.r.t. the top level AVM being copied. - naive algorithm to create the flags: whenever you add an arc A -> B, recursively walk up (other) parents of B until you reach the root or A, and mark them as not self-contained - snag: we don't have parent link lists in dags (there could be lots of parents of any given node... and in this usage we really only want parents from the same top-level AVM... ugh). - also sounds very slow -- but maybe only applicable for manual arc additions that create reentrancies? - for the unifier, there's a quicker algorithm: when unifying two nodes, the result is self-contained if both inputs are. - note that this is not a *necessary* condition: [ X #1, Y #1 ] & [ X [ A #2 ], Y [ A #2 ] ] = [ X #1 [ A top ], Y #1 ] in the right-hand input dag, the path X is not self-contained (since X.A = Y.A), but in the result dag the reentrancy from Y.A evaporates because X=Y. - but it is a sufficient condition. - how about a naive algorithm to build the flags during a (non-structure-sharing) copy, e.g. the freeze() operation that saves the grammar to disk? - not sure how that would work. - let's get an estimate for how many copy() calls end up sharing instead of copying... answer: 65.9% of nodes get shared instead of copied about half of which are atomic what proportion of these would we be able to *detect* - should we be doing unfilling? DONE: unreleased 0.9.10pre4 - don't overwrite user's config file if they swap -G and -g by mistake - keep track of start/end vertices in lattice mapping - add a free pass-through when a non-spelling-change-rule becomes a spelling-change-rule by way of the irregs table - be a bit more generous about irregs.tab syntax 0.9.10pre3 - roll back (by default) to original +copy+ variable semantics - single quoted 'strings are now warnings and are converted to "STRINGS" - TNT callout - multiple POS tags/probs per word (from TNT or from YY) - lattice mapping is much faster in some circumstances - more structure sharing for generation - code to produce normalized maxent probabilities - itsdb status code output is more uniform - dead code removal from old ad-hoc unknown word handling and chart reduction - fix another packing bug - unifier runs things in a slightly different order, for better subgraph sharing 0.9.10pre2 - initial ICONS support 0.9.10pre1 - fix a lexical packing bug resulting in spurious readings - merge some code for being launched/controlled by LUI - transfer mode: configuration option to prefix a namespace in input EPs - transfer mode: configuration options for input/output VPMs - set locale numeric type to POSIX, to avoid problems reading standard decimals with scanf - make mrs HOOK path configurable - check for cycles in trimmed portion of unification results 0.9.9 - no changes??? even neglected to update the version.h number. 0.9.8 - make FLAGS.SUBSUME with when no SEMI is loaded - make ITSDB forest token output work for token-free grammars 0.9.7 - transfer rules can output new HCONS elements - transfer +copy+ can invent new variables - ??? what else 0.9.6 - ??? what else - trigger rule DAG specialization mechanism - rebuilt HCONS matching in transfer rules 0.9.6pre1 - "cleanup" transfer rules, which apply right after unpack() calls extract_mrs() 0.9.5 - one more TDL case sensitivity... - Version.lsp reading is more robust 0.9.4 - TDL operators are now case insensitive in their LHS - bug fix: generation derivation trees were inflecting token strings in the wrong order - bug fix: copy_mrs() wasn't copying the ->dg field - new --generation-server=LANG option (watches ~/tmp/.transfer.USER.LANG and generates from it) - LUI support for displaying realizations (click for tree, with active nodes) 0.9.4pre4 - fflush for generator script consumption - don't crash when EPs have no LBL - derivation trees from generation have proper token structures and orthographies - new option --show-realization-trees, which prints derivation trees for generation results - new option --show-realization-mrses, which prints MRSes for generation results - support for labeling SLASHes, with configuration options recursive-label-path-in-label and recursive-label-path-in-sign - support for special :c command to show last parse chart in LUI - slightly better communication with LUI (e.g. actually notice when LUI exits, suppress some needless noise) - report an error when attempting to parse and no REPP is loaded, unless in YY mode - new --report-trees option for sending labelled trees to TSDB (or stdout even) 0.9.4pre3 - YY mode - don't output a properties clause for MRS variables that have no properties 0.9.4pre2 - fix MRS characterization output format (thanks Berthold) - new optional post-generation token mapping stage, for visually fixing up generated strings 0.9.4pre1 - bug fixes in fixup - moving towards configurability for transfer-only grammars 0.9.3 - basic version of output-enabled transfer rules, used for mrs fixup for generation - changed post-generation subsumption test to operate on *external* MRSes 0.9.3pre2 - bigger TDL buffer: JaCY's QC skeleton needed it - allow lexemes to have STEM parts that aren't strings 0.9.3pre1 - better parsing of MRS string constants - config option to specify whether LTOP is extracted or invented - fix QC-from-instance loader - changes that speed up lattice mapping a lot in some cases - profiling mode "-i", for parsing - improved filtering of orthographemic rule chain hypotheses 0.9.2 - fix TDL reader/dagify to not conflate coreferences with the same name in different :+ addenda - fix a unicode bug in token mapping - fixed an obscure bug wherein GLB types could have incorrect constraints - don't load token mapping rules when token mapping is disabled - support more than 256 features - support ^ and $ in token mapping positional constraints - support spaces showing up in more unexpected places in TDL syntax - the path to the label within parse-node instances is now configurable (LNAME feature) - semantic indexing now pays attention to the lex-rels-path and rule-rels-path configurations - quickcheck can be loaded from a PET QC instance - new configuration options for limiting the number of orthographemic rules to apply - new configuration option to specify how much room to preallocate in the freezer. - new configuration option to specify what file[s] to load irregulars forms from - new configuration option to specify a suffix for rule names given in irregulars tables 0.9.1 - remove dependence on #include - change ~/logon/ to ${LOGONROOT} in Makefile 0.9 ----- - potentially use POS tagger to prune lexical ambiguity - but POS tagger makes mistakes... - preprocessor characterization problem: ... hole for 'n' ... hole for ''' ... hole for 't' ... lost 'n'=>4 filling hole 't' with 'n' ... lost '''=>5 ... lost 't'=>6 `(.) +n't' -> `\1n't' yields ` I don't know. ' + debug: I believe there are 15 747 724 136 275 002 577 605 653 961 181 555 468 044 717 914 527 116 709 366 231 425 076 185 631 031 296 296 protons in the universe and the same number of electrons. + currently, that results in a "too much RAM" error. that seems reasonable. delete deleted daughters from daughters of rule, since they can't ever be present in unified-in daughters figure out why unpacking is so much slower than PET (sometimes 10x slower!) - mrs extraction is the big bottleneck - approach one: memoization; most bits and pieces of MRS are reused many times - tried this on mrs_var's with good results, should also do ep's and hcon's - type names perhaps should be pointers to types, since much time is spent looking up and comparing types - maybe some other aspects of unpacking are slow? not clear yet. learn something from head-corner strategy ignore punctuation-chars for jacy try to prove/analyze interesting things at grammar compile time - *intelligently* auto pick quickcheck paths - make quickcheck used for packing ignore packing restrictor - and the shameless hack(tm) in qc.c with 0-1-lists - e.g. the INFLECTD is monotonic, '-' becomes '+' - this would help rule out applying inflectional rules like non3sg_fin_verb when 3sg_fin_verb is on the orth-agenda - might be lots of other such features we could find - auto pick packing restrictor - for each type of rule, precompute segments of the rule which are non-reentrant with ARGS[k] - then on filling in ARGS[k], copy() can know to structure-share without looking at those segments optimize lexicon storage; maybe use provided lexdb schema maybe support lexdb make STEM actually be updated by ortho rules; apparently some grammars depend on it.