BUG: surprising .mem features output:
(13797136) [1 (0) n_-_pn-unk_le "Settings\Administrator\" +FROM "] -0.997524 {0 0 0 0} [0 0]
first and more serious bug is that the backslashes are there unescaped.
	- failed FORMs as they went into quoted feature leaf strings
second bug is that the +FROM from token structures shows up...
	- insufficient escaping when edge record records [ +FORM "Settings\Administrator\" +FROM "37" ... ]
		i.e. the backslashes are part of the content.
		... arguably they should be escaped as soon as they are in double quotes, i.e. ace's build_token_dg()


plan:
load parse forests from tsdb profiles
convert to feature forests
	- unfold so feature contexts are local to 'and' nodes
	- ungrandparented, initially; ws01
		ws01 edge relation on disk takes 230MB
		reading the edge relation takes ~8s and 2GB RAM
		ws01 parse forests take 1GB(?) of RAM and ~7s (on top of the exist 8s, for a 15s total) to load
		ws01 ungrandparented feature forests take 7GB(?) of RAM and ~45s (+15s = 60 total) to load
		ws01-ws12 might take 90GB of RAM for feature forests
train maxent using clever math
	distribute forest data over multiple processes/nodes
	master process: use mela to drive the optimizer
		update()	-- send current lambda to nodes,
						initiate computation of gradient and log-likelihood
						collect results; add regularization term
		gradient() and objective() just return the values
	slave processes: connect to master process, load up a tsdb profile, convert to feature forest
		loop:
			wait for lambdas, do math, send back results

notes:
	could split different tsdb profiles to different processes, coordinated somehow

need to use existing wescience.mem to get *real* top-1 rate on ws01-ws12 and on ws13, and output a new .mem file
	- but existing wescience.mem is GP[2]! not comparable.


maxent math:

highest entropy subject to E(f) = E~(f)
highest training likelihood, given exponential model
L = sum log P(Y_i | X_i ; lambda)
dL/dlambda = sum [ dP(Y_i | X_i ; lambda)/dlambda     / P(Y_i|X_i;lambda) ]
P(Y_i | X_i ; lambda) = exp(lambda dot F_i,gold) Z^{-1}
Z = sum exp(lambda dot F_i,j)

dP/dlambda = F_i,gold P(Y_i | X_i ; lambda) - exp(lambda dot F_i,gold) Z^{-2} dZ dlambda

dP/dlambda = F_i,gold P(Y_i | X_i ; lambda) - P(Y_i | X_i ; lambda) Z^{-1} dZ/dlambda
dZ/dlambda = sum F_i,j exp(lambda dot F_i,j)

dP/dlambda = P(Y_i | X_i ; lambda) [ F_i,gold - E(f_i | X_i ; lambda) ]

dL/dlambda = sum [ F_i,gold - E(f_i | X_i ; lambda) ]
					^^ empirical feature values
					model expectation ^^

need to efficiently compute E(f_i | X_i ; lambda)
= (1/Z) dZ/dlambda
= (1/Z) sum F_i,j exp(lambda dot F_i,j)
need to compute:
	Z = sum 1 * exp(lambda dot F_i,j)
	sum F_i,j * exp(lambda dot F_i,j)

dropping the _i's (i.e. which sentence) for convenience, need a way to compute:
sum g(j) exp(lambda dot F_j)
for certain classes of g (namely g(j) = 1 and g(j) = F_j)
bearing in mind that F_j is a feature *vector*
... but we can consider each dimension independently if necessary

Z = sum exp(lambda dot F_j)
when combining several branches with an OR node, the outer sum splits, so the whole term sums
when combining parts of a tree with an AND node, the lambda dot F_j's sum, so the inner exp(lambda dot F_j)'s multiply
	... and considering the choices of nested ORs independent, the (z1+z2+z3) * (z4+z5+z6) combine nicely,
	so the whole term multiplies
... that's how the unpacking probability calculation in ACE works.

how about the harder variant?
g(j) is a sum of values contributed by different AND nodes
sum g(j) exp(lambda dot F_j) is a sum over all unpackings
= sum_{n is an AND node contributing v to g(j)} v sum_{all unpackings using n} exp(lambda dot F_j)

so, need to be able to compute for any given AND node n:
sum_{all unpackings using n} exp(lambda dot F_j)
... this is where the inside/outside thing comes in.
get to assume local unpackings of n can all fit into any context above n.
= (sum over local unpackings of N of exp(lambda dot local_features) * (sum over all trees containing N modulo what happens inside N of exp(lambda dot features_outside_N))
former term is "inside" score for N
latter term is "outside" score for N

already saw how to compute inside scores
for outside scores, start at top
top level OR has outside score 1
an AND's outside score is the sum of the outside scores of the ORs it appears in
an OR's outside score is the sum of the (product of outside and inside scores) of the ANDs it appears in, divided by its own inside score

also could use all-readings scores, call it Z (= inside * outside)
for top level OR, Z = inside score
an AND's Z score is the sum of the (Z score divided by inside score) of the ORs it appears in, multiplied by its own inside score
an OR's Z score is the sum of the Z scores of the ANDs it appears in
... that's how the readings counter in FFTB works.

inside and Z are both exp(sum of weights); likely to get small. store as logs?

abstracting away from the exp(lambda dot F_j)'s...
some function f(t) = property of an unpacked reading
want to sum f(t) over all readings that use node N
an OR's score is the sum of the scores of all ANDs it appears in

after working all this out by myself, I checked the Miyao and Tsujii paper and I got it right :-) yay.