Characterization Study for REPP

The following (improbable) RPP presents a difficult question for characterization:

:[ \t]
!([a-z]+) ([a-z]+)			\1 ~ \2

This preprocessor rewrites "I arrived." to "I ~ arrived.", and then tokenizes it a ["I", "~", "arrived."]. The correct characterization of the first and last tokens is pretty clear, but the correct characterization of the "~" token is not. There are at least 4 plausible answers:

"~" corresponds to the space between the two input words.
"~" corresponds to the zero-length string between "I" and " arrived."
"~" corresponds to the zero-length string between "I " and "arrived."
"~" does not correspond to any input character positions.

Unfortunately these interpretations each have shortcomings.

A token should not correspond to a string matching the token separator expression.
There is no reason to prefer this to option 3.
There is no reason to prefer this to option 2.
Every token should correspond to some input string segment.