The following (improbable) RPP presents a difficult question for characterization:
:[ \t]
!([a-z]+) ([a-z]+) \1 ~ \2
This preprocessor rewrites "I arrived." to "I ~ arrived.",
and then tokenizes it a ["I", "~", "arrived."].
The correct characterization of the first and last tokens is pretty clear, but the correct characterization of the "~" token is not.
There are at least 4 plausible answers:
- "~" corresponds to the space between the two input words.
- "~" corresponds to the zero-length string between "I" and " arrived."
- "~" corresponds to the zero-length string between "I " and "arrived."
- "~" does not correspond to any input character positions.
Unfortunately these interpretations each have shortcomings.
- A token should not correspond to a string matching the token separator expression.
- There is no reason to prefer this to option 3.
- There is no reason to prefer this to option 2.
- Every token should correspond to some input string segment.