This was accepting an early EOF when the active child `ParseState` was in an
accepting state, because it was not ensuring that anything on the stack was
also accepting.
Ideally, there should be nothing on the stack, and hopefully in the future
that's what happens. But with how things are today, it's important that, if
anything is on the stack, it is accepting.
Since `is_accepting` on the superstate is only called during finalization,
and because the check terminates early, and because the stack practically
speaking will only have a couple things on it max (unless we're in tail
position in a deeply nested tree, without TCO [yet]), this shouldn't be an
expensive check.
Implementing this did require that we expose `Context` to `is_accepting`,
which I had hoped to avoid having to do, but here we are.
DEV-7145
Along with this change we also had to change how we handle dead states in
the superstate. So there were two problems here:
1. Sum states were not yielding a dead state after recovery, which meant
that parsing was unable to continue (we still have a `todo!`); and
2. The superstate considered it an error when there was nothing left on
the stack, because I assumed that ought not happen.
Regarding #2---it _shouldn't_ happen, _unless_ we have extra input after we
have completed parsing. Which happens to be the case for this test case,
but more importantly, we shouldn't be panicing with errors about TAMER bugs
if somebody puts extra input after a closing root tag in a source file.
DEV-7145
This properly integrates the trampoline into `ele_parse!`. The
implementation leaves some TODOs, most notably broken mixed text handling
since we can no longer intercept those tokens before passing to the
child. That is temporarily marked as incomplete; see a future commit.
The introduced test `ParseState`s were to help me reason about the system
intuitively as I struggled to track down some type errors in the monstrosity
that is `ele_parse!`. It will fail to compile if those invariants are
violated. (In the end, the problems were pretty simple to resolve, and the
struggle was the type system doing its job in telling me that I needed to
step back and try to reason about the problem again until it was intuitive.)
This keeps around the NT states for now, which are quickly used to
transition to the next NT state, like a couple of bounces on a trampoline:
NT -> Dead -> Parent -> Next NT
This could be optimized in the future, if it's worth doing.
This also makes no attempt to implement tail calls; that would have to come
after fixing mixed content and really isn't worth the added complexity
now. I (desperately) need to move on, and still have a bunch of cleanup to
do.
I had hoped for a smaller commit, but that was too difficult to do with all
the types involved.
DEV-7145
And here's the thing that I've been dreading, partly because of the
`macro_rules` issues involved. But, it's not too terrible.
This module was already large and complex, and this just adds to it---it's
in need of refactoring, but I want to be sure it's fully working and capable
of handling NIR before I go spending time refactoring only to undo it.
_This does not yet use trampolining in place of the call stack._ That'll
come next; I just wanted to get the macro updated, the superstate generated,
and tests passing. This does convert into the
superstate (`ParseState::Super`), but then converts back to the original
`ParseState` for BC with the existing composition-based delegation. That
will go away and will then use the equivalent of CPS, using the
superstate+`Parser` as a trampoline. This will require an explicit stack
via `Context`, like XIRF. And it will allow for tail calls, with respect to
parser delegation, if I decide it's worth doing.
The root problem is that source XML requires recursive parsing (for
expressions and statements like `<section>`), which results in recursive
data structures (`ParseState` enum variants). Resolving this with boxing is
not appropriate, because that puts heap indirection in an extremely hot code
path, and may also inhibit the aggressive optimizations that I need Rust to
perform to optimize away the majority of the lowering pipeline.
Once this is sorted out, this should be the last big thing for the
parser. This unfortunately has been a nagging and looming issue for months,
that I was hoping to avoid, and in retrospect that was naive.
DEV-7145
I'm disappointed that I keep having to implement features that I had hoped
to avoid implementing.
This introduces a "superstate" feature, which is intended really just to be
a sum type that is able to delegate to stitched `ParseState`s. This then
allows a `ParseState` to transition directly to another `ParseState` and
have the parent `ParseState` handle the delegation---a trampoline.
This issue naturally arises out of the recursive nature of parsing a TAME
XML document, where certain statements can be nested (like `<section>`), and
where expressions can be nested. I had gotten away with composition-based
delegation for now because `xmlo` headers do not have such nesting.
The composition-based approach falls flat for recursive structures. The
typical naive solution is boxing, which I cannot do, because not only is
this on an extremely hot code path, but I require that Rust be able to
deeply introspect and optimize away the lowering pipeline as much as
possible.
Many months ago, I figured that such a solution would require a trampoline,
as it typically does in stack-based languages, but I was hoping to avoid
it. Well, no longer; let's just get on with it.
This intends to implement trampolining in a `ParseState` that serves as that
sum type, rather than introducing it as yet another feature to `Parser`; the
latter would provide a more convenient API, but it would continue to bloat
`Parser` itself. Right now, only the element parser generator will require
use of this, so if it's needed beyond that, then I'll debate whether it's
worth providing a better abstraction. For now, the intent will be to use
the `Context` to store a stack that it can pop off of to restore the
previous `ParseState` before delegation.
DEV-7145
Various DUMMY_SPAN-derived spans are used by many test cases, so this
finally extracts them---something I've been meaning to do for some time.
This also places DUMMY_SPAN behind a `cfg(test)` directive to ensure that it
is _only_ used in tests; UNKNOWN_SPAN should be used when a span is actually
unknown, which may also be the case during development.
DEV-7145
The trace outputs a note in the footer indicating _why_ it's being output,
so that the reader understands both where the potentially-unexpected
behavior originates from and so they know (in the case of the feature flag)
how to inhibit it.
That information originally lived in `Parser`, where the `cfg` directive to
enable it lives, but it was moved into the abstraction. This corrects that.
DEV-7145
This has gotten large and was cluttering `feed_tok`. This also provides the
ability to more easily expand into other types of tracing in the future.
DEV-7145
This information is likely redundant in a lowering pipeline, but is more
useful outside of such a pipeline. It's also more clear.
`Object` does not implement `Display`, though, because that's too burdensome
for how it's currently used. Many `Object`s are also `Token`s though and,
if fed to another `Parser` for lowering, it'll get `Display::fmt`'d.
DEV-7145
Rust was warning that `cfg` was unused if both `test` and
`parser-trace-stderr`. This both allows that and adjusts the precedence to
make more sense for tests.
DEV-7145
Because of recovery, the trace otherwise paints a really confusing-looking
picture when given unexpected input.
This is large enough now that it really ought to be extracted from
`feed_tok`, but I'll wait to see how this evolves further. I considered
adding color too, but it's not yet clear to me that the visual noise will be
all that helpful.
DEV-7145
This flag allows toggling the parser trace that was previously only
available to tests. Unfortunately, at the time of writing, Cargo cannot
enable flags in profiles, so I have to check for either `test` or this flag
being set to enable relevant features.
This trace is useful as I start to run the parser against existing code
written in TAME so that our existing systems can help to guide my
development. Unlike the current tests, it also allows seeing real-world
data as part of the lowering pipeline, where multiple `Parser`s are in
play.
Having this feature flag also makes this feature more easily discoverable to
those wishing to observe how the lowering pipeline works.
DEV-7145
This allows an element to be repeated by the parent NT. The easiest way I
saw to implement this for now was to abuse the Context to provide a runtime
configuration that would allow the state machine to reset after it has
completed parsing.
This also influences error recovery, in that if we're expecting zero or more
of something, we cannot provide an error for an unexpected name, and instead
must emit a dead state so that the caller can determine what to do.
DEV-7145
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
This adds the ability to bind identifiers to represent `OpenSpan` and
`CloseSpan`, available to the `@` and `/` maps. Since identifiers in TAME
originate from attributes, this may not get a whole lot of use, but it's
important to be available.
There is some awkwardness in that the opening span appears to be scoped to
the entire nonterminal, but it's actually only available in the `@`
mapping. I'll change this if it's actually needed; this keeps things simple
for now.
DEV-7145
Since the parsers produce streaming IRs, we need to be able to emit tokens
representing closing delimiters, where they are important.
This notably doesn't use spans; I'll add those next, since they're also
needed for the previous work.
DEV-7145
This begins generating parsers that are capable of parsing elements. I need
to move on, so this abstraction isn't going to go as far as it could, but
let's see where it takes me.
This was the work that required the recent lookahead changes, which has been
detailed in previous commits.
This initial support is basic, but robust. It supports parsing elements
with attributes and children, but it does not yet support the equivalent of
the Kleene star (`*`). Such support will likely be added by supporting
parsers that are able to recurse on their own definition in tail position,
which will also require supporting parsers that do not add to the stack.
This generates parsers that, like all the other parsers, use enums to
provide a typed stack. Stitched parsers produce a nested stack that is
always bounded in size. Fortunately, expressions---which can nest
deeply---do not need to maintain ancestor context on the stack, and so this
should work fine; we can get away with this because XIRF ensures proper
nesting for us. Statements that _do_ need to maintain such context are not
nested.
This also does not yet support emitting an object on closing tag, which
will be necessary for NIR, which will be a streaming IR that is "near" to
the source XML in structure. This will then be used to lower into AIR for
the ASG, which gives structure needed for further analysis.
More information to come; I just want to get this committed to serve as a
mental synchronization point and clear my head, since I've been sitting on
these changes for so long and have to keep stashing them as I tumble down
rabbit holes covered in yak hair.
DEV-7145
Having the lookahead token generic over the `ParseState` was a pain in the
ass for stitching, since they shared the same token type but not the same
parser. I don't expect there to be any need to be able to infer other
parser-related types for a token of lookahead, so I'd rather just make my
life easier until such a thing is needed.
DEV-7145
Oh what a tortured journey. I had originally tried to avoid formalizing
lookahead for all parsers by pretending that it was only needed for dead
state transitions (that is---states that have no transitions for a given
input token), but then I needed to yield information for aggregation. So I
added the ability to override the token for `Dead` to yield that, in
addition to the token. But then I also needed to yield lookahead for error
conditions. It was a mess that didn't make sense.
This eliminates `ParseStatus::Dead` entirely and fully integrates the
lookahead token in `Parser` that was previously implemented.
Notably, the lookahead token is encapsulated in `TransitionResult` and
unavailable to `ParseState` implementations, forcing them to rely on
`Parser` for recursion. This not only prevents `ParseState` from recursing,
but also simplifies delegation by removing the need to manually handle
tokens of lookahead.
The awkward case here is XIRT, which does not follow the streaming parsing
convention, because it was conceived before the parsing framework. It needs
to go away, but doing so right now would be a lot of work, so it has to
stick around for a little bit longer until the new parser generators can be
used instead. It is a persistent thorn in my side, going against the grain.
`Parser` will immediately recurse if it sees a token of lookahead with an
incomplete parse. This is because stitched parsers will frequently yield a
dead state indication when they're done parsing, and there's no use in
propagating an `Incomplete` status down the entire lowering pipeline. But,
that does mean that the toplevel is not the only thing recursing. _But_,
the behavior doesn't really change, in the sense that it would infinitely
recurse down the entire lowering stack (though there'd be an opportunity to
detect that). This should never happen with a correct parser, but it's not
worth the effort right now to try to force such a thing with Rust's type
system. Something like TLA+ is better suited here as an aid, but it
shouldn't be necessary with clear implementations and proper test
cases. Parser generators will also ensure such a thing cannot occur.
I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a
lot of type inference that happens based on the fact that `ParseStatus` has
a `ParseState` type parameter; `Parsed` has only `Object`. It is desirable
for a public-facing `Parsed` to not be tied to `ParseState`, since consumers
need not be concerned with such a heavy type; however, we _do_ want that
heavy type internally, as it carries a lot of useful information that allows
for significant and powerful type inference, which in turn creates
expressive and convenient APIs.
DEV-7145
*NB: This is the initial change to introduce the token of lookahead, but this
does not fully integrate it. In particular, this is missing from the
stitching/delegation layer.*
This has been a long time coming, I suppose, though I had tried to avoid it
with `Parser::delegate_lookahead`. But the problem with doing that is that
it forced the ParserState to recurse, which both violates that I want no
looping constructs except for the toplevel, and performs additional stack
allocation as it is not in tail position.
The final straw was having to both return an error _and_ an aggregate object
for the attribute parser when an unexpected element is encountered (this
code is not yet committed). One option was to add a recovery object to the
error object, and formalize that, but then we have other concerns; for
example, what if that recovery object triggered an error? We'd have to mask
either the old or the new error. But we wouldn't want to mask either,
because the object causing the error would be the aggregate attributes,
which is _not_ a recovery object, but actual data we want to emit. And so
it's a kluge right off of the bat.
The use of a token of lookahaed is a more traditional approach and has uses
outside of just this one scenario. It'll also allow for the removal of
recursion from the existing ParserStates, and possibly the elimination of
dead state associated data, though I may end up leaving that; more to come.
Rust will also optimize away lookahead storage and processing in Parsers
that do not utilize it.
DEV-7145
There was only one test outside of the `parse` module using these
fields. The next commit will be introducing lookahead, and I do not want to
have to trust callers to ensure invariants are met.
DEV-7145
This is the first parser generator for the parsing framework. I've been
waiting quite a while to do this because I wanted to be sure that I
understood how I intended to write the attribute parsers manually. Now that
I'm about to start parsing source XML files, it is necessary to have a
parser generator.
Typically one thinks of a parser generator as a separate program that
generates code for some language, but that is not always the case---that
represents a lack of expressiveness in the language itself (e.g. C). Here,
I simply use Rust's macro system, which should be a concept familiar to
someone coming from a language like Lisp.
This also resolves where I stand on parser combinators with respect to this
abstraction: they both accomplish the exact same thing (composition of
smaller parsers), but this abstraction doesn't do so in the typical
functional way. But the end result is the same.
The parser generated by this abstraction will be optimized an inlined in the
same manner as the hand-written parsers. Since they'll be tightly coupled
with an element parser (which too will have a parser generator), I expect
that most attribute parsers will simply be inlined; they exist as separate
parsers conceptually, for the same reason that you'd use parser combinators.
It's worth mentioning that this awkward reliance on dead state for a
lookahead token to determine when aggregation is complete rubs me the wrong
way, but resolving it would involve reintroducing the XIR AttrEnd that I had
previously removed. I'll keep fighting with myself on this, but I want to
get a bit further before I determine if it's worth the tradeoff of
reintroducing (more complex IR but simplified parsing).
DEV-7145
`ParseState` originally required `Default` for use with `mem::take` in
`Parser::feed_tok`. This unfortunately cannot last, since more specialized
parsers require context during initialization in order to provide useful
diagnostic information. (The other option is to require the caller to
augment errors with diagnostic information, but that would have to be
duplicated by every caller and complicates parser composition; I'd prefer
those diagnostic details remain encapsulated.)
Replacing `Default` with `Option` is uglier, but it ends up producing the
same assembly as `mem::take` did, at least at the time of writing. Because
Rust is able to elide unnecessary moves using this implementation, there is
no need for `unwrap_unchecked` or other unsafe methods, which is great,
since it shows that this parsing methodology is viable entirely in safe
Rust.
DEV-7145
Previously, `ParseStatus::Dead` always yielded
`ParseState::Token`. However, I'm working on introducing parsers that
aggregate (parsing XML attributes into structs), and those parsers do not
know that they have completed aggregation until they reach a dead state;
given that, I need to yield additional information at that time.
I played around with a number of alternative ideas, but this ended up being
the cleanest, relative to the effort involved. For example, introducing
another parameter to `ParseStatus::Dead` was too burdensome on APIs that
ought not concern themselves with the possibility of receiving an object in
addition to a lookahead token, since many parsers are not capable of doing
so (given that they map M:(N<=M)).
Another option that I abandoned fairly quickly was having
`is_accepting` (potentially renamed) return an aggregate object, since
that's on the side and didn't feel like it was part of the parsing pipeline.
The intent is to abstract this some in a new `ParseState` method for
delegation + aggregation.
DEV-7145
This allows `XmlXirReader` to be used in a `Lower` operation, just as
everything else, bringing me one step closer to a pipeline that can be
concisely represented; this is finally beginning to unify in a clear way,
though it is still a bit of a mess.
This causes `XmlXirReader` to _act_ like a `parse::Parser` in that it yields
a `ParsedResult`, but it does not use `parse::Parser` itself; that was the
_original_ plan: convert it into a `ParseState` where `XmlXirReader` became
a context, and force `Parser` to yield by feeding it a stream of tokens with
`repeat`, but that ended up performing poorly relative to this change. I
did some investigation, which I might write about in the future, but for
now, this solution works just fine.
DEV-7145
This abstraction has grown quite a bit, and it's time to start formalizing
it a bit. This split doesn't change any behavior, but it does start to make
it easier to reason about by clearly stating the broad components and how
they interact with one-another.
This doesn't yet move the tests; those will come next, but they are very
few. The reason I gave previously for this was because (a) they're tested
indirectly via the systems that utilize them and (b) because the abstraction
was not yet settled on the process was already very expensive. No test
coverage was lost---it's only that failures were potentially harder to debug
on test failures, but in practice not even this was true, because the deeply
expressive types all but ensured that, if it compiles, it will function in a
way that is expected. Unit tests and documentation for this system will be
added once I'm sure that this abstraction is in a proper state.
DEV-7145