Commit Graph

21 Commits (bef68e16340ab5e6abdcf2807e535771d8e98436)

Author SHA1 Message Date
Mike Gerwitz 954b5a2795 Copyright year and name update
Ryan Specialty Group (RSG) rebranded to Ryan Specialty after its IPO.
2023-01-20 23:37:30 -05:00
Mike Gerwitz ed8a2ce28a tamer: xir::parse::ele: Superstate not to accept early EOF
This was accepting an early EOF when the active child `ParseState` was in an
accepting state, because it was not ensuring that anything on the stack was
also accepting.

Ideally, there should be nothing on the stack, and hopefully in the future
that's what happens.  But with how things are today, it's important that, if
anything is on the stack, it is accepting.

Since `is_accepting` on the superstate is only called during finalization,
and because the check terminates early, and because the stack practically
speaking will only have a couple things on it max (unless we're in tail
position in a deeply nested tree, without TCO [yet]), this shouldn't be an
expensive check.

Implementing this did require that we expose `Context` to `is_accepting`,
which I had hoped to avoid having to do, but here we are.

DEV-7145
2022-08-12 00:47:15 -04:00
Mike Gerwitz 77efefe680 tamer: xir::attr::parse: Better parser state descriptions
The attribute name was neither quoted nor `@`-prefixed.  (I noticed this in
the traces.)

DEV-7145
2022-08-01 15:01:37 -04:00
Mike Gerwitz 8f3301431c tamer: span::dummy: New module to hold DUMMY_SPAN and derivatives
Various DUMMY_SPAN-derived spans are used by many test cases, so this
finally extracts them---something I've been meaning to do for some time.

This also places DUMMY_SPAN behind a `cfg(test)` directive to ensure that it
is _only_ used in tests; UNKNOWN_SPAN should be used when a span is actually
unknown, which may also be the case during development.

DEV-7145
2022-08-01 15:01:37 -04:00
Mike Gerwitz bd783ac08b tamer: Replace ParseStatus::Dead with generic lookahead
Oh what a tortured journey.  I had originally tried to avoid formalizing
lookahead for all parsers by pretending that it was only needed for dead
state transitions (that is---states that have no transitions for a given
input token), but then I needed to yield information for aggregation.  So I
added the ability to override the token for `Dead` to yield that, in
addition to the token.  But then I also needed to yield lookahead for error
conditions.  It was a mess that didn't make sense.

This eliminates `ParseStatus::Dead` entirely and fully integrates the
lookahead token in `Parser` that was previously implemented.

Notably, the lookahead token is encapsulated in `TransitionResult` and
unavailable to `ParseState` implementations, forcing them to rely on
`Parser` for recursion.  This not only prevents `ParseState` from recursing,
but also simplifies delegation by removing the need to manually handle
tokens of lookahead.

The awkward case here is XIRT, which does not follow the streaming parsing
convention, because it was conceived before the parsing framework.  It needs
to go away, but doing so right now would be a lot of work, so it has to
stick around for a little bit longer until the new parser generators can be
used instead.  It is a persistent thorn in my side, going against the grain.

`Parser` will immediately recurse if it sees a token of lookahead with an
incomplete parse.  This is because stitched parsers will frequently yield a
dead state indication when they're done parsing, and there's no use in
propagating an `Incomplete` status down the entire lowering pipeline.  But,
that does mean that the toplevel is not the only thing recursing.  _But_,
the behavior doesn't really change, in the sense that it would infinitely
recurse down the entire lowering stack (though there'd be an opportunity to
detect that).  This should never happen with a correct parser, but it's not
worth the effort right now to try to force such a thing with Rust's type
system.  Something like TLA+ is better suited here as an aid, but it
shouldn't be necessary with clear implementations and proper test
cases.  Parser generators will also ensure such a thing cannot occur.

I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a
lot of type inference that happens based on the fact that `ParseStatus` has
a `ParseState` type parameter; `Parsed` has only `Object`.  It is desirable
for a public-facing `Parsed` to not be tied to `ParseState`, since consumers
need not be concerned with such a heavy type; however, we _do_ want that
heavy type internally, as it carries a lot of useful information that allows
for significant and powerful type inference, which in turn creates
expressive and convenient APIs.

DEV-7145
2022-07-12 00:11:45 -04:00
Mike Gerwitz 40c68d3e1e tamer: parse::state::TransitionResult: Make opaque
There was only one test outside of the `parse` module using these
fields.  The next commit will be introducing lookahead, and I do not want to
have to trust callers to ensure invariants are met.

DEV-7145
2022-07-05 14:12:06 -04:00
Mike Gerwitz a16a0d9138 Revert "tamer: xir: Initial re-introduction of AttrEnd"
This reverts commit b973d36862.

Alright, I'm getting sick of fighting with myself on this.  But rather than
just removing the last commit, I'm going to keep it around, so that my
thoughts are clearly documented for my future quarrels with myself.

Firstly: this added more overhead than I wanted it to.  While it wasn't
significant, it did add 100--150ms to one of our largest systems, up from
~2.8s, which seems a bit much for a token that's really just meant to make
life easier for the parser.

Further, it seems that all I've managed to do is push my original problem to
a different layer---this started as a means to resolve having to emit both
an object and an error simultaneously in the case where aggregate attribute
parsing has completed, but we encounter an error on the next token (e.g. an
unexpected element).  But XIRF, if it's missing AttrEnd, should throw an
error, but should also recover.  Recovery is easy---just assume that it was
present---_but then we don't emit a XIRF `AttrEnd` token_, which is
necessary for downstream systems.  So we'd need to either:

  (a) emit both a token and an error; or
  (b) panic.

But if we're doing (a), then the need for `AttrEnd` goes away, because it
solves the original problem (though the other concerns of the previous
commit still stand).  (b) is not ideal at all, even though the missing token
does represent an internal system error; it's not something the user can
correct.  But, given that it's something that the user cannot correct,
doesn't that imply that it's an awkward thing to include in the token
stream?  So back to `AttrEnd` being an awkward PITA to have.

So, given (a), I'll just do that: errors will become more of a "hey, this
error just occurred, but I'm trying to recover---here's an object that you
should use if you choose to continue parsing, but it may or may not be what
you're looking for; proceed with caution".  That flips the original script:
I imagined having external systems feed recovery tokens, but this
encapsulates recovery within the parser, which really is more appropriate,
though less flexible than having an omniscient external recovery system;
such a monolith was always an awkward concept and would be difficult to
implement cleanly.

This can also potentially be implemented as a generalization of the Dead
state change that allowed an object to be emitted alongside the
lookahead/error.

Anyway, back to where I was...I'm sure I'll look back on this in the future
shaking my head, reflecting on how naive I was.

DEV-7145
2022-06-29 11:25:44 -04:00
Mike Gerwitz b973d36862 tamer: xir: Initial re-introduction of AttrEnd
AttrEnd was initially removed in
0cc0bc9d5a (and the commit prior), because
there was not a compelling reason to use it over a lookahead
operation (returning a token via the a dead state transition); `AttrEnd`
simply introduced inconsistencies between the XIR reader (which produced
AttrEnd) and internal XIR stream generators (e.g. the lowering operations
into XIR->XML, which do not).

But now that parsers are performing aggregation---in particular the
attribute parser-generator `xir::parse::attr`---this has become quite a
pain, because the dead state is an actionable token.  For example:

  1. Open
  2. Attr
  3. Attr
  4. Open
  5. ...

In the happy case, token #4 results in `Parsed::Incomplete`, and so can just
be transformed into the object representing the aggregated attributes.  But
even in this happy path, it's ugly, and it requires non-tail recursion on
the parser which requires a duplicate stack allocation for the
`ParserState`.  That violates a core principle of the system.

But if there is an error at #4---e.g. an unexpected element---then we no
longer have a `Parsed::Incomplete` to hijack for our own uses, and we'd have
to introduce the ability to return both an error and a token, or we'd have
to introduce the ability to keep a token of lookahead instead of reading
from the underlying token stream, but that's complicated with push parsers,
which are used for parser composition.  Yikes.

And furthermore, the aggregation has caused me to introduce the ability to
override the dead state type to introduce both a token of lookahead and
aggregation information.  This complicates the system and is going to be
confusing to others.

Given all of this, AttrEnd does now seem appropriate to reintroduce, since
it will allow processing of aggregate operations when encountering that
token without having to worry about the above scenario; without having to
duplicate a `ParseState` stack; without having to hijack dead state
transitions for producing our aggregate object; and everything else
mentioned above.

This commit does not modify those abstractions to use AttrEnd yet; it
re-introduces the token to the core system, not the parser-generators, and
it doesn't yet replace lookahead operations in the parsers that use
them.  That'll come next.  Unlike the commit that removed it, though, we are
now generating proper spans, so make note of that here.  This also does not
introduce the concept to XIRF yet, which did not exist at the time that it
was removed, so XIRF is filtering it out until a following commit.

DEV-7145
2022-06-29 11:02:02 -04:00
Mike Gerwitz c671bf6a9c tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working.  I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.

This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token.  This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error.  However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.

This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information.  For that, we need to
encode additional context, and this is an attempt at that.

I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency.  But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project.  TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).

DEV-7145
2022-06-24 14:16:29 -04:00
Mike Gerwitz eafb3b2a1b tamer: Add Display impl for each ParseState for generic ParseErrors
This is intended to describe, to the user, the state that the parser is
in.  This will be used to convey additional information for general parser
errors, but it should also probably be integrated into parsers' individual
errors as well when appropriate.

This is something I expected to add at some point, but I wanted to add them
because, when dealing with lowering errors, it can be difficult to tell
what parser the error originated from.

DEV-11864
2022-05-25 15:26:02 -04:00
Mike Gerwitz 1ad2fb1dc8 Copyright year update 2022
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
2022-05-03 14:14:29 -04:00
Mike Gerwitz eaa8133d21 tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve.  I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.

This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does.  The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them.  I need to
balance this work with everything else I have going on.

This is a large commit, but it converts the existing Error Display impls
into Diagnostic.  This separation is a bit verbose, so I'll see how this
ends up evolving.

Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.

Output is integrated into tameld only in this commit; I'll add tamec
next.  Examples of what this outputs are available in the test cases in this
commit.

DEV-10935
2022-04-13 15:22:46 -04:00
Mike Gerwitz e77bdaf19a tamer: parse: Introduce mutable Context
This resolves the performance issues caused by Rust's failure to elide the
ElementStack (ArrayVec) memcpys on move.

Since XIRF is invoked tens of millions of times in some cases for larger
systems, prior to this change, failure to optimize away moves for XIRF
resulted in tens of millions of memcpys.  This resulted in linking of one
program going from 1s -> ~15s.  This change reduces it to ~2.5s with the
wip-xmlo-xir-reader flag on, with the extra time coming from elsewhere (the
subject of future changes).

In particular, this change introduces a new mutable reference to
`ParseState::parse_token`, which is a reference to a `Context` owned by the
caller (e.g. `Parser`).  In the case of XIRF, this means that
`Parser<flat::State, _>` will own the `ElementStack`/`ArrayVec` instead of
`flat::State`; this allows the latter to remain pure and benefit from Rust's
move optimizations, without sacrificing the otherwise-pure implementation.

ParseStates that do not need a mutable context can use `NoContext` and
remain pure.

DEV-12024
2022-04-05 15:50:53 -04:00
Mike Gerwitz f402e51d04 tamer: parse: More flexible Transition API
This does some cleanup and adds `parse::Object` for use in disambiguating
`From` for `ParseStatus`, allowing the `Transition` API to be much more
flexible in the data it accepts and automatically converts.  This allows us
to concisely provide raw output data to be wrapped, or provide `ParseStatus`
directly when more convenient.

There aren't yet examples in the docs; I'll do so once I make sure this API
is actually utilized as intended.

DEV-10863
2022-03-25 16:45:32 -04:00
Mike Gerwitz 279ddc79d7 tamer: parse::TransitionResult: Alias=>newtype
This converts the tuple type alias into a newtype, so that we may provide
our own implementations.

This differs from a previous approach that I took, which involved making
this type `Result<(S, T), (S, E)>` so that the return values composed well
with other functions.  But the reality is that this is used only by other
`ParseState`s and `Parser`, so it's unnecessary.

However, this is also an attempt to utilize the new Try and FromResidual
traits; note how the Try associated types match precisely what I was trying
to do before, though they're used as intermediate types.  I'll see how this
evolves.

DEV-10863
2022-03-25 12:28:50 -04:00
Mike Gerwitz 2e98a69d15 Revert "tamer: parse::TransitionResult: Move common Transition into Result"
This reverts commit bf5da75096.
2022-03-25 09:17:25 -04:00
Mike Gerwitz bf5da75096 tamer: parse::TransitionResult: Move common Transition into Result
This allows the Results to compose and, importantly, is compatible with
`?` without having to put in any extra effort.

This makes puts the caller in an awkward spot, so I introduced a utility
function `result_tup0_invert` for now; we'll see if that stays or evolves
differently.

DEV-10863
2022-03-24 23:48:30 -04:00
Mike Gerwitz ceb00c4df5 tamer: xir: Complete parse type migration
A previous commit moved the parser.  This updates the types so that they can
actually be utilized in that context.

DEV-10863
2022-03-21 15:50:43 -04:00
Mike Gerwitz 14638a612f tamer: {xir::=>}parse: Move parser out of XIR
The parsing framework originally created for XIR is now more general and
useful to other things.  We'll see how this evolves.

This needs additional documentation, but I'd like to see how it changes as
I implement XmloReader and then some of the source readers first.

DEV-10863
2022-03-18 16:24:53 -04:00
Mike Gerwitz 0360226caa tamer: xir::parse: Generalize input token type
This adds a `Token` type to `ParseState`.  Everything uses `xir::Token`
currently, but `XmloReader` will use `xir::flat::Object`.

Now that this has been generalized beyond XIR, the parser ought to be
hoisted up a level.

DEV-10863
2022-03-18 15:26:05 -04:00
Mike Gerwitz 7b6d68af85 tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system.  I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.

Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser).  Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference.  Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).

This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together.  I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.

DEV-10863
2022-03-17 16:02:05 -04:00