This is the first parser generator for the parsing framework. I've been
waiting quite a while to do this because I wanted to be sure that I
understood how I intended to write the attribute parsers manually. Now that
I'm about to start parsing source XML files, it is necessary to have a
parser generator.
Typically one thinks of a parser generator as a separate program that
generates code for some language, but that is not always the case---that
represents a lack of expressiveness in the language itself (e.g. C). Here,
I simply use Rust's macro system, which should be a concept familiar to
someone coming from a language like Lisp.
This also resolves where I stand on parser combinators with respect to this
abstraction: they both accomplish the exact same thing (composition of
smaller parsers), but this abstraction doesn't do so in the typical
functional way. But the end result is the same.
The parser generated by this abstraction will be optimized an inlined in the
same manner as the hand-written parsers. Since they'll be tightly coupled
with an element parser (which too will have a parser generator), I expect
that most attribute parsers will simply be inlined; they exist as separate
parsers conceptually, for the same reason that you'd use parser combinators.
It's worth mentioning that this awkward reliance on dead state for a
lookahead token to determine when aggregation is complete rubs me the wrong
way, but resolving it would involve reintroducing the XIR AttrEnd that I had
previously removed. I'll keep fighting with myself on this, but I want to
get a bit further before I determine if it's worth the tradeoff of
reintroducing (more complex IR but simplified parsing).
DEV-7145
This was missed. It was not possible, using the documentation
alone (without looking at the linked source) to tell what the QName actually
represented, though you could assume by the name.
DEV-7145
This is partly an experiment, but is designed to simplify producing English
sentences in various contexts. It makes use of a not only unstable, but
incomplete, Rust feature---adt_const_params, for a static str const type
parameter. Hopefully that ends up being stabalized.
This uses types, but it's the same as function composition due to Rust's
monomorphization.
DEV-7145
`ParseState` originally required `Default` for use with `mem::take` in
`Parser::feed_tok`. This unfortunately cannot last, since more specialized
parsers require context during initialization in order to provide useful
diagnostic information. (The other option is to require the caller to
augment errors with diagnostic information, but that would have to be
duplicated by every caller and complicates parser composition; I'd prefer
those diagnostic details remain encapsulated.)
Replacing `Default` with `Option` is uglier, but it ends up producing the
same assembly as `mem::take` did, at least at the time of writing. Because
Rust is able to elide unnecessary moves using this implementation, there is
no need for `unwrap_unchecked` or other unsafe methods, which is great,
since it shows that this parsing methodology is viable entirely in safe
Rust.
DEV-7145
Previously, `ParseStatus::Dead` always yielded
`ParseState::Token`. However, I'm working on introducing parsers that
aggregate (parsing XML attributes into structs), and those parsers do not
know that they have completed aggregation until they reach a dead state;
given that, I need to yield additional information at that time.
I played around with a number of alternative ideas, but this ended up being
the cleanest, relative to the effort involved. For example, introducing
another parameter to `ParseStatus::Dead` was too burdensome on APIs that
ought not concern themselves with the possibility of receiving an object in
addition to a lookahead token, since many parsers are not capable of doing
so (given that they map M:(N<=M)).
Another option that I abandoned fairly quickly was having
`is_accepting` (potentially renamed) return an aggregate object, since
that's on the side and didn't feel like it was part of the parsing pipeline.
The intent is to abstract this some in a new `ParseState` method for
delegation + aggregation.
DEV-7145
I'll document it more formally eventually, but this settles on a mix of the
two: square brackets and dashes for intervals, `+` for intersecting lines,
byte offsets below interval endpoints, and names below that.
The docblock for `Span` itself iss still off; I'll probably just take one of
the test cases and paste it there at some point.
DEV-7145
This replaces a tuple with a tuple struct that allows for calculating more
complete span information, such as the span encompassing the entire
attribute and the value span including the surrounding quotes.
This includes logic that ought to be abstracted into `Span` itself, and it's
not as formal as I'd like it to be (e.g. not ensuring context), but this is
a good starting point.
Note that parsers call `Token::span`, which in turn calculates the attribute
span, each time an attribute is encountered during lowering. But Rust does
a good job at optimizing away unnecessary operations, so this didn't have an
observable impact on time.
DEV-7145
This provides much more clarity as to what is going on. Further, it's less
ambiguous, since I'm about to introduce a new type of xmlo lowering into XIR
for writing the actual xmlo files.
DEV-7145
This allows `XmlXirReader` to be used in a `Lower` operation, just as
everything else, bringing me one step closer to a pipeline that can be
concisely represented; this is finally beginning to unify in a clear way,
though it is still a bit of a mess.
This causes `XmlXirReader` to _act_ like a `parse::Parser` in that it yields
a `ParsedResult`, but it does not use `parse::Parser` itself; that was the
_original_ plan: convert it into a `ParseState` where `XmlXirReader` became
a context, and force `Parser` to yield by feeding it a stream of tokens with
`repeat`, but that ended up performing poorly relative to this change. I
did some investigation, which I might write about in the future, but for
now, this solution works just fine.
DEV-7145
This abstraction has grown quite a bit, and it's time to start formalizing
it a bit. This split doesn't change any behavior, but it does start to make
it easier to reason about by clearly stating the broad components and how
they interact with one-another.
This doesn't yet move the tests; those will come next, but they are very
few. The reason I gave previously for this was because (a) they're tested
indirectly via the systems that utilize them and (b) because the abstraction
was not yet settled on the process was already very expensive. No test
coverage was lost---it's only that failures were potentially harder to debug
on test failures, but in practice not even this was true, because the deeply
expressive types all but ensured that, if it compiles, it will function in a
way that is expected. Unit tests and documentation for this system will be
added once I'm sure that this abstraction is in a proper state.
DEV-7145
This also modifies `poc` such that `Lower` is invoked as an associated
function rather than a method to emphasize the pattern that is forming, so
that it can be later abstracted away.
DEV-11864
The `while_ok` can just be implied with a lowering operation, and that
reduces the name complexity so that we can maybe introduce even more
specialized methods without resulting in a huge sentence as a name.
DEV-11864
This finally uses `parse` all the way up to aggregation into the ASG, as can
be seen by the mess in `poc`. This will be further simplified---I just need
to get this committed so that I can mentally get it off my plate. I've been
separating this commit into smaller commits, but there's a point where it's
just not worth the effort anymore. I don't like making large changes such
as this one.
There is still work to do here. First, it's worth re-mentioning that
`poc` means "proof-of-concept", and represents things that still need a
proper home/abstraction.
Secondly, `poc` is retrieving the context of two parsers---`LowerContext`
and `Asg`. The latter is desirable, since it's the final aggregation point,
but the former needs to be eliminated; in particular, packages need to be
worked into the ASG so that `found` can be removed.
Recursively loading `xmlo` files still happens in `poc`, but the compiler
will need this as well. Once packages are on the ASG, along with their
state, that responsibility can be generalized as well.
That will then simplify lowering even further, to the point where hopefully
everything has the same shape (once final aggregation has an abstraction),
after which we can then create a final abstraction to concisely stitch
everything together. Right now, Rust isn't able to infer `S` for
`Lower<S, LS>`, which is unfortunate, but we'll be able to help it along
with a more explicit abstraction.
DEV-11864
This is intended to describe, to the user, the state that the parser is
in. This will be used to convey additional information for general parser
errors, but it should also probably be integrated into parsers' individual
errors as well when appropriate.
This is something I expected to add at some point, but I wanted to add them
because, when dealing with lowering errors, it can be difficult to tell
what parser the error originated from.
DEV-11864
The `*_iter_while_ok` functions now compose like monads, flattening `Result`
at each step and drastically simplifying handling of error types. This also
removes the bunch of `?`s at the end of the expression, and allows me to use
`?` within the callback itself.
I had originally not used `Result` as the return type of the callback
because I was not entirely sure how I was going to use them, but it's now
clear that I _always_ use `Result` as the return type, and so there's no use
in trying to be too accommodating; it can always change in the future.
This is desirable not just for cleanup, but because trying to refactor
`asg_builder` into a pair of `Parser`s is really messy to chain without
flattening, especially given some state that has to leak temporarily to the
caller. More on that in a future commit.
DEV-11864
This was always the intent, but I didn't have a higher-level object
yet. This removes all the awkwardness that existed with working the root
in as an identifier.
DEV-11864
This wraps `Ident` in a new `Object` variant and modifies `Asg` so that its
nodes are of type `Object`.
This unfortunately requires runtime type checking. Whether or not that's
worth alleviating in the future depends on a lot of different things, since
it'll require my own graph implementation, and I have to focus on other
things right now. Maybe it'll be worth it in the future.
Note that this also gets rid of some doc examples that simply aren't worth
maintaining as the API evolves.
DEV-11864
A previous commit mentioned that there's not a place for `Dim`, and
duplicated it between `asg` and `xmlo`. Well, `Dtype` is also needed in
both, and so here's a home for now.
`Dtype` has always been an inappropriate detail for the system and will one
day be removed entirely in favor of higher-level types; the machine
representation is up to the compiler to decide.
DEV-11864
asg_builder is about to be replaced, but in the process of simplifying the
destination IR (the ASG), I'm moving things into the proper place. This
never belonged here---it belongs with the actual lowering operation.
Previously, this was not reasoned about in terms of a lowering operation,
and was written when I was first introducing myself to Rust and trying to
get a proof-of-concept linker working.
DEV-11864
This matches xmlo::Dim, and could be the same thing, if we can find a home
for it in the future; it's not worth creating such a home right now when I'm
not yet sure what else ought to live there; the duplication may be fine.
The conversion from xmlo needs to be moved, and `Dim` is going to be used
for more than just identifiers (expressions will have type inference
performed).
DEV-11864
This allows retrieving and providing a context to a `Parser`. This is
intended for use with an aggregating parser, in particular to construct the
ASG and return it.
This is a component of a change that replaces `asg_builder` with a
`Parser`-based lowering into the ASG, but there are still changes that need
to be made to simplify things and complete its integration.
DEV-11864
Previously, since the graph contained only identifiers, discovered roots
were stored in a separate vector and exposed to the caller. This not only
leaked details, but added complexity; this was left over from the
refactoring of the proof-of-concept linker some time ago.
This moves the root management into the ASG itself, mostly, with one item
being left over for now in the asg_builder (eligibility classifications).
There are two roots that were added automatically:
- __yield
- __worksheet
The former has been removed and is now expected to be explicitly mapped in
the return map, which is now enforced with an extern in `core/base`. This
is still special, in the sense that it is explicitly referenced by the
generated code, but there's nothing inherently special about it and I'll
continue to generalize it into oblivion in the future, such that the final
yield is just a convention.
`__worksheet` is the only symbol of type `IdentKind::Worksheet`, and so that
was generalized just as the meta and map entries were.
The goal in the future will be to have this more under the control of the
source language, and to consolodate individual roots under packages, so that
the _actual_ roots are few.
As far as the actual ASG goes: this introduces a single root node that is
used as the sole reference for reachability analysis and topological
sorting. The edges of that root node replace the vector that was removed.
DEV-11864
In the actual implementation (outside of tests), this is always looking up
before adding the symbol. This will simplify the API, while still retaining
errors, since the identifier will fail the state transition if the
identifier did not exist before attempting to set a fragment. So while this
is slower in microbenchmarks, this has no effect on real-world performance.
Further, I'm refactoring toward a streaming ASG aggregation, which is a lot
easier if we do not need to perform lookups in a separate step from the
ASG's primitives.
DEV-11864
`PartialEq` remains, and is all that is needed. See previous commit
regarding the removal of this same bound from `Context`.
This can be re-added if it ends up actually being necessary. But Tokens are
ephemeral and used only in lowering pipelines, using pattern matching.
DEV-11864
These traits are no longer necessary now that I'm using concrete types; they
just add unnecessary noise and confusion as I attempt to further refactor.
Don't abstract prematurely.
DEV-11864
This removes the generic on the Asg (which was formerly BaseAsg),
hard-coding `IdentObject`, which will further evolve. This makes the IR an
actual concrete IR rather than an abstract data structure.
These tests bring me back a bit, since they were written as I was still
becoming familiar with Rust.
DEV-11864
This is the beginning of an incremental refactoring to remove generics, to
simplify the ASG. When I initially wrote the linker, I wasn't sure what
direction I was going in, but I was also negatively influenced by more
traditional approaches to both design and unit testing.
If we're going to call the ASG an IR, then it needs to be one---if the core
of the IR is generic, then it's more like an abstract data structure than
anything. We can abstract around the IR to slice it up into components that
are a little easier to reason about and understand how responsibilities are
segregated.
DEV-11864
This is unnecessarily restrictive, since we do not require anything further
than `PartialEq` for the situations where we care about equality (tests).
DEV-11864
This is too restrictive, especially for parsers that fold into something,
like the ASG, which may exist prior to invoking the parser.
This moves the trait bound to the functions that actually need it. Those
obviously cannot be used if the Context does not implement `Default`, but
I'll provide alternative conveniences.
DEV-11864
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
These are no longer TODOs---they represent invalid tokens.
I'm going to put effort into providing further context with the diagnostic
system [right now] because these are internal errors caused by either
miscompilation or an incomplete reader.
DEV-10936
This was missed when removing it from other Display impls when the new
diagnostic system was introduced. Raw `Span`s display byte offsets and the
context, which is no longer desirable as part of an error message.
DEV-10936
I had waited to provide more documentation until I was sure that the
abstraction was not going to change significantly; there was a lot of
refactoring in prior commits.
DEV-12151