2022-03-18 16:24:53 -04:00
|
|
|
// Basic streaming parsing framework
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//
|
2022-05-03 14:14:29 -04:00
|
|
|
// Copyright (C) 2014-2022 Ryan Specialty Group, LLC.
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//
|
|
|
|
// This file is part of TAME.
|
|
|
|
//
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
2022-03-18 16:24:53 -04:00
|
|
|
//! Basic streaming parser framework for lowering operations.
|
|
|
|
//!
|
|
|
|
//! _TODO: Some proper docs and examples!_
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
2022-06-01 11:32:58 -04:00
|
|
|
mod error;
|
|
|
|
mod lower;
|
|
|
|
mod parser;
|
|
|
|
mod state;
|
|
|
|
|
|
|
|
pub use error::ParseError;
|
2022-06-02 10:30:44 -04:00
|
|
|
pub use lower::{Lower, LowerIter, ParsedObject};
|
2022-06-01 11:32:58 -04:00
|
|
|
pub use parser::{Parsed, ParsedResult, Parser};
|
|
|
|
pub use state::{
|
|
|
|
context::{Context, Empty as EmptyContext, NoContext},
|
tamer: Replace ParseStatus::Dead with generic lookahead
Oh what a tortured journey. I had originally tried to avoid formalizing
lookahead for all parsers by pretending that it was only needed for dead
state transitions (that is---states that have no transitions for a given
input token), but then I needed to yield information for aggregation. So I
added the ability to override the token for `Dead` to yield that, in
addition to the token. But then I also needed to yield lookahead for error
conditions. It was a mess that didn't make sense.
This eliminates `ParseStatus::Dead` entirely and fully integrates the
lookahead token in `Parser` that was previously implemented.
Notably, the lookahead token is encapsulated in `TransitionResult` and
unavailable to `ParseState` implementations, forcing them to rely on
`Parser` for recursion. This not only prevents `ParseState` from recursing,
but also simplifies delegation by removing the need to manually handle
tokens of lookahead.
The awkward case here is XIRT, which does not follow the streaming parsing
convention, because it was conceived before the parsing framework. It needs
to go away, but doing so right now would be a lot of work, so it has to
stick around for a little bit longer until the new parser generators can be
used instead. It is a persistent thorn in my side, going against the grain.
`Parser` will immediately recurse if it sees a token of lookahead with an
incomplete parse. This is because stitched parsers will frequently yield a
dead state indication when they're done parsing, and there's no use in
propagating an `Incomplete` status down the entire lowering pipeline. But,
that does mean that the toplevel is not the only thing recursing. _But_,
the behavior doesn't really change, in the sense that it would infinitely
recurse down the entire lowering stack (though there'd be an opportunity to
detect that). This should never happen with a correct parser, but it's not
worth the effort right now to try to force such a thing with Rust's type
system. Something like TLA+ is better suited here as an aid, but it
shouldn't be necessary with clear implementations and proper test
cases. Parser generators will also ensure such a thing cannot occur.
I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a
lot of type inference that happens based on the fact that `ParseStatus` has
a `ParseState` type parameter; `Parsed` has only `Object`. It is desirable
for a public-facing `Parsed` to not be tied to `ParseState`, since consumers
need not be concerned with such a heavy type; however, we _do_ want that
heavy type internally, as it carries a lot of useful information that allows
for significant and powerful type inference, which in turn creates
expressive and convenient APIs.
DEV-7145
2022-07-11 23:49:57 -04:00
|
|
|
ParseResult, ParseState, ParseStatus, Transition, TransitionResult,
|
|
|
|
Transitionable,
|
2022-06-01 11:32:58 -04:00
|
|
|
};
|
|
|
|
|
2022-06-02 10:30:44 -04:00
|
|
|
use crate::span::{Span, DUMMY_SPAN};
|
|
|
|
use std::{
|
|
|
|
error::Error,
|
|
|
|
fmt::{Debug, Display},
|
|
|
|
};
|
2022-03-18 15:26:05 -04:00
|
|
|
|
|
|
|
/// A single datum from a streaming IR with an associated [`Span`].
|
|
|
|
///
|
|
|
|
/// A token may be a lexeme with associated data,
|
|
|
|
/// or a more structured object having been lowered from other IRs.
|
2022-05-16 10:05:14 -04:00
|
|
|
pub trait Token: Display + Debug + PartialEq {
|
2022-03-18 15:26:05 -04:00
|
|
|
/// Retrieve the [`Span`] representing the source location of the token.
|
|
|
|
fn span(&self) -> Span;
|
tamer: parser::Parser: cfg(test) tracing
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
2022-07-18 14:32:34 -04:00
|
|
|
|
|
|
|
/// Name of the intermediate representation (IR) this token represents.
|
|
|
|
///
|
|
|
|
/// This is used for diagnostic information,
|
|
|
|
/// primarily for debugging TAMER itself.
|
|
|
|
fn ir_name() -> &'static str;
|
2022-03-18 15:26:05 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: Token> From<T> for Span {
|
|
|
|
fn from(tok: T) -> Self {
|
|
|
|
tok.span()
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-02 10:30:44 -04:00
|
|
|
/// A type of [`Token`] that is not relevant.
|
|
|
|
///
|
|
|
|
/// This may be used when a [`Token`] type is required but only incidental,
|
|
|
|
/// such as for use with a [`ParseState`] in the context of a source of a
|
|
|
|
/// lowering operation.
|
|
|
|
#[derive(Debug, PartialEq)]
|
|
|
|
pub struct UnknownToken;
|
|
|
|
|
|
|
|
impl Token for UnknownToken {
|
|
|
|
fn span(&self) -> Span {
|
|
|
|
DUMMY_SPAN
|
|
|
|
}
|
tamer: parser::Parser: cfg(test) tracing
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
2022-07-18 14:32:34 -04:00
|
|
|
|
|
|
|
fn ir_name() -> &'static str {
|
|
|
|
"<UNKNOWN IR>"
|
|
|
|
}
|
2022-06-02 10:30:44 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for UnknownToken {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
|
|
write!(f, "<unknown token>")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-03-25 16:45:32 -04:00
|
|
|
/// An IR object produced by a lowering operation on one or more [`Token`]s.
|
|
|
|
///
|
|
|
|
/// Note that an [`Object`] may also be a [`Token`] if it will be in turn
|
|
|
|
/// fed to another [`Parser`] for lowering.
|
|
|
|
///
|
|
|
|
/// This trait exists to disambiguate an otherwise unbounded type for
|
|
|
|
/// [`From`] conversions,
|
|
|
|
/// used in the [`Transition`] API to provide greater flexibility.
|
2022-05-16 10:05:14 -04:00
|
|
|
pub trait Object: Debug + PartialEq {}
|
2022-03-25 16:45:32 -04:00
|
|
|
|
tamer: Refactor asg_builder into obj::xmlo::lower and asg::air
This finally uses `parse` all the way up to aggregation into the ASG, as can
be seen by the mess in `poc`. This will be further simplified---I just need
to get this committed so that I can mentally get it off my plate. I've been
separating this commit into smaller commits, but there's a point where it's
just not worth the effort anymore. I don't like making large changes such
as this one.
There is still work to do here. First, it's worth re-mentioning that
`poc` means "proof-of-concept", and represents things that still need a
proper home/abstraction.
Secondly, `poc` is retrieving the context of two parsers---`LowerContext`
and `Asg`. The latter is desirable, since it's the final aggregation point,
but the former needs to be eliminated; in particular, packages need to be
worked into the ASG so that `found` can be removed.
Recursively loading `xmlo` files still happens in `poc`, but the compiler
will need this as well. Once packages are on the ASG, along with their
state, that responsibility can be generalized as well.
That will then simplify lowering even further, to the point where hopefully
everything has the same shape (once final aggregation has an abstraction),
after which we can then create a final abstraction to concisely stitch
everything together. Right now, Rust isn't able to infer `S` for
`Lower<S, LS>`, which is unfortunate, but we'll be able to help it along
with a more explicit abstraction.
DEV-11864
2022-05-27 13:51:29 -04:00
|
|
|
impl Object for () {}
|
|
|
|
|
2022-03-18 15:26:05 -04:00
|
|
|
/// An infallible [`Token`] stream.
|
|
|
|
///
|
|
|
|
/// If the token stream originates from an operation that could potentially
|
|
|
|
/// fail and ought to be propagated,
|
|
|
|
/// use [`TokenResultStream`].
|
|
|
|
///
|
|
|
|
/// The name "stream" in place of "iterator" is intended to convey that this
|
|
|
|
/// type is expected to be processed in real-time as a stream,
|
|
|
|
/// not read into memory.
|
|
|
|
pub trait TokenStream<T: Token> = Iterator<Item = T>;
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
2022-03-18 16:24:53 -04:00
|
|
|
/// A [`Token`] stream that may encounter errors during parsing.
|
|
|
|
///
|
|
|
|
/// If the stream cannot fail,
|
|
|
|
/// consider using [`TokenStream`].
|
|
|
|
pub trait TokenResultStream<T: Token, E: Error> = Iterator<Item = Result<T, E>>;
|
|
|
|
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
#[cfg(test)]
|
|
|
|
pub mod test {
|
|
|
|
use super::*;
|
2022-06-01 11:32:58 -04:00
|
|
|
use crate::{
|
|
|
|
diagnose::{AnnotatedSpan, Diagnostic},
|
|
|
|
span::{DUMMY_SPAN as DS, UNKNOWN_SPAN},
|
|
|
|
sym::GlobalSymbolIntern,
|
|
|
|
};
|
|
|
|
use std::{assert_matches::assert_matches, iter::once};
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
2022-03-18 16:24:53 -04:00
|
|
|
#[derive(Debug, PartialEq, Eq, Clone)]
|
|
|
|
enum TestToken {
|
|
|
|
Close(Span),
|
2022-05-18 16:06:48 -04:00
|
|
|
MarkDone(Span),
|
2022-03-18 16:24:53 -04:00
|
|
|
Text(Span),
|
2022-05-18 16:06:48 -04:00
|
|
|
SetCtxVal(u8),
|
2022-03-18 16:24:53 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for TestToken {
|
tamer: parser::Parser: cfg(test) tracing
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
2022-07-18 14:32:34 -04:00
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
|
|
write!(f, "(test token)")
|
2022-03-18 16:24:53 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Token for TestToken {
|
tamer: parser::Parser: cfg(test) tracing
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
2022-07-18 14:32:34 -04:00
|
|
|
fn ir_name() -> &'static str {
|
|
|
|
"<PARSE TEST IR>"
|
|
|
|
}
|
|
|
|
|
2022-03-18 16:24:53 -04:00
|
|
|
fn span(&self) -> Span {
|
|
|
|
use TestToken::*;
|
|
|
|
match self {
|
2022-05-18 16:06:48 -04:00
|
|
|
Close(span) | MarkDone(span) | Text(span) => *span,
|
|
|
|
_ => UNKNOWN_SPAN,
|
2022-03-18 16:24:53 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-03-25 16:45:32 -04:00
|
|
|
impl Object for TestToken {}
|
|
|
|
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
#[derive(Debug, PartialEq, Eq)]
|
|
|
|
enum EchoState {
|
|
|
|
Empty,
|
|
|
|
Done,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Default for EchoState {
|
|
|
|
fn default() -> Self {
|
|
|
|
Self::Empty
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-18 16:06:48 -04:00
|
|
|
#[derive(Debug, PartialEq, Default)]
|
|
|
|
struct StubContext {
|
|
|
|
val: u8,
|
|
|
|
}
|
|
|
|
|
2021-12-10 15:39:59 -05:00
|
|
|
impl ParseState for EchoState {
|
2022-03-18 16:24:53 -04:00
|
|
|
type Token = TestToken;
|
|
|
|
type Object = TestToken;
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
type Error = EchoStateError;
|
|
|
|
|
2022-05-18 16:06:48 -04:00
|
|
|
type Context = StubContext;
|
|
|
|
|
2022-04-04 21:50:47 -04:00
|
|
|
fn parse_token(
|
|
|
|
self,
|
|
|
|
tok: TestToken,
|
2022-05-18 16:06:48 -04:00
|
|
|
ctx: &mut StubContext,
|
2022-04-04 21:50:47 -04:00
|
|
|
) -> TransitionResult<Self> {
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
match tok {
|
2022-05-18 16:06:48 -04:00
|
|
|
TestToken::MarkDone(..) => Transition(Self::Done).ok(tok),
|
2022-03-18 16:24:53 -04:00
|
|
|
TestToken::Close(..) => {
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
Transition(self).err(EchoStateError::InnerError(tok))
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
}
|
2022-03-18 16:24:53 -04:00
|
|
|
TestToken::Text(..) => Transition(self).dead(tok),
|
2022-05-18 16:06:48 -04:00
|
|
|
TestToken::SetCtxVal(val) => {
|
|
|
|
ctx.val = val;
|
|
|
|
Transition(Self::Done).incomplete()
|
|
|
|
}
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn is_accepting(&self) -> bool {
|
|
|
|
*self == Self::Done
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-25 14:20:10 -04:00
|
|
|
impl Display for EchoState {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
write!(f, "<EchoState as Display>::fmt")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
#[derive(Debug, PartialEq, Eq)]
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
enum EchoStateError {
|
2022-03-18 16:24:53 -04:00
|
|
|
InnerError(TestToken),
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for EchoStateError {
|
|
|
|
fn fmt(&self, _: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
|
|
unimplemented!()
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Error for EchoStateError {
|
|
|
|
fn source(&self) -> Option<&(dyn Error + 'static)> {
|
|
|
|
None
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
impl Diagnostic for EchoStateError {
|
|
|
|
fn describe(&self) -> Vec<AnnotatedSpan> {
|
|
|
|
unimplemented!()
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-13 16:51:15 -05:00
|
|
|
type Sut<I> = Parser<EchoState, I>;
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
#[test]
|
2021-12-06 15:34:29 -05:00
|
|
|
fn successful_parse_in_accepting_state_with_spans() {
|
2021-12-17 10:14:31 -05:00
|
|
|
// EchoState is placed into a Done state given Comment.
|
2022-05-18 16:06:48 -04:00
|
|
|
let tok = TestToken::MarkDone(DS);
|
2021-12-17 10:14:31 -05:00
|
|
|
let mut toks = once(tok.clone());
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
|
|
|
|
// The first token should be processed normally.
|
|
|
|
// EchoState proxies the token back.
|
2021-12-17 10:14:31 -05:00
|
|
|
assert_eq!(Some(Ok(Parsed::Object(tok))), sut.next());
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
// This is now the end of the token stream,
|
|
|
|
// which should be okay provided that the first token put us into
|
|
|
|
// a proper accepting state.
|
|
|
|
assert_eq!(None, sut.next());
|
|
|
|
|
|
|
|
// Further, finalizing should work in this state.
|
|
|
|
assert!(sut.finalize().is_ok());
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn fails_on_end_of_stream_when_not_in_accepting_state() {
|
2022-03-17 21:33:05 -04:00
|
|
|
let span = Span::new(10, 20, "ctx".intern());
|
2022-03-18 16:24:53 -04:00
|
|
|
let mut toks = [TestToken::Close(span)].into_iter();
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
|
2021-12-06 15:34:29 -05:00
|
|
|
// The first token is fine,
|
|
|
|
// and allows us to acquire our most recent span.
|
|
|
|
sut.next();
|
|
|
|
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
// Given that we have no tokens,
|
|
|
|
// and that EchoState::default does not start in an accepting
|
|
|
|
// state,
|
|
|
|
// we must fail when we encounter the end of the stream.
|
2022-03-17 21:33:05 -04:00
|
|
|
assert_eq!(
|
2022-05-25 14:20:10 -04:00
|
|
|
Some(Err(ParseError::UnexpectedEof(
|
|
|
|
span.endpoints().1.unwrap(),
|
|
|
|
// All the states have the same string
|
|
|
|
// (at time of writing).
|
|
|
|
EchoState::default().to_string(),
|
|
|
|
))),
|
2022-03-17 21:33:05 -04:00
|
|
|
sut.next()
|
|
|
|
);
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn returns_state_specific_error() {
|
2022-03-18 16:24:53 -04:00
|
|
|
// TestToken::Close causes EchoState to produce an error.
|
|
|
|
let errtok = TestToken::Close(DS);
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
let mut toks = [errtok.clone()].into_iter();
|
|
|
|
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
Some(Err(ParseError::StateError(EchoStateError::InnerError(
|
|
|
|
errtok
|
|
|
|
)))),
|
|
|
|
sut.next()
|
|
|
|
);
|
|
|
|
|
|
|
|
// The token must have been consumed.
|
|
|
|
// It is up to a recovery process to either bail out or provide
|
|
|
|
// recovery tokens;
|
|
|
|
// continuing without recovery is unlikely to make sense.
|
|
|
|
assert_eq!(0, toks.len());
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn fails_when_parser_is_finalized_in_non_accepting_state() {
|
2022-03-17 21:33:05 -04:00
|
|
|
let span = Span::new(10, 10, "ctx".intern());
|
|
|
|
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
// Set up so that we have a single token that we can use for
|
|
|
|
// recovery as part of the same iterator.
|
2022-05-18 16:06:48 -04:00
|
|
|
let recovery = TestToken::MarkDone(DS);
|
2021-12-06 15:34:29 -05:00
|
|
|
let mut toks = [
|
|
|
|
// Used purely to populate a Span.
|
2022-03-18 16:24:53 -04:00
|
|
|
TestToken::Close(span),
|
2021-12-06 15:34:29 -05:00
|
|
|
// Recovery token here:
|
2021-12-17 10:14:31 -05:00
|
|
|
recovery.clone(),
|
2021-12-06 15:34:29 -05:00
|
|
|
]
|
|
|
|
.into_iter();
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
2021-12-06 15:34:29 -05:00
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
|
|
|
|
// Populate our most recently seen token's span.
|
|
|
|
sut.next();
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
// Attempting to finalize now in a non-accepting state should fail
|
|
|
|
// in the same way that encountering an end-of-stream does,
|
|
|
|
// since we're effectively saying "we're done with the stream"
|
|
|
|
// and the parser will have no further opportunity to reach an
|
|
|
|
// accepting state.
|
|
|
|
let result = sut.finalize();
|
2021-12-06 15:34:29 -05:00
|
|
|
assert_matches!(
|
|
|
|
result,
|
2022-05-25 14:20:10 -04:00
|
|
|
Err((_, ParseError::UnexpectedEof(s, _))) if s == span.endpoints().1.unwrap()
|
2021-12-06 15:34:29 -05:00
|
|
|
);
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
// The sut should have been re-returned,
|
|
|
|
// allowing for attempted error recovery if the caller can manage
|
|
|
|
// to produce a sequence of tokens that will be considered valid.
|
|
|
|
// `toks` above is set up already for this,
|
|
|
|
// which allows us to assert that we received back the same `sut`.
|
|
|
|
let mut sut = result.unwrap_err().0;
|
2021-12-17 10:14:31 -05:00
|
|
|
assert_eq!(Some(Ok(Parsed::Object(recovery))), sut.next());
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
|
|
|
// And so we should now be in an accepting state,
|
|
|
|
// able to finalize.
|
|
|
|
assert!(sut.finalize().is_ok());
|
|
|
|
}
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn unhandled_dead_state_results_in_error() {
|
2021-12-17 10:14:31 -05:00
|
|
|
// A Text will cause our parser to return Dead.
|
2022-03-18 16:24:53 -04:00
|
|
|
let tok = TestToken::Text(DS);
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut toks = once(tok.clone());
|
|
|
|
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
|
|
|
|
// Our parser returns a Dead status,
|
|
|
|
// which is unhandled by any parent context
|
|
|
|
// (since we're not composing parsers),
|
|
|
|
// which causes an error due to an unhandled Dead state.
|
2022-05-25 14:20:10 -04:00
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
|
|
|
Some(Err(ParseError::UnexpectedToken(
|
|
|
|
tok,
|
|
|
|
EchoState::default().to_string()
|
|
|
|
))),
|
|
|
|
);
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
}
|
2022-05-18 16:06:48 -04:00
|
|
|
|
|
|
|
// A context can be both retrieved from a finished parser and provided
|
|
|
|
// to a new one.
|
|
|
|
#[test]
|
|
|
|
fn provide_and_retrieve_context() {
|
|
|
|
// First, verify that it's initialized to a default context.
|
|
|
|
let mut toks = vec![TestToken::MarkDone(DS)].into_iter();
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
sut.next().unwrap().unwrap();
|
|
|
|
let ctx = sut.finalize().unwrap();
|
|
|
|
assert_eq!(ctx, Default::default());
|
|
|
|
|
|
|
|
// Next, verify that the context that is manipulated is the context
|
|
|
|
// that is returned to us.
|
|
|
|
let val = 5;
|
|
|
|
let mut toks = vec![TestToken::SetCtxVal(5)].into_iter();
|
|
|
|
let mut sut = Sut::from(&mut toks);
|
|
|
|
sut.next().unwrap().unwrap();
|
|
|
|
let ctx = sut.finalize().unwrap();
|
|
|
|
assert_eq!(ctx, StubContext { val });
|
|
|
|
|
|
|
|
// Finally, verify that the context provided is the context that is
|
|
|
|
// used.
|
|
|
|
let val = 10;
|
|
|
|
let given_ctx = StubContext { val };
|
|
|
|
let mut toks = vec![TestToken::MarkDone(DS)].into_iter();
|
|
|
|
let mut sut = EchoState::parse_with_context(&mut toks, given_ctx);
|
|
|
|
sut.next().unwrap().unwrap();
|
|
|
|
let ctx = sut.finalize().unwrap();
|
|
|
|
assert_eq!(ctx, StubContext { val });
|
|
|
|
}
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
}
|