2021-09-08 13:53:47 -04:00
|
|
|
// XIR tree representation
|
|
|
|
//
|
|
|
|
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
|
|
|
|
//
|
|
|
|
// This file is part of TAME.
|
|
|
|
//
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
2021-09-21 10:43:23 -04:00
|
|
|
//! XIR token stream parsed into a tree-based IR (XIRT).
|
2021-09-08 13:53:47 -04:00
|
|
|
//!
|
|
|
|
//! **This is a work-in-progress implementation.**
|
|
|
|
//! It will be augmented only as needed.
|
|
|
|
//!
|
2021-12-13 15:00:34 -05:00
|
|
|
//! Parsing is handled by [`Stack::parse_token`].
|
2021-09-08 13:53:47 -04:00
|
|
|
//! An [`Iterator::scan`]-based parser can be constructed using
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//! [`parser_from`] or [`parse`][parse()].
|
2021-09-08 13:53:47 -04:00
|
|
|
//!
|
|
|
|
//! ```
|
2021-12-13 15:00:34 -05:00
|
|
|
//! use tamer::xir::tree::{Stack, parse, parser_from};
|
2021-11-04 16:12:15 -04:00
|
|
|
//!# use tamer::xir::Token;
|
2021-09-08 13:53:47 -04:00
|
|
|
//!
|
2021-09-23 14:52:53 -04:00
|
|
|
//!# let token_stream: std::vec::IntoIter<Token> = vec![].into_iter();
|
2021-09-09 13:05:11 -04:00
|
|
|
//! // Lazily parse a stream of XIR tokens as an iterator, yielding the next
|
|
|
|
//! // fully parsed object. This may consume any number of tokens.
|
2021-09-08 13:53:47 -04:00
|
|
|
//! let parser = parser_from(token_stream);
|
2021-09-09 13:05:11 -04:00
|
|
|
//!
|
2021-09-23 14:52:53 -04:00
|
|
|
//!# let token_stream: std::vec::IntoIter<Token> = vec![].into_iter();
|
2021-09-09 13:05:11 -04:00
|
|
|
//! // Consume a single token at a time, yielding either an incomplete state
|
|
|
|
//! // or the next parsed object.
|
2021-12-14 12:36:35 -05:00
|
|
|
//! let parser = parse(token_stream);
|
2021-09-08 13:53:47 -04:00
|
|
|
//! ```
|
|
|
|
//!
|
2021-09-09 13:05:11 -04:00
|
|
|
//! `parser_from` Or `parse`?
|
|
|
|
//! =========================
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//! [`parser_from`] is implemented in terms of [`parse`][parse()].
|
2021-09-09 13:05:11 -04:00
|
|
|
//! They have slightly different use cases and tradeoffs:
|
|
|
|
//!
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//! [`parse`][parse()] yields a [`Result`] containing [`Parsed`],
|
2021-12-13 14:08:16 -05:00
|
|
|
//! which _may_ contain an [`Parsed::Object`],
|
2021-09-09 13:05:11 -04:00
|
|
|
//! but it's more likely to contain [`Parsed::Incomplete`];
|
|
|
|
//! this is because it typically takes multiple [`Token`]s to complete
|
|
|
|
//! parsing within a given context.
|
|
|
|
//!
|
|
|
|
//! In return, though, you get some important guarantees:
|
|
|
|
//!
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//! 1. [`parse`][parse()] consumes only a _single_ token; and
|
2021-09-09 13:05:11 -04:00
|
|
|
//! 2. It has a constant upper bound for execution time.
|
|
|
|
//!
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
//! This means that [`parse`][parse()] will never cause the system to
|
|
|
|
//! hang---you
|
|
|
|
//! are in complete control over how much progress parsing makes,
|
|
|
|
//! and are free to stop and resume it at any time.
|
2021-09-09 13:05:11 -04:00
|
|
|
//!
|
|
|
|
//! However,
|
|
|
|
//! if you do not care about those things,
|
|
|
|
//! working with [`Parsed`] is verbose and inconvenient;
|
|
|
|
//! sometimes you just want the next [`Tree`] object.
|
|
|
|
//! For this,
|
|
|
|
//! we have [`parser_from`],
|
|
|
|
//! which does two things:
|
|
|
|
//!
|
|
|
|
//! 1. It filters out all [`Parsed::Incomplete`]; and
|
2021-12-13 14:08:16 -05:00
|
|
|
//! 2. On [`Parsed::Object`],
|
2021-12-14 12:44:32 -05:00
|
|
|
//! it yields the inner [`Tree`].
|
2021-09-09 13:05:11 -04:00
|
|
|
//!
|
|
|
|
//! This is a much more convenient API,
|
|
|
|
//! but is not without its downsides:
|
|
|
|
//! if the context is large
|
|
|
|
//! (e.g. the root node of a large XML document),
|
|
|
|
//! parsing can take a considerable amount of time,
|
|
|
|
//! and the [`Iterator`] produced by [`parser_from`] will cause the
|
|
|
|
//! system to process [`Iterator::next`] for that entire duration.
|
|
|
|
//!
|
2021-12-13 16:20:50 -05:00
|
|
|
//! See also [`attr_parser_from`] for parsing only attributes partway
|
|
|
|
//! through a token stream.
|
2021-11-04 10:52:16 -04:00
|
|
|
//!
|
2021-12-13 14:29:16 -05:00
|
|
|
//! [`Parsed::Incomplete`]: parse::Parsed::Incomplete
|
|
|
|
//! [`Parsed::Object`]: parse::Parsed::Object
|
|
|
|
//!
|
2021-09-08 13:53:47 -04:00
|
|
|
//! Cost of Parsing
|
|
|
|
//! ===============
|
|
|
|
//! While [`Tree`] is often much easier to work with than a stream of
|
|
|
|
//! [`Token`],
|
|
|
|
//! there are notable downsides:
|
|
|
|
//!
|
|
|
|
//! - The context in which parsing began
|
2021-09-13 09:47:39 -04:00
|
|
|
//! (see _Parser Implementation_ below)
|
|
|
|
//! must complete before _any_ token is emitted.
|
2021-09-08 13:53:47 -04:00
|
|
|
//! If parsing begins at the root element,
|
2021-09-13 09:47:39 -04:00
|
|
|
//! this means that the _entire XML document_ must be loaded into
|
2021-09-08 13:53:47 -04:00
|
|
|
//! memory before it is available for use.
|
|
|
|
//! - While the token stream is capable of operating using constant memory
|
|
|
|
//! (since [`Token`] can be discarded after being consumed),
|
|
|
|
//! a [`Tree`] holds a significant amount of data in memory.
|
|
|
|
//!
|
|
|
|
//! It is recommended to parse into [`Tree`] only for the portions of the
|
|
|
|
//! XML document that will benefit from it.
|
|
|
|
//! For example,
|
|
|
|
//! by avoiding parsing of the root element into a tree,
|
|
|
|
//! you can emit [`Tree`] for child elements without having to wait for
|
|
|
|
//! the entire document to be parsed.
|
|
|
|
//!
|
|
|
|
//!
|
|
|
|
//! Validity Of Token Stream
|
|
|
|
//! ========================
|
|
|
|
//! XIR verifies that each [`Token`] is syntactically valid and follows an
|
|
|
|
//! XML grammar subset;
|
|
|
|
//! as such,
|
|
|
|
//! the tree parser does not concern itself with syntax analysis.
|
|
|
|
//! It does,
|
|
|
|
//! however,
|
|
|
|
//! perform _[semantic analysis]_ on the token stream.
|
|
|
|
//! Given that,
|
2021-12-13 15:00:34 -05:00
|
|
|
//! [`Stack::parse_token`] returns a [`Result`],
|
2021-12-13 15:27:20 -05:00
|
|
|
//! with parsing errors represented by this module's [`StackError`].
|
2021-09-08 13:53:47 -04:00
|
|
|
//!
|
|
|
|
//! As an example,
|
|
|
|
//! a XIR token stream permits unbalanced tags.
|
|
|
|
//! However,
|
|
|
|
//! we cannot represent an invalid tree,
|
|
|
|
//! so that would result in a semantic error.
|
|
|
|
//!
|
|
|
|
//! [semantic analysis]: https://en.wikipedia.org/wiki/Semantic_analysis_(compilers)
|
|
|
|
//!
|
|
|
|
//!
|
|
|
|
//! Parser Implementation
|
|
|
|
//! =====================
|
|
|
|
//! The parser that lowers the XIR [`Token`] stream into a [`Tree`]
|
2021-12-13 15:00:34 -05:00
|
|
|
//! is implemented on [`Stack`].
|
2021-09-08 13:53:47 -04:00
|
|
|
//!
|
|
|
|
//! This parser is a [stack machine],
|
|
|
|
//! where the stack represents the [`Tree`] that is under construction.
|
2021-09-13 09:47:39 -04:00
|
|
|
//! Parsing operates on _context_.
|
2021-09-09 14:39:53 -04:00
|
|
|
//! At present, the only parsing context is an element---it
|
2021-09-08 13:53:47 -04:00
|
|
|
//! begins parsing at an opening tag ([`Token::Open`]) and completes
|
2021-09-13 12:58:54 -04:00
|
|
|
//! parsing at a _matching_ [`Token::Close`].
|
2021-09-08 13:53:47 -04:00
|
|
|
//! All attributes and child nodes encountered during parsing of an element
|
|
|
|
//! will automatically be added to the appropriate element,
|
|
|
|
//! recursively.
|
|
|
|
//!
|
|
|
|
//! [stack machine]: https://en.wikipedia.org/wiki/Stack_machine
|
|
|
|
//!
|
|
|
|
//! State Machine With A Typed Stack
|
|
|
|
//! --------------------------------
|
|
|
|
//! The parser is a [finate-state machine (FSM)] with a stack encoded in
|
|
|
|
//! variants of [`Stack`],
|
|
|
|
//! where each variant represents the current state of the parser.
|
|
|
|
//! The parser cannot be reasoned about as a pushdown automaton because the
|
|
|
|
//! language of the [`Stack`] is completely arbitrary,
|
|
|
|
//! but it otherwise operates in a similar manner.
|
|
|
|
//!
|
|
|
|
//! Each state transition consumes the entire stack and produces a new one,
|
|
|
|
//! which may be identical.
|
|
|
|
//! Intuitively, though, based on the construction of [`Stack`],
|
|
|
|
//! this is equivalent to popping the needed data off of the stack and
|
|
|
|
//! optionally pushing additional information.
|
|
|
|
//!
|
|
|
|
//! By encoding the stack in [`Stack`] variants,
|
|
|
|
//! we are able to verify statically that the stack is always in a valid
|
|
|
|
//! state and contains expected data---that
|
|
|
|
//! is, our stack is fully type-safe.
|
|
|
|
//!
|
|
|
|
//! [state machine]: https://en.wikipedia.org/wiki/Finite-state_machine
|
|
|
|
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
mod attr;
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
pub mod parse;
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
|
2021-12-13 16:57:04 -05:00
|
|
|
use self::{
|
2021-12-17 10:22:05 -05:00
|
|
|
attr::{AttrParseError, AttrParseState},
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
parse::{
|
|
|
|
ParseError, ParseResult, ParseState, ParseStateResult, ParseStatus,
|
|
|
|
ParsedResult,
|
|
|
|
},
|
2021-12-13 16:57:04 -05:00
|
|
|
};
|
tamer: xir::tree::attr_parser_from: Integrate AttrParser
This begins to integrate the isolated AttrParser. The next step will be
integrating it into the larger XIRT parser.
There's been considerable delay in getting this committed, because I went
through quite the struggle with myself trying to determine what balance I
want to strike between Rust's type system; convenience with parser
combinators; iterators; and various other abstractions. I ended up being
confounded by trying to maintain the current XmloReader abstraction, which
is fundamentally incompatible with the way the new parsing system
works (streaming iterators that do not collect or perform heap
allocations).
There'll be more information on this to come, but there are certain things
that will be changing.
There are a couple problems highlighted by this commit (not in code, but
conceptually):
1. Introducing Option here for the TokenParserState doesn't feel right, in
the sense that the abstraction is inappropriate. We should perhaps
introduce a new variant Parsed::Done or something to indicate intent,
rather than leaving the reader to have to read about what None actually
means.
2. This turns Parsed into more of a statement influencing control
flow/logic, and so should be encapsulated, with an external equivalent
of Parsed that omits variants that ought to remain encapsulated.
3. TokenStreamState is true, but these really are the actual parsers;
TokenStreamParser is more of a coordinator, and helps to abstract away
some of the common logic so lower-level parsers do not have to worry
about it. But calling it TokenStreamState is both a bit
confusing and is an understatement---it _does_ hold the state, but it
also holds the current parsing stack in its variants.
Another thing that is not yet entirely clear is whether this AttrParser
ought to care about detection of duplicate attributes, or if that should be
done in a separate parser, perhaps even at the XIR level. The same can be
said for checking for balanced tags. By pushing it to TokenStream in XIR,
we would get a guaranteed check regardless of what parsers are used, which
is attractive because it reduces the (almost certain-to-otherwise-occur)
risk that individual parsers will not sufficiently check for semantically
valid XML. But it does _potentially_ match error recovery more
complicated. But at the same time, perhaps more specific parsers ought not
care about recovery at that level.
Anyway, point being, more to come, but I am disappointed how much time I'm
spending considering parsing, given that there are so many things I need to
move onto. I just want this done right and in a way that feels like it's
working well with Rust while it's all in working memory, otherwise it's
going to be a significant effort to get back into.
DEV-11268
2021-12-10 14:13:02 -05:00
|
|
|
|
2021-11-15 23:47:14 -05:00
|
|
|
use super::{QName, Token, TokenResultStream, TokenStream};
|
2021-12-14 12:36:35 -05:00
|
|
|
use crate::{span::Span, sym::SymbolId};
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
use std::{error::Error, fmt::Display, mem::take, result};
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-12-06 14:26:58 -05:00
|
|
|
pub use attr::{Attr, AttrList};
|
2021-09-21 10:43:23 -04:00
|
|
|
|
2021-12-14 12:44:32 -05:00
|
|
|
type Parsed = parse::Parsed<Tree>;
|
2021-12-13 14:29:16 -05:00
|
|
|
|
2021-09-21 10:43:23 -04:00
|
|
|
/// A XIR tree (XIRT).
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
|
|
|
/// This object represents a XIR token stream parsed into a tree
|
|
|
|
/// representation.
|
|
|
|
/// This representation is easier to process and manipulate in most contexts,
|
|
|
|
/// but also requires memory allocation for the entire tree and requires
|
|
|
|
/// that a potentially significant portion of a token stream be processed
|
|
|
|
/// (e.g. from start to end tag for a given element).
|
|
|
|
///
|
|
|
|
/// _Note that this implementation is incomplete!_
|
|
|
|
/// It will be augmented as needed.
|
|
|
|
///
|
|
|
|
/// For more information,
|
|
|
|
/// see the [module-level documentation](self).
|
2021-09-13 10:43:33 -04:00
|
|
|
#[derive(Debug, Clone, Eq, PartialEq)]
|
2021-09-23 14:52:53 -04:00
|
|
|
pub enum Tree {
|
2021-09-08 13:53:47 -04:00
|
|
|
/// XML element.
|
2021-09-23 14:52:53 -04:00
|
|
|
Element(Element),
|
2021-09-28 14:52:31 -04:00
|
|
|
|
2021-10-08 16:16:33 -04:00
|
|
|
/// Text node.
|
|
|
|
///
|
|
|
|
/// A text node cannot contain other [`Tree`] elements;
|
|
|
|
/// sibling text nodes must exist within an [`Element`].
|
2021-11-15 23:47:14 -05:00
|
|
|
Text(SymbolId, Span),
|
2021-10-08 16:16:33 -04:00
|
|
|
|
2021-09-28 14:52:31 -04:00
|
|
|
/// This variant exists purely because `#[non_exhaustive]` has no effect
|
|
|
|
/// within the crate.
|
|
|
|
///
|
|
|
|
/// This ensures that matches must account for other variants that will
|
|
|
|
/// be introduced in the future,
|
|
|
|
/// easing the maintenance burden
|
|
|
|
/// (for both implementation and unit tests).
|
|
|
|
_NonExhaustive,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Into<Option<Element>> for Tree {
|
|
|
|
#[inline]
|
|
|
|
fn into(self) -> Option<Element> {
|
|
|
|
match self {
|
|
|
|
Self::Element(ele) => Some(ele),
|
|
|
|
_ => None,
|
|
|
|
}
|
|
|
|
}
|
2021-09-09 14:39:53 -04:00
|
|
|
}
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-11-15 23:47:14 -05:00
|
|
|
impl Into<Option<SymbolId>> for Tree {
|
2021-10-08 16:16:33 -04:00
|
|
|
#[inline]
|
2021-11-15 23:47:14 -05:00
|
|
|
fn into(self) -> Option<SymbolId> {
|
2021-10-08 16:16:33 -04:00
|
|
|
match self {
|
|
|
|
Self::Text(text, _) => Some(text),
|
|
|
|
_ => None,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-23 14:52:53 -04:00
|
|
|
impl Tree {
|
2021-09-28 14:52:31 -04:00
|
|
|
/// Yield a reference to the inner value if it is an [`Element`],
|
|
|
|
/// otherwise [`None`].
|
|
|
|
#[inline]
|
|
|
|
pub fn as_element<'a>(&'a self) -> Option<&'a Element> {
|
2021-09-13 10:43:33 -04:00
|
|
|
match self {
|
|
|
|
Self::Element(ele) => Some(ele),
|
2021-09-28 14:52:31 -04:00
|
|
|
_ => None,
|
2021-09-13 10:43:33 -04:00
|
|
|
}
|
|
|
|
}
|
2021-09-28 14:52:31 -04:00
|
|
|
|
|
|
|
/// Yield the inner value if it is an [`Element`],
|
|
|
|
/// otherwise [`None`].
|
|
|
|
#[inline]
|
|
|
|
pub fn into_element(self) -> Option<Element> {
|
|
|
|
self.into()
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Whether the inner value is an [`Element`].
|
|
|
|
#[inline]
|
|
|
|
pub fn is_element(&self) -> bool {
|
|
|
|
matches!(self, Self::Element(_))
|
|
|
|
}
|
2021-10-08 16:16:33 -04:00
|
|
|
|
2021-11-15 23:47:14 -05:00
|
|
|
/// Yield a string representation of the element,
|
|
|
|
/// if applicable.
|
|
|
|
///
|
|
|
|
/// This is incomplete.
|
2021-10-08 16:16:33 -04:00
|
|
|
#[inline]
|
2021-11-15 23:47:14 -05:00
|
|
|
pub fn as_sym(&self) -> Option<SymbolId> {
|
2021-10-08 16:16:33 -04:00
|
|
|
match self {
|
2021-11-15 23:47:14 -05:00
|
|
|
Self::Text(sym, ..) => Some(*sym),
|
2021-10-08 16:16:33 -04:00
|
|
|
_ => None,
|
|
|
|
}
|
|
|
|
}
|
2021-09-13 10:43:33 -04:00
|
|
|
}
|
|
|
|
|
2021-09-08 13:53:47 -04:00
|
|
|
/// Element node.
|
|
|
|
///
|
|
|
|
/// This represents an [XML element] beginning with an opening tag that is
|
|
|
|
/// either self-closing or ending with a balanced closing tag.
|
|
|
|
/// The two spans together represent the span of the entire element with all
|
|
|
|
/// its constituents.
|
|
|
|
///
|
|
|
|
/// [XML element]: https://www.w3.org/TR/REC-xml/#sec-starttags
|
2021-09-13 10:43:33 -04:00
|
|
|
#[derive(Debug, Clone, Eq, PartialEq)]
|
2021-09-23 14:52:53 -04:00
|
|
|
pub struct Element {
|
|
|
|
name: QName,
|
2021-09-08 13:53:47 -04:00
|
|
|
/// Zero or more attributes.
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList,
|
2021-09-08 13:53:47 -04:00
|
|
|
/// Zero or more child nodes.
|
2021-09-23 14:52:53 -04:00
|
|
|
children: Vec<Tree>,
|
2021-09-08 13:53:47 -04:00
|
|
|
/// Spans for opening and closing tags respectively.
|
|
|
|
span: (Span, Span),
|
|
|
|
}
|
|
|
|
|
2021-09-23 14:52:53 -04:00
|
|
|
impl Element {
|
2021-09-28 14:52:31 -04:00
|
|
|
/// Element name.
|
|
|
|
#[inline]
|
|
|
|
pub fn name(&self) -> QName {
|
|
|
|
self.name
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Child [`Tree`] objects of this element.
|
|
|
|
#[inline]
|
|
|
|
pub fn children(&self) -> &Vec<Tree> {
|
|
|
|
&self.children
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Attributes of this element.
|
|
|
|
#[inline]
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
pub fn attrs(&self) -> &AttrList {
|
|
|
|
&self.attrs
|
2021-09-28 14:52:31 -04:00
|
|
|
}
|
|
|
|
|
2021-09-16 10:50:55 -04:00
|
|
|
/// Opens an element for incremental construction.
|
|
|
|
///
|
|
|
|
/// This is intended for use by the parser to begin building an element.
|
|
|
|
/// It does not represent a completed element and should not be yielded
|
|
|
|
/// to any outside caller until it is complete.
|
2021-09-21 00:16:03 -04:00
|
|
|
/// This incomplete state is encoded in [`Stack::BuddingElement`].
|
2021-09-16 10:50:55 -04:00
|
|
|
#[inline]
|
2021-09-23 14:52:53 -04:00
|
|
|
fn open(name: QName, span: Span) -> Self {
|
2021-09-16 10:50:55 -04:00
|
|
|
Self {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:50:55 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (span, span), // We do not yet know where the span will end
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-15 16:32:32 -04:00
|
|
|
/// Complete an element's span by setting its ending span.
|
|
|
|
///
|
|
|
|
/// When elements are still budding (see [`Stack::BuddingElement`]),
|
|
|
|
/// the ending span is set to the starting span,
|
|
|
|
/// since the end is not yet known.
|
2021-09-16 10:50:55 -04:00
|
|
|
#[inline]
|
2021-09-15 16:32:32 -04:00
|
|
|
fn close_span(self, close_span: Span) -> Self {
|
|
|
|
Element {
|
|
|
|
span: (self.span.0, close_span),
|
|
|
|
..self
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-15 11:18:17 -04:00
|
|
|
/// A [`Stack`] representing an element and its (optional) parent's stack.
|
|
|
|
///
|
|
|
|
/// Storing the parent of an [`Element`] allows it to be manipulated on the
|
|
|
|
/// [`Stack`] using the usual operations,
|
|
|
|
/// while maintaining the context needed to later add it as a child to
|
|
|
|
/// its parent once the element is completed.
|
|
|
|
///
|
|
|
|
/// This is used to represent a [`Stack::BuddingElement`].
|
|
|
|
/// This type exists because enum variants are not their own types,
|
|
|
|
/// but we want to nest _only_ element stacks,
|
|
|
|
/// not any type of stack.
|
|
|
|
#[derive(Debug, Eq, PartialEq)]
|
2021-09-23 14:52:53 -04:00
|
|
|
pub struct ElementStack {
|
|
|
|
element: Element,
|
2021-09-15 16:32:32 -04:00
|
|
|
|
|
|
|
/// Parent element stack to be restored once element has finished
|
|
|
|
/// processing.
|
2021-09-23 14:52:53 -04:00
|
|
|
pstack: Option<Box<ElementStack>>,
|
2021-09-15 16:32:32 -04:00
|
|
|
}
|
|
|
|
|
2021-09-23 14:52:53 -04:00
|
|
|
impl ElementStack {
|
2021-09-15 16:32:32 -04:00
|
|
|
/// Attempt to close an element,
|
|
|
|
/// verifying that the closing tag is either self-closing or
|
|
|
|
/// balanced.
|
|
|
|
///
|
|
|
|
/// This does not verify that a request to self-close only happens if
|
|
|
|
/// there are no child elements;
|
|
|
|
/// that is the responsibility of the parser producing the XIR
|
|
|
|
/// stream to ensure that self-closing can only happen during
|
|
|
|
/// attribute parsing.
|
|
|
|
fn try_close(
|
|
|
|
self,
|
2021-09-23 14:52:53 -04:00
|
|
|
close_name: Option<QName>,
|
2021-09-15 16:32:32 -04:00
|
|
|
close_span: Span,
|
2021-09-23 14:52:53 -04:00
|
|
|
) -> Result<Self> {
|
2021-09-15 16:32:32 -04:00
|
|
|
let Element {
|
|
|
|
name: ele_name,
|
|
|
|
span: (open_span, _),
|
|
|
|
..
|
|
|
|
} = self.element;
|
|
|
|
|
|
|
|
// Note that self-closing with children is syntactically
|
|
|
|
// invalid and is expected to never make it into a XIR
|
|
|
|
// stream to begin with, so we don't check for it.
|
|
|
|
if let Some(name) = close_name {
|
|
|
|
if name != ele_name {
|
2021-12-13 15:27:20 -05:00
|
|
|
return Err(StackError::UnbalancedTag {
|
2021-09-15 16:32:32 -04:00
|
|
|
open: (ele_name, open_span),
|
|
|
|
close: (name, close_span),
|
|
|
|
});
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Ok(Self {
|
|
|
|
element: self.element.close_span(close_span),
|
|
|
|
pstack: self.pstack,
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
2021-09-16 10:02:01 -04:00
|
|
|
/// Transfer stack element into the parent as a child and return the
|
|
|
|
/// previous [`Stack`] state,
|
|
|
|
/// or yield a [`Stack::ClosedElement`] if there is no parent.
|
|
|
|
///
|
|
|
|
/// If there is a parent element,
|
|
|
|
/// then the returned [`Stack`] will represent the state of the stack
|
|
|
|
/// prior to the child element being opened,
|
|
|
|
/// as stored with [`ElementStack::store`].
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
fn consume_child_or_complete<SA: StackAttrParseState>(self) -> Stack<SA> {
|
2021-09-16 10:02:01 -04:00
|
|
|
match self.pstack {
|
|
|
|
Some(parent_stack) => Stack::BuddingElement(
|
|
|
|
parent_stack.consume_element(self.element),
|
|
|
|
),
|
|
|
|
|
|
|
|
None => Stack::ClosedElement(self.element),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Push the provided [`Element`] onto the child list of the inner
|
|
|
|
/// [`Element`].
|
2021-09-23 14:52:53 -04:00
|
|
|
fn consume_element(mut self, child: Element) -> Self {
|
2021-09-16 10:02:01 -04:00
|
|
|
self.element.children.push(Tree::Element(child));
|
|
|
|
self
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Push the provided [`Attr`] onto the attribute list of the inner
|
|
|
|
/// [`Element`].
|
2021-11-02 14:07:20 -04:00
|
|
|
fn consume_attrs(mut self, attr_list: AttrList) -> Self {
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
self.element.attrs = attr_list;
|
2021-09-16 10:02:01 -04:00
|
|
|
self
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Transfer self to the heap to be later restored.
|
|
|
|
///
|
|
|
|
/// This method simply exists for self-documentation.
|
2021-09-15 16:32:32 -04:00
|
|
|
fn store(self) -> Box<Self> {
|
|
|
|
Box::new(self)
|
|
|
|
}
|
|
|
|
}
|
2021-09-15 11:18:17 -04:00
|
|
|
|
2021-09-08 13:53:47 -04:00
|
|
|
/// The state and typed stack of the XIR parser stack machine.
|
|
|
|
///
|
|
|
|
/// Since all possible states of the stack are known statically,
|
|
|
|
/// we encode the stack into variants,
|
|
|
|
/// where each variant represents the state of the parser's state
|
|
|
|
/// machine.
|
|
|
|
/// This way,
|
|
|
|
/// we know that the stack is always well-formed,
|
|
|
|
/// and benefit from strong type checking.
|
|
|
|
/// This also allows Rust to optimize its use.
|
|
|
|
///
|
|
|
|
/// Rust will compile this into a value that exists on the stack,
|
|
|
|
/// so we wind up with an actual stack machine in the end anyway.
|
|
|
|
///
|
|
|
|
/// For more information,
|
|
|
|
/// see the [module-level documentation](self).
|
2021-09-13 10:43:33 -04:00
|
|
|
#[derive(Debug, Eq, PartialEq)]
|
2021-12-17 10:22:05 -05:00
|
|
|
pub enum Stack<SA = AttrParseState>
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
where
|
|
|
|
SA: StackAttrParseState,
|
|
|
|
{
|
2021-09-08 13:53:47 -04:00
|
|
|
/// Empty stack.
|
|
|
|
Empty,
|
|
|
|
|
|
|
|
/// An [`Element`] that is still under construction.
|
|
|
|
///
|
|
|
|
/// (This is a tree IR,
|
|
|
|
/// so here's a plant pun).
|
2021-09-23 14:52:53 -04:00
|
|
|
BuddingElement(ElementStack),
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-09-15 16:32:32 -04:00
|
|
|
/// A completed [`Element`].
|
|
|
|
///
|
|
|
|
/// This should be consumed and emitted.
|
2021-09-23 14:52:53 -04:00
|
|
|
ClosedElement(Element),
|
2021-09-15 16:32:32 -04:00
|
|
|
|
2021-11-02 14:07:20 -04:00
|
|
|
/// An [`AttrList`] that is still under construction.
|
2021-12-14 12:49:06 -05:00
|
|
|
BuddingAttrList(ElementStack, AttrList),
|
2021-11-02 14:07:20 -04:00
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
/// Parsing has been ceded to `SA` for attribute parsing.
|
|
|
|
AttrState(ElementStack, AttrList, SA),
|
2021-09-21 15:30:44 -04:00
|
|
|
|
2021-11-04 10:52:16 -04:00
|
|
|
/// Parsing has completed relative to the initial context.
|
|
|
|
///
|
|
|
|
/// This is the final accepting state of the state machine.
|
|
|
|
/// The parser will not operate while in this state,
|
|
|
|
/// which must be explicitly acknowledged and cleared in order to
|
|
|
|
/// indicate that additional tokens are expected and are not in
|
|
|
|
/// error.
|
|
|
|
Done,
|
2021-09-08 13:53:47 -04:00
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
pub trait StackAttrParseState = ParseState<Object = Attr>
|
|
|
|
where
|
|
|
|
<Self as ParseState>::Error: Into<StackError>;
|
|
|
|
|
|
|
|
impl<SA: StackAttrParseState> Default for Stack<SA> {
|
2021-09-08 13:53:47 -04:00
|
|
|
fn default() -> Self {
|
|
|
|
Self::Empty
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
impl<SA: StackAttrParseState> ParseState for Stack<SA> {
|
2021-12-14 12:44:32 -05:00
|
|
|
type Object = Tree;
|
2021-12-13 15:27:20 -05:00
|
|
|
type Error = StackError;
|
2021-12-13 15:00:34 -05:00
|
|
|
|
|
|
|
fn parse_token(&mut self, tok: Token) -> ParseStateResult<Self> {
|
|
|
|
let stack = take(self);
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
// This demonstrates how parsers can be combined.
|
|
|
|
// The next step will be to abstract this away.
|
|
|
|
if let Stack::AttrState(estack, attrs, mut sa) = stack {
|
|
|
|
use ParseStatus::*;
|
|
|
|
return match sa.parse_token(tok) {
|
|
|
|
Ok(Incomplete) => {
|
|
|
|
*self = Self::AttrState(estack, attrs, sa);
|
|
|
|
Ok(Incomplete)
|
|
|
|
}
|
|
|
|
Ok(Object(attr)) => {
|
|
|
|
let attrs = attrs.push(attr);
|
|
|
|
*self = Self::AttrState(estack, attrs, sa);
|
|
|
|
Ok(Incomplete)
|
|
|
|
}
|
|
|
|
Ok(Dead(lookahead)) => {
|
|
|
|
*self = Self::BuddingElement(estack.consume_attrs(attrs));
|
|
|
|
self.parse_token(lookahead)
|
|
|
|
}
|
|
|
|
Err(x) => Err(x.into()),
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
2021-12-13 15:00:34 -05:00
|
|
|
match tok {
|
|
|
|
Token::Open(name, span) => stack.open_element(name, span),
|
|
|
|
Token::Close(name, span) => stack.close_element(name, span),
|
|
|
|
Token::Text(value, span) => stack.text(value, span),
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
_ if self.is_accepting() => return Ok(ParseStatus::Dead(tok)),
|
|
|
|
_ => {
|
|
|
|
todo!(
|
|
|
|
"TODO: `{:?}` unrecognized. The parser is not yet \
|
|
|
|
complete, so this could represent either a missing \
|
|
|
|
feature or a semantic error. Stack: `{:?}`.",
|
|
|
|
tok,
|
|
|
|
stack
|
|
|
|
)
|
2021-12-13 15:00:34 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
.map(|new_stack| self.store_or_emit(new_stack))
|
|
|
|
}
|
|
|
|
|
|
|
|
fn is_accepting(&self) -> bool {
|
|
|
|
*self == Self::Empty || *self == Self::Done
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
impl<SA: StackAttrParseState> Stack<SA> {
|
2021-09-15 16:32:32 -04:00
|
|
|
/// Attempt to open a new element.
|
|
|
|
///
|
|
|
|
/// If the stack is [`Self::Empty`],
|
|
|
|
/// then the element will be considered to be a root element,
|
|
|
|
/// meaning that it will be completed once it is closed.
|
|
|
|
/// If the stack contains [`Self::BuddingElement`],
|
|
|
|
/// then a child element will be started,
|
|
|
|
/// which will be consumed by the parent one closed rather than
|
|
|
|
/// being considered a completed [`Element`].
|
|
|
|
///
|
|
|
|
/// Attempting to open an element in any other context is an error.
|
2021-09-23 14:52:53 -04:00
|
|
|
fn open_element(self, name: QName, span: Span) -> Result<Self> {
|
2021-09-16 10:50:55 -04:00
|
|
|
let element = Element::open(name, span);
|
2021-09-15 16:32:32 -04:00
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
Ok(Self::AttrState(
|
|
|
|
ElementStack {
|
|
|
|
element,
|
|
|
|
pstack: match self {
|
|
|
|
// Opening a root element (or lack of context).
|
|
|
|
Self::Empty => None,
|
|
|
|
|
|
|
|
// Open a child element.
|
|
|
|
Self::BuddingElement(pstack) => Some(pstack.store()),
|
|
|
|
|
|
|
|
// Opening a child element in attribute parsing context.
|
|
|
|
Self::BuddingAttrList(pstack, attr_list) => {
|
|
|
|
Some(pstack.consume_attrs(attr_list).store())
|
|
|
|
}
|
|
|
|
|
|
|
|
_ => todo! {},
|
|
|
|
},
|
2021-12-14 12:49:06 -05:00
|
|
|
},
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
Default::default(),
|
|
|
|
SA::default(),
|
|
|
|
))
|
2021-09-15 16:32:32 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/// Attempt to close an element.
|
|
|
|
///
|
|
|
|
/// Elements can be either self-closing
|
|
|
|
/// (in which case `name` is [`None`]),
|
|
|
|
/// or have their own independent closing tags.
|
|
|
|
/// If a name is provided,
|
|
|
|
/// then it _must_ match the name of the element currently being
|
|
|
|
/// processed---that is,
|
|
|
|
/// the tree must be _balanced_.
|
2021-12-13 15:27:20 -05:00
|
|
|
/// An unbalanced tree results in a [`StackError::UnbalancedTag`].
|
2021-09-23 14:52:53 -04:00
|
|
|
fn close_element(self, name: Option<QName>, span: Span) -> Result<Self> {
|
2021-09-15 16:32:32 -04:00
|
|
|
match self {
|
|
|
|
Self::BuddingElement(stack) => stack
|
|
|
|
.try_close(name, span)
|
2021-09-16 10:02:01 -04:00
|
|
|
.map(ElementStack::consume_child_or_complete),
|
2021-09-15 16:32:32 -04:00
|
|
|
|
2021-12-14 12:49:06 -05:00
|
|
|
Self::BuddingAttrList(stack, attr_list) => stack
|
2021-11-02 14:07:20 -04:00
|
|
|
.consume_attrs(attr_list)
|
|
|
|
.try_close(name, span)
|
|
|
|
.map(ElementStack::consume_child_or_complete),
|
|
|
|
|
2021-09-15 16:32:32 -04:00
|
|
|
_ => todo! {},
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-08 16:16:33 -04:00
|
|
|
/// Appends a text node as a child of an element.
|
|
|
|
///
|
|
|
|
/// This is valid only for a [`Stack::BuddingElement`].
|
2021-11-15 23:47:14 -05:00
|
|
|
fn text(self, value: SymbolId, span: Span) -> Result<Self> {
|
2021-10-08 16:16:33 -04:00
|
|
|
Ok(match self {
|
|
|
|
Self::BuddingElement(mut ele) => {
|
|
|
|
ele.element.children.push(Tree::Text(value, span));
|
|
|
|
|
|
|
|
Self::BuddingElement(ele)
|
|
|
|
}
|
|
|
|
_ => todo! {},
|
|
|
|
})
|
|
|
|
}
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-09-15 16:32:32 -04:00
|
|
|
/// Emit a completed object or store the current stack for further processing.
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
fn store_or_emit(&mut self, new_stack: Self) -> ParseStatus<Tree> {
|
2021-09-15 16:32:32 -04:00
|
|
|
match new_stack {
|
2021-12-13 14:08:16 -05:00
|
|
|
Stack::ClosedElement(ele) => {
|
2021-12-14 12:44:32 -05:00
|
|
|
ParseStatus::Object(Tree::Element(ele))
|
2021-12-13 14:08:16 -05:00
|
|
|
}
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-09-15 16:32:32 -04:00
|
|
|
_ => {
|
2021-12-13 15:00:34 -05:00
|
|
|
*self = new_stack;
|
2021-12-13 14:29:16 -05:00
|
|
|
ParseStatus::Incomplete
|
2021-09-08 13:53:47 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-13 10:43:33 -04:00
|
|
|
/// Result of a XIR tree parsing operation.
|
2021-12-13 15:27:20 -05:00
|
|
|
pub type Result<T> = std::result::Result<T, StackError>;
|
2021-09-13 10:43:33 -04:00
|
|
|
|
2021-12-13 15:00:34 -05:00
|
|
|
/// Parsing error from [`Stack`].
|
2021-09-08 13:53:47 -04:00
|
|
|
#[derive(Debug, Eq, PartialEq)]
|
2021-12-13 15:27:20 -05:00
|
|
|
pub enum StackError {
|
2021-09-13 10:43:33 -04:00
|
|
|
/// The closing tag does not match the opening tag at the same level of
|
|
|
|
/// nesting.
|
2021-09-15 16:32:32 -04:00
|
|
|
UnbalancedTag {
|
2021-09-23 14:52:53 -04:00
|
|
|
open: (QName, Span),
|
|
|
|
close: (QName, Span),
|
2021-09-13 10:43:33 -04:00
|
|
|
},
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
AttrError(AttrParseError),
|
|
|
|
|
2021-11-03 14:37:05 -04:00
|
|
|
/// An attribute was expected as the next [`Token`].
|
|
|
|
AttrNameExpected(Token),
|
|
|
|
|
|
|
|
/// Token stream ended before attribute parsing was complete.
|
|
|
|
UnexpectedAttrEof,
|
2021-09-13 10:43:33 -04:00
|
|
|
}
|
|
|
|
|
2021-12-13 15:27:20 -05:00
|
|
|
impl Display for StackError {
|
2021-09-13 10:43:33 -04:00
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
|
|
match self {
|
|
|
|
// TODO: not a useful error because of symbols and missing span information
|
2021-09-15 16:32:32 -04:00
|
|
|
Self::UnbalancedTag {
|
2021-11-03 14:54:37 -04:00
|
|
|
open: (open_name, open_span),
|
|
|
|
close: (close_name, close_span),
|
2021-09-13 10:43:33 -04:00
|
|
|
} => {
|
|
|
|
write!(
|
|
|
|
f,
|
2021-11-03 14:54:37 -04:00
|
|
|
"expected closing tag `{}`, but found `{}` at {} \
|
|
|
|
(opening tag at {})",
|
|
|
|
open_name, close_name, close_span, open_span
|
2021-09-13 10:43:33 -04:00
|
|
|
)
|
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
Self::AttrError(e) => Display::fmt(e, f),
|
|
|
|
|
2021-11-03 14:37:05 -04:00
|
|
|
Self::AttrNameExpected(tok) => {
|
2021-11-03 15:07:52 -04:00
|
|
|
write!(f, "attribute name expected, found {}", tok)
|
2021-11-03 14:37:05 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: Perhaps we should include the last-encountered Span.
|
|
|
|
Self::UnexpectedAttrEof => {
|
|
|
|
write!(
|
|
|
|
f,
|
|
|
|
"unexpected end of input during isolated attribute parsing",
|
|
|
|
)
|
|
|
|
}
|
2021-09-13 10:43:33 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2021-09-08 13:53:47 -04:00
|
|
|
|
2021-12-13 15:27:20 -05:00
|
|
|
impl Error for StackError {
|
2021-11-18 00:59:10 -05:00
|
|
|
fn source(&self) -> Option<&(dyn Error + 'static)> {
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
match self {
|
|
|
|
Self::AttrError(e) => Some(e),
|
|
|
|
_ => None,
|
|
|
|
}
|
2021-11-18 00:59:10 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-14 12:36:35 -05:00
|
|
|
/// Produce a streaming parser for the given [`TokenStream`].
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
2021-09-09 13:05:11 -04:00
|
|
|
/// If you do not require a single-step [`Iterator::next`] and simply want
|
|
|
|
/// the next parsed object,
|
|
|
|
/// use [`parser_from`] instead.
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
|
|
|
/// Note that parsing errors are represented by the wrapped [`Result`],
|
|
|
|
/// _not_ by [`None`].
|
|
|
|
///
|
|
|
|
/// This will produce an iterator that can only return [`None`] if the
|
|
|
|
/// iterator it scans returns [`None`].
|
|
|
|
///
|
|
|
|
/// ```
|
2021-12-13 15:00:34 -05:00
|
|
|
/// use tamer::xir::tree::{Stack, parse};
|
2021-11-04 16:12:15 -04:00
|
|
|
///# use tamer::xir::Token;
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
2021-09-23 14:52:53 -04:00
|
|
|
///# let token_stream: std::vec::IntoIter<Token> = vec![].into_iter();
|
2021-09-08 13:53:47 -04:00
|
|
|
/// // The above is equivalent to:
|
2021-12-14 12:36:35 -05:00
|
|
|
/// let parser = parse(token_stream);
|
2021-09-08 13:53:47 -04:00
|
|
|
/// ```
|
2021-12-14 12:36:35 -05:00
|
|
|
pub fn parse(
|
|
|
|
toks: impl TokenStream,
|
|
|
|
) -> impl Iterator<Item = ParsedResult<Stack>> {
|
2021-12-17 10:22:05 -05:00
|
|
|
Stack::<AttrParseState>::parse(toks)
|
2021-09-08 13:53:47 -04:00
|
|
|
}
|
|
|
|
|
2021-11-04 10:52:16 -04:00
|
|
|
/// Produce a lazy parser from a given [`TokenStream`],
|
2021-09-09 13:05:11 -04:00
|
|
|
/// yielding only when an object has been fully parsed.
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
tamer: xir:tree: Begin work on composable XIRT parser
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
2021-12-06 11:26:53 -05:00
|
|
|
/// Unlike [`parse`][parse()],
|
2021-09-09 13:05:11 -04:00
|
|
|
/// which is intended for use with [`Iterator::scan`],
|
|
|
|
/// this will yield /only/ when the underlying parser yields
|
2021-12-14 12:44:32 -05:00
|
|
|
/// [`Tree`].
|
2021-09-09 13:05:11 -04:00
|
|
|
/// This interface is far more convenient,
|
|
|
|
/// but comes at the cost of not knowing how many parsing steps a single
|
|
|
|
/// [`Iterator::next`] call will take.
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
|
|
|
/// For more information on contexts,
|
|
|
|
/// and the parser in general,
|
|
|
|
/// see the [module-level documentation](self).
|
|
|
|
///
|
|
|
|
/// ```
|
2021-11-04 16:12:15 -04:00
|
|
|
/// use tamer::xir::tree::parser_from;
|
|
|
|
///# use tamer::xir::Token;
|
2021-09-08 13:53:47 -04:00
|
|
|
///
|
2021-09-23 14:52:53 -04:00
|
|
|
///# let token_stream: std::vec::IntoIter<Token> = vec![].into_iter();
|
2021-09-08 13:53:47 -04:00
|
|
|
/// // Lazily parse a stream of XIR tokens as an iterator.
|
|
|
|
/// let parser = parser_from(token_stream);
|
|
|
|
/// ```
|
2021-09-23 14:52:53 -04:00
|
|
|
pub fn parser_from(
|
2021-10-29 14:39:40 -04:00
|
|
|
toks: impl TokenStream,
|
2021-12-13 16:57:04 -05:00
|
|
|
) -> impl Iterator<Item = ParseResult<Stack, Tree>> {
|
2021-12-17 10:22:05 -05:00
|
|
|
Stack::<AttrParseState>::parse(toks).filter_map(|parsed| match parsed {
|
2021-12-14 12:44:32 -05:00
|
|
|
Ok(Parsed::Object(tree)) => Some(Ok(tree)),
|
2021-12-13 16:57:04 -05:00
|
|
|
Ok(Parsed::Incomplete) => None,
|
|
|
|
Err(x) => Some(Err(x)),
|
|
|
|
})
|
2021-09-08 13:53:47 -04:00
|
|
|
}
|
|
|
|
|
2021-11-04 10:52:16 -04:00
|
|
|
/// Produce a lazy attribute parser from a given [`TokenStream`],
|
|
|
|
/// yielding only when an attribute has been fully parsed.
|
|
|
|
///
|
|
|
|
/// This is a specialized parser that begins parsing partway through a XIR
|
|
|
|
/// token stream.
|
|
|
|
/// To parse an entire stream as a tree,
|
|
|
|
/// see [`parser_from`].
|
|
|
|
///
|
2021-11-05 10:54:05 -04:00
|
|
|
/// This parser does not take ownership over the iterator,
|
|
|
|
/// allowing parsing to continue on the underlying token stream after
|
|
|
|
/// attribute parsing has completed.
|
|
|
|
/// Once attribute parsing is finished,
|
|
|
|
/// parsing is able to continue on the underlying token stream as if the
|
|
|
|
/// attributes were never present in XIR at all;
|
|
|
|
/// this also allows this parser to be used as an attribute filter while
|
|
|
|
/// ensuring that the attributes are syntactically valid.
|
|
|
|
///
|
2021-11-04 10:52:16 -04:00
|
|
|
/// For more information on contexts,
|
|
|
|
/// and the parser in general,
|
|
|
|
/// see the [module-level documentation](self).
|
2021-11-05 10:54:05 -04:00
|
|
|
#[inline]
|
|
|
|
pub fn attr_parser_from<'a>(
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
toks: impl TokenStream,
|
|
|
|
) -> impl Iterator<Item = result::Result<Attr, ParseError<StackError>>> {
|
tamer: xir::tree::attr_parser_from: Integrate AttrParser
This begins to integrate the isolated AttrParser. The next step will be
integrating it into the larger XIRT parser.
There's been considerable delay in getting this committed, because I went
through quite the struggle with myself trying to determine what balance I
want to strike between Rust's type system; convenience with parser
combinators; iterators; and various other abstractions. I ended up being
confounded by trying to maintain the current XmloReader abstraction, which
is fundamentally incompatible with the way the new parsing system
works (streaming iterators that do not collect or perform heap
allocations).
There'll be more information on this to come, but there are certain things
that will be changing.
There are a couple problems highlighted by this commit (not in code, but
conceptually):
1. Introducing Option here for the TokenParserState doesn't feel right, in
the sense that the abstraction is inappropriate. We should perhaps
introduce a new variant Parsed::Done or something to indicate intent,
rather than leaving the reader to have to read about what None actually
means.
2. This turns Parsed into more of a statement influencing control
flow/logic, and so should be encapsulated, with an external equivalent
of Parsed that omits variants that ought to remain encapsulated.
3. TokenStreamState is true, but these really are the actual parsers;
TokenStreamParser is more of a coordinator, and helps to abstract away
some of the common logic so lower-level parsers do not have to worry
about it. But calling it TokenStreamState is both a bit
confusing and is an understatement---it _does_ hold the state, but it
also holds the current parsing stack in its variants.
Another thing that is not yet entirely clear is whether this AttrParser
ought to care about detection of duplicate attributes, or if that should be
done in a separate parser, perhaps even at the XIR level. The same can be
said for checking for balanced tags. By pushing it to TokenStream in XIR,
we would get a guaranteed check regardless of what parsers are used, which
is attractive because it reduces the (almost certain-to-otherwise-occur)
risk that individual parsers will not sufficiently check for semantically
valid XML. But it does _potentially_ match error recovery more
complicated. But at the same time, perhaps more specific parsers ought not
care about recovery at that level.
Anyway, point being, more to come, but I am disappointed how much time I'm
spending considering parsing, given that there are so many things I need to
move onto. I just want this done right and in a way that feels like it's
working well with Rust while it's all in working memory, otherwise it's
going to be a significant effort to get back into.
DEV-11268
2021-12-10 14:13:02 -05:00
|
|
|
use parse::Parsed;
|
2021-11-05 10:54:05 -04:00
|
|
|
|
2021-12-17 10:22:05 -05:00
|
|
|
AttrParseState::parse(toks).filter_map(|parsed| match parsed {
|
2021-12-10 14:58:44 -05:00
|
|
|
Ok(Parsed::Object(attr)) => Some(Ok(attr)),
|
2021-12-10 16:51:53 -05:00
|
|
|
Ok(Parsed::Incomplete) => None,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
Err(ParseError::StateError(e)) => {
|
|
|
|
Some(Err(ParseError::StateError(StackError::AttrError(e))))
|
|
|
|
}
|
|
|
|
Err(e) => Some(Err(e.inner_into())),
|
2021-11-05 10:54:05 -04:00
|
|
|
})
|
2021-11-04 10:52:16 -04:00
|
|
|
}
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
impl From<AttrParseError> for StackError {
|
|
|
|
fn from(e: AttrParseError) -> Self {
|
|
|
|
StackError::AttrError(e)
|
tamer: xir::tree::attr_parser_from: Integrate AttrParser
This begins to integrate the isolated AttrParser. The next step will be
integrating it into the larger XIRT parser.
There's been considerable delay in getting this committed, because I went
through quite the struggle with myself trying to determine what balance I
want to strike between Rust's type system; convenience with parser
combinators; iterators; and various other abstractions. I ended up being
confounded by trying to maintain the current XmloReader abstraction, which
is fundamentally incompatible with the way the new parsing system
works (streaming iterators that do not collect or perform heap
allocations).
There'll be more information on this to come, but there are certain things
that will be changing.
There are a couple problems highlighted by this commit (not in code, but
conceptually):
1. Introducing Option here for the TokenParserState doesn't feel right, in
the sense that the abstraction is inappropriate. We should perhaps
introduce a new variant Parsed::Done or something to indicate intent,
rather than leaving the reader to have to read about what None actually
means.
2. This turns Parsed into more of a statement influencing control
flow/logic, and so should be encapsulated, with an external equivalent
of Parsed that omits variants that ought to remain encapsulated.
3. TokenStreamState is true, but these really are the actual parsers;
TokenStreamParser is more of a coordinator, and helps to abstract away
some of the common logic so lower-level parsers do not have to worry
about it. But calling it TokenStreamState is both a bit
confusing and is an understatement---it _does_ hold the state, but it
also holds the current parsing stack in its variants.
Another thing that is not yet entirely clear is whether this AttrParser
ought to care about detection of duplicate attributes, or if that should be
done in a separate parser, perhaps even at the XIR level. The same can be
said for checking for balanced tags. By pushing it to TokenStream in XIR,
we would get a guaranteed check regardless of what parsers are used, which
is attractive because it reduces the (almost certain-to-otherwise-occur)
risk that individual parsers will not sufficiently check for semantically
valid XML. But it does _potentially_ match error recovery more
complicated. But at the same time, perhaps more specific parsers ought not
care about recovery at that level.
Anyway, point being, more to come, but I am disappointed how much time I'm
spending considering parsing, given that there are so many things I need to
move onto. I just want this done right and in a way that feels like it's
working well with Rust while it's all in working memory, otherwise it's
going to be a significant effort to get back into.
DEV-11268
2021-12-10 14:13:02 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-06 14:26:58 -05:00
|
|
|
#[cfg(test)]
|
|
|
|
pub fn merge_attr_fragments<'a>(
|
|
|
|
toks: &'a mut impl TokenStream,
|
|
|
|
) -> impl TokenStream + 'a {
|
tamer: xir::tree::attr_parser_from: Integrate AttrParser
This begins to integrate the isolated AttrParser. The next step will be
integrating it into the larger XIRT parser.
There's been considerable delay in getting this committed, because I went
through quite the struggle with myself trying to determine what balance I
want to strike between Rust's type system; convenience with parser
combinators; iterators; and various other abstractions. I ended up being
confounded by trying to maintain the current XmloReader abstraction, which
is fundamentally incompatible with the way the new parsing system
works (streaming iterators that do not collect or perform heap
allocations).
There'll be more information on this to come, but there are certain things
that will be changing.
There are a couple problems highlighted by this commit (not in code, but
conceptually):
1. Introducing Option here for the TokenParserState doesn't feel right, in
the sense that the abstraction is inappropriate. We should perhaps
introduce a new variant Parsed::Done or something to indicate intent,
rather than leaving the reader to have to read about what None actually
means.
2. This turns Parsed into more of a statement influencing control
flow/logic, and so should be encapsulated, with an external equivalent
of Parsed that omits variants that ought to remain encapsulated.
3. TokenStreamState is true, but these really are the actual parsers;
TokenStreamParser is more of a coordinator, and helps to abstract away
some of the common logic so lower-level parsers do not have to worry
about it. But calling it TokenStreamState is both a bit
confusing and is an understatement---it _does_ hold the state, but it
also holds the current parsing stack in its variants.
Another thing that is not yet entirely clear is whether this AttrParser
ought to care about detection of duplicate attributes, or if that should be
done in a separate parser, perhaps even at the XIR level. The same can be
said for checking for balanced tags. By pushing it to TokenStream in XIR,
we would get a guaranteed check regardless of what parsers are used, which
is attractive because it reduces the (almost certain-to-otherwise-occur)
risk that individual parsers will not sufficiently check for semantically
valid XML. But it does _potentially_ match error recovery more
complicated. But at the same time, perhaps more specific parsers ought not
care about recovery at that level.
Anyway, point being, more to come, but I am disappointed how much time I'm
spending considering parsing, given that there are so many things I need to
move onto. I just want this done right and in a way that feels like it's
working well with Rust while it's all in working memory, otherwise it's
going to be a significant effort to get back into.
DEV-11268
2021-12-10 14:13:02 -05:00
|
|
|
use std::iter;
|
|
|
|
|
2021-12-06 14:26:58 -05:00
|
|
|
use crate::sym::{GlobalSymbolIntern, GlobalSymbolResolve};
|
|
|
|
|
|
|
|
let mut stack = Vec::with_capacity(4);
|
|
|
|
|
|
|
|
iter::from_fn(move || {
|
|
|
|
loop {
|
|
|
|
match toks.next() {
|
|
|
|
// Collect fragments and continue iterating until we find
|
|
|
|
// the final `Token::AttrValue`.
|
|
|
|
Some(Token::AttrValueFragment(frag, ..)) => {
|
|
|
|
stack.push(frag);
|
|
|
|
}
|
|
|
|
|
|
|
|
// An AttrValue without any stack is just a normal value.
|
|
|
|
// We are not interested in it.
|
|
|
|
val @ Some(Token::AttrValue(..)) if stack.len() == 0 => {
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
|
|
|
// But if we have a stack,
|
|
|
|
// allocate a new string that concatenates each of the
|
|
|
|
// symbols and return a newly allocated symbol.
|
|
|
|
Some(Token::AttrValue(last, span)) if stack.len() > 0 => {
|
|
|
|
stack.push(last);
|
|
|
|
|
|
|
|
let merged = stack
|
|
|
|
.iter()
|
|
|
|
.map(|frag| frag.lookup_str())
|
|
|
|
.collect::<String>()
|
|
|
|
.intern();
|
|
|
|
|
|
|
|
stack.clear();
|
|
|
|
|
|
|
|
return Some(Token::AttrValue(merged, span));
|
|
|
|
}
|
|
|
|
other => return other,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
2021-09-08 13:53:47 -04:00
|
|
|
#[cfg(test)]
|
2021-09-16 10:18:02 -04:00
|
|
|
mod test;
|