tamer: xir:tree: Begin work on composable XIRT parser

The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR).  While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ.  Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.

When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate.  The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).

A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones.  The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.

TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate.  Specifically:

  1. Rust’s type system should be used as combinators, so that parsers are
  automatically constructed from the type definition.

  2. Primitive parsers are written as explicit automata, not as primitive
     combinators.

  3. Parsing should directly produce IRs as a lowering operation below XIRT,
     rather than producing XIRT itself.  That is, target IRs should consume
     XIRT and produce parse themselves immediately, during streaming.

In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now.  And, to be
honest, I’m hoping that won’t be necessary.
main
Mike Gerwitz 2021-12-06 11:26:53 -05:00
parent fd1b1527d6
commit 42b5007402
4 changed files with 665 additions and 9 deletions

View File

@ -24,7 +24,7 @@
//!
//! Parsing is handled by [`ParserState::parse_token`].
//! An [`Iterator::scan`]-based parser can be constructed using
//! [`parser_from`] or [`parse`].
//! [`parser_from`] or [`parse`][parse()].
//!
//! ```
//! use tamer::xir::tree::{ParserState, parse, parser_from};
@ -43,10 +43,10 @@
//!
//! `parser_from` Or `parse`?
//! =========================
//! [`parser_from`] is implemented in terms of [`parse`].
//! [`parser_from`] is implemented in terms of [`parse`][parse()].
//! They have slightly different use cases and tradeoffs:
//!
//! [`parse`] yields a [`Result`] containing [`Parsed`],
//! [`parse`][parse()] yields a [`Result`] containing [`Parsed`],
//! which _may_ contain a [`Parsed::Tree`],
//! but it's more likely to contain [`Parsed::Incomplete`];
//! this is because it typically takes multiple [`Token`]s to complete
@ -54,12 +54,13 @@
//!
//! In return, though, you get some important guarantees:
//!
//! 1. [`parse`] consumes only a _single_ token; and
//! 1. [`parse`][parse()] consumes only a _single_ token; and
//! 2. It has a constant upper bound for execution time.
//!
//! This means that [`parse`] will never cause the system to hang---you
//! are in complete control over how much progress parsing makes,
//! and are free to stop and resume it at any time.
//! This means that [`parse`][parse()] will never cause the system to
//! hang---you
//! are in complete control over how much progress parsing makes,
//! and are free to stop and resume it at any time.
//!
//! However,
//! if you do not care about those things,
@ -194,11 +195,13 @@
//! For more information,
//! see [`AttrParts`].
mod attr;
mod parse;
use super::{QName, Token, TokenResultStream, TokenStream};
use crate::{span::Span, sym::SymbolId};
use std::{error::Error, fmt::Display, iter, mem::take};
mod attr;
pub use attr::{Attr, AttrList, AttrParts, SimpleAttr};
/// A XIR tree (XIRT).
@ -1018,7 +1021,7 @@ pub fn parse(state: &mut ParserState, tok: Token) -> Option<Result<Parsed>> {
/// Produce a lazy parser from a given [`TokenStream`],
/// yielding only when an object has been fully parsed.
///
/// Unlike [`parse`],
/// Unlike [`parse`][parse()],
/// which is intended for use with [`Iterator::scan`],
/// this will yield /only/ when the underlying parser yields
/// [`Parsed::Tree`],

View File

@ -27,6 +27,8 @@ use super::QName;
use crate::{span::Span, sym::SymbolId};
use std::fmt::Display;
mod parse;
/// An attribute.
///
/// Attributes come in two flavors:

View File

@ -0,0 +1,221 @@
// XIRT attribute parsers
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! Parse XIR attribute [`TokenStream`][super::super::TokenStream]s.
use crate::{
span::Span,
xir::{
tree::parse::{Parsed, TokenStreamState, TokenStreamStateResult},
QName, Token,
},
};
use std::{error::Error, fmt::Display, mem::take};
use super::Attr;
/// Attribute parser DFA.
///
/// While this parser does store the most recently encountered [`QName`]
/// and [`Span`],
/// these data are used only for emitting data about the accepted state;
/// they do not influence the automaton's state transitions.
/// The actual parsing operation is therefore a FSM,
/// not a PDA.
#[derive(Debug, Eq, PartialEq)]
pub enum AttrParserState {
Empty,
Name(QName, Span),
}
impl TokenStreamState for AttrParserState {
type Object = Attr;
type Error = AttrParseError;
fn parse_token(&mut self, tok: Token) -> TokenStreamStateResult<Self> {
use AttrParserState::*;
*self = match (take(self), tok) {
(Empty, Token::AttrName(name, span)) => Name(name, span),
(Empty, invalid) => {
return Err(AttrParseError::AttrNameExpected(invalid))
}
(Name(name, nspan), Token::AttrValue(value, vspan)) => {
return Ok(Parsed::Object(Attr::new(
name,
value,
(nspan, vspan),
)))
}
(Name(name, nspan), invalid) => {
// Restore state for error recovery.
*self = Name(name, nspan);
return Err(AttrParseError::AttrValueExpected(
name, nspan, invalid,
));
}
};
Ok(Parsed::Incomplete)
}
#[inline]
fn is_accepting(&self) -> bool {
*self == Self::Empty
}
}
impl Default for AttrParserState {
fn default() -> Self {
Self::Empty
}
}
/// Attribute parsing error.
#[derive(Debug, PartialEq)]
pub enum AttrParseError {
/// [`Token::AttrName`] was expected.
AttrNameExpected(Token),
/// [`Token::AttrValue`] was expected.
AttrValueExpected(QName, Span, Token),
}
impl Display for AttrParseError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::AttrNameExpected(tok) => {
write!(f, "attribute name expected, found {}", tok)
}
Self::AttrValueExpected(name, span, tok) => {
write!(
f,
"expected value for `@{}` at {}, found {}",
name, span, tok
)
}
}
}
}
impl Error for AttrParseError {
fn source(&self) -> Option<&(dyn Error + 'static)> {
None
}
}
#[cfg(test)]
mod test {
use super::*;
use crate::{convert::ExpectInto, sym::GlobalSymbolIntern};
// TODO: Just make these const
lazy_static! {
static ref S: Span =
Span::from_byte_interval((0, 0), "test case, 1".intern());
static ref S2: Span =
Span::from_byte_interval((0, 0), "test case, 2".intern());
static ref S3: Span =
Span::from_byte_interval((0, 0), "test case, 3".intern());
}
#[test]
fn fails_if_first_token_is_non_attr() {
let tok = Token::Open("foo".unwrap_into(), *S);
let mut sut = AttrParserState::default();
// Fail immediately.
assert_eq!(
Err(AttrParseError::AttrNameExpected(tok.clone())),
sut.parse_token(tok)
);
// Let's just make sure we're in the same state we started in so
// that we know we can accommodate recovery token(s).
assert_eq!(sut, AttrParserState::default());
}
#[test]
fn parse_single_attr() {
let attr = "attr".unwrap_into();
let val = "val".intern();
let mut sut = AttrParserState::default();
let expected = Attr::new(attr, val, (*S, *S2));
// First token represents the name,
// and so we are awaiting a value.
assert_eq!(
sut.parse_token(Token::AttrName(attr, *S)),
Ok(Parsed::Incomplete)
);
// Once we have a value,
// an Attr can be emitted.
assert_eq!(
sut.parse_token(Token::AttrValue(val, *S2)),
Ok(Parsed::Object(expected))
);
}
#[test]
fn parse_fails_when_attribute_value_missing_but_can_recover() {
let attr = "bad".unwrap_into();
let mut sut = AttrParserState::default();
// This token indicates that we're expecting a value to come next in
// the token stream.
assert_eq!(
sut.parse_token(Token::AttrName(attr, *S)),
Ok(Parsed::Incomplete)
);
// But we provide something else unexpected.
assert_eq!(
sut.parse_token(Token::AttrEnd),
Err(AttrParseError::AttrValueExpected(attr, *S, Token::AttrEnd))
);
// We should not be in an accepting state,
// given that we haven't finished parsing the attribute.
assert!(!sut.is_accepting());
// Despite this error,
// we should remain in a state that permits recovery should a
// proper token be substituted.
// Rather than checking for that state,
// let's actually attempt a recovery.
let recover = "value".intern();
let expected = Attr::new(attr, recover, (*S, *S2));
assert_eq!(
sut.parse_token(Token::AttrValue(recover, *S2)),
Ok(Parsed::Object(expected))
);
// Finally, we should now be in an accepting state.
assert!(sut.is_accepting());
}
}

View File

@ -0,0 +1,430 @@
// Basic parsing framework for XIR into XIRT
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! Basic streaming parsing framework to lower XIR into XIRT.
use super::super::{Token, TokenStream};
use std::{error::Error, fmt::Display};
/// Lower a [`TokenStream`] into XIRT.
///
/// Parsers are wrappers around a ([`TokenStreamState`], [`TokenStream`])
/// pair,
/// where only one parser may have mutable access to the stream at any
/// given time.
///
/// After you have finished with a parser,
/// you should call [`finalize`](TokenStreamParser::finalize) to ensure
/// that parsing has completed in an accepting state.
pub trait TokenStreamParser<I: TokenStream>:
Iterator<Item = TokenStreamParsedResult<Self::State>> + Sized
{
/// Parsing automaton.
type State: TokenStreamState;
/// Parse a single [`Token`] according to the current
/// [`TokenStreamState`],
/// if available.
///
/// If the underlying [`TokenStream`] yields [`None`],
/// then the [`TokenStreamState`] must be in an accepting state;
/// otherwise, [`ParseError::UnexpectedEof`] will occur.
///
/// This is intended to be invoked by [`Iterator::next`].
/// Accepting a token rather than the [`TokenStream`] allows the caller
/// to inspect the token first
/// (e.g. to store a copy of the [`Span`][crate::span::Span]).
#[inline]
fn parse_next(
state: &mut Self::State,
otok: Option<Token>,
) -> Option<Self::Item> {
match otok {
None if state.is_accepting() => None,
None => Some(Err(ParseError::UnexpectedEof)),
Some(tok) => Some(state.parse_token(tok).map_err(ParseError::from)),
}
}
/// Indicate that no further parsing will take place using this parser,
/// and [`drop`] it.
///
/// Invoking the method is equivalent to stating that the stream has
/// ended,
/// since the parser will have no later opportunity to continue
/// parsing.
/// Consequently,
/// the caller should expect [`ParseError::UnexpectedEof`] if the
/// parser is not in an accepting state.
fn finalize(
self,
) -> Result<(), (Self, ParseError<<Self::State as TokenStreamState>::Error>)>;
}
/// Result of applying a [`Token`] to a [`TokenStreamState`],
/// with any error having been wrapped in a [`ParseError`].
pub type TokenStreamParsedResult<S> =
TokenStreamParserResult<S, Parsed<<S as TokenStreamState>::Object>>;
/// Result of some non-parsing operation on a [`TokenStreamParser`],
/// with any error having been wrapped in a [`ParseError`].
pub type TokenStreamParserResult<S, T> =
Result<T, ParseError<<S as TokenStreamState>::Error>>;
/// A deterministic parsing automaton.
///
/// These states are utilized by a [`TokenStreamParser`].
///
/// A [`TokenStreamState`] is also responsible for storing data about the
/// accepted input,
/// and handling appropriate type conversions into the final type.
/// That is---an
/// automaton may store metadata that is subsequently emitted once an
/// accepting state has been reached.
/// Whatever the underlying automaton,
/// a `(state, token)` pair must uniquely determine the next parser
/// action via [`TokenStreamParser::parse_next`].
///
/// Intuitively,
/// since only one [`TokenStreamParser`] may hold a mutable reference to
/// an underlying [`TokenStream`] at any given point,
/// this does in fact represent the current state of the entire
/// [`TokenStream`] at the current position for a given parser
/// composition.
pub trait TokenStreamState: Default {
/// Objects produced by a parser utilizing these states.
type Object;
/// Errors specific to this set of states.
type Error: Error + PartialEq;
/// Construct a parser.
///
/// Whether this method is helpful or provides any clarity depends on
/// the context and the types that are able to be inferred.
/// This is completely generic,
/// able to construct any compatible type of [`TokenStreamParser`],
/// and so does not in itself do anything to help with type inference
/// (compared to `P::from`,
/// you trade an unknown `P::State` for an unknown `P`).
fn parser<P, I>(toks: &mut I) -> P
where
I: TokenStream,
P: TokenStreamParser<I> + for<'a> From<&'a mut I>,
{
P::from(toks)
}
/// Parse a single [`Token`] and optionally perform a state transition.
///
/// The current state is represented by `self`,
/// which is mutable to allow for a state transition.
/// The result of a parsing operation is either an object or an
/// indication that additional tokens of input are needed;
/// see [`Parsed`] for more information.
fn parse_token(&mut self, tok: Token) -> TokenStreamStateResult<Self>;
/// Whether the current state represents an accepting state.
///
/// An accepting state represents a valid state to stop parsing.
/// If parsing stops at a state that is _not_ accepting,
/// then the [`TokenStream`] has ended unexpectedly and should produce
/// a [`ParseError::UnexpectedEof`].
///
/// It makes sense for there to be exist multiple accepting states for a
/// parser.
/// For example:
/// A parser that parses a list of attributes may be used to parse one
/// or more attributes,
/// or the entire list of attributes.
/// It is acceptable to attempt to parse just one of those attributes,
/// or it is acceptable to parse all the way until the end.
fn is_accepting(&self) -> bool;
}
/// Result of applying a [`Token`] to a [`TokenStreamState`].
///
/// See [`TokenStreamState::parse_token`] and
/// [`TokenStreamParser::parse_next`] for more information.
pub type TokenStreamStateResult<S> = Result<
Parsed<<S as TokenStreamState>::Object>,
<S as TokenStreamState>::Error,
>;
/// A streaming parser defined by a [`TokenStreamState`] with exclusive
/// mutable access to an underlying [`TokenStream`].
///
/// This parser handles operations that are common among all types of
/// parsers,
/// such that specialized parsers need only implement logic that is
/// unique to their operation.
/// This also simplifies combinators,
/// since there is more uniformity among distinct parser types.
#[derive(Debug, PartialEq, Eq)]
pub struct Parser<'a, S: TokenStreamState, I: TokenStream> {
toks: &'a mut I,
state: S,
}
impl<'a, S: TokenStreamState, I: TokenStream> TokenStreamParser<I>
for Parser<'a, S, I>
{
type State = S;
fn finalize(
self,
) -> Result<(), (Self, ParseError<<Self::State as TokenStreamState>::Error>)>
{
if self.state.is_accepting() {
Ok(())
} else {
Err((self, ParseError::UnexpectedEof))
}
}
}
impl<'a, S: TokenStreamState, I: TokenStream> Iterator for Parser<'a, S, I> {
type Item = TokenStreamParsedResult<S>;
/// Consume a single token from the underlying [`TokenStream`] and parse
/// it according to the current [`TokenStreamState`].
///
/// See [`TokenStreamParser::parse_next`] for more information.
#[inline]
fn next(&mut self) -> Option<Self::Item> {
<Self as TokenStreamParser<I>>::parse_next(
&mut self.state,
self.toks.next(),
)
}
}
/// Common parsing errors produced by [`TokenStreamParser`].
///
/// These errors are common enough that they are handled in a common way,
/// such that individual parsers needn't check for these situations
/// themselves.
///
/// Having a common type also allows combinators to handle error types in a
/// consistent way when composing parsers.
///
/// Parsers may return their own unique errors via the
/// [`StateError`][ParseError::StateError] variant.
#[derive(Debug, PartialEq)]
pub enum ParseError<E: Error + PartialEq> {
// TODO: Last span encountered, maybe?
UnexpectedEof,
/// A parser-specific error associated with an inner
/// [`TokenStreamState`].
StateError(E),
}
impl<E: Error + PartialEq> From<E> for ParseError<E> {
fn from(e: E) -> Self {
Self::StateError(e)
}
}
impl<E: Error + PartialEq> Display for ParseError<E> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::UnexpectedEof => write!(f, "unexpected end of input"),
Self::StateError(e) => Display::fmt(e, f),
}
}
}
impl<E: Error + PartialEq + 'static> Error for ParseError<E> {
fn source(&self) -> Option<&(dyn Error + 'static)> {
match self {
Self::StateError(e) => Some(e),
_ => None,
}
}
}
impl<'a, S: TokenStreamState, I: TokenStream> From<&'a mut I>
for Parser<'a, S, I>
{
fn from(toks: &'a mut I) -> Self {
Self {
toks,
state: Default::default(),
}
}
}
#[derive(Debug, PartialEq, Eq)]
pub enum Parsed<T> {
Incomplete,
Object(T),
}
#[cfg(test)]
pub mod test {
use std::assert_matches::assert_matches;
use super::*;
use crate::span::DUMMY_SPAN;
/// Preferred [`TokenStreamParser`].
///
/// TODO: Move into parent module once used outside of tests.
pub type DefaultParser<'a, S, I> = Parser<'a, S, I>;
#[derive(Debug, PartialEq, Eq)]
enum EchoState {
Empty,
Done,
}
impl Default for EchoState {
fn default() -> Self {
Self::Empty
}
}
impl TokenStreamState for EchoState {
type Object = Token;
type Error = EchoStateError;
fn parse_token(&mut self, tok: Token) -> TokenStreamStateResult<Self> {
match tok {
Token::AttrEnd => {
*self = Self::Done;
}
Token::Close(..) => {
return Err(EchoStateError::InnerError(tok))
}
_ => {}
}
Ok(Parsed::Object(tok))
}
fn is_accepting(&self) -> bool {
*self == Self::Done
}
}
#[derive(Debug, PartialEq)]
enum EchoStateError {
InnerError(Token),
}
impl Display for EchoStateError {
fn fmt(&self, _: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
unimplemented!()
}
}
impl Error for EchoStateError {
fn source(&self) -> Option<&(dyn Error + 'static)> {
None
}
}
type Sut<'a, I> = DefaultParser<'a, EchoState, I>;
#[test]
fn permits_end_of_stream_in_accepting_state() {
// EchoState is placed into a Done state given AttrEnd.
let mut toks = [Token::AttrEnd].into_iter();
let mut sut = Sut::from(&mut toks);
// The first token should be processed normally.
// EchoState proxies the token back.
assert_eq!(Some(Ok(Parsed::Object(Token::AttrEnd))), sut.next());
// This is now the end of the token stream,
// which should be okay provided that the first token put us into
// a proper accepting state.
assert_eq!(None, sut.next());
// Further, finalizing should work in this state.
assert!(sut.finalize().is_ok());
}
#[test]
fn fails_on_end_of_stream_when_not_in_accepting_state() {
// No tokens, so EchoState starts in a non-accepting state.
let mut toks = [].into_iter();
let mut sut = Sut::from(&mut toks);
// Given that we have no tokens,
// and that EchoState::default does not start in an accepting
// state,
// we must fail when we encounter the end of the stream.
assert_eq!(Some(Err(ParseError::UnexpectedEof)), sut.next());
}
#[test]
fn returns_state_specific_error() {
// Token::Close causes EchoState to produce an error.
let errtok = Token::Close(None, DUMMY_SPAN);
let mut toks = [errtok.clone()].into_iter();
let mut sut = Sut::from(&mut toks);
assert_eq!(
Some(Err(ParseError::StateError(EchoStateError::InnerError(
errtok
)))),
sut.next()
);
// The token must have been consumed.
// It is up to a recovery process to either bail out or provide
// recovery tokens;
// continuing without recovery is unlikely to make sense.
assert_eq!(0, toks.len());
}
#[test]
fn fails_when_parser_is_finalized_in_non_accepting_state() {
// Set up so that we have a single token that we can use for
// recovery as part of the same iterator.
let mut toks = [Token::AttrEnd].into_iter();
let sut = Sut::from(&mut toks);
// Attempting to finalize now in a non-accepting state should fail
// in the same way that encountering an end-of-stream does,
// since we're effectively saying "we're done with the stream"
// and the parser will have no further opportunity to reach an
// accepting state.
let result = sut.finalize();
assert_matches!(result, Err((_, ParseError::UnexpectedEof)));
// The sut should have been re-returned,
// allowing for attempted error recovery if the caller can manage
// to produce a sequence of tokens that will be considered valid.
// `toks` above is set up already for this,
// which allows us to assert that we received back the same `sut`.
let mut sut = result.unwrap_err().0;
assert_eq!(Some(Ok(Parsed::Object(Token::AttrEnd))), sut.next());
// And so we should now be in an accepting state,
// able to finalize.
assert!(sut.finalize().is_ok());
}
}