// XIR tree representation
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see .
//! XIR token stream parsed into a tree-based IR (XIRT).
//!
//! **This is a work-in-progress implementation.**
//! It will be augmented only as needed.
//!
//! Parsing is handled by [`Stack::parse_token`].
//! An [`Iterator::scan`]-based parser can be constructed using
//! [`parser_from`] or [`parse`][parse()].
//!
//! ```
//! use tamer::xir::tree::{Stack, parse, parser_from};
//!# use tamer::xir::Token;
//!
//!# let token_stream: std::vec::IntoIter = vec![].into_iter();
//! // Lazily parse a stream of XIR tokens as an iterator, yielding the next
//! // fully parsed object. This may consume any number of tokens.
//! let parser = parser_from(token_stream);
//!
//!# let token_stream: std::vec::IntoIter = vec![].into_iter();
//! // Consume a single token at a time, yielding either an incomplete state
//! // or the next parsed object.
//! let parser = parse(token_stream);
//! ```
//!
//! `parser_from` Or `parse`?
//! =========================
//! [`parser_from`] is implemented in terms of [`parse`][parse()].
//! They have slightly different use cases and tradeoffs:
//!
//! [`parse`][parse()] yields a [`Result`] containing [`Parsed`],
//! which _may_ contain an [`Parsed::Object`],
//! but it's more likely to contain [`Parsed::Incomplete`];
//! this is because it typically takes multiple [`Token`]s to complete
//! parsing within a given context.
//!
//! In return, though, you get some important guarantees:
//!
//! 1. [`parse`][parse()] consumes only a _single_ token; and
//! 2. It has a constant upper bound for execution time.
//!
//! This means that [`parse`][parse()] will never cause the system to
//! hang---you
//! are in complete control over how much progress parsing makes,
//! and are free to stop and resume it at any time.
//!
//! However,
//! if you do not care about those things,
//! working with [`Parsed`] is verbose and inconvenient;
//! sometimes you just want the next [`Tree`] object.
//! For this,
//! we have [`parser_from`],
//! which does two things:
//!
//! 1. It filters out all [`Parsed::Incomplete`]; and
//! 2. On [`Parsed::Object`],
//! it yields the inner [`Tree`].
//!
//! This is a much more convenient API,
//! but is not without its downsides:
//! if the context is large
//! (e.g. the root node of a large XML document),
//! parsing can take a considerable amount of time,
//! and the [`Iterator`] produced by [`parser_from`] will cause the
//! system to process [`Iterator::next`] for that entire duration.
//!
//! See also [`attr_parser_from`] for parsing only attributes partway
//! through a token stream.
//!
//! [`Parsed::Incomplete`]: parse::Parsed::Incomplete
//! [`Parsed::Object`]: parse::Parsed::Object
//!
//! Cost of Parsing
//! ===============
//! While [`Tree`] is often much easier to work with than a stream of
//! [`Token`],
//! there are notable downsides:
//!
//! - The context in which parsing began
//! (see _Parser Implementation_ below)
//! must complete before _any_ token is emitted.
//! If parsing begins at the root element,
//! this means that the _entire XML document_ must be loaded into
//! memory before it is available for use.
//! - While the token stream is capable of operating using constant memory
//! (since [`Token`] can be discarded after being consumed),
//! a [`Tree`] holds a significant amount of data in memory.
//!
//! It is recommended to parse into [`Tree`] only for the portions of the
//! XML document that will benefit from it.
//! For example,
//! by avoiding parsing of the root element into a tree,
//! you can emit [`Tree`] for child elements without having to wait for
//! the entire document to be parsed.
//!
//!
//! Validity Of Token Stream
//! ========================
//! XIR verifies that each [`Token`] is syntactically valid and follows an
//! XML grammar subset;
//! as such,
//! the tree parser does not concern itself with syntax analysis.
//! It does,
//! however,
//! perform _[semantic analysis]_ on the token stream.
//! Given that,
//! [`Stack::parse_token`] returns a [`Result`],
//! with parsing errors represented by this module's [`StackError`].
//!
//! As an example,
//! a XIR token stream permits unbalanced tags.
//! However,
//! we cannot represent an invalid tree,
//! so that would result in a semantic error.
//!
//! [semantic analysis]: https://en.wikipedia.org/wiki/Semantic_analysis_(compilers)
//!
//!
//! Parser Implementation
//! =====================
//! The parser that lowers the XIR [`Token`] stream into a [`Tree`]
//! is implemented on [`Stack`].
//!
//! This parser is a [stack machine],
//! where the stack represents the [`Tree`] that is under construction.
//! Parsing operates on _context_.
//! At present, the only parsing context is an element---it
//! begins parsing at an opening tag ([`Token::Open`]) and completes
//! parsing at a _matching_ [`Token::Close`].
//! All attributes and child nodes encountered during parsing of an element
//! will automatically be added to the appropriate element,
//! recursively.
//!
//! [stack machine]: https://en.wikipedia.org/wiki/Stack_machine
//!
//! State Machine With A Typed Stack
//! --------------------------------
//! The parser is a [finate-state machine (FSM)] with a stack encoded in
//! variants of [`Stack`],
//! where each variant represents the current state of the parser.
//! The parser cannot be reasoned about as a pushdown automaton because the
//! language of the [`Stack`] is completely arbitrary,
//! but it otherwise operates in a similar manner.
//!
//! Each state transition consumes the entire stack and produces a new one,
//! which may be identical.
//! Intuitively, though, based on the construction of [`Stack`],
//! this is equivalent to popping the needed data off of the stack and
//! optionally pushing additional information.
//!
//! By encoding the stack in [`Stack`] variants,
//! we are able to verify statically that the stack is always in a valid
//! state and contains expected data---that
//! is, our stack is fully type-safe.
//!
//! [state machine]: https://en.wikipedia.org/wiki/Finite-state_machine
mod attr;
mod parse;
use self::{
attr::AttrParserState,
parse::{ParseResult, ParseState, ParseStateResult, ParsedResult},
};
use super::{QName, Token, TokenResultStream, TokenStream};
use crate::{span::Span, sym::SymbolId};
use std::{error::Error, fmt::Display, mem::take};
pub use attr::{Attr, AttrList};
type Parsed = parse::Parsed;
type ParseStatus = parse::ParseStatus;
/// A XIR tree (XIRT).
///
/// This object represents a XIR token stream parsed into a tree
/// representation.
/// This representation is easier to process and manipulate in most contexts,
/// but also requires memory allocation for the entire tree and requires
/// that a potentially significant portion of a token stream be processed
/// (e.g. from start to end tag for a given element).
///
/// _Note that this implementation is incomplete!_
/// It will be augmented as needed.
///
/// For more information,
/// see the [module-level documentation](self).
#[derive(Debug, Clone, Eq, PartialEq)]
pub enum Tree {
/// XML element.
Element(Element),
/// Text node.
///
/// A text node cannot contain other [`Tree`] elements;
/// sibling text nodes must exist within an [`Element`].
Text(SymbolId, Span),
/// This variant exists purely because `#[non_exhaustive]` has no effect
/// within the crate.
///
/// This ensures that matches must account for other variants that will
/// be introduced in the future,
/// easing the maintenance burden
/// (for both implementation and unit tests).
_NonExhaustive,
}
impl Into