2022-03-17 12:20:20 -04:00
|
|
|
// XIR flat (XIRF)
|
|
|
|
//
|
2023-01-17 23:09:25 -05:00
|
|
|
// Copyright (C) 2014-2023 Ryan Specialty, LLC.
|
2022-03-17 12:20:20 -04:00
|
|
|
//
|
|
|
|
// This file is part of TAME.
|
|
|
|
//
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
//! Lightly-parsed XIR as a flat stream (XIRF).
|
|
|
|
//!
|
|
|
|
//! XIRF lightly parses a raw XIR [`TokenStream`] into a stream of
|
2022-06-02 13:41:24 -04:00
|
|
|
//! [`XirfToken`]s that are,
|
2022-03-17 12:20:20 -04:00
|
|
|
//! like a [`TokenStream`],
|
|
|
|
//! flat in structure.
|
|
|
|
//! It provides the following features over raw XIR:
|
|
|
|
//!
|
|
|
|
//! 1. All closing tags must correspond to a matching opening tag at the
|
|
|
|
//! same depth;
|
2022-07-29 15:27:42 -04:00
|
|
|
//! 2. [`XirfToken`] exposes the [`Depth`] of each node-related token;
|
2022-03-17 23:22:38 -04:00
|
|
|
//! 3. Attribute tokens are parsed into [`Attr`] objects;
|
|
|
|
//! 4. Documents must begin with an element and end with the closing of
|
|
|
|
//! that element;
|
|
|
|
//! 5. Parsing will fail if input ends before all elements have been
|
2022-03-17 12:20:20 -04:00
|
|
|
//! closed.
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
//! 6. Text nodes may optionally be parsed into [`RefinedText`] to
|
|
|
|
//! distinguish whitespace.
|
2022-03-17 12:20:20 -04:00
|
|
|
//!
|
|
|
|
//! XIRF lowering does not perform any dynamic memory allocation;
|
|
|
|
//! maximum element nesting depth is set statically depending on the needs
|
|
|
|
//! of the caller.
|
|
|
|
|
|
|
|
use super::{
|
2022-03-17 16:10:56 -04:00
|
|
|
attr::{Attr, AttrParseError, AttrParseState},
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
reader::is_xml_whitespace_char,
|
|
|
|
CloseSpan, OpenSpan, QName, Token as XirToken, TokenStream,
|
2022-03-17 12:20:20 -04:00
|
|
|
};
|
2022-03-18 16:24:53 -04:00
|
|
|
use crate::{
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
diagnose::{Annotate, AnnotatedSpan, Diagnostic},
|
2023-02-14 16:04:17 -05:00
|
|
|
f::Functor,
|
2023-02-16 16:54:31 -05:00
|
|
|
parse::prelude::*,
|
2022-03-18 16:24:53 -04:00
|
|
|
span::Span,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
sym::{st::is_common_whitespace, GlobalSymbolResolve, SymbolId},
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
xir::EleSpan,
|
2022-03-18 16:24:53 -04:00
|
|
|
};
|
2022-03-17 12:20:20 -04:00
|
|
|
use arrayvec::ArrayVec;
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
use std::{
|
2023-02-16 16:54:31 -05:00
|
|
|
convert::Infallible,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
error::Error,
|
|
|
|
fmt::{Debug, Display},
|
|
|
|
marker::PhantomData,
|
|
|
|
};
|
2022-03-17 12:20:20 -04:00
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
// Used for organization.
|
|
|
|
pub use accept::*;
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
/// Tag nesting depth
|
|
|
|
/// (`0` represents the root).
|
2023-02-14 16:04:17 -05:00
|
|
|
///
|
|
|
|
/// Note: the lack of a [`Default`] implementation is intentional so that
|
|
|
|
/// this does not see lax initialization;
|
|
|
|
/// you probably want [`Depth::root`] in that case.
|
|
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd)]
|
tamer: obj::xmlo::reader: Begin conversion to ParseState
This begins to transition XmloReader into a ParseState. Unlike previous
changes where ParseStates were composed into a single ParseState, this is
instead a lowering operation that will take the output of one Parser and
provide it to another.
The mess in ld::poc (...which still needs to be refactored and removed)
shows the concept, which will be abstracted away. This won't actually get
to the ASG in order to test that that this works with the
wip-xmlo-xir-reader flag on (development hasn't gotten that far yet), but
since it type-checks, it should conceptually work.
Wiring lowering operations together is something that I've been dreading for
months, but my approach of only abstracting after-the-fact has helped to
guide a sane approach for this. For some definition of "sane".
It's also worth noting that AsgBuilder will too become a ParseState
implemented as another lowering operation, so:
XIR -> XIRF -> XMLO -> ASG
These steps will all be streaming, with iteration happening only at the
topmost level. For this reason, it's important that ASG not be responsible
for doing that pull, and further we should propagate Parsed::Incomplete
rather than filtering it out and looping an indeterminate number of times
outside of the toplevel.
One final note: the choice of 64 for the maximum depth is entirely
arbitrary and should be more than generous; it'll be finalized at some point
in the future once I actually evaluate what maximum depth is reasonable
based on how the system is used, with some added growing room.
DEV-10863
2022-03-22 13:56:43 -04:00
|
|
|
pub struct Depth(pub usize);
|
2022-03-17 12:20:20 -04:00
|
|
|
|
2022-08-01 13:37:16 -04:00
|
|
|
impl Depth {
|
2023-02-07 16:43:40 -05:00
|
|
|
/// Depth representing a root.
|
|
|
|
pub fn root() -> Depth {
|
|
|
|
Depth(0)
|
|
|
|
}
|
|
|
|
|
2022-08-01 13:37:16 -04:00
|
|
|
/// Yield a new [`Depth`] representing the expected depth of children of
|
|
|
|
/// an element at the current depth.
|
|
|
|
///
|
|
|
|
/// That description is probably more confusing than the method name.
|
|
|
|
pub fn child_depth(&self) -> Depth {
|
|
|
|
match self {
|
|
|
|
Depth(depth) => Depth(depth + 1),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
impl Display for Depth {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
|
|
Display::fmt(&self.0, f)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// A lightly-parsed XIRF object.
|
|
|
|
///
|
|
|
|
/// Certain XIR [`Token`]s are formed into a single object,
|
|
|
|
/// such as an [`Attr`].
|
|
|
|
/// Other objects retain the same format as their underlying token,
|
|
|
|
/// but are still validated to ensure that they are well-formed and that
|
|
|
|
/// the XML is well-structured.
|
2022-07-29 15:27:42 -04:00
|
|
|
///
|
|
|
|
/// Each token representing a child node contains a numeric [`Depth`]
|
|
|
|
/// indicating the nesting depth;
|
|
|
|
/// this can be used by downstream parsers to avoid maintaining their
|
|
|
|
/// own stack in certain cases.
|
2022-03-17 12:20:20 -04:00
|
|
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
pub enum XirfToken<T: TextType> {
|
2022-03-17 12:20:20 -04:00
|
|
|
/// Opening tag of an element.
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
Open(QName, OpenSpan, Depth),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
/// Closing tag of an element.
|
|
|
|
///
|
|
|
|
/// If the name is [`None`],
|
|
|
|
/// then the tag is self-closing.
|
|
|
|
/// If the name is [`Some`],
|
|
|
|
/// then the tag is guaranteed to be balanced
|
|
|
|
/// (matching the depth of its opening tag).
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
Close(Option<QName>, CloseSpan, Depth),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
/// An attribute and its value.
|
|
|
|
///
|
|
|
|
/// The associated [`Span`]s can be found on the enclosed [`Attr`]
|
|
|
|
/// object.
|
|
|
|
Attr(Attr),
|
|
|
|
|
|
|
|
/// Comment node.
|
2022-07-29 15:27:42 -04:00
|
|
|
Comment(SymbolId, Span, Depth),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
/// Character data as part of an element.
|
|
|
|
///
|
2022-06-02 13:41:24 -04:00
|
|
|
/// See also [`CData`](XirfToken::CData) variant.
|
2022-07-29 15:27:42 -04:00
|
|
|
Text(T, Depth),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
/// CData node (`<![CDATA[...]]>`).
|
|
|
|
///
|
|
|
|
/// _Warning: It is up to the caller to ensure that the string `]]>` is
|
|
|
|
/// not present in the text!_
|
|
|
|
/// This is intended for reading existing XML data where CData is
|
|
|
|
/// already present,
|
|
|
|
/// not for producing new CData safely!
|
2022-07-29 15:27:42 -04:00
|
|
|
CData(SymbolId, Span, Depth),
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
2023-02-22 23:03:42 -05:00
|
|
|
impl<T: TextType> XirfToken<T> {
|
|
|
|
pub fn open(
|
|
|
|
qname: impl Into<QName>,
|
|
|
|
span: impl Into<OpenSpan>,
|
|
|
|
depth: Depth,
|
|
|
|
) -> Self {
|
|
|
|
Self::Open(qname.into(), span.into(), depth)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn close(
|
|
|
|
qname: Option<impl Into<QName>>,
|
|
|
|
span: impl Into<CloseSpan>,
|
|
|
|
depth: Depth,
|
|
|
|
) -> Self {
|
|
|
|
Self::Close(qname.map(Into::into), span.into(), depth)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn attr(
|
|
|
|
qname: impl Into<QName>,
|
|
|
|
value: impl Into<SymbolId>,
|
|
|
|
span: (impl Into<Span>, impl Into<Span>),
|
|
|
|
) -> Self {
|
|
|
|
Self::Attr(Attr::new(
|
|
|
|
qname.into(),
|
|
|
|
value.into(),
|
|
|
|
(span.0.into(), span.1.into()),
|
|
|
|
))
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn comment(
|
|
|
|
comment: impl Into<SymbolId>,
|
|
|
|
span: impl Into<Span>,
|
|
|
|
depth: Depth,
|
|
|
|
) -> Self {
|
|
|
|
Self::Comment(comment.into(), span.into(), depth)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn text(text: impl Into<T>, depth: Depth) -> Self {
|
|
|
|
Self::Text(text.into(), depth)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
impl<T: TextType> Token for XirfToken<T> {
|
tamer: parser::Parser: cfg(test) tracing
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
2022-07-18 14:32:34 -04:00
|
|
|
fn ir_name() -> &'static str {
|
|
|
|
"XIRF"
|
|
|
|
}
|
|
|
|
|
2022-03-21 13:40:54 -04:00
|
|
|
fn span(&self) -> Span {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirfToken::*;
|
2022-03-21 13:40:54 -04:00
|
|
|
|
|
|
|
match self {
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
Open(_, OpenSpan(span, _), _)
|
|
|
|
| Close(_, CloseSpan(span, _), _)
|
2022-07-29 15:27:42 -04:00
|
|
|
| Comment(_, span, _)
|
|
|
|
| CData(_, span, _) => *span,
|
2022-03-21 13:40:54 -04:00
|
|
|
|
2022-07-29 15:27:42 -04:00
|
|
|
Text(text, _) => text.span(),
|
2022-03-21 13:40:54 -04:00
|
|
|
Attr(attr) => attr.span(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
impl<T: TextType> Object for XirfToken<T> {}
|
2022-03-25 16:45:32 -04:00
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
impl<T: TextType> Display for XirfToken<T> {
|
2022-03-21 13:40:54 -04:00
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirfToken::*;
|
2022-03-21 13:40:54 -04:00
|
|
|
|
|
|
|
match self {
|
|
|
|
Open(qname, span, _) => {
|
|
|
|
Display::fmt(&XirToken::Open(*qname, *span), f)
|
|
|
|
}
|
|
|
|
Close(oqname, span, _) => {
|
|
|
|
Display::fmt(&XirToken::Close(*oqname, *span), f)
|
|
|
|
}
|
|
|
|
Attr(attr) => Display::fmt(&attr, f),
|
2022-07-29 15:27:42 -04:00
|
|
|
Comment(sym, span, _) => {
|
2022-03-21 13:40:54 -04:00
|
|
|
Display::fmt(&XirToken::Comment(*sym, *span), f)
|
|
|
|
}
|
2022-07-29 15:27:42 -04:00
|
|
|
Text(text, _) => Display::fmt(text, f),
|
|
|
|
CData(sym, span, _) => {
|
|
|
|
Display::fmt(&XirToken::CData(*sym, *span), f)
|
|
|
|
}
|
2022-03-21 13:40:54 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-14 16:04:17 -05:00
|
|
|
impl<T: TextType> XirfToken<T> {
|
|
|
|
pub fn depth(&self) -> Option<Depth> {
|
|
|
|
use XirfToken::*;
|
|
|
|
|
|
|
|
match self {
|
|
|
|
Open(_, _, depth)
|
|
|
|
| Close(_, _, depth)
|
|
|
|
| Comment(_, _, depth)
|
|
|
|
| Text(_, depth)
|
|
|
|
| CData(_, _, depth) => Some(*depth),
|
|
|
|
Attr(_) => None,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
impl<T: TextType> From<Attr> for XirfToken<T> {
|
2022-03-29 14:18:08 -04:00
|
|
|
fn from(attr: Attr) -> Self {
|
|
|
|
Self::Attr(attr)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-14 16:04:17 -05:00
|
|
|
impl<T: TextType> Functor<Depth> for XirfToken<T> {
|
|
|
|
fn map(self, f: impl FnOnce(Depth) -> Depth) -> Self::Target {
|
|
|
|
use XirfToken::*;
|
|
|
|
|
|
|
|
match self {
|
|
|
|
Open(qn, span, depth) => Open(qn, span, f(depth)),
|
|
|
|
Close(qn, span, depth) => Close(qn, span, f(depth)),
|
|
|
|
Attr(_) => self,
|
|
|
|
Comment(sym, span, depth) => Comment(sym, span, f(depth)),
|
|
|
|
Text(text, depth) => Text(text, f(depth)),
|
|
|
|
CData(cdata, span, depth) => CData(cdata, span, f(depth)),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
/// Token of an optionally refined [`Text`].
|
|
|
|
///
|
|
|
|
/// XIRF is configurable on the type of processing it performs on [`Text`],
|
|
|
|
/// including the detection of [`Whitespace`].
|
|
|
|
///
|
|
|
|
/// See also [`RefinedText`].
|
2023-02-16 16:54:31 -05:00
|
|
|
pub trait TextType = From<Text> + Into<Text> + Token + Eq;
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
|
|
|
|
#[derive(Debug, PartialEq, Eq, Clone)]
|
|
|
|
pub struct Text(pub SymbolId, pub Span);
|
|
|
|
|
|
|
|
impl Token for Text {
|
|
|
|
fn ir_name() -> &'static str {
|
|
|
|
"XIRF Text"
|
|
|
|
}
|
|
|
|
|
|
|
|
fn span(&self) -> Span {
|
|
|
|
match self {
|
|
|
|
Self(_, span) => *span,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for Text {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
// TODO: We'll need care to output text so that it does not mess up
|
|
|
|
// formatted output.
|
|
|
|
// Further,
|
|
|
|
// text can be any arbitrary length,
|
|
|
|
// and so should probably be elided after a certain length.
|
|
|
|
write!(f, "text")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// A sequence of one or more whitespace characters.
|
|
|
|
///
|
|
|
|
/// Whitespace here is expected to consist of `[ \n\t\r]`
|
|
|
|
/// (where the first character in that class is a space).
|
|
|
|
#[derive(Debug, PartialEq, Eq, Clone)]
|
|
|
|
pub struct Whitespace(pub Text);
|
|
|
|
|
|
|
|
impl Display for Whitespace {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
// TODO: Escape output as necessary so that we can render the symbol
|
|
|
|
// string.
|
|
|
|
// See also `<Text as Display>::fmt` TODO.
|
|
|
|
write!(f, "whitespace")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Text that has been refined to a more descriptive form.
|
|
|
|
///
|
|
|
|
/// This type may be used as a [`TextType`] to instruct XIRF to detect
|
|
|
|
/// [`Whitespace`].
|
|
|
|
#[derive(Debug, PartialEq, Eq, Clone)]
|
|
|
|
pub enum RefinedText {
|
|
|
|
/// Provided [`Text`] has been determined to be [`Whitespace`].
|
|
|
|
Whitespace(Whitespace),
|
|
|
|
/// Provided [`Text`] was not able to be refined into a more specific
|
|
|
|
/// type.
|
|
|
|
Unrefined(Text),
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Token for RefinedText {
|
|
|
|
fn ir_name() -> &'static str {
|
|
|
|
"XIRF RefinedText"
|
|
|
|
}
|
|
|
|
|
|
|
|
fn span(&self) -> Span {
|
|
|
|
match self {
|
|
|
|
Self::Whitespace(Whitespace(text)) | Self::Unrefined(text) => {
|
|
|
|
text.span()
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for RefinedText {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
match self {
|
|
|
|
Self::Whitespace(ws) => Display::fmt(ws, f),
|
|
|
|
Self::Unrefined(text) => Display::fmt(text, f),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl From<Text> for RefinedText {
|
|
|
|
fn from(text: Text) -> Self {
|
|
|
|
match text {
|
|
|
|
Text(sym, _) if is_whitespace(sym) => {
|
|
|
|
Self::Whitespace(Whitespace(text))
|
|
|
|
}
|
|
|
|
_ => Self::Unrefined(text),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-16 16:54:31 -05:00
|
|
|
impl From<RefinedText> for Text {
|
|
|
|
fn from(value: RefinedText) -> Self {
|
|
|
|
match value {
|
|
|
|
RefinedText::Whitespace(Whitespace(text))
|
|
|
|
| RefinedText::Unrefined(text) => text,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
/// XIRF-compatible attribute parser.
|
2022-04-04 21:50:47 -04:00
|
|
|
pub trait FlatAttrParseState<const MAX_DEPTH: usize> =
|
tamer: parse::state::ParseState::Super: Superstate concept
I'm disappointed that I keep having to implement features that I had hoped
to avoid implementing.
This introduces a "superstate" feature, which is intended really just to be
a sum type that is able to delegate to stitched `ParseState`s. This then
allows a `ParseState` to transition directly to another `ParseState` and
have the parent `ParseState` handle the delegation---a trampoline.
This issue naturally arises out of the recursive nature of parsing a TAME
XML document, where certain statements can be nested (like `<section>`), and
where expressions can be nested. I had gotten away with composition-based
delegation for now because `xmlo` headers do not have such nesting.
The composition-based approach falls flat for recursive structures. The
typical naive solution is boxing, which I cannot do, because not only is
this on an extremely hot code path, but I require that Rust be able to
deeply introspect and optimize away the lowering pipeline as much as
possible.
Many months ago, I figured that such a solution would require a trampoline,
as it typically does in stack-based languages, but I was hoping to avoid
it. Well, no longer; let's just get on with it.
This intends to implement trampolining in a `ParseState` that serves as that
sum type, rather than introducing it as yet another feature to `Parser`; the
latter would provide a more convenient API, but it would continue to bloat
`Parser` itself. Right now, only the element parser generator will require
use of this, so if it's needed beyond that, then I'll debate whether it's
worth providing a better abstraction. For now, the intent will be to use
the `Context` to store a stack that it can pop off of to restore the
previous `ParseState` before delegation.
DEV-7145
2022-08-03 12:53:50 -04:00
|
|
|
ClosedParseState<Token = XirToken, Object = Attr>
|
2022-04-04 21:50:47 -04:00
|
|
|
where
|
2022-06-07 15:02:41 -04:00
|
|
|
Self: Default,
|
2022-06-02 13:41:24 -04:00
|
|
|
<Self as ParseState>::Error: Into<XirToXirfError>,
|
2022-04-04 21:50:47 -04:00
|
|
|
StateContext<MAX_DEPTH>: AsMut<<Self as ParseState>::Context>;
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
/// Stack of element [`QName`] and [`Span`] pairs,
|
|
|
|
/// representing the current level of nesting.
|
|
|
|
///
|
|
|
|
/// This storage is statically allocated,
|
|
|
|
/// allowing XIRF's parser to avoid memory allocation entirely.
|
|
|
|
type ElementStack<const MAX_DEPTH: usize> = ArrayVec<(QName, Span), MAX_DEPTH>;
|
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
/// Lower [XIR](XirToken) into [XIRF](XirfToken),
|
|
|
|
/// accepting only fully parsed XML documents.
|
|
|
|
///
|
|
|
|
/// If parsing is expected to stop before reaching the end of the document,
|
|
|
|
/// see [`PartialXirToXirf`].
|
|
|
|
/// For more information on accepting states,
|
|
|
|
/// see [`XirfAcceptor`].
|
|
|
|
pub type FullXirToXirf<const MAX_DEPTH: usize, T> =
|
|
|
|
XirToXirf<MAX_DEPTH, T, AttrParseState, FullXirfAcceptor>;
|
|
|
|
|
|
|
|
/// Lower [XIR](XirToken) into [XIRF](XirfToken),
|
|
|
|
/// accepting partially parsed XML documents at node boundaries.
|
|
|
|
///
|
|
|
|
/// If the entire XML document ought to be parsed,
|
|
|
|
/// see [`FullXirToXirf`] to provide a guarantee of an error in case the
|
|
|
|
/// system stops parsing before completion.
|
|
|
|
/// For more information on accepting states,
|
|
|
|
/// see [`XirfAcceptor`].
|
|
|
|
pub type PartialXirToXirf<const MAX_DEPTH: usize, T> =
|
|
|
|
XirToXirf<MAX_DEPTH, T, AttrParseState, PartialXirfAcceptor>;
|
|
|
|
|
|
|
|
/// Lower [XIR](XirToken) into [XIRF](XirfToken).
|
2022-03-17 12:20:20 -04:00
|
|
|
///
|
2022-03-17 23:22:38 -04:00
|
|
|
/// This parser is a pushdown automaton that parses a single XML document.
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
#[derive(Debug, PartialEq, Eq)]
|
2023-02-24 11:15:06 -05:00
|
|
|
pub enum XirToXirf<
|
|
|
|
const MAX_DEPTH: usize,
|
|
|
|
T,
|
|
|
|
SA = AttrParseState,
|
|
|
|
A: XirfAcceptor = FullXirfAcceptor,
|
|
|
|
> where
|
2022-04-04 21:50:47 -04:00
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
T: TextType,
|
2022-03-17 12:20:20 -04:00
|
|
|
{
|
2022-03-17 23:22:38 -04:00
|
|
|
/// Document parsing has not yet begun.
|
2023-02-24 11:15:06 -05:00
|
|
|
PreRoot(PhantomData<(T, A)>),
|
2022-03-17 12:20:20 -04:00
|
|
|
/// Parsing nodes.
|
2022-04-04 21:50:47 -04:00
|
|
|
NodeExpected,
|
2022-03-17 12:20:20 -04:00
|
|
|
/// Delegating to attribute parser.
|
2022-04-04 21:50:47 -04:00
|
|
|
AttrExpected(SA),
|
2022-03-17 23:22:38 -04:00
|
|
|
/// End of document has been reached.
|
|
|
|
Done,
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
impl<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor> Default
|
|
|
|
for XirToXirf<MAX_DEPTH, T, SA, A>
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
where
|
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
|
|
|
T: TextType,
|
|
|
|
{
|
|
|
|
fn default() -> Self {
|
|
|
|
Self::PreRoot(PhantomData::default())
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-04-04 21:50:47 -04:00
|
|
|
pub type StateContext<const MAX_DEPTH: usize> =
|
|
|
|
Context<ElementStack<MAX_DEPTH>>;
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
/// Whether the given [`SymbolId`] is all whitespace according to
|
|
|
|
/// [`is_xml_whitespace_char`].
|
|
|
|
///
|
|
|
|
/// This will first consult the pre-interned whitespace symbol list using
|
|
|
|
/// [`is_common_whitespace`].
|
|
|
|
/// If that check fails,
|
|
|
|
/// it will resort to looking up the symbol and performing a linear scan
|
|
|
|
/// of the string,
|
|
|
|
/// terminating early if a non-whitespace character is found.
|
|
|
|
///
|
|
|
|
/// Note that the empty string is considered to be whitespace.
|
|
|
|
#[inline]
|
|
|
|
fn is_whitespace(sym: SymbolId) -> bool {
|
|
|
|
// See `sym::prefill`;
|
|
|
|
// this may require maintenance to keep the prefill list up-to-date
|
|
|
|
// with common whitespace symbols to avoid symbol lookups.
|
|
|
|
// This common check is purely a performance optimization.
|
|
|
|
is_common_whitespace(sym) || {
|
|
|
|
// If this is called often and is too expensive,
|
|
|
|
// it may be worth caching metadata about symbols,
|
|
|
|
// either for XIRF or globally.
|
|
|
|
// This requires multiple dereferences
|
|
|
|
// (for looking up the intern for the `SymbolId`,
|
|
|
|
// which may result in multiple (CPU) cache misses,
|
|
|
|
// but that would have to be profiled since the symbol may
|
|
|
|
// have just been interned and may be cached still)
|
|
|
|
// and then a linear scan of the associated `str`,
|
|
|
|
// though it will terminate as soon as it finds a non-whitespace
|
|
|
|
// character.
|
|
|
|
sym.lookup_str().chars().all(is_xml_whitespace_char)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
impl<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor> ParseState
|
|
|
|
for XirToXirf<MAX_DEPTH, T, SA, A>
|
2022-03-17 12:20:20 -04:00
|
|
|
where
|
2022-04-04 21:50:47 -04:00
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
T: TextType,
|
2022-03-17 12:20:20 -04:00
|
|
|
{
|
2022-03-18 15:26:05 -04:00
|
|
|
type Token = XirToken;
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
type Object = XirfToken<T>;
|
2022-06-02 13:41:24 -04:00
|
|
|
type Error = XirToXirfError;
|
2022-04-04 21:50:47 -04:00
|
|
|
type Context = StateContext<MAX_DEPTH>;
|
2022-03-17 12:20:20 -04:00
|
|
|
|
2022-04-04 21:50:47 -04:00
|
|
|
fn parse_token(
|
|
|
|
self,
|
|
|
|
tok: Self::Token,
|
|
|
|
stack: &mut Self::Context,
|
|
|
|
) -> TransitionResult<Self> {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirToXirf::{AttrExpected, Done, NodeExpected, PreRoot};
|
2022-03-17 12:20:20 -04:00
|
|
|
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
match (self, tok) {
|
2022-03-17 23:22:38 -04:00
|
|
|
// Comments are permitted before and after the first root element.
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
(st @ (PreRoot(_) | Done), XirToken::Comment(sym, span)) => {
|
2022-07-29 15:27:42 -04:00
|
|
|
let depth = Depth(stack.len());
|
|
|
|
Transition(st).ok(XirfToken::Comment(sym, span, depth))
|
2022-03-17 23:22:38 -04:00
|
|
|
}
|
|
|
|
|
2022-07-29 00:30:44 -04:00
|
|
|
// Ignore whitespace before or after root.
|
|
|
|
(st @ (PreRoot(_) | Done), XirToken::Text(sym, _))
|
|
|
|
if is_whitespace(sym) =>
|
|
|
|
{
|
|
|
|
Transition(st).incomplete()
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
}
|
|
|
|
|
2022-07-29 00:30:44 -04:00
|
|
|
(PreRoot(_), tok @ XirToken::Open(..)) => {
|
|
|
|
Self::parse_node(tok, stack)
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
}
|
2022-03-17 23:22:38 -04:00
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
(st @ PreRoot(_), tok) => {
|
|
|
|
Transition(st).err(XirToXirfError::RootOpenExpected(tok))
|
2022-03-17 23:22:38 -04:00
|
|
|
}
|
|
|
|
|
2022-04-04 21:50:47 -04:00
|
|
|
(NodeExpected, tok) => Self::parse_node(tok, stack),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
tamer: Replace ParseStatus::Dead with generic lookahead
Oh what a tortured journey. I had originally tried to avoid formalizing
lookahead for all parsers by pretending that it was only needed for dead
state transitions (that is---states that have no transitions for a given
input token), but then I needed to yield information for aggregation. So I
added the ability to override the token for `Dead` to yield that, in
addition to the token. But then I also needed to yield lookahead for error
conditions. It was a mess that didn't make sense.
This eliminates `ParseStatus::Dead` entirely and fully integrates the
lookahead token in `Parser` that was previously implemented.
Notably, the lookahead token is encapsulated in `TransitionResult` and
unavailable to `ParseState` implementations, forcing them to rely on
`Parser` for recursion. This not only prevents `ParseState` from recursing,
but also simplifies delegation by removing the need to manually handle
tokens of lookahead.
The awkward case here is XIRT, which does not follow the streaming parsing
convention, because it was conceived before the parsing framework. It needs
to go away, but doing so right now would be a lot of work, so it has to
stick around for a little bit longer until the new parser generators can be
used instead. It is a persistent thorn in my side, going against the grain.
`Parser` will immediately recurse if it sees a token of lookahead with an
incomplete parse. This is because stitched parsers will frequently yield a
dead state indication when they're done parsing, and there's no use in
propagating an `Incomplete` status down the entire lowering pipeline. But,
that does mean that the toplevel is not the only thing recursing. _But_,
the behavior doesn't really change, in the sense that it would infinitely
recurse down the entire lowering stack (though there'd be an opportunity to
detect that). This should never happen with a correct parser, but it's not
worth the effort right now to try to force such a thing with Rust's type
system. Something like TLA+ is better suited here as an aid, but it
shouldn't be necessary with clear implementations and proper test
cases. Parser generators will also ensure such a thing cannot occur.
I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a
lot of type inference that happens based on the fact that `ParseStatus` has
a `ParseState` type parameter; `Parsed` has only `Object`. It is desirable
for a public-facing `Parsed` to not be tied to `ParseState`, since consumers
need not be concerned with such a heavy type; however, we _do_ want that
heavy type internally, as it carries a lot of useful information that allows
for significant and powerful type inference, which in turn creates
expressive and convenient APIs.
DEV-7145
2022-07-11 23:49:57 -04:00
|
|
|
(AttrExpected(sa), tok) => sa.delegate(
|
|
|
|
tok,
|
|
|
|
stack,
|
|
|
|
|sa| Transition(AttrExpected(sa)),
|
|
|
|
|| Transition(NodeExpected),
|
|
|
|
),
|
2022-03-17 23:22:38 -04:00
|
|
|
|
|
|
|
(Done, tok) => Transition(Done).dead(tok),
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
}
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/// Whether all elements have been closed.
|
|
|
|
///
|
|
|
|
/// Parsing will fail if there are any open elements.
|
|
|
|
/// Intuitively,
|
|
|
|
/// this means that the parser must have encountered the closing tag
|
|
|
|
/// for the root element.
|
tamer: xir::parse::ele: Superstate not to accept early EOF
This was accepting an early EOF when the active child `ParseState` was in an
accepting state, because it was not ensuring that anything on the stack was
also accepting.
Ideally, there should be nothing on the stack, and hopefully in the future
that's what happens. But with how things are today, it's important that, if
anything is on the stack, it is accepting.
Since `is_accepting` on the superstate is only called during finalization,
and because the check terminates early, and because the stack practically
speaking will only have a couple things on it max (unless we're in tail
position in a deeply nested tree, without TCO [yet]), this shouldn't be an
expensive check.
Implementing this did require that we expose `Context` to `is_accepting`,
which I had hoped to avoid having to do, but here we are.
DEV-7145
2022-08-11 13:49:11 -04:00
|
|
|
fn is_accepting(&self, _: &Self::Context) -> bool {
|
2022-03-17 12:20:20 -04:00
|
|
|
// TODO: It'd be nice if we could also return additional context to
|
|
|
|
// aid the user in diagnosing the problem,
|
|
|
|
// e.g. what element(s) still need closing.
|
2023-02-24 11:15:06 -05:00
|
|
|
A::is_accepting(self)
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
/// Configurable accepting states for [`XirToXirf`].
|
|
|
|
///
|
|
|
|
/// See this module's [`XirfAcceptor`] for more information.
|
|
|
|
mod accept {
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
/// Acceptor for [`XirToXirf`].
|
|
|
|
///
|
|
|
|
/// This is responsible for determining whether [`XirToXirf`] is in an
|
|
|
|
/// accepting state.
|
|
|
|
/// There are two acceptors:
|
|
|
|
///
|
|
|
|
/// 1. [`FullXirfAcceptor`] expects that the _entire_ XML document be
|
|
|
|
/// completely parsed up to and including the closing root node;
|
|
|
|
/// and
|
|
|
|
/// 2. [`PartialXirfAcceptor`] allows parsing to halt part-way through
|
|
|
|
/// an XML document,
|
|
|
|
/// provided that parsing ends at a node boundary.
|
|
|
|
///
|
|
|
|
/// See each respective acceptor for more information.
|
|
|
|
pub trait XirfAcceptor: Debug + PartialEq + Eq + Display + Default {
|
|
|
|
fn is_accepting<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor>(
|
|
|
|
st: &XirToXirf<MAX_DEPTH, T, SA, A>,
|
|
|
|
) -> bool
|
|
|
|
where
|
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
|
|
|
T: TextType;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Acceptor for fully parsed XML documents for [`XirToXirf`].
|
|
|
|
///
|
|
|
|
/// This acceptor should be used when the intent of the lowering
|
|
|
|
/// pipeline is to fully parse the [XIR](XirToken) stream.
|
|
|
|
/// In other words:
|
|
|
|
/// this should be used when the XML document being read ought to be
|
|
|
|
/// read _fully_,
|
|
|
|
/// where halting parsing before the root node would indicate a
|
|
|
|
/// defect in the system.
|
|
|
|
///
|
|
|
|
/// For example,
|
|
|
|
/// when reading a file with TAME sources in `tamec`,
|
|
|
|
/// the compiler ought to ensure that the entire file is read to
|
|
|
|
/// completion.
|
|
|
|
/// If the lowering pipeline stops requesting tokens before the XIR
|
|
|
|
/// stream has ended,
|
|
|
|
/// then that means that compilation has halted before the system
|
|
|
|
/// has had a chance to consider the rest of the file.
|
|
|
|
/// Because the lowering pipeline is intended to parse and present
|
|
|
|
/// errors on the entire file each run,
|
|
|
|
/// this would represent a bug in the system,
|
|
|
|
/// and so we ought to fail.
|
|
|
|
///
|
|
|
|
/// Downstream parsers ought to fail for their own reasons as well,
|
|
|
|
/// but this provides an extra layer of protection for _anything_ that
|
|
|
|
/// happens to read XML files.
|
|
|
|
///
|
|
|
|
/// For an example of a situation where we may not wish to fail,
|
|
|
|
/// see [`PartialXirfAcceptor`].
|
|
|
|
#[derive(Debug, PartialEq, Eq, Default)]
|
|
|
|
pub struct FullXirfAcceptor;
|
|
|
|
|
|
|
|
impl XirfAcceptor for FullXirfAcceptor {
|
|
|
|
fn is_accepting<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor>(
|
|
|
|
st: &XirToXirf<MAX_DEPTH, T, SA, A>,
|
|
|
|
) -> bool
|
|
|
|
where
|
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
|
|
|
T: TextType,
|
|
|
|
{
|
|
|
|
matches!(st, XirToXirf::Done)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for FullXirfAcceptor {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
write!(f, "accepting only full documents")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Acceptor for either full _or_ partially parsed XML documents for
|
|
|
|
/// [`XirToXirf`].
|
|
|
|
///
|
|
|
|
/// This acceptor is intended to be used when parsing the entirety of a
|
|
|
|
/// [XIR](XirToken) stream is not desirable;
|
|
|
|
/// it allows parsing to be completed at a node boundary
|
|
|
|
/// (when a node is expected).
|
|
|
|
/// This acceptor builds on the behavior of [`FullXirfAcceptor`],
|
|
|
|
/// and so will also accept all fully parsed documents.
|
|
|
|
///
|
|
|
|
/// For example,
|
|
|
|
/// when reading object files in `tameld`,
|
|
|
|
/// the linker is concerned only with header information;
|
|
|
|
/// the remainder of the XML document does not contain useful
|
|
|
|
/// information and would be wasteful to parse.
|
|
|
|
/// In that case,
|
|
|
|
/// we rely on downstream parsers to determine whether the document
|
|
|
|
/// has been sufficiently parsed.
|
|
|
|
///
|
|
|
|
/// This acceptor provides one weaker guarantee:
|
|
|
|
/// that parsing has _at least_ completed parsing a node,
|
|
|
|
/// such as an element.
|
|
|
|
/// Parsing must complete at a node boundary,
|
|
|
|
/// and so cannot halt in the middle of attribute parsing for an
|
|
|
|
/// element,
|
|
|
|
/// for example.
|
|
|
|
#[derive(Debug, PartialEq, Eq, Default)]
|
|
|
|
pub struct PartialXirfAcceptor;
|
|
|
|
|
|
|
|
impl XirfAcceptor for PartialXirfAcceptor {
|
|
|
|
fn is_accepting<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor>(
|
|
|
|
st: &XirToXirf<MAX_DEPTH, T, SA, A>,
|
|
|
|
) -> bool
|
|
|
|
where
|
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
|
|
|
T: TextType,
|
|
|
|
{
|
|
|
|
FullXirfAcceptor::is_accepting(st)
|
|
|
|
|| matches!(st, XirToXirf::NodeExpected)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl Display for PartialXirfAcceptor {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
write!(f, "accepting partial documents at node boundaries")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor> Display
|
|
|
|
for XirToXirf<MAX_DEPTH, T, SA, A>
|
2022-05-25 14:20:10 -04:00
|
|
|
where
|
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
T: TextType,
|
2022-05-25 14:20:10 -04:00
|
|
|
{
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirToXirf::*;
|
2022-05-25 14:20:10 -04:00
|
|
|
|
|
|
|
match self {
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
PreRoot(_) => write!(f, "expecting document root"),
|
2022-05-25 14:20:10 -04:00
|
|
|
NodeExpected => write!(f, "expecting a node"),
|
|
|
|
AttrExpected(sa) => Display::fmt(sa, f),
|
2022-07-29 00:30:44 -04:00
|
|
|
Done => write!(f, "done parsing document root"),
|
2023-02-24 11:15:06 -05:00
|
|
|
}?;
|
|
|
|
|
|
|
|
// e.g. ", accepting ..."
|
|
|
|
write!(f, ", ")?;
|
|
|
|
Display::fmt(&A::default(), f)
|
2022-05-25 14:20:10 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-24 11:15:06 -05:00
|
|
|
impl<const MAX_DEPTH: usize, T, SA, A: XirfAcceptor>
|
|
|
|
XirToXirf<MAX_DEPTH, T, SA, A>
|
2022-03-17 12:20:20 -04:00
|
|
|
where
|
2022-04-04 21:50:47 -04:00
|
|
|
SA: FlatAttrParseState<MAX_DEPTH>,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
T: TextType,
|
2022-03-17 12:20:20 -04:00
|
|
|
{
|
|
|
|
/// Parse a token while in a state expecting a node.
|
|
|
|
fn parse_node(
|
2022-03-21 13:40:54 -04:00
|
|
|
tok: <Self as ParseState>::Token,
|
2022-04-04 21:50:47 -04:00
|
|
|
stack: &mut ElementStack<MAX_DEPTH>,
|
2022-03-17 16:30:35 -04:00
|
|
|
) -> TransitionResult<Self> {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirToXirf::{AttrExpected, Done, NodeExpected};
|
2022-03-17 12:20:20 -04:00
|
|
|
|
2022-07-29 15:27:42 -04:00
|
|
|
let depth = Depth(stack.len());
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
match tok {
|
2022-03-21 13:40:54 -04:00
|
|
|
XirToken::Open(qname, span) if stack.len() == MAX_DEPTH => {
|
2022-06-02 13:41:24 -04:00
|
|
|
Transition(NodeExpected).err(XirToXirfError::MaxDepthExceeded {
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
open: (qname, span.tag_span()),
|
2022-04-04 21:50:47 -04:00
|
|
|
max: Depth(MAX_DEPTH),
|
|
|
|
})
|
2022-03-21 13:40:54 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
XirToken::Open(qname, span) => {
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
stack.push((qname, span.tag_span()));
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
// Delegate to the attribute parser until it is complete.
|
2022-07-29 15:27:42 -04:00
|
|
|
Transition(AttrExpected(SA::default()))
|
|
|
|
.ok(XirfToken::Open(qname, span, depth))
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
2022-03-21 13:40:54 -04:00
|
|
|
XirToken::Close(close_oqname, close_span) => {
|
2022-03-17 12:20:20 -04:00
|
|
|
match (close_oqname, stack.pop()) {
|
2022-03-17 23:22:38 -04:00
|
|
|
(_, None) => unreachable!("parser should be in Done state"),
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
(Some(qname), Some((open_qname, open_span)))
|
|
|
|
if qname != open_qname =>
|
|
|
|
{
|
2022-04-04 21:50:47 -04:00
|
|
|
Transition(NodeExpected).err(
|
2022-06-02 13:41:24 -04:00
|
|
|
XirToXirfError::UnbalancedTag {
|
2022-03-17 12:20:20 -04:00
|
|
|
open: (open_qname, open_span),
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
close: (qname, close_span.tag_span()),
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
},
|
2022-03-17 12:20:20 -04:00
|
|
|
)
|
|
|
|
}
|
|
|
|
|
2022-03-17 23:22:38 -04:00
|
|
|
// Final closing tag (for root node) completes the document.
|
tamer: Integrate clippy
This invokes clippy as part of `make check` now, which I had previously
avoided doing (I'll elaborate on that below).
This commit represents the changes needed to resolve all the warnings
presented by clippy. Many changes have been made where I find the lints to
be useful and agreeable, but there are a number of lints, rationalized in
`src/lib.rs`, where I found the lints to be disagreeable. I have provided
rationale, primarily for those wondering why I desire to deviate from the
default lints, though it does feel backward to rationalize why certain lints
ought to be applied (the reverse should be true).
With that said, this did catch some legitimage issues, and it was also
helpful in getting some older code up-to-date with new language additions
that perhaps I used in new code but hadn't gone back and updated old code
for. My goal was to get clippy working without errors so that, in the
future, when others get into TAMER and are still getting used to Rust,
clippy is able to help guide them in the right direction.
One of the reasons I went without clippy for so long (though I admittedly
forgot I wasn't using it for a period of time) was because there were a
number of suggestions that I found disagreeable, and I didn't take the time
to go through them and determine what I wanted to follow. Furthermore, it
was hard to make that judgment when I was new to the language and lacked
the necessary experience to do so.
One thing I would like to comment further on is the use of `format!` with
`expect`, which is also what the diagnostic system convenience methods
do (which clippy does not cover). Because of all the work I've done trying
to understand Rust and looking at disassemblies and seeing what it
optimizes, I falsely assumed that Rust would convert such things into
conditionals in my otherwise-pure code...but apparently that's not the case,
when `format!` is involved.
I noticed that, after making the suggested fix with `get_ident`, Rust
proceeded to then inline it into each call site and then apply further
optimizations. It was also previously invoking the thread lock (for the
interner) unconditionally and invoking the `Display` implementation. That
is not at all what I intended for, despite knowing the eager semantics of
function calls in Rust.
Anyway, possibly more to come on that, I'm just tired of typing and need to
move on. I'll be returning to investigate further diagnostic messages soon.
2023-01-12 10:46:48 -05:00
|
|
|
(..) if stack.is_empty() => Transition(Done).ok(
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
XirfToken::Close(close_oqname, close_span, Depth(0)),
|
|
|
|
),
|
2022-03-17 23:22:38 -04:00
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
(..) => {
|
|
|
|
let depth = stack.len();
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
Transition(NodeExpected).ok(XirfToken::Close(
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
close_oqname,
|
|
|
|
close_span,
|
|
|
|
Depth(depth),
|
|
|
|
))
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-07-29 15:27:42 -04:00
|
|
|
XirToken::Comment(sym, span) => Transition(NodeExpected)
|
|
|
|
.ok(XirfToken::Comment(sym, span, depth)),
|
|
|
|
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
XirToken::Text(sym, span) => Transition(NodeExpected)
|
2022-07-29 15:27:42 -04:00
|
|
|
.ok(XirfToken::Text(T::from(Text(sym, span)), depth)),
|
|
|
|
|
2022-03-21 13:40:54 -04:00
|
|
|
XirToken::CData(sym, span) => {
|
2022-07-29 15:27:42 -04:00
|
|
|
Transition(NodeExpected).ok(XirfToken::CData(sym, span, depth))
|
tamer: xir::parse::Transition: Generalize flat::Transition
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
2022-03-17 15:50:35 -04:00
|
|
|
}
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
// We should transition to `State::Attr` before encountering any
|
|
|
|
// of these tokens.
|
2022-03-21 13:40:54 -04:00
|
|
|
XirToken::AttrName(..)
|
|
|
|
| XirToken::AttrValue(..)
|
Revert "tamer: xir: Initial re-introduction of AttrEnd"
This reverts commit b973d36862a4a2aaf53fb0b25fba01b57e5a7463.
Alright, I'm getting sick of fighting with myself on this. But rather than
just removing the last commit, I'm going to keep it around, so that my
thoughts are clearly documented for my future quarrels with myself.
Firstly: this added more overhead than I wanted it to. While it wasn't
significant, it did add 100--150ms to one of our largest systems, up from
~2.8s, which seems a bit much for a token that's really just meant to make
life easier for the parser.
Further, it seems that all I've managed to do is push my original problem to
a different layer---this started as a means to resolve having to emit both
an object and an error simultaneously in the case where aggregate attribute
parsing has completed, but we encounter an error on the next token (e.g. an
unexpected element). But XIRF, if it's missing AttrEnd, should throw an
error, but should also recover. Recovery is easy---just assume that it was
present---_but then we don't emit a XIRF `AttrEnd` token_, which is
necessary for downstream systems. So we'd need to either:
(a) emit both a token and an error; or
(b) panic.
But if we're doing (a), then the need for `AttrEnd` goes away, because it
solves the original problem (though the other concerns of the previous
commit still stand). (b) is not ideal at all, even though the missing token
does represent an internal system error; it's not something the user can
correct. But, given that it's something that the user cannot correct,
doesn't that imply that it's an awkward thing to include in the token
stream? So back to `AttrEnd` being an awkward PITA to have.
So, given (a), I'll just do that: errors will become more of a "hey, this
error just occurred, but I'm trying to recover---here's an object that you
should use if you choose to continue parsing, but it may or may not be what
you're looking for; proceed with caution". That flips the original script:
I imagined having external systems feed recovery tokens, but this
encapsulates recovery within the parser, which really is more appropriate,
though less flexible than having an omniscient external recovery system;
such a monolith was always an awkward concept and would be difficult to
implement cleanly.
This can also potentially be implemented as a generalization of the Dead
state change that allowed an object to be emitted alongside the
lookahead/error.
Anyway, back to where I was...I'm sure I'll look back on this in the future
shaking my head, reflecting on how naive I was.
DEV-7145
2022-06-29 11:02:18 -04:00
|
|
|
| XirToken::AttrValueFragment(..) => {
|
2022-03-17 12:20:20 -04:00
|
|
|
unreachable!("attribute token in NodeExpected state: {tok:?}")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Produce a streaming parser lowering a XIR [`TokenStream`] into a XIRF
|
|
|
|
/// stream.
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
pub fn parse<const MAX_DEPTH: usize, T: TextType>(
|
2022-03-17 12:20:20 -04:00
|
|
|
toks: impl TokenStream,
|
tamer: Xirf::Text refinement
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
2022-07-27 15:49:38 -04:00
|
|
|
) -> impl Iterator<Item = ParsedResult<XirToXirf<MAX_DEPTH, T>>> {
|
|
|
|
XirToXirf::<MAX_DEPTH, T>::parse(toks)
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
2022-06-02 13:41:24 -04:00
|
|
|
/// Parsing error from [`XirToXirf`].
|
2022-03-17 12:20:20 -04:00
|
|
|
#[derive(Debug, Eq, PartialEq)]
|
2022-06-02 13:41:24 -04:00
|
|
|
pub enum XirToXirfError {
|
2022-03-17 23:22:38 -04:00
|
|
|
/// Opening root element tag was expected.
|
2022-03-21 13:40:54 -04:00
|
|
|
RootOpenExpected(XirToken),
|
2022-03-17 23:22:38 -04:00
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
/// Opening tag exceeds the maximum nesting depth for this parser.
|
|
|
|
MaxDepthExceeded { open: (QName, Span), max: Depth },
|
|
|
|
|
|
|
|
/// The closing tag does not match the opening tag at the same level of
|
|
|
|
/// nesting.
|
|
|
|
UnbalancedTag {
|
|
|
|
open: (QName, Span),
|
|
|
|
close: (QName, Span),
|
|
|
|
},
|
|
|
|
|
|
|
|
/// Error from the attribute parser.
|
|
|
|
AttrError(AttrParseError),
|
|
|
|
}
|
|
|
|
|
2022-06-02 13:41:24 -04:00
|
|
|
impl Display for XirToXirfError {
|
2022-03-17 12:20:20 -04:00
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirToXirfError::*;
|
2022-03-17 12:20:20 -04:00
|
|
|
|
|
|
|
match self {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
RootOpenExpected(_tok) => {
|
|
|
|
write!(f, "missing opening root element",)
|
2022-03-17 23:22:38 -04:00
|
|
|
}
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
MaxDepthExceeded {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
open: (_name, _),
|
2022-03-17 12:20:20 -04:00
|
|
|
max,
|
|
|
|
} => {
|
|
|
|
write!(
|
|
|
|
f,
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
"maximum XML element nesting depth of `{max}` exceeded"
|
2022-03-17 12:20:20 -04:00
|
|
|
)
|
|
|
|
}
|
|
|
|
|
|
|
|
UnbalancedTag {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
open: (open_name, _),
|
|
|
|
close: (_close_name, _),
|
2022-03-17 12:20:20 -04:00
|
|
|
} => {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
write!(f, "expected closing tag for `{open_name}`")
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
AttrError(e) => Display::fmt(e, f),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-02 13:41:24 -04:00
|
|
|
impl Error for XirToXirfError {
|
2022-03-17 12:20:20 -04:00
|
|
|
fn source(&self) -> Option<&(dyn Error + 'static)> {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
match self {
|
|
|
|
Self::AttrError(e) => Some(e),
|
|
|
|
_ => None,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-02 13:41:24 -04:00
|
|
|
impl Diagnostic for XirToXirfError {
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
fn describe(&self) -> Vec<AnnotatedSpan> {
|
2022-06-02 13:41:24 -04:00
|
|
|
use XirToXirfError::*;
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
|
|
|
|
match self {
|
|
|
|
RootOpenExpected(tok) => {
|
|
|
|
// TODO: Should the span be the first byte,
|
|
|
|
// or should we delegate that question to an e.g. `SpanLike`?
|
|
|
|
tok.span()
|
|
|
|
.error("an opening root node was expected here")
|
|
|
|
.into()
|
|
|
|
}
|
|
|
|
|
|
|
|
MaxDepthExceeded {
|
|
|
|
open: (_, span),
|
|
|
|
max,
|
|
|
|
} => span
|
|
|
|
.error(format!(
|
|
|
|
"this opening tag increases the level of nesting \
|
|
|
|
past the limit of {max}"
|
|
|
|
))
|
|
|
|
.into(),
|
|
|
|
|
|
|
|
UnbalancedTag {
|
|
|
|
open: (open_name, open_span),
|
2022-04-28 15:47:34 -04:00
|
|
|
close: (_close_name, close_span),
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
} => {
|
|
|
|
// TODO: hint saying that the nesting could be wrong, etc;
|
|
|
|
// we can't just suggest a replacement,
|
|
|
|
// since that's not necessarily the problem
|
|
|
|
vec![
|
|
|
|
open_span
|
|
|
|
.note(format!("element `{open_name}` is opened here")),
|
2022-04-28 15:47:34 -04:00
|
|
|
// No need to state the close name since the source line
|
|
|
|
// will be highlighted by the diagnostic message.
|
|
|
|
close_span.error(format!("expected `</{open_name}>`")),
|
tamer: diagnose: Introduction of diagnostic system
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
2022-04-13 14:41:54 -04:00
|
|
|
]
|
|
|
|
}
|
|
|
|
|
|
|
|
AttrError(e) => e.describe(),
|
|
|
|
}
|
2022-03-17 12:20:20 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-02 13:41:24 -04:00
|
|
|
impl From<AttrParseError> for XirToXirfError {
|
2022-03-17 12:20:20 -04:00
|
|
|
fn from(e: AttrParseError) -> Self {
|
|
|
|
Self::AttrError(e)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-16 16:54:31 -05:00
|
|
|
/// Lower a [`XirfToken`] stream into a [`XirToken`] stream.
|
|
|
|
///
|
|
|
|
/// This is the dual of [`XirToXirf`],
|
|
|
|
/// and is intended to be used when the system _generates_ XML.
|
|
|
|
/// If you do not need any features of XIRF,
|
|
|
|
/// and aren't using any operation that produces it,
|
|
|
|
/// then you may also skip a step and just emit XIR to avoid having to
|
|
|
|
/// perform this lowering operation.
|
|
|
|
#[derive(Debug, PartialEq, Eq)]
|
|
|
|
pub enum XirfToXir<T: TextType> {
|
|
|
|
Ready(PhantomData<T>),
|
|
|
|
AttrVal(PhantomData<T>),
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: TextType> Default for XirfToXir<T> {
|
|
|
|
fn default() -> Self {
|
|
|
|
Self::Ready(Default::default())
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: TextType> Display for XirfToXir<T> {
|
|
|
|
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
|
|
|
|
write!(f, "translating XIRF to XIR")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: TextType> ParseState for XirfToXir<T> {
|
|
|
|
type Token = XirfToken<T>;
|
|
|
|
type Object = XirToken;
|
|
|
|
type Error = Infallible;
|
|
|
|
|
|
|
|
fn parse_token(
|
|
|
|
self,
|
|
|
|
tok: Self::Token,
|
|
|
|
_: NoContext,
|
|
|
|
) -> TransitionResult<Self::Super> {
|
|
|
|
use XirToken as Xir;
|
|
|
|
use XirfToXir::*;
|
|
|
|
use XirfToken as Xirf;
|
|
|
|
|
|
|
|
macro_rules! to {
|
|
|
|
($tok:expr) => {
|
|
|
|
Transition(self).ok($tok)
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
|
|
|
match tok {
|
|
|
|
Xirf::Open(qname, ospan, _) => to!(Xir::Open(qname, ospan)),
|
|
|
|
Xirf::Close(qname, cspan, _) => to!(Xir::Close(qname, cspan)),
|
|
|
|
Xirf::Attr(attr) => match self {
|
|
|
|
Self::Ready(p) => Transition(AttrVal(p))
|
|
|
|
.ok(Xir::AttrName(attr.name(), attr.attr_span().key_span()))
|
|
|
|
.with_lookahead(Xirf::Attr(attr)),
|
|
|
|
Self::AttrVal(p) => Transition(Ready(p)).ok(Xir::AttrValue(
|
|
|
|
attr.value(),
|
|
|
|
attr.attr_span().value_span(),
|
|
|
|
)),
|
|
|
|
},
|
|
|
|
Xirf::Comment(sym, span, _) => to!(Xir::Comment(sym, span)),
|
|
|
|
Xirf::Text(x, _) => match x.into() {
|
|
|
|
Text(sym, span) => to!(Xir::Text(sym, span)),
|
|
|
|
},
|
|
|
|
Xirf::CData(sym, span, _) => to!(Xir::CData(sym, span)),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn is_accepting(&self, _: &Self::Context) -> bool {
|
|
|
|
matches!(self, Self::Ready(_))
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-03-17 12:20:20 -04:00
|
|
|
#[cfg(test)]
|
tamer: xir: Introduce {Ele,Open,Close}Span
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
2022-06-24 13:51:49 -04:00
|
|
|
pub mod test;
|