tamer: xir: Initial re-introduction of AttrEnd

AttrEnd was initially removed in
0cc0bc9d5a (and the commit prior), because
there was not a compelling reason to use it over a lookahead
operation (returning a token via the a dead state transition); `AttrEnd`
simply introduced inconsistencies between the XIR reader (which produced
AttrEnd) and internal XIR stream generators (e.g. the lowering operations
into XIR->XML, which do not).

But now that parsers are performing aggregation---in particular the
attribute parser-generator `xir::parse::attr`---this has become quite a
pain, because the dead state is an actionable token.  For example:

  1. Open
  2. Attr
  3. Attr
  4. Open
  5. ...

In the happy case, token #4 results in `Parsed::Incomplete`, and so can just
be transformed into the object representing the aggregated attributes.  But
even in this happy path, it's ugly, and it requires non-tail recursion on
the parser which requires a duplicate stack allocation for the
`ParserState`.  That violates a core principle of the system.

But if there is an error at #4---e.g. an unexpected element---then we no
longer have a `Parsed::Incomplete` to hijack for our own uses, and we'd have
to introduce the ability to return both an error and a token, or we'd have
to introduce the ability to keep a token of lookahead instead of reading
from the underlying token stream, but that's complicated with push parsers,
which are used for parser composition.  Yikes.

And furthermore, the aggregation has caused me to introduce the ability to
override the dead state type to introduce both a token of lookahead and
aggregation information.  This complicates the system and is going to be
confusing to others.

Given all of this, AttrEnd does now seem appropriate to reintroduce, since
it will allow processing of aggregate operations when encountering that
token without having to worry about the above scenario; without having to
duplicate a `ParseState` stack; without having to hijack dead state
transitions for producing our aggregate object; and everything else
mentioned above.

This commit does not modify those abstractions to use AttrEnd yet; it
re-introduces the token to the core system, not the parser-generators, and
it doesn't yet replace lookahead operations in the parsers that use
them.  That'll come next.  Unlike the commit that removed it, though, we are
now generating proper spans, so make note of that here.  This also does not
introduce the concept to XIRF yet, which did not exist at the time that it
was removed, so XIRF is filtering it out until a following commit.

DEV-7145
main
Mike Gerwitz 2022-06-28 16:10:57 -04:00
parent 9276d00456
commit b973d36862
6 changed files with 191 additions and 40 deletions

View File

@ -598,6 +598,22 @@ pub enum Token {
/// Element attribute value.
AttrValue(SymbolId, Span),
/// Delimiter indicating that attribute parsing has ended and that the
/// next token will be either a [`Token::Close`] for this element or a
/// child [`Token::Open`].
///
/// This token allows for streaming attribute aggregation without
/// lookahead,
/// and corresponds to the `>` terminal for start tags.
/// The token must be present if there is one or more attributes,
/// but is otherwise omitted,
/// with the `>` terminal considered part of [`Token::Open`] and
/// included in its span.
///
/// There is no corresponding attribute opening delimiter;
/// such is implied by [`Token::AttrName`].
AttrEnd(Span),
/// A portion of an element attribute value.
///
/// This allows for concatenating values into an attribute value without
@ -659,6 +675,7 @@ impl Display for Token {
Self::AttrValue(attr_val, _) => {
write!(f, "attribute value `{}`", attr_val)
}
Self::AttrEnd(_) => write!(f, "end of attribute list"),
Self::AttrValueFragment(attr_val, _) => {
write!(f, "attribute value fragment `{}`", attr_val)
}
@ -682,6 +699,7 @@ impl crate::parse::Token for Token {
| Close(_, CloseSpan(span, _))
| AttrName(_, span)
| AttrValue(_, span)
| AttrEnd(span)
| AttrValueFragment(_, span)
| Comment(_, span)
| Text(_, span)

View File

@ -37,10 +37,19 @@ use super::Attr;
/// they do not influence the automaton's state transitions.
/// The actual parsing operation is therefore a FSM,
/// not a PDA.
///
/// This parse may be used to parse a portion of an attribute list,
/// but it cannot continue after having read a [`XirToken::AttrEnd`],
/// indicating the end of the attribute list.
#[derive(Debug, Eq, PartialEq)]
pub enum AttrParseState {
/// Expecting attribute name or end of attribute list.
Empty,
/// Attribute value expected.
Name(QName, Span),
/// End of attribute list;
/// no further attributes expected.
End(Span),
}
impl ParseState for AttrParseState {
@ -53,14 +62,18 @@ impl ParseState for AttrParseState {
tok: Self::Token,
_: NoContext,
) -> TransitionResult<Self> {
use AttrParseState::{Empty, Name};
use AttrParseState::{Empty, End, Name};
match (self, tok) {
(Empty, XirToken::AttrName(name, span)) => {
Transition(Name(name, span)).incomplete()
}
(Empty, invalid) => Transition(Empty).dead(invalid),
(Empty, XirToken::AttrEnd(span)) => {
Transition(End(span)).incomplete()
}
(Empty | End(_), invalid) => Transition(Empty).dead(invalid),
(Name(name, nspan), XirToken::AttrValue(value, vspan)) => {
Transition(Empty).ok(Attr::new(name, value, (nspan, vspan)))
@ -77,7 +90,7 @@ impl ParseState for AttrParseState {
#[inline]
fn is_accepting(&self) -> bool {
*self == Self::Empty
matches!(self, Self::Empty | Self::End(..))
}
}
@ -96,6 +109,9 @@ impl Display for AttrParseState {
Name(name, _) => {
write!(f, "expecting an attribute value for {name}")
}
End(_) => {
write!(f, "finished with attribute parsing")
}
}
}
}
@ -144,20 +160,24 @@ impl Diagnostic for AttrParseError {
#[cfg(test)]
mod test {
use std::assert_matches::assert_matches;
use super::*;
use crate::{
convert::ExpectInto,
parse::{EmptyContext, ParseStatus, Parsed},
parse::{EmptyContext, ParseError, ParseStatus, Parsed},
sym::GlobalSymbolIntern,
xir::test::{close_empty, open},
};
const S: Span = crate::span::DUMMY_SPAN;
const S2: Span = S.offset_add(1).unwrap();
const S1: Span = crate::span::DUMMY_SPAN;
const S2: Span = S1.offset_add(1).unwrap();
const S3: Span = S2.offset_add(1).unwrap();
const S4: Span = S3.offset_add(1).unwrap();
#[test]
fn dead_if_first_token_is_non_attr() {
let tok = open("foo", S);
let tok = open("foo", S1);
let sut = AttrParseState::default();
@ -179,7 +199,8 @@ mod test {
let attr = "attr".unwrap_into();
let val = "val".intern();
let toks = [XirToken::AttrName(attr, S), XirToken::AttrValue(val, S2)]
// AttrEnd not required.
let toks = [XirToken::AttrName(attr, S1), XirToken::AttrValue(val, S2)]
.into_iter();
let sut = AttrParseState::parse(toks);
@ -187,12 +208,18 @@ mod test {
assert_eq!(
Ok(vec![
Parsed::Incomplete,
Parsed::Object(Attr::new(attr, val, (S, S2))),
Parsed::Object(Attr::new(attr, val, (S1, S2))),
]),
sut.collect()
);
}
// TODO: A proper token will not be substituted;
// parsers are now expected to recover themselves rather than expect
// an external system to substitute tokens,
// unless such an explicit recovery mode is provided to be
// configured by the caller.
// The default recovery mode should drop the attribute entirely.
#[test]
fn parse_fails_when_attribute_value_missing_but_can_recover() {
let attr = "bad".unwrap_into();
@ -202,7 +229,7 @@ mod test {
// This token indicates that we're expecting a value to come next in
// the token stream.
let TransitionResult(Transition(sut), result) =
sut.parse_token(XirToken::AttrName(attr, S), &mut EmptyContext);
sut.parse_token(XirToken::AttrName(attr, S1), &mut EmptyContext);
assert_eq!(result, Ok(ParseStatus::Incomplete));
// But we provide something else unexpected.
@ -210,7 +237,7 @@ mod test {
sut.parse_token(close_empty(S2), &mut EmptyContext);
assert_eq!(
result,
Err(AttrParseError::AttrValueExpected(attr, S, close_empty(S2)))
Err(AttrParseError::AttrValueExpected(attr, S1, close_empty(S2)))
);
// We should not be in an accepting state,
@ -227,10 +254,46 @@ mod test {
.parse_token(XirToken::AttrValue(recover, S2), &mut EmptyContext);
assert_eq!(
result,
Ok(ParseStatus::Object(Attr::new(attr, recover, (S, S2)))),
Ok(ParseStatus::Object(Attr::new(attr, recover, (S1, S2)))),
);
// Finally, we should now be in an accepting state.
assert!(sut.is_accepting());
}
// If `AttrEnd` is encountered,
// every otherwise-valid token after that point is rejected.
#[test]
fn parsing_ends_at_attr_end() {
let name = "attr".unwrap_into();
let val = "val".intern();
let toks = [
// This is valid.
XirToken::AttrName(name, S1),
XirToken::AttrValue(val, S2),
// But the list ends here.
XirToken::AttrEnd(S3),
// And so this token is invalid now.
XirToken::AttrName(name, S4),
]
.into_iter();
let mut sut = AttrParseState::parse(toks);
// The valid attribute should yield.
assert_eq!(Some(Ok(Parsed::Incomplete)), sut.next());
assert_eq!(
Some(Ok(Parsed::Object(Attr::new(name, val, (S1, S2))))),
sut.next()
);
// But we must stop parsing after this.
assert_eq!(Some(Ok(Parsed::Incomplete)), sut.next()); // AttrEnd
assert_matches!(
sut.next(),
Some(Err(ParseError::UnexpectedToken(given_tok, _)))
if given_tok == XirToken::AttrName(name, S4),
);
}
}

View File

@ -356,7 +356,8 @@ where
// of these tokens.
XirToken::AttrName(..)
| XirToken::AttrValue(..)
| XirToken::AttrValueFragment(..) => {
| XirToken::AttrValueFragment(..)
| XirToken::AttrEnd(..) => {
unreachable!("attribute token in NodeExpected state: {tok:?}")
}
}

View File

@ -371,6 +371,8 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
.find(|b| !Self::is_whitespace(**b))
.is_some();
let attr_len = ele.attributes_raw().len();
// The tail is anything following the last byte of the QName
// in a non-empty tag with no attributes.
// For example:
@ -396,6 +398,17 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
));
}
// (empty) (open)
// <foo bar="baz" /> <foo bar="baz"></foo>
// | |
// zero-length '>'
let attr_end_span = ctx.span_or_zz(
pos + 1 + len + attr_len,
if empty_tag { 0 } else { 1 },
);
tokbuf.push_front(Token::AttrEnd(attr_end_span));
// No tail because of attributes.
0
} else {
@ -405,7 +418,7 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
// The "attributes" buffer represents whitespace,
// so the tail is the number of bytes of
// whitespace plus the closing '>' tag delimiter.
false => ele.attributes_raw().len() + 1,
false => attr_len + 1,
}
};

View File

@ -140,13 +140,15 @@ fn empty_node_without_prefix_or_attributes() {
fn does_not_resolve_xmlns() {
new_sut!(sut = r#"<no-ns xmlns="noresolve" />"#);
// [----] [---] [-------] []
// 0 5 7 11 14 22 25
// A B C D
// 0 5 7 11 14 22 /25
// A B C D E
//
let a = DC.span(0, 6);
let b = DC.span(7, 5);
let c = DC.span(14, 9);
let d = DC.span(25, 2);
let d = DC.span(25, 0);
let e = DC.span(25, 2);
assert_eq!(
Ok(vec![
@ -154,7 +156,8 @@ fn does_not_resolve_xmlns() {
// Since we didn't parse @xmlns, it's still an attribute.
O(Token::AttrName("xmlns".unwrap_into(), b)),
O(Token::AttrValue("noresolve:UNESC".intern(), c)),
O(Token::Close(None, CloseSpan(d, 0))),
O(Token::AttrEnd(d)),
O(Token::Close(None, CloseSpan(e, 0))),
]),
sut.collect(),
);
@ -165,13 +168,14 @@ fn does_not_resolve_xmlns() {
fn empty_node_with_prefix_without_attributes_unresolved() {
new_sut!(sut = r#"<x:empty-node xmlns:x="noresolve" />"#);
// [-----------] [-----] [-------] []
// 0 12 14 20 23 31 34
// A B C D
// 0 12 14 20 23 31 /34
// A B C D E
let a = DC.span(0, 13);
let b = DC.span(14, 7);
let c = DC.span(23, 9);
let d = DC.span(34, 2);
let d = DC.span(34, 0);
let e = DC.span(34, 2);
// Should be the QName, _unresolved_.
assert_eq!(
@ -182,7 +186,8 @@ fn empty_node_with_prefix_without_attributes_unresolved() {
)),
O(Token::AttrName(("xmlns", "x").unwrap_into(), b)),
O(Token::AttrValue("noresolve:UNESC".intern(), c)),
O(Token::Close(None, CloseSpan(d, 0))),
O(Token::AttrEnd(d)),
O(Token::Close(None, CloseSpan(e, 0))),
]),
sut.collect(),
);
@ -214,8 +219,8 @@ fn prefix_with_empty_local_name_invalid_qname() {
fn multiple_attrs_ordered() {
new_sut!(sut = r#"<ele foo="a" bar="b" b:baz="c" />"#);
// [--] [-] | [-] | [---] | []
// 0 3 5 7 10 13 18 21 25 28 31
// A B C D E F G H
// 0 3 5 7 10 13 18 21 25 28 /31
// A B C D E F GH I
let a = DC.span(0, 4);
let b = DC.span(5, 3);
@ -224,7 +229,8 @@ fn multiple_attrs_ordered() {
let e = DC.span(18, 1);
let f = DC.span(21, 5);
let g = DC.span(28, 1);
let h = DC.span(31, 2);
let h = DC.span(31, 0);
let i = DC.span(31, 2);
assert_eq!(
Ok(vec![
@ -235,7 +241,8 @@ fn multiple_attrs_ordered() {
O(Token::AttrValue("b:UNESC".intern(), e)),
O(Token::AttrName(("b", "baz").unwrap_into(), f)),
O(Token::AttrValue("c:UNESC".intern(), g)),
O(Token::Close(None, CloseSpan(h, 0))),
O(Token::AttrEnd(h)),
O(Token::Close(None, CloseSpan(i, 0))),
]),
sut.collect(),
);
@ -245,8 +252,8 @@ fn multiple_attrs_ordered() {
fn empty_attr_value() {
new_sut!(sut = r#"<ele empty="" />"#);
// [--] [---] | []
// 0 3 5 9 12 14
// A B C D
// 0 3 5 9 12'14
// A B CDE
// /
// zero-length span, where
// the value _would_ be
@ -254,14 +261,43 @@ fn empty_attr_value() {
let a = DC.span(0, 4);
let b = DC.span(5, 5);
let c = DC.span(12, 0);
let d = DC.span(14, 2);
let d = DC.span(14, 0);
let e = DC.span(14, 2);
assert_eq!(
Ok(vec![
O(Token::Open("ele".unwrap_into(), OpenSpan(a, 3))),
O(Token::AttrName("empty".unwrap_into(), b)),
O(Token::AttrValue(":UNESC".intern(), c)),
O(Token::Close(None, CloseSpan(d, 0))),
O(Token::AttrEnd(d)),
O(Token::Close(None, CloseSpan(e, 0))),
]),
sut.collect(),
);
}
#[test]
fn open_node_attr() {
new_sut!(sut = r#"<ele foobar="baz"></ele>"#);
// [--] [----] [-] |[----]
// 0 3 5 10 13 15|18 22
// A B C D E
// /
// AttrEnd here contains '>'
let a = DC.span(0, 4);
let b = DC.span(5, 6);
let c = DC.span(13, 3);
let d = DC.span(17, 1);
let e = DC.span(18, 6);
assert_eq!(
Ok(vec![
O(Token::Open("ele".unwrap_into(), OpenSpan(a, 3))),
O(Token::AttrName("foobar".unwrap_into(), b)),
O(Token::AttrValue("baz:UNESC".intern(), c)),
O(Token::AttrEnd(d)),
O(Token::Close(Some("ele".unwrap_into()), CloseSpan(e, 3))),
]),
sut.collect(),
);
@ -274,15 +310,16 @@ fn empty_attr_value() {
fn permits_duplicate_attrs() {
new_sut!(sut = r#"<dup attr="a" attr="b" />"#);
// [--] [--] | [--] | []
// 0 3 5 8 11 14 17 20 23
// A B C D E F
// 0 3 5 8 11 14 17 20/23
// A B C D EF G
let a = DC.span(0, 4);
let b = DC.span(5, 4);
let c = DC.span(11, 1);
let d = DC.span(14, 4);
let e = DC.span(20, 1);
let f = DC.span(23, 2);
let f = DC.span(23, 0);
let g = DC.span(23, 2);
assert_eq!(
Ok(vec![
@ -291,7 +328,8 @@ fn permits_duplicate_attrs() {
O(Token::AttrValue("a:UNESC".intern(), c)),
O(Token::AttrName("attr".unwrap_into(), d)),
O(Token::AttrValue("b:UNESC".intern(), e)),
O(Token::Close(None, CloseSpan(f, 0))),
O(Token::AttrEnd(f)),
O(Token::Close(None, CloseSpan(g, 0))),
]),
sut.collect(),
);
@ -418,15 +456,16 @@ fn sibling_nodes() {
fn child_node_with_attrs() {
new_sut!(sut = r#"<root><child foo="bar" /></root>"#);
// [----][----] [-] [-] [][-----]
// 0 5`6 11 13 18 20 23`25 31
// A B C D E F
// 0 5`6 11 13 18 20/23`25 31
// A B C D E F G
let a = DC.span(0, 6);
let b = DC.span(6, 6);
let c = DC.span(13, 3);
let d = DC.span(18, 3);
let e = DC.span(23, 2);
let f = DC.span(25, 7);
let e = DC.span(23, 0);
let f = DC.span(23, 2);
let g = DC.span(25, 7);
assert_eq!(
Ok(vec![
@ -434,8 +473,9 @@ fn child_node_with_attrs() {
O(Token::Open("child".unwrap_into(), OpenSpan(b, 5))),
O(Token::AttrName("foo".unwrap_into(), c)),
O(Token::AttrValue("bar:UNESC".intern(), d)),
O(Token::Close(None, CloseSpan(e, 0))),
O(Token::Close(Some("root".unwrap_into()), CloseSpan(f, 4))),
O(Token::AttrEnd(e)),
O(Token::Close(None, CloseSpan(f, 0))),
O(Token::Close(Some("root".unwrap_into()), CloseSpan(g, 4))),
]),
sut.collect(),
);

View File

@ -255,6 +255,11 @@ impl<S: Escaper> XmlWriter<S> for Token {
Ok(W::AttrFragmentAdjacent)
}
// AttrEnd is ignored by the writer,
// effectively making it optional for internal XIR generators
// that are used only for output.
(Self::AttrEnd(..), st) => Ok(st),
// TODO: We have no way of knowing if text should be formatted
// as CData,
// which may also be beneficial to avoid escaping if we
@ -495,6 +500,17 @@ mod test {
Ok(())
}
#[test]
fn ignores_attr_end() -> TestResult {
let result = Token::AttrEnd(S)
.write_new(WriterState::NodeOpen, &MockEscaper::default())?;
assert_eq!(result.0, b"");
assert_eq!(result.1, WriterState::NodeOpen);
Ok(())
}
#[test]
fn writes_text() -> TestResult {
let text = "test unescaped".intern();