2021-09-16 10:18:02 -04:00
|
|
|
// Test XIR tree representation
|
|
|
|
//
|
|
|
|
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
|
|
|
|
//
|
|
|
|
// This file is part of TAME.
|
|
|
|
//
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
use super::*;
|
|
|
|
use crate::convert::ExpectInto;
|
|
|
|
use crate::sym::GlobalSymbolIntern;
|
2021-12-14 12:36:35 -05:00
|
|
|
use crate::xir::tree::parse::ParseError;
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
lazy_static! {
|
|
|
|
static ref S: Span =
|
|
|
|
Span::from_byte_interval((0, 0), "test case, 1".intern());
|
|
|
|
static ref S2: Span =
|
|
|
|
Span::from_byte_interval((0, 0), "test case, 2".intern());
|
|
|
|
static ref S3: Span =
|
|
|
|
Span::from_byte_interval((0, 0), "test case, 3".intern());
|
|
|
|
}
|
|
|
|
|
|
|
|
mod tree {
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn element_from_tree() {
|
2021-09-23 14:52:53 -04:00
|
|
|
let ele = Element {
|
2021-09-16 10:18:02 -04:00
|
|
|
name: "foo".unwrap_into(),
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
|
|
|
let tree = Tree::Element(ele.clone());
|
|
|
|
|
2021-09-28 14:52:31 -04:00
|
|
|
assert_eq!(Some(&ele), tree.as_element());
|
2021-11-15 23:47:14 -05:00
|
|
|
assert_eq!(None, Into::<Option<SymbolId>>::into(tree));
|
2021-10-08 16:16:33 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn text_from_tree() {
|
2021-11-15 23:47:14 -05:00
|
|
|
let text = "foo".intern();
|
2021-10-08 16:16:33 -04:00
|
|
|
let tree = Tree::Text(text, *S);
|
|
|
|
|
|
|
|
assert!(!tree.is_element());
|
|
|
|
assert_eq!(None, tree.as_element());
|
|
|
|
assert_eq!(None, tree.clone().into_element());
|
|
|
|
|
2021-11-15 23:47:14 -05:00
|
|
|
assert_eq!(Some(text), tree.into());
|
2021-09-28 14:52:31 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mod attrs {
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn linear_search_for_attr_name_in_list() {
|
|
|
|
let a = "a".unwrap_into();
|
|
|
|
let b = "b".unwrap_into();
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let attra = Attr::new(a, "a value".intern(), (*S, *S2));
|
|
|
|
let attrb = Attr::new(b, "b value".intern(), (*S, *S2));
|
2021-09-28 14:52:31 -04:00
|
|
|
|
|
|
|
let attrs = AttrList::from([attra.clone(), attrb.clone()]);
|
|
|
|
|
|
|
|
assert_eq!(attrs.find(a), Some(&attra));
|
|
|
|
assert_eq!(attrs.find(b), Some(&attrb));
|
|
|
|
|
|
|
|
assert_eq!(attrs.find("unknown".unwrap_into()), None);
|
2021-09-16 10:18:02 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn empty_element_self_close_from_toks() {
|
|
|
|
let name = ("ns", "elem").unwrap_into();
|
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [Token::Open(name, *S), Token::Close(None, *S2)].into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut sut = parse(toks);
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete)));
|
2021-12-13 14:08:16 -05:00
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
2021-12-14 12:44:32 -05:00
|
|
|
Some(Ok(Parsed::Object(Tree::Element(expected))))
|
2021-12-13 14:08:16 -05:00
|
|
|
);
|
2021-09-16 10:18:02 -04:00
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Same as above test, but with balanced closing instead of self
|
|
|
|
// closing.
|
|
|
|
#[test]
|
|
|
|
fn empty_element_balanced_close_from_toks() {
|
|
|
|
let name = ("ns", "openclose").unwrap_into();
|
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks =
|
|
|
|
[Token::Open(name, *S), Token::Close(Some(name), *S2)].into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut sut = parse(toks);
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete)));
|
2021-12-13 14:08:16 -05:00
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
2021-12-14 12:44:32 -05:00
|
|
|
Some(Ok(Parsed::Object(Tree::Element(expected))))
|
2021-12-13 14:08:16 -05:00
|
|
|
);
|
2021-09-16 10:18:02 -04:00
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Unbalanced should result in error. This does not test what happens
|
|
|
|
// _after_ the error.
|
|
|
|
#[test]
|
|
|
|
fn empty_element_unbalanced_close_from_toks() {
|
|
|
|
let open_name = "open".unwrap_into();
|
|
|
|
let close_name = "unbalanced_name".unwrap_into();
|
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [
|
2021-09-23 14:52:53 -04:00
|
|
|
Token::Open(open_name, *S),
|
|
|
|
Token::Close(Some(close_name), *S2),
|
2021-10-02 01:03:19 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut sut = parse(toks);
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete)));
|
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
2021-12-14 12:36:35 -05:00
|
|
|
Some(Err(ParseError::StateError(StackError::UnbalancedTag {
|
2021-09-16 10:18:02 -04:00
|
|
|
open: (open_name, *S),
|
|
|
|
close: (close_name, *S2),
|
2021-12-14 12:36:35 -05:00
|
|
|
})))
|
2021-09-16 10:18:02 -04:00
|
|
|
);
|
|
|
|
|
|
|
|
// TODO: We need to figure out how to best implement recovery before
|
|
|
|
// continuing with this design.
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn empty_element_with_attrs_from_toks() {
|
|
|
|
let name = ("ns", "elem").unwrap_into();
|
|
|
|
let attr1 = "a".unwrap_into();
|
|
|
|
let attr2 = "b".unwrap_into();
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let val1 = "val1".intern();
|
2021-12-06 14:26:58 -05:00
|
|
|
let val2 = "val2".intern();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [
|
2021-09-23 14:52:53 -04:00
|
|
|
Token::Open(name, *S),
|
2021-09-16 10:18:02 -04:00
|
|
|
Token::AttrName(attr1, *S),
|
|
|
|
Token::AttrValue(val1, *S2),
|
|
|
|
Token::AttrName(attr2, *S),
|
2021-12-06 14:26:58 -05:00
|
|
|
Token::AttrValue(val2, *S3),
|
2021-09-16 10:18:02 -04:00
|
|
|
Token::Close(None, *S2),
|
2021-10-02 01:03:19 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::from(vec![
|
2021-09-21 10:43:23 -04:00
|
|
|
Attr::new(attr1, val1, (*S, *S2)),
|
2021-12-06 14:26:58 -05:00
|
|
|
Attr::new(attr2, val2, (*S, *S3)),
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
]),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut sut = parse(toks);
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // Open
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrName
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrValue
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrName
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrValue
|
2021-12-13 14:08:16 -05:00
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
2021-12-14 12:44:32 -05:00
|
|
|
Some(Ok(Parsed::Object(Tree::Element(expected))))
|
2021-12-13 14:08:16 -05:00
|
|
|
);
|
2021-11-03 14:37:05 -04:00
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn child_element_after_attrs() {
|
|
|
|
let name = ("ns", "elem").unwrap_into();
|
|
|
|
let child = "child".unwrap_into();
|
|
|
|
let attr = "a".unwrap_into();
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let val = "val".intern();
|
2021-11-03 14:37:05 -04:00
|
|
|
|
|
|
|
let toks = [
|
|
|
|
Token::Open(name, *S),
|
|
|
|
Token::AttrName(attr, *S),
|
|
|
|
Token::AttrValue(val, *S2),
|
|
|
|
Token::Open(child, *S),
|
|
|
|
Token::Close(None, *S2),
|
|
|
|
Token::Close(Some(name), *S3),
|
|
|
|
]
|
|
|
|
.into_iter();
|
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::from(vec![Attr::new(attr, val, (*S, *S2))]),
|
2021-11-03 14:37:05 -04:00
|
|
|
children: vec![Tree::Element(Element {
|
|
|
|
name: child,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-11-03 14:37:05 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
})],
|
|
|
|
span: (*S, *S3),
|
|
|
|
};
|
|
|
|
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
let mut sut = parse(toks);
|
2021-11-03 14:37:05 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // Open
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrName
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // AttrValue
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // Open
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Parsed::Incomplete))); // Close
|
2021-12-13 14:08:16 -05:00
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
2021-12-14 12:44:32 -05:00
|
|
|
Some(Ok(Parsed::Object(Tree::Element(expected))))
|
2021-12-13 14:08:16 -05:00
|
|
|
);
|
2021-09-16 10:18:02 -04:00
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn element_with_empty_sibling_children() {
|
|
|
|
let parent = "parent".unwrap_into();
|
|
|
|
let childa = "childa".unwrap_into();
|
|
|
|
let childb = "childb".unwrap_into();
|
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [
|
2021-09-23 14:52:53 -04:00
|
|
|
Token::Open(parent, *S),
|
|
|
|
Token::Open(childa, *S),
|
|
|
|
Token::Close(None, *S2),
|
|
|
|
Token::Open(childb, *S),
|
|
|
|
Token::Close(None, *S2),
|
|
|
|
Token::Close(Some(parent), *S2),
|
2021-10-02 01:03:19 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name: parent,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![
|
|
|
|
Tree::Element(Element {
|
|
|
|
name: childa,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
}),
|
|
|
|
Tree::Element(Element {
|
|
|
|
name: childb,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
}),
|
|
|
|
],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
|
|
|
let mut sut = parser_from(toks);
|
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Tree::Element(expected))));
|
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Ensures that attributes do not cause the parent context to be lost.
|
|
|
|
#[test]
|
|
|
|
fn element_with_child_with_attributes() {
|
|
|
|
let parent = "parent".unwrap_into();
|
|
|
|
let child = "child".unwrap_into();
|
|
|
|
let attr = "attr".unwrap_into();
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let value = "attr value".intern();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [
|
2021-09-23 14:52:53 -04:00
|
|
|
Token::Open(parent, *S),
|
|
|
|
Token::Open(child, *S),
|
|
|
|
Token::AttrName(attr, *S),
|
|
|
|
Token::AttrValue(value, *S2),
|
|
|
|
Token::Close(None, *S3),
|
|
|
|
Token::Close(Some(parent), *S3),
|
2021-10-02 01:03:19 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name: parent,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![Tree::Element(Element {
|
|
|
|
name: child,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::from([Attr::new(attr, value, (*S, *S2))]),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S3),
|
|
|
|
})],
|
|
|
|
span: (*S, *S3),
|
|
|
|
};
|
|
|
|
|
|
|
|
let mut sut = parser_from(toks);
|
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Tree::Element(expected))));
|
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
2021-10-08 16:16:33 -04:00
|
|
|
#[test]
|
|
|
|
fn element_with_text() {
|
|
|
|
let parent = "parent".unwrap_into();
|
2021-11-15 23:47:14 -05:00
|
|
|
let text = "inner text".into();
|
2021-10-08 16:16:33 -04:00
|
|
|
|
|
|
|
let toks = [
|
|
|
|
Token::Open(parent, *S),
|
|
|
|
Token::Text(text, *S2),
|
|
|
|
Token::Close(Some(parent), *S3),
|
|
|
|
]
|
|
|
|
.into_iter();
|
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name: parent,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::new(),
|
2021-10-08 16:16:33 -04:00
|
|
|
children: vec![Tree::Text(text, *S2)],
|
|
|
|
span: (*S, *S3),
|
|
|
|
};
|
|
|
|
|
|
|
|
let mut sut = parser_from(toks);
|
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Tree::Element(expected))));
|
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
|
|
|
|
2021-09-16 10:18:02 -04:00
|
|
|
#[test]
|
|
|
|
fn parser_from_filters_incomplete() {
|
|
|
|
let name = ("ns", "elem").unwrap_into();
|
|
|
|
let attr = "a".unwrap_into();
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let val = "val1".intern();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
2021-10-02 01:03:19 -04:00
|
|
|
let toks = [
|
2021-09-23 14:52:53 -04:00
|
|
|
Token::Open(name, *S),
|
2021-09-16 10:18:02 -04:00
|
|
|
Token::AttrName(attr, *S),
|
|
|
|
Token::AttrValue(val, *S2),
|
|
|
|
Token::Close(None, *S2),
|
2021-10-02 01:03:19 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
2021-09-16 10:18:02 -04:00
|
|
|
|
|
|
|
let expected = Element {
|
|
|
|
name,
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
attrs: AttrList::from([Attr::new(attr, val, (*S, *S2))]),
|
2021-09-16 10:18:02 -04:00
|
|
|
children: vec![],
|
|
|
|
span: (*S, *S2),
|
|
|
|
};
|
|
|
|
|
|
|
|
let mut sut = parser_from(toks);
|
|
|
|
|
|
|
|
// Unlike the previous tests, we should filter out all the
|
|
|
|
// `Parsed::Incomplete` and yield only when we have a fully parsed
|
|
|
|
// object.
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Tree::Element(expected))));
|
|
|
|
assert_eq!(sut.next(), None);
|
|
|
|
}
|
2021-11-03 14:37:05 -04:00
|
|
|
|
2021-11-04 10:52:16 -04:00
|
|
|
#[test]
|
|
|
|
fn attr_parser_with_non_attr_token() {
|
|
|
|
let name = "unexpected".unwrap_into();
|
2021-11-05 10:54:05 -04:00
|
|
|
let mut toks = [Token::Open(name, *S)].into_iter();
|
2021-11-04 10:52:16 -04:00
|
|
|
|
2021-11-05 10:54:05 -04:00
|
|
|
let mut sut = attr_parser_from(&mut toks);
|
2021-11-04 10:52:16 -04:00
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
2021-12-16 09:44:02 -05:00
|
|
|
Some(Err(ParseError::UnexpectedToken(Token::Open(name, *S))))
|
2021-11-04 10:52:16 -04:00
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn parser_attr_multiple() {
|
|
|
|
let attr1 = "one".unwrap_into();
|
|
|
|
let attr2 = "two".unwrap_into();
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
let val1 = "val1".intern();
|
|
|
|
let val2 = "val2".intern();
|
2021-11-04 10:52:16 -04:00
|
|
|
|
2021-11-05 10:54:05 -04:00
|
|
|
let mut toks = [
|
2021-11-04 10:52:16 -04:00
|
|
|
Token::AttrName(attr1, *S),
|
|
|
|
Token::AttrValue(val1, *S2),
|
|
|
|
Token::AttrName(attr2, *S2),
|
|
|
|
Token::AttrValue(val2, *S3),
|
2021-11-17 00:13:07 -05:00
|
|
|
// Token that we should _not_ hit.
|
|
|
|
Token::Text("nohit".into(), *S),
|
2021-11-04 10:52:16 -04:00
|
|
|
]
|
|
|
|
.into_iter();
|
|
|
|
|
2021-11-05 10:54:05 -04:00
|
|
|
let mut sut = attr_parser_from(&mut toks);
|
2021-11-04 10:52:16 -04:00
|
|
|
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Attr::new(attr1, val1, (*S, *S2)))));
|
|
|
|
assert_eq!(sut.next(), Some(Ok(Attr::new(attr2, val2, (*S2, *S3)))));
|
2021-11-17 00:13:07 -05:00
|
|
|
|
2021-12-17 10:14:31 -05:00
|
|
|
// Parsing must stop after the last attribute,
|
2021-11-17 00:13:07 -05:00
|
|
|
// after which some other parser can continue on the same token
|
2021-12-17 10:14:31 -05:00
|
|
|
// stream
|
|
|
|
// (using this token as a lookahead).
|
|
|
|
assert_eq!(
|
|
|
|
sut.next(),
|
|
|
|
Some(Err(ParseError::UnexpectedToken(Token::Text(
|
|
|
|
"nohit".into(),
|
|
|
|
*S
|
|
|
|
))))
|
|
|
|
);
|
2021-11-04 10:52:16 -04:00
|
|
|
}
|