2021-10-21 16:17:17 -04:00
|
|
|
// XIR reader tests
|
|
|
|
//
|
|
|
|
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
|
|
|
|
//
|
|
|
|
// This file is part of TAME.
|
|
|
|
//
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
use std::borrow::Cow;
|
|
|
|
|
2021-10-21 16:17:17 -04:00
|
|
|
use super::*;
|
tamer: xir::XirString: WIP implementation (likely going away)
I'm not fond of this implementation, which is why it's not fully
completed. I wanted to commit this for future reference, and take the
opportunity to explain why I don't like it.
First: this task started as an idea to implement a third variant to
AttrValue and friends that indicates that a value is fixed, in the sense of
a fixed-point function: escaped or unescaped, its value is the same. This
would allow us to skip wasteful escape/unescape operations.
In doing so, it became obvious that there's no need to leak this information
through the API, and indeed, no part of the system should care. When we
read XML, it should be unescaped, and when we write, it should be
escaped. The reason that this didn't quite happen to begin with was an
optimization: I'll be creating an echo writer in place of the current
filesystem-based copy in tamec shortly, and this would allow streaming XIR
directly from the reader to the writer without any unescaping or
re-escaping.
When we unescape, we know the value that it came from, so we could simply
store both symbols---they're 32-bit, so it results in a nicely compressed
64-bit value, so it's essentially cost-free, as long as we accept the
expense of internment. This is `XirString`. Then, when we want to escape
or unescape, we first check to see whether a symbol already exists and, if
so, use it.
While this works well for echoing streams, it won't work all that well in
practice: the unescaped SymbolId will be taken and the XirString discarded,
since nothing after XIR should be coupled with it. Then, when we later
construct a XIR stream for writting, XirString will no longer be available
and our previously known escape is lost, so the writer will have to
re-escape.
Further, if we look at XirString's generic for the XirStringEscaper---it
uses phantom, which hints that maybe it's not in the best place. Indeed,
I've already acknowledged that only a reader unescapes and only a writer
escapes, and that the rest of the system works with normal (unescaped)
values, so only readers and writers should be part of this process. I also
already acknowledged that XirString would be lost and only the unescaped
SymbolId would be used.
So what's the point of XirString, then, if it won't be a useful optimization
beyond the temporary echo writer?
Instead, we can take the XirStringWriter and implement two caches on that:
mapping SymbolId from escaped->unescaped and vice-versa. These can be
simple vectors, since SymbolId is a 32-bit value we will not have much
wasted space for symbols that never get read or written. We could even
optimize for preinterned symbols using markers, though I'll probably not do
so, and I'll explain why later.
If we do _that_, we get even _better_ optimizations through caching that
_will_ apply in the general case (so, not just for echo), and we're able to
ditch XirString entirely and simply use a SymbolId. This makes for a much
more friendly API that isn't leaking implementation details, though it
_does_ put an onus on the caller to pass the encoder to both the reader and
the writer, _if_ it wants to take advantage of a cache. But that burden is
not significant (and is, again, optional if we don't want it).
So, that'll be the next step.
2021-11-10 09:42:18 -05:00
|
|
|
use crate::sym::GlobalSymbolIntern;
|
2021-10-21 16:17:17 -04:00
|
|
|
use crate::{
|
|
|
|
convert::ExpectInto,
|
|
|
|
span::DUMMY_SPAN,
|
2021-11-15 23:47:14 -05:00
|
|
|
xir::{Error, Token},
|
2021-10-21 16:17:17 -04:00
|
|
|
};
|
|
|
|
|
|
|
|
/// These tests use [`quick_xml`] directly,
|
|
|
|
/// rather than mocking it,
|
|
|
|
/// because parsing XML isn't a simple matter and we want to be sure that
|
|
|
|
/// our assumptions of how `quick_xml` performs its parsing is accurate.
|
|
|
|
/// Consequently,
|
|
|
|
/// these act more like integration tests than unit tests.
|
|
|
|
///
|
|
|
|
/// This means that `quick_xml` breakages will break these tests,
|
|
|
|
/// and that is (unlike with unit tests) exactly what we want to happen
|
|
|
|
/// here;
|
|
|
|
/// we _complement_ the behavior of quick-xml,
|
|
|
|
/// both by reimplementing certain functionality
|
|
|
|
/// (like namespace management)
|
|
|
|
/// and by relying on certain parsing behavior to eliminate
|
|
|
|
/// redundant checks.
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
type Sut<'a, B, S> = XmlXirReader<'a, B, S>;
|
|
|
|
|
|
|
|
#[derive(Debug, Default)]
|
|
|
|
struct MockEscaper {}
|
|
|
|
|
|
|
|
// Simply adds ":UNESC" as a suffix to the provided byte slice.
|
|
|
|
impl Escaper for MockEscaper {
|
|
|
|
fn escape_bytes(_: &[u8]) -> Cow<[u8]> {
|
|
|
|
unreachable!("Reader should not be escaping!")
|
|
|
|
}
|
|
|
|
|
|
|
|
fn unescape_bytes(value: &[u8]) -> result::Result<Cow<[u8]>, Error> {
|
|
|
|
let mut unesc = value.to_owned();
|
|
|
|
unesc.extend_from_slice(b":UNESC");
|
|
|
|
|
|
|
|
Ok(Cow::Owned(unesc))
|
|
|
|
}
|
|
|
|
}
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
/// A byte that will be invalid provided that there is either no following
|
|
|
|
/// UTF-8 byte,
|
|
|
|
/// or if it's followed by another byte that is invalid in that
|
|
|
|
/// position.
|
|
|
|
const INVALID_UTF8_BYTE: u8 = 0b11000000u8;
|
|
|
|
|
|
|
|
// SAFETY: We want an invalid UTF-8 str for tests.
|
|
|
|
// (We can use raw bytes and avoid `unsafe`,
|
|
|
|
// but this is more convenient.)
|
|
|
|
const INVALID_STR: &str =
|
|
|
|
unsafe { std::str::from_utf8_unchecked(&[INVALID_UTF8_BYTE]) };
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
macro_rules! new_sut {
|
|
|
|
($sut:ident = $data:expr) => {
|
|
|
|
new_sut!(b $sut = $data.as_bytes())
|
|
|
|
};
|
|
|
|
|
|
|
|
(b $sut:ident = $data:expr) => {
|
|
|
|
let escaper = MockEscaper::default();
|
|
|
|
let $sut = Sut::new($data, &escaper);
|
|
|
|
};
|
|
|
|
}
|
|
|
|
|
2021-10-21 16:17:17 -04:00
|
|
|
#[test]
|
|
|
|
fn empty_node_without_prefix_or_attributes() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = "<empty-node />");
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("empty-node".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Resolving namespaces is not the concern of XIR.
|
|
|
|
#[test]
|
|
|
|
fn does_not_resolve_xmlns() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<no-ns xmlns="noresolve" />"#);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("no-ns".unwrap_into(), DUMMY_SPAN),
|
|
|
|
// Since we didn't parse @xmlns, it's still an attribute.
|
|
|
|
Token::AttrName("xmlns".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("noresolve:UNESC".intern(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Resolving namespaces is not the concern of XIR.
|
|
|
|
#[test]
|
|
|
|
fn empty_node_with_prefix_without_attributes_unresolved() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<x:empty-node xmlns:x="noresolve" />"#);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
// Should be the QName, _unresolved_.
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open(("x", "empty-node").unwrap_into(), DUMMY_SPAN),
|
|
|
|
Token::AttrName(("xmlns", "x").unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("noresolve:UNESC".intern(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: Enough information for error recovery and reporting.
|
|
|
|
#[test]
|
|
|
|
fn prefix_with_empty_local_name_invalid_qname() {
|
|
|
|
// No local name (trailing colon).
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<x: xmlns:x="testns" />"#);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
match result {
|
|
|
|
Ok(_) => panic!("expected failure"),
|
|
|
|
Err(given) => {
|
|
|
|
assert_eq!(Error::InvalidQName("x:".into()), given);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// The order of attributes must be retained.
|
|
|
|
#[test]
|
|
|
|
fn multiple_attrs_ordered() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<ele foo="a" bar="b" b:baz="c" />"#);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("ele".unwrap_into(), DUMMY_SPAN),
|
|
|
|
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("a:UNESC".intern(), DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::AttrName("bar".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("b:UNESC".intern(), DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::AttrName(("b", "baz").unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("c:UNESC".intern(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Contrary to the specification, but this is the responsibility of XIRT; we
|
|
|
|
// need to allow it to support e.g. recovery, code formatting, and LSPs.
|
|
|
|
#[test]
|
|
|
|
fn permits_duplicate_attrs() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<dup attr="a" attr="b" />"#);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("dup".unwrap_into(), DUMMY_SPAN),
|
|
|
|
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("a:UNESC".intern(), DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("b:UNESC".intern(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:17:17 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
2021-10-21 16:32:19 -04:00
|
|
|
#[test]
|
|
|
|
fn child_node_self_closing() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<root><child /></root>"#);
|
2021-10-21 16:32:19 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn sibling_nodes() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<root><child /><child /></root>"#);
|
2021-10-21 16:32:19 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn child_node_with_attrs() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<root><child foo="bar" /></root>"#);
|
2021-10-21 16:32:19 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
|
|
|
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
Token::AttrValue("bar:UNESC".intern(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 16:32:19 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
2021-10-21 21:42:39 -04:00
|
|
|
#[test]
|
|
|
|
fn child_text() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<text>foo bar</text>"#);
|
2021-10-21 21:42:39 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("text".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("foo bar:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Close(Some("text".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn mixed_child_content() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<text>foo<em>bar</em></text>"#);
|
2021-10-21 21:42:39 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("text".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("foo:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Open("em".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("bar:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Close(Some("em".unwrap_into()), DUMMY_SPAN),
|
|
|
|
Token::Close(Some("text".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
// This is how XML is typically written; people don't perceive it as mixed,
|
|
|
|
// even though it is. This intentionally adds newlines before and after the
|
|
|
|
// opening and closing tags of the root node.
|
|
|
|
#[test]
|
|
|
|
fn mixed_child_content_with_newlines() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(
|
|
|
|
sut = r#"
|
2021-10-21 21:42:39 -04:00
|
|
|
<root>
|
|
|
|
<child />
|
|
|
|
</root>
|
|
|
|
"#
|
|
|
|
);
|
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("\n :UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:42:39 -04:00
|
|
|
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 21:55:15 -04:00
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
2021-10-21 22:04:45 -04:00
|
|
|
#[test]
|
|
|
|
fn comment() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<!--root--><root><!--<child>--></root>"#);
|
2021-10-21 22:04:45 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Comment("root".into(), DUMMY_SPAN),
|
2021-10-21 22:04:45 -04:00
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Comment("<child>".into(), DUMMY_SPAN),
|
2021-10-21 22:04:45 -04:00
|
|
|
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn comment_multiline() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(
|
|
|
|
sut = r#"<mult><!--comment
|
2021-10-21 22:04:45 -04:00
|
|
|
on multiple
|
|
|
|
lines-->
|
|
|
|
</mult>"#
|
|
|
|
);
|
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("mult".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-11-15 23:47:14 -05:00
|
|
|
Token::Comment("comment\non multiple\nlines".into(), DUMMY_SPAN),
|
|
|
|
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
|
2021-10-21 22:04:45 -04:00
|
|
|
Token::Close(Some("mult".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
2021-10-25 10:58:19 -04:00
|
|
|
// XIRT handles mismatch errors; XIR must explicitly support them.
|
|
|
|
#[test]
|
|
|
|
fn permits_mismatched_tags() {
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = r#"<root><child /></mismatch>"#);
|
2021-10-25 10:58:19 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
result.expect("parsing failed"),
|
|
|
|
vec![
|
|
|
|
Token::Open("root".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-25 10:58:19 -04:00
|
|
|
Token::Open("child".unwrap_into(), DUMMY_SPAN),
|
2021-12-06 14:48:55 -05:00
|
|
|
Token::AttrEnd(DUMMY_SPAN),
|
2021-10-25 10:58:19 -04:00
|
|
|
Token::Close(None, DUMMY_SPAN),
|
|
|
|
Token::Close(Some("mismatch".unwrap_into()), DUMMY_SPAN),
|
|
|
|
],
|
|
|
|
);
|
|
|
|
}
|
|
|
|
|
2021-10-21 16:17:17 -04:00
|
|
|
// TODO: Enough information for error recovery and reporting.
|
|
|
|
#[test]
|
|
|
|
fn node_name_invalid_utf8() {
|
|
|
|
let bytes: &[u8] = &[b'<', INVALID_UTF8_BYTE, b'/', b'>'];
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(b sut = bytes);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
match result {
|
|
|
|
Ok(_) => panic!("expected failure"),
|
|
|
|
Err(Error::InvalidUtf8(_, bytes)) => {
|
|
|
|
assert_eq!(bytes, &[INVALID_UTF8_BYTE]);
|
|
|
|
}
|
|
|
|
_ => panic!("unexpected failure"),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: Enough information for error recovery and reporting.
|
|
|
|
#[test]
|
|
|
|
fn attr_name_invalid_utf8() {
|
|
|
|
let mut s = String::from("<a ");
|
|
|
|
s.push_str(INVALID_STR);
|
|
|
|
s.push_str(r#"="value"/>"#);
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = s);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
match result {
|
|
|
|
Ok(_) => panic!("expected failure"),
|
|
|
|
Err(Error::InvalidUtf8(_, bytes)) => {
|
|
|
|
assert_eq!(bytes, &[INVALID_UTF8_BYTE]);
|
|
|
|
}
|
|
|
|
_ => panic!("unexpected failure"),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: Enough information for error recovery and reporting.
|
|
|
|
#[test]
|
|
|
|
fn attr_value_invalid_utf8() {
|
|
|
|
let mut s = String::from(r#"<a attr="bad"#);
|
|
|
|
s.push_str(INVALID_STR);
|
|
|
|
s.push_str(r#""/>"#);
|
|
|
|
|
tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
2021-11-12 13:59:14 -05:00
|
|
|
new_sut!(sut = s);
|
2021-10-21 16:17:17 -04:00
|
|
|
|
|
|
|
let result = sut.collect::<Result<Vec<_>>>();
|
|
|
|
|
|
|
|
match result {
|
|
|
|
Ok(_) => panic!("expected failure"),
|
|
|
|
Err(Error::InvalidUtf8(_, bytes)) => {
|
2021-11-12 16:07:57 -05:00
|
|
|
// Doesn't make it to the Escaper.
|
|
|
|
assert_eq!(bytes, &[b'b', b'a', b'd', INVALID_UTF8_BYTE]);
|
2021-10-21 16:17:17 -04:00
|
|
|
}
|
|
|
|
_ => panic!("unexpected failure"),
|
|
|
|
}
|
|
|
|
}
|