tamer: xir::XirString: WIP implementation (likely going away)

I'm not fond of this implementation, which is why it's not fully
completed.  I wanted to commit this for future reference, and take the
opportunity to explain why I don't like it.

First: this task started as an idea to implement a third variant to
AttrValue and friends that indicates that a value is fixed, in the sense of
a fixed-point function: escaped or unescaped, its value is the same.  This
would allow us to skip wasteful escape/unescape operations.

In doing so, it became obvious that there's no need to leak this information
through the API, and indeed, no part of the system should care.  When we
read XML, it should be unescaped, and when we write, it should be
escaped.  The reason that this didn't quite happen to begin with was an
optimization: I'll be creating an echo writer in place of the current
filesystem-based copy in tamec shortly, and this would allow streaming XIR
directly from the reader to the writer without any unescaping or
re-escaping.

When we unescape, we know the value that it came from, so we could simply
store both symbols---they're 32-bit, so it results in a nicely compressed
64-bit value, so it's essentially cost-free, as long as we accept the
expense of internment.  This is `XirString`.  Then, when we want to escape
or unescape, we first check to see whether a symbol already exists and, if
so, use it.

While this works well for echoing streams, it won't work all that well in
practice: the unescaped SymbolId will be taken and the XirString discarded,
since nothing after XIR should be coupled with it.  Then, when we later
construct a XIR stream for writting, XirString will no longer be available
and our previously known escape is lost, so the writer will have to
re-escape.

Further, if we look at XirString's generic for the XirStringEscaper---it
uses phantom, which hints that maybe it's not in the best place.  Indeed,
I've already acknowledged that only a reader unescapes and only a writer
escapes, and that the rest of the system works with normal (unescaped)
values, so only readers and writers should be part of this process.  I also
already acknowledged that XirString would be lost and only the unescaped
SymbolId would be used.

So what's the point of XirString, then, if it won't be a useful optimization
beyond the temporary echo writer?

Instead, we can take the XirStringWriter and implement two caches on that:
mapping SymbolId from escaped->unescaped and vice-versa.  These can be
simple vectors, since SymbolId is a 32-bit value we will not have much
wasted space for symbols that never get read or written.  We could even
optimize for preinterned symbols using markers, though I'll probably not do
so, and I'll explain why later.

If we do _that_, we get even _better_ optimizations through caching that
_will_ apply in the general case (so, not just for echo), and we're able to
ditch XirString entirely and simply use a SymbolId.  This makes for a much
more friendly API that isn't leaking implementation details, though it
_does_ put an onus on the caller to pass the encoder to both the reader and
the writer, _if_ it wants to take advantage of a cache.  But that burden is
not significant (and is, again, optional if we don't want it).

So, that'll be the next step.
main
Mike Gerwitz 2021-11-10 09:42:18 -05:00
parent c57aa7fb53
commit b1c0783c75
13 changed files with 463 additions and 143 deletions

4
tamer/Cargo.lock generated
View File

@ -217,9 +217,9 @@ dependencies = [
[[package]]
name = "quick-xml"
version = "0.22.0"
version = "0.23.0-alpha3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8533f14c8382aaad0d592c812ac3b826162128b65662331e1127b45c3d18536b"
checksum = "e6c002cfafa1de674ef8557e3dbf2ad8c03ac8eb3d162672411b1b7fb4d272ef"
dependencies = [
"memchr",
]

View File

@ -31,7 +31,7 @@ arrayvec = ">= 0.7.1"
bumpalo = ">= 2.6.0"
fxhash = ">= 0.2.1"
petgraph = "0.5.1" # TODO: petgraph-graphml holds this back
quick-xml = ">= 0.17.0"
quick-xml = ">= 0.23.0-alpha3"
getopts = "0.2"
exitcode = "1.1.2"
lazy_static = ">= 1.4.0"

View File

@ -139,7 +139,7 @@ mod writer {
};
use std::borrow::Cow;
use tamer::span::Span;
use tamer::xir::{writer::XmlWriter, AttrValue, Text};
use tamer::xir::{writer::XmlWriter, Text};
const FRAGMENT: &str = r#"<fragment>
This is pretend fragment text. We need a lot of it.
@ -210,9 +210,9 @@ This is pretend fragment text. We need a lot of it.</fragment>
vec![
Token::Open(name, span),
Token::AttrName(attr1, span),
Token::AttrValue(AttrValue::Escaped(val1), span),
Token::AttrValue(val1.into(), span),
Token::AttrName(attr2, span),
Token::AttrValue(AttrValue::Escaped(val2), span),
Token::AttrValue(val2.into(), span),
Token::Close(None, span),
]
.into_iter()

View File

@ -35,7 +35,7 @@ use crate::{
sym::{st::*, SymbolId},
xir::{
iter::{elem_wrap, ElemWrapIter},
AttrValue, QName, Text, Token,
AttrValue, QName, Text, Token, XirString,
},
};
use arrayvec::ArrayVec;
@ -75,7 +75,12 @@ type HeaderIter = array::IntoIter<Token, HEADER_SIZE>;
/// and its immediate child.
#[inline]
fn header(pkg_name: SymbolId, relroot: SymbolId) -> HeaderIter {
let pkg_name_val = AttrValue::Escaped(pkg_name);
// TODO: Introduce newtypes so that we do not have to make unsafe
// assumptions.
let pkg_name_val =
AttrValue::from(unsafe { XirString::assume_fixed(pkg_name) });
let relroot_val =
AttrValue::from(unsafe { XirString::assume_fixed(relroot) });
[
Token::AttrName(QN_XMLNS, LSPAN),
@ -91,7 +96,7 @@ fn header(pkg_name: SymbolId, relroot: SymbolId) -> HeaderIter {
Token::AttrName(QN_NAME, LSPAN),
Token::AttrValue(pkg_name_val, LSPAN),
Token::AttrName(QN_UUROOTPATH, LSPAN),
Token::AttrValue(AttrValue::Escaped(relroot), LSPAN),
Token::AttrValue(relroot_val, LSPAN),
]
.into_iter()
}
@ -130,7 +135,9 @@ impl<'a> DepListIter<'a> {
toks: ArrayVec::new(),
// TODO: we cannot trust that an arbitrary symbol is escaped; this
// needs better typing, along with other things.
relroot: AttrValue::Escaped(relroot),
relroot: AttrValue::from(unsafe {
XirString::assume_fixed(relroot)
}),
}
}
@ -163,7 +170,12 @@ impl<'a> DepListIter<'a> {
self.toks_push_attr(QN_PARENT, src.parent);
if let Some(pkg_name) = src.pkg_name {
self.toks.push(Token::AttrValue(AttrValue::Escaped(pkg_name), LSPAN));
// TODO: Introduce newtypes so that we do not have to make unsafe
// assumptions.
let pkg_name_val =
AttrValue::from(unsafe { XirString::assume_fixed(pkg_name) });
self.toks.push(Token::AttrValue(pkg_name_val, LSPAN));
self.toks.push(Token::AttrValueFragment(self.relroot, LSPAN));
self.toks.push(Token::AttrName(QN_SRC, LSPAN));
}
@ -182,16 +194,14 @@ impl<'a> DepListIter<'a> {
/// Optionally push an attribute if it has a `value`.
///
/// _The provided `value` must be escaped;_
/// it is blindly wrapped in [`AttrValue::Escaped`]!
///
/// Like [`refill_toks`](DepListIter::refill_toks),
/// we push in reverse.
#[inline]
fn toks_push_attr(&mut self, name: QName, value: Option<SymbolId>) {
if let Some(val) = value {
self.toks
.push(Token::AttrValue(AttrValue::Escaped(val), LSPAN));
let attr_val = AttrValue::from(val);
self.toks.push(Token::AttrValue(attr_val, LSPAN));
self.toks.push(Token::AttrName(name, LSPAN));
}
}
@ -274,8 +284,11 @@ impl MapFromsIter {
self.iter.next().and_then(|from| {
self.toks.push(Token::Close(None, LSPAN));
self.toks
.push(Token::AttrValue(AttrValue::Escaped(from), LSPAN));
// TODO
let from_val =
AttrValue::from(unsafe { XirString::assume_fixed(from) });
self.toks.push(Token::AttrValue(from_val, LSPAN));
self.toks.push(Token::AttrName(QN_NAME, LSPAN));
Some(Token::Open(QN_L_FROM, LSPAN))

View File

@ -256,12 +256,12 @@ fn test_writes_deps() -> TestResult {
assert_eq!(
attrs.find(QN_NAME).and_then(|a| a.value_atom()),
Some(AttrValue::Escaped(ident.name())),
Some(AttrValue::from(ident.name())),
);
assert_eq!(
attrs.find(QN_TYPE).and_then(|a| a.value_atom()),
Some(AttrValue::Escaped(ident.kind().unwrap().as_sym()))
Some(AttrValue::from(ident.kind().unwrap().as_sym()))
);
let generated = attrs.find(QN_GENERATED).and_then(|a| a.value_atom());
@ -270,25 +270,17 @@ fn test_writes_deps() -> TestResult {
generated: true, ..
}) = ident.src()
{
assert_eq!(generated, Some(AttrValue::Escaped("true".intern())));
assert_eq!(generated, Some(AttrValue::from("true".intern())));
} else {
assert_eq!(generated, None);
}
if let Some(Source { parent, .. }) = ident.src() {
assert_attr!(
attrs,
QN_PARENT,
parent.map(|x| AttrValue::Escaped(x)),
);
assert_attr!(attrs, QN_PARENT, parent.map(|x| AttrValue::from(x)),);
}
if let Some(Source { yields, .. }) = ident.src() {
assert_attr!(
attrs,
QN_YIELDS,
yields.map(|x| AttrValue::Escaped(x)),
);
assert_attr!(attrs, QN_YIELDS, yields.map(|x| AttrValue::from(x)),);
}
if let Some(Source {
@ -300,8 +292,11 @@ fn test_writes_deps() -> TestResult {
// `SymbolId`. Once the reader takes care of creating the
// symbol, we'll have no such problem.
match attrs.find(QN_DESC).and_then(|a| a.value_atom()) {
Some(AttrValue::Escaped(given)) => {
assert_eq!(desc.lookup_str(), given.lookup_str());
Some(given) => {
assert_eq!(
desc.lookup_str(),
Into::<SymbolId>::into(given).lookup_str()
);
}
invalid => panic!("unexpected desc: {:?}", invalid),
}
@ -317,8 +312,8 @@ fn test_writes_deps() -> TestResult {
assert_eq!(
parts.value_fragments(),
&vec![
(AttrValue::Escaped(relroot), LSPAN),
(AttrValue::Escaped(*pkg_name), LSPAN),
(AttrValue::from(relroot), LSPAN),
(AttrValue::from(*pkg_name), LSPAN),
]
);
}
@ -332,7 +327,7 @@ fn test_writes_deps() -> TestResult {
assert_attr!(
attrs,
QN_DIM,
Some(AttrValue::Escaped((*dim).into())),
Some(AttrValue::from(Into::<SymbolId>::into(*dim))),
"invalid {:?} @dim",
ident.kind()
);
@ -346,7 +341,7 @@ fn test_writes_deps() -> TestResult {
assert_attr!(
attrs,
QN_DIM,
Some(AttrValue::Escaped((*dim).into())),
Some(AttrValue::from(Into::<SymbolId>::into(*dim))),
"invalid {:?} @dim",
ident.kind()
);
@ -354,7 +349,7 @@ fn test_writes_deps() -> TestResult {
assert_attr!(
attrs,
QN_DTYPE,
Some(AttrValue::Escaped((*dtype).into())),
Some(AttrValue::from(Into::<SymbolId>::into(*dtype))),
"invalid {:?} @dtype",
ident.kind()
);
@ -364,7 +359,7 @@ fn test_writes_deps() -> TestResult {
assert_attr!(
attrs,
QN_DTYPE,
Some(AttrValue::Escaped((*dtype).into())),
Some(AttrValue::from(Into::<SymbolId>::into(*dtype))),
"invalid {:?} @dim",
ident.kind()
);
@ -443,8 +438,8 @@ fn test_writes_map_froms() -> TestResult {
);
});
assert!(found.contains(&AttrValue::Escaped("froma".intern())));
assert!(found.contains(&AttrValue::Escaped("fromb".intern())));
assert!(found.contains(&AttrValue::from("froma".intern())));
assert!(found.contains(&AttrValue::from("fromb".intern())));
Ok(())
}

View File

@ -73,11 +73,15 @@ use crate::sym::{
use memchr::memchr;
use std::convert::{TryFrom, TryInto};
use std::fmt::Display;
use std::hash::Hash;
use std::ops::Deref;
mod error;
pub use error::Error;
mod string;
pub use string::XirString;
pub mod iter;
pub mod pred;
pub mod reader;
@ -497,39 +501,37 @@ pub enum Text {
/// This does, however, put the onus on the caller to ensure that they got
/// the escaping status correct.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum AttrValue {
/// Value that requires escaping.
///
/// Unescaped values require further processing before writing.
Unescaped(SymbolId),
/// Value that either has already been escaped or is known not to
/// require escaping.
///
/// Escaped values can be written as-is without any further processing.
Escaped(SymbolId),
}
pub struct AttrValue(XirString);
impl AttrValue {
/// Construct a constant escaped attribute from a static C-style symbol.
pub const fn st_cid(sym: CIdentStaticSymbolId) -> Self {
Self::Escaped(sym.as_sym())
Self(XirString::st_cid(sym))
}
/// Construct a constant escaped attribute from a static URI symbol.
///
/// URIs are expected _not_ to contain quotes.
pub const fn st_uri(sym: UriStaticSymbolId) -> Self {
Self::Escaped(sym.as_sym())
Self(XirString::st_uri(sym))
}
}
impl<T: Into<XirString>> From<T> for AttrValue {
fn from(s: T) -> Self {
Self(s.into())
}
}
impl Into<SymbolId> for AttrValue {
fn into(self) -> SymbolId {
self.0.into()
}
}
impl Display for AttrValue {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Escaped(value) => value.fmt(f),
_ => todo!("AttrValue::Unescaped fmt"),
}
self.0.fmt(f)
}
}

View File

@ -19,10 +19,11 @@
//! XIR error information.
use crate::tpwrap::quick_xml;
use std::{fmt::Display, str::Utf8Error};
/// Error attempting to produce a XIR object.
#[derive(Debug, PartialEq, Eq)]
#[derive(Debug, PartialEq)]
pub enum Error {
/// Provided name contains a `':'`.
NCColon(Vec<u8>),
@ -36,6 +37,9 @@ pub enum Error {
// we allow the displayer to determine how to handle invalid UTF-8
// encodings.
InvalidUtf8(Utf8Error, Vec<u8>),
// TODO: Better error translation and spans.
QuickXmlError(quick_xml::Error),
}
impl Display for Error {
@ -62,6 +66,10 @@ impl Display for Error {
String::from_utf8_lossy(bytes)
)
}
// TODO: See Error TODO
Self::QuickXmlError(inner) => {
write!(f, "internal parser error: {:?}", inner)
}
}
}
}
@ -81,9 +89,8 @@ impl From<(Utf8Error, &[u8])> for Error {
}
}
impl From<quick_xml::Error> for Error {
fn from(err: quick_xml::Error) -> Self {
// These need to be translated to provide our own errors.
todo!("quick_xml::Error: {:?}", err)
impl<E: Into<quick_xml::Error>> From<E> for Error {
fn from(err: E) -> Self {
Self::QuickXmlError(err.into().into())
}
}

View File

@ -21,7 +21,7 @@
//!
//! This uses [`quick_xml`] as the parser.
use super::{AttrValue, Error, Token};
use super::{Error, Token, XirString};
use crate::{span::DUMMY_SPAN, sym::GlobalSymbolInternBytes, xir::Text};
use quick_xml::{
self,
@ -188,6 +188,9 @@ impl<B: BufRead> XmlXirReader<B> {
for result in attrs.with_checks(false) {
let attr = result?;
// The name must be parsed as a QName.
let name = attr.key.try_into()?;
// The attribute value,
// having just been read from XML,
// must have been escaped to be parsed properly.
@ -196,10 +199,8 @@ impl<B: BufRead> XmlXirReader<B> {
// that's okay as long as we can read it again,
// but we probably should still throw an error if we
// encounter such a situation.
let value = AttrValue::Escaped(attr.value.as_ref().intern_utf8()?);
// The name must be parsed as a QName.
let name = attr.key.try_into()?;
let value =
XirString::from_escaped_raw(attr.value.as_ref())?.into();
tokbuf.push_front(Token::AttrName(name, DUMMY_SPAN));
tokbuf.push_front(Token::AttrValue(value, DUMMY_SPAN));
@ -224,7 +225,7 @@ impl<B: BufRead> Iterator for XmlXirReader<B> {
fn next(&mut self) -> Option<Self::Item> {
self.tokbuf
.pop_back()
.map(|tok| Ok(tok))
.map(Result::Ok)
.or_else(|| self.refill_buf())
}
}

View File

@ -18,6 +18,7 @@
// along with this program. If not, see <http://www.gnu.org/licenses/>.
use super::*;
use crate::sym::GlobalSymbolIntern;
use crate::{
convert::ExpectInto,
span::DUMMY_SPAN,
@ -83,10 +84,7 @@ fn does_not_resolve_xmlns() {
Token::Open("no-ns".unwrap_into(), DUMMY_SPAN),
// Since we didn't parse @xmlns, it's still an attribute.
Token::AttrName("xmlns".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(
AttrValue::Escaped("noresolve".into()),
DUMMY_SPAN
),
Token::AttrValue(AttrValue::from("noresolve".intern()), DUMMY_SPAN),
Token::AttrEnd,
Token::Close(None, DUMMY_SPAN),
],
@ -106,10 +104,7 @@ fn empty_node_with_prefix_without_attributes_unresolved() {
vec![
Token::Open(("x", "empty-node").unwrap_into(), DUMMY_SPAN),
Token::AttrName(("xmlns", "x").unwrap_into(), DUMMY_SPAN),
Token::AttrValue(
AttrValue::Escaped("noresolve".into()),
DUMMY_SPAN
),
Token::AttrValue(AttrValue::from("noresolve".intern()), DUMMY_SPAN),
Token::AttrEnd,
Token::Close(None, DUMMY_SPAN),
],
@ -144,11 +139,11 @@ fn multiple_attrs_ordered() {
vec![
Token::Open("ele".unwrap_into(), DUMMY_SPAN),
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("a".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("a".intern()), DUMMY_SPAN),
Token::AttrName("bar".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("b".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("b".intern()), DUMMY_SPAN),
Token::AttrName(("b", "baz").unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("c".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("c".intern()), DUMMY_SPAN),
Token::AttrEnd,
Token::Close(None, DUMMY_SPAN),
],
@ -168,9 +163,9 @@ fn permits_duplicate_attrs() {
vec![
Token::Open("dup".unwrap_into(), DUMMY_SPAN),
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("a".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("a".intern()), DUMMY_SPAN),
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("b".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("b".intern()), DUMMY_SPAN),
Token::AttrEnd,
Token::Close(None, DUMMY_SPAN),
],
@ -231,7 +226,7 @@ fn child_node_with_attrs() {
Token::AttrEnd,
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
Token::AttrValue(AttrValue::Escaped("bar".into()), DUMMY_SPAN),
Token::AttrValue(AttrValue::from("bar".intern()), DUMMY_SPAN),
Token::AttrEnd,
Token::Close(None, DUMMY_SPAN),
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),

View File

@ -0,0 +1,321 @@
// XIR string and escape context
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! [`XirString`] and escape context.
//!
//! Safety
//! ======
//! The purpose of this type is to provide safety against XML injection by
//! encapsulating all responsibility within a single object.
//! The idea is simple:
//! if you have a safely constructed [`XirString`],
//! it can be safely used in any context without worrying about these
//! critical problems:
//!
//! 1. XML injection (via lack of escaping); and
//! 2. Erroneous multiple escape/unescape.
//!
//! Both of these problems are solved by ensuring that the proper context
//! for a given [`SymbolId`] is always maintained---a
//! symbol is either valid to be written or not.
//! Similarly,
//! we must know whether a symbol is escaped or not to know whether it
//! ought to be unescaped while reading.
//!
//! This context also ensures that we will not erroneously unescape or
//! re-escape values at multiple points in a program,
//! leading to incorrect data at best and vulnerabilities at worst.
//!
//! To ensure this safety,
//! it is important that types understand how to convert between
//! one-another in well-defined ways.
//! It should not be possible "just assume" that a value has already been
//! escaped.
//! Given that,
//! the constructors for this type are private to this module;
//! readers know the escape status and can produce the proper type and
//! internal types know how to translate between one-another,
//! but anything else making those assumptions is considered unsafe.
//!
//! Outside of this module,
//! it is assumed that the rest of the system is working with _unescaped_
//! values---afterall,
//! why would other parts of the system not dealing with XML directly
//! take it upon themselves to deal with XML directly?
//! Given that,
//! other modules can only read unescaped values and construct
//! [`XirString`] using unescaped values.
//! If we permitted retrieving raw escaped [`SymbolId`]s,
//! then we could construct from it another [`XirString`] that is
//! considered to be unescaped,
//! which would result in a double-escape if it were read using
//! [`XirString::escaped`].
use crate::sym::{
CIdentStaticSymbolId, GlobalSymbolInternBytes, GlobalSymbolInternUnchecked,
GlobalSymbolResolve, SymbolId, UriStaticSymbolId,
};
use std::{
borrow::Cow,
fmt::Display,
hash::{Hash, Hasher},
marker::PhantomData,
};
use super::Error;
/// An XML string that requires escaping before writing.
///
/// This type must be used in contexts where writing to an XML document is
/// not safe without proper escaping,
/// and where reading may require unescaping.
///
#[derive(Debug, Clone, Copy)]
pub struct XirString<S: XirStringEscaper = DefaultXirStringEscaper> {
unescaped: SymbolId,
escaped: Option<SymbolId>,
_escaper: PhantomData<S>,
}
// Since we implement Copy,
// ensure this size meets our expectations both as a sanity check and to
// ensure that attention is brought to this if it ever changes.
const_assert!(std::mem::size_of::<XirString>() <= std::mem::size_of::<usize>());
// A note about this type:
// Both fields are optional,
// but it is not valid to have both be `None`.
// To ensure that this is not possible,
// (a) the fields must remain private;
// (b) all constructors must initialize at least one of the fields; and
// (c) mutation must reconstruct using those constructors.
// This makes it possible to prove that the invariant always holds.
impl<S: XirStringEscaper> XirString<S> {
pub(super) fn from_escaped_raw(escaped: &[u8]) -> Result<Self, Error> {
let esc_sym = escaped.intern_utf8()?;
Ok(Self {
escaped: Some(esc_sym),
unescaped: match S::unescape_bytes(escaped)? {
// We got back what we sent in,
// so this value is fixed.
Cow::Borrowed(orig) => {
debug_assert!(orig == escaped);
esc_sym
}
// The value was rewritten,
// meaning that the original was escaped.
// We can't assume that it's valid UTF-8.
Cow::Owned(unesc) => unesc.intern_utf8()?,
},
_escaper: PhantomData,
})
}
pub const fn new_unescaped(sym: SymbolId) -> Self {
Self {
escaped: None,
unescaped: sym,
_escaper: PhantomData,
}
}
const fn new_fixed(sym: SymbolId) -> Self {
Self {
escaped: Some(sym),
unescaped: sym,
_escaper: PhantomData,
}
}
pub const unsafe fn assume_fixed(sym: SymbolId) -> Self {
Self::new_fixed(sym)
}
/// Construct a constant escaped attribute from a static C-style symbol.
pub const fn st_cid(sym: CIdentStaticSymbolId) -> Self {
Self::new_fixed(sym.as_sym())
}
/// Construct a constant escaped attribute from a static URI symbol.
///
/// URIs are expected _not_ to contain quotes.
pub const fn st_uri(sym: UriStaticSymbolId) -> Self {
Self::new_fixed(sym.as_sym())
}
// TODO: This unnecessarily allocates a symbol that'll just be written
// and not needed thereafter.
#[inline]
pub(super) fn into_escaped(self) -> SymbolId {
self.escaped.unwrap_or_else(|| S::escape(self.unescaped))
}
#[inline]
pub fn unescaped(&self) -> SymbolId {
self.unescaped
}
}
impl PartialEq for XirString {
fn eq(&self, other: &Self) -> bool {
self.unescaped == other.unescaped
}
}
impl Eq for XirString {}
impl Hash for XirString {
fn hash<H: Hasher>(&self, state: &mut H) {
self.unescaped.hash(state);
}
}
impl From<SymbolId> for XirString {
fn from(sym: SymbolId) -> Self {
Self::new_unescaped(sym)
}
}
impl Into<SymbolId> for XirString {
fn into(self) -> SymbolId {
self.unescaped
}
}
impl Display for XirString {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
self.unescaped.fmt(f)
}
}
pub trait XirStringEscaper {
fn escape_bytes(value: &[u8]) -> Cow<[u8]>;
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, Error>;
fn escape(sym: SymbolId) -> SymbolId {
match Self::escape_bytes(sym.lookup_str().as_bytes()) {
// We got back what we sent in,
// so this value is fixed.
Cow::Borrowed(_) => sym,
// The value changed,
// so we must allocate a new symbol.
// SAFETY: The unescaped symbol is valid UTF-8 unless it was
// unsafely allocated.
// Given that escaping does not introduce any invalid UTF-8
// sequences
// (as is trivially verified by reading its implementation),
// we can skip the UTF-8 check.
Cow::Owned(esc) => unsafe { esc[..].intern_utf8_unchecked() },
}
}
}
#[derive(Debug, Clone, Copy)]
pub struct QuickXmlXirStringEscaper {}
impl XirStringEscaper for QuickXmlXirStringEscaper {
#[inline]
fn escape_bytes(value: &[u8]) -> Cow<[u8]> {
quick_xml::escape::escape(value)
}
#[inline]
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, Error> {
// For some reason,
// quick-xml has made EscapeError explicitly private to the crate,
// and so it is opaque to us.
// They have, however,
// implemented `From<EscapeError> for Error`,
// which we will use here.
Ok(quick_xml::escape::unescape(value)
.map_err(quick_xml::Error::from)?)
}
}
pub type DefaultXirStringEscaper = QuickXmlXirStringEscaper;
#[cfg(test)]
mod test {
use super::*;
use crate::sym::GlobalSymbolIntern;
type Sut<S> = XirString<S>;
#[test]
fn create_from_and_retrieve_unescaped() {
let sym = "foo".intern();
let sut = Sut::new_unescaped(sym);
// Converting to a symbol yields the _unescaped_ value.
assert_eq!(sym, sut.into());
// An explicit method is also provided when the clarity is deemed
// necessary.
assert_eq!(sym, sut.unescaped());
}
// The unescaped values are used to identify the SUT.
#[test]
fn eq_on_unescape() {
let sym = "equal".intern();
assert_eq!(Sut::new_unescaped(sym), Sut::new_unescaped(sym));
}
#[test]
fn escapes_using_escaper() {
const GIVEN: &str = "str to escape";
const EXPECTED: &str = "ESCAPED";
struct MockEscaper {}
impl XirStringEscaper for MockEscaper {
fn escape_bytes<'a>(value: &'a [u8]) -> Cow<'a, [u8]> {
assert_eq!(GIVEN.as_bytes(), value);
Cow::Owned(EXPECTED.as_bytes().to_owned())
}
fn unescape_bytes(_: &[u8]) -> Result<Cow<[u8]>, Error> {
unreachable!("not used in this test")
}
}
let sut = Sut::<MockEscaper>::new_unescaped(GIVEN.intern());
// Note that this uses the MockEscaper defined above,
// _not_ quick_xml.
assert_eq!(EXPECTED.intern(), sut.into_escaped());
}
// Simple sanity check to ensure that the default escaper actually does
// some sort of escaping.
#[test]
fn default_escaper_escapes() {
assert_eq!(
"foo&lt;bar".intern(),
Sut::<DefaultXirStringEscaper>::new_unescaped("foo<bar".intern())
.into_escaped()
);
}
}

View File

@ -320,7 +320,7 @@ mod test {
#[test]
fn attr_into_attr_parts() {
let name = "attr".unwrap_into();
let value = AttrValue::Escaped("value".intern());
let value = AttrValue::from("value".intern());
let attr = SimpleAttr {
name,
@ -347,8 +347,8 @@ mod test {
#[test]
fn push_attr_part() {
let name = "pushattr".unwrap_into();
let value1 = AttrValue::Escaped("first".intern());
let value2 = AttrValue::Escaped("second".intern());
let value1 = AttrValue::from("first".intern());
let value2 = AttrValue::from("second".intern());
let mut attr = Attr::new_extensible_with_capacity(name, *S, 2).parts();
@ -361,8 +361,8 @@ mod test {
#[test]
fn attr_from_parts() {
let name = "pushattr".unwrap_into();
let value1 = AttrValue::Escaped("first".intern());
let value2 = AttrValue::Escaped("second".intern());
let value1 = AttrValue::from("first".intern());
let value2 = AttrValue::from("second".intern());
let attr =
Attr::from_fragments(name, *S, vec![(value1, *S), (value2, *S2)])
@ -374,9 +374,9 @@ mod test {
#[test]
fn into_fragments_to_reuse_buffer_for_parts() {
let name = "partbuffer".unwrap_into();
let value1 = AttrValue::Escaped("first".intern());
let value2 = AttrValue::Escaped("second".intern());
let value3 = AttrValue::Escaped("third".intern());
let value1 = AttrValue::from("first".intern());
let value2 = AttrValue::from("second".intern());
let value3 = AttrValue::from("third".intern());
let frags = vec![(value1, *S2), (value2, *S)];

View File

@ -73,9 +73,9 @@ mod attrs {
let b = "b".unwrap_into();
let attra =
Attr::new(a, AttrValue::Escaped("a value".into()), (*S, *S2));
Attr::new(a, AttrValue::from("a value".intern()), (*S, *S2));
let attrb =
Attr::new(b, AttrValue::Escaped("b value".into()), (*S, *S2));
Attr::new(b, AttrValue::from("b value".intern()), (*S, *S2));
let attrs = AttrList::from([attra.clone(), attrb.clone()]);
@ -162,10 +162,10 @@ fn empty_element_with_attrs_from_toks() {
let name = ("ns", "elem").unwrap_into();
let attr1 = "a".unwrap_into();
let attr2 = "b".unwrap_into();
let val1 = AttrValue::Escaped("val1".intern());
let val2a = AttrValue::Escaped("val2a".intern());
let val2b = AttrValue::Escaped("val2b".intern());
let val2c = AttrValue::Escaped("val2b".intern());
let val1 = AttrValue::from("val1".intern());
let val2a = AttrValue::from("val2a".intern());
let val2b = AttrValue::from("val2b".intern());
let val2c = AttrValue::from("val2b".intern());
let toks = [
Token::Open(name, *S),
@ -217,7 +217,7 @@ fn child_element_after_attrs() {
let name = ("ns", "elem").unwrap_into();
let child = "child".unwrap_into();
let attr = "a".unwrap_into();
let val = AttrValue::Escaped("val".intern());
let val = AttrValue::from("val".intern());
let toks = [
Token::Open(name, *S),
@ -301,7 +301,7 @@ fn element_with_child_with_attributes() {
let parent = "parent".unwrap_into();
let child = "child".unwrap_into();
let attr = "attr".unwrap_into();
let value = AttrValue::Escaped("attr value".into());
let value = AttrValue::from("attr value".intern());
let toks = [
Token::Open(parent, *S),
@ -360,7 +360,7 @@ fn element_with_text() {
fn parser_from_filters_incomplete() {
let name = ("ns", "elem").unwrap_into();
let attr = "a".unwrap_into();
let val = AttrValue::Escaped("val1".intern());
let val = AttrValue::from("val1".intern());
let toks = [
Token::Open(name, *S),
@ -406,7 +406,7 @@ fn parse_attrs_fails_if_first_token_is_non_attr() {
fn parse_attrs_fails_if_end_before_attr_end() {
let mut toks = [
Token::AttrName("foo".unwrap_into(), *S),
Token::AttrValue(AttrValue::Escaped("bar".into()), *S),
Token::AttrValue(AttrValue::from("bar".intern()), *S),
// No Token::AttrEnd
]
.into_iter();
@ -423,7 +423,7 @@ fn parse_attrs_fails_if_missing_attr_end() {
// of Token::AttrEnd.
let mut toks = [
Token::AttrName("foo".unwrap_into(), *S),
Token::AttrValue(AttrValue::Escaped("bar".into()), *S2),
Token::AttrValue(AttrValue::from("bar".intern()), *S2),
// No Token::AttrEnd
Token::Close(None, *S3),
]
@ -439,8 +439,8 @@ fn parse_attrs_fails_if_missing_attr_end() {
fn parse_attrs_isolated() {
let attr1 = "one".unwrap_into();
let attr2 = "two".unwrap_into();
let val1 = AttrValue::Escaped("val1".into());
let val2 = AttrValue::Escaped("val2".into());
let val1 = AttrValue::from("val1".intern());
let val2 = AttrValue::from("val2".intern());
let mut toks = [
Token::AttrName(attr1, *S),
@ -476,8 +476,8 @@ fn attr_parser_with_non_attr_token() {
fn parser_attr_multiple() {
let attr1 = "one".unwrap_into();
let attr2 = "two".unwrap_into();
let val1 = AttrValue::Escaped("val1".into());
let val2 = AttrValue::Escaped("val2".into());
let val1 = AttrValue::from("val1".intern());
let val2 = AttrValue::from("val2".intern());
let mut toks = [
Token::AttrName(attr1, *S),

View File

@ -207,42 +207,36 @@ impl XmlWriter for Token {
Ok(S::AttrNameAdjacent)
}
(
Self::AttrValue(AttrValue::Escaped(value), _),
S::AttrNameAdjacent,
) => {
(Self::AttrValue(AttrValue(value), _), S::AttrNameAdjacent) => {
sink.write(b"=\"")?;
sink.write(value.lookup_str().as_bytes())?;
sink.write(value.into_escaped().lookup_str().as_bytes())?;
sink.write(b"\"")?;
Ok(S::NodeOpen)
}
(Self::AttrValue(AttrValue(value), _), S::AttrFragmentAdjacent) => {
sink.write(value.into_escaped().lookup_str().as_bytes())?;
sink.write(b"\"")?;
Ok(S::NodeOpen)
}
(
Self::AttrValue(AttrValue::Escaped(value), _),
S::AttrFragmentAdjacent,
) => {
sink.write(value.lookup_str().as_bytes())?;
sink.write(b"\"")?;
Ok(S::NodeOpen)
}
(
Self::AttrValueFragment(AttrValue::Escaped(value), _),
Self::AttrValueFragment(AttrValue(value), _),
S::AttrNameAdjacent,
) => {
sink.write(b"=\"")?;
sink.write(value.lookup_str().as_bytes())?;
sink.write(value.into_escaped().lookup_str().as_bytes())?;
Ok(S::AttrFragmentAdjacent)
}
(
Self::AttrValueFragment(AttrValue::Escaped(value), _),
Self::AttrValueFragment(AttrValue(value), _),
S::AttrFragmentAdjacent,
) => {
sink.write(value.lookup_str().as_bytes())?;
sink.write(value.into_escaped().lookup_str().as_bytes())?;
Ok(S::AttrFragmentAdjacent)
}
@ -295,13 +289,7 @@ impl XmlWriter for Token {
// As-of-yet unsupported operations that weren't needed at the
// time of writing, but were planned for in the design of Xir.
(
invalid
@
(Self::AttrName(_, _)
| Self::AttrValue(AttrValue::Unescaped(_), _)),
S::AttrNameAdjacent,
)
(invalid @ Self::AttrName(_, _), S::AttrNameAdjacent)
| (invalid @ Self::Text(Text::Unescaped(_), _), S::NodeExpected)
| (invalid @ Self::CData(Text::Escaped(_), _), S::NodeExpected) => {
Err(Error::Todo(format!("{:?}", invalid), prev_state))
@ -452,24 +440,22 @@ mod test {
}
#[test]
fn writes_escaped_attr_value_when_adjacent_to_attr() -> TestResult {
// Just to be sure it's not trying to escape when we say it
// shouldn't, we include a character that must otherwise be escaped.
let value = AttrValue::Escaped("test \" escaped".intern());
fn writes_attr_value_when_adjacent_to_attr() -> TestResult {
let value = AttrValue::from("test str".intern());
let result = Token::AttrValue(value, *S)
.write_new(WriterState::AttrNameAdjacent)?;
assert_eq!(result.0, br#"="test " escaped""#);
assert_eq!(result.0, br#"="test str""#);
assert_eq!(result.1, WriterState::NodeOpen);
Ok(())
}
#[test]
fn writes_escaped_attr_value_consisting_of_fragments() -> TestResult {
let value_left = AttrValue::Escaped("left ".intern());
let value_right = AttrValue::Escaped("right".intern());
fn writes_attr_value_consisting_of_fragments() -> TestResult {
let value_left = AttrValue::from("left ".intern());
let value_right = AttrValue::from("right".intern());
let result = vec![
Token::AttrValueFragment(value_left, *S),
@ -559,7 +545,7 @@ mod test {
#[test]
fn unsupported_transition_results_in_error() -> TestResult {
assert!(matches!(
Token::AttrValue(AttrValue::Escaped("".into()), *S)
Token::AttrValue(AttrValue::from("".intern()), *S)
.write(&mut vec![], WriterState::NodeExpected),
Err(Error::UnexpectedToken(_, WriterState::NodeExpected)),
));
@ -576,7 +562,7 @@ mod test {
let result = vec![
Token::Open(root, *S),
Token::AttrName(("an", "attr").try_into()?, *S),
Token::AttrValue(AttrValue::Escaped("value".intern()), *S),
Token::AttrValue(AttrValue::from("value".intern()), *S),
Token::AttrEnd,
Token::Text(Text::Escaped("text".intern()), *S),
Token::Open(("c", "child").try_into()?, *S),