tamer: XIR: Working concept

This is a working streaming IR for XML.  I want to get this committed before
I go further cleaning it up and integrating it into the xmle writer.

This is lacking detailed documentation, and the names of things may end up
changing.

Initial benchmarks do show that it has a ~2x performance improvement over
quick-xml when dealing with two attributes on a node, and I suspect that
improvement will increase with the number of attributes.  We will see how it
compares in real-world benchmarks once the linker has been modified to use
it.

The goal isn't to _avoid_ quick-xml---it'll be used in the future for things
like escaping that would be a huge waste to implement ourselves.  It just so
happened that quick-xml was not beneficial for these changes; indeed, its
own writer is fairly simple for the portions that were implemented here, so
there's no use in fighting with its API, particularly around attributes and
our need to explicitly control whitespace (with the intent of handling code
formatters in the future).

To put this into perspective: the reason this work is being done isn't to
refactor the linker, or to speed it up, but to generalize XML writing and
provide a suitable IR for use in the compiler.  The first step of the
frontend is to essentially echo the XML token stream back out so we can
incrementally parse it and do something useful, to incrementally rewrite the
compiler in Rust.
main
Mike Gerwitz 2021-08-20 10:09:55 -04:00
parent c211ada89b
commit a23bae5e4d
5 changed files with 1295 additions and 0 deletions

View File

@ -0,0 +1,265 @@
// Comparisons between Rust built-ins and memchr.
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
#![feature(test)]
//! Assessment of overhead of Xir compared to baselines.
//!
//! A lot of time in TAMER is spent parsing and writing XML files, so it's
//! important that these operations be efficient.
//! Xir is intended to be a very lightweight IR,
//! able to provide convenient abstractions and validations only when
//! both necessary and desired.
//!
//! Rust touts "zero-cost abstractions",
//! which is a generally true statement (with some exceptions) that allows
//! us to create dense newtype abstractions that represent validated and
//! structured data,
//! at a compile-time but not runtime cost.
//! These tests serve to demonstrate that such a claim is true for Xir,
//! and help to obviate any potential future regressions.
extern crate quick_xml;
extern crate tamer;
extern crate test;
use std::convert::{TryFrom, TryInto};
use tamer::ir::xir::{NCName, NodeStream, QName};
use tamer::sym::{GlobalSymbolIntern, GlobalSymbolResolve, SymbolId};
use test::Bencher;
type Ix = tamer::global::PkgSymSize;
fn gen_strs(n: usize, suffix: &str) -> Vec<String> {
(0..n).map(|n| n.to_string() + suffix).collect()
}
mod name {
use super::*;
// Essentially duplicates sym::interner::global::with_all_new_1000, but
// provides a local baseline that we can be sure will be available to
// compare against, at a glance.
#[bench]
fn baseline_global_intern_str_1000(bench: &mut Bencher) {
let strs = gen_strs(1000, "foobar");
bench.iter(|| {
strs.iter()
.map(|s| s.as_str().intern() as SymbolId<Ix>)
.for_each(drop);
});
}
// This should be cost-free relative to the previous test.
#[bench]
fn ncname_new_unchecked_str_intern_1000(bench: &mut Bencher) {
let strs = gen_strs(1000, "foobar");
bench.iter(|| {
strs.iter()
.map(|s| unsafe {
NCName::<Ix>::new_unchecked(s.as_str().intern())
})
.for_each(drop);
});
}
// This duplicates a memchr test, but allows us to have a comparable
// baseline at a glance.
#[bench]
fn baseline_str_contains_1000(bench: &mut Bencher) {
let strs = gen_strs(1000, "foobar");
bench.iter(|| {
strs.iter().map(|s| s.as_str().contains(':')).for_each(drop);
});
}
// This should be approximately as expensive as the two baselines added
// together.
#[bench]
fn ncname_try_from_str_1000(bench: &mut Bencher) {
let strs = gen_strs(1000, "foobar");
bench.iter(|| {
strs.iter()
.map(|s| NCName::<Ix>::try_from(s.as_str()))
.for_each(drop);
});
}
// Should be ~2x previous test, since it contains two `NCName`s.
#[bench]
fn qname_try_from_str_pair_1000(bench: &mut Bencher) {
let prefixes = gen_strs(1000, "prefix");
let names = gen_strs(1000, "name");
bench.iter(|| {
prefixes
.iter()
.zip(names.iter())
.map(|(p, s)| QName::<Ix>::try_from((p.as_str(), s.as_str())))
.for_each(drop);
});
}
}
mod ws {
use super::*;
use tamer::ir::xir::Whitespace;
#[bench]
fn whitespace_1000(bench: &mut Bencher) {
bench.iter(|| {
(0..1000)
.map(|_| Whitespace::<Ix>::try_from(" \t "))
.for_each(drop);
});
}
}
mod writer {
use super::*;
use quick_xml::{
events::{BytesStart, BytesText, Event as XmlEvent},
Writer as QuickXmlWriter,
};
use std::borrow::Cow;
use tamer::ir::xir::{writer::XmlWriter, AttrValue, Text};
use tamer::span::Span;
const FRAGMENT: &str = r#"<fragment>
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.
This is pretend fragment text. We need a lot of it.</fragment>
"#;
// TAME makes heavy use of attributes, which unfortunately requires
// copies in quick-xml. This will serve as our baseline---we want to
// perform _at least_ as well (but we do end up performing much better,
// despite the global symbol lookups).
#[bench]
fn baseline_quick_xml_empty_with_attrs_1000(bench: &mut Bencher) {
let buf = Vec::<u8>::new();
let mut writer = QuickXmlWriter::new(buf);
bench.iter(|| {
(0..1000).for_each(|_| {
writer
.write_event(XmlEvent::Empty(
BytesStart::borrowed_name(b"test:foo").with_attributes(
vec![("first", "value"), ("second", "value2")],
),
))
.unwrap();
});
});
}
// Produces the same output as above.
#[bench]
fn xir_empty_with_attrs_preinterned_1000(bench: &mut Bencher) {
let mut buf = Vec::<u8>::new();
// Perform all interning beforehand, since in practice, values will
// have been interned well before we get to the writer. Further,
// common values such as these (QNames) will be pre-defined and
// reused.
let span = Span::from_byte_interval((0, 0), "path".intern());
let name = QName::<Ix>::try_from(("test", "foo")).unwrap();
let attr1 = QName::new_local("first".try_into().unwrap());
let attr2 = QName::new_local("second".try_into().unwrap());
let val1 = "value".intern();
let val2 = "value2".intern();
bench.iter(|| {
(0..1000).for_each(|_| {
vec![
NodeStream::Open(name, span),
NodeStream::AttrName(attr1, span),
NodeStream::AttrValue(AttrValue::Escaped(val1), span),
NodeStream::AttrName(attr2, span),
NodeStream::AttrValue(AttrValue::Escaped(val2), span),
NodeStream::SelfClose(span),
]
.into_iter()
.write(&mut buf, Default::default())
.unwrap();
});
});
}
// The other major thing we do is output large amounts of text (the
// linked fragments).
#[bench]
fn baseline_quick_xml_text_500(bench: &mut Bencher) {
let buf = Vec::<u8>::new();
let mut writer = QuickXmlWriter::new(buf);
let frag: SymbolId<Ix> = FRAGMENT.intern();
bench.iter(|| {
(0..500).for_each(|_| {
writer
.write_event(XmlEvent::Text(BytesText::from_escaped_str(
Cow::Borrowed(&frag.lookup_str() as &str),
)))
.unwrap();
});
});
}
// This test and the above are expected to perform similarly, and can
// vary wildy run-to-run.
#[bench]
fn xir_text_500(bench: &mut Bencher) {
let mut buf = Vec::<u8>::new();
let frag: SymbolId<Ix> = FRAGMENT.intern();
let span = Span::from_byte_interval((0, 0), "path".intern());
bench.iter(|| {
(0..500).for_each(|_| {
NodeStream::Text(Text::Escaped(frag), span)
.write(&mut buf, Default::default())
.unwrap();
});
});
}
}

View File

@ -73,3 +73,4 @@
pub mod asg;
pub mod legacyir;
pub mod xir;

463
tamer/src/ir/xir.rs 100644
View File

@ -0,0 +1,463 @@
// XML IR (XIR)
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! Intermediate representation (IR) of an XML document.
//!
//! XIR serves not only as a TAMER-specific IR,
//! but also as an abstraction layer atop of whatever XML library is
//! used (e.g. `quick_xml`).
//!
//! XIR is _not_ intended to be comprehensive,
//! or even general-purpose---it
//! exists to solve concerns specific to TAMER's construction.
//!
//! _This is a work in progress!_
use crate::global;
use crate::span::Span;
use crate::sym::{GlobalSymbolIntern, SymbolId, SymbolIndexSize};
use std::convert::{TryFrom, TryInto};
use std::fmt::Display;
use std::ops::Deref;
pub mod writer;
// TODO: Move into crate::sym if this is staying around.
macro_rules! newtype_symbol {
{$($(#[$meta:meta])* pub struct $name:ident;)*} => {
$(
$(#[$meta])*
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct $name<Ix: SymbolIndexSize>(SymbolId<Ix>);
impl<Ix: SymbolIndexSize> Deref for $name<Ix> {
type Target = SymbolId<Ix>;
fn deref(&self) -> &Self::Target {
&self.0
}
}
impl<Ix: SymbolIndexSize> PartialEq<SymbolId<Ix>> for $name<Ix> {
fn eq(&self, other: &SymbolId<Ix>) -> bool {
self.0 == *other
}
}
)*
};
}
// TODO: Derive macro instead?
newtype_symbol! {
/// XML Name minus `":"`.
///
/// The intent is to check a string for validity _before_ interning;
/// otherwise,
/// the string would have to be first retrieved from the intern pool
/// for comparison,
/// which is not an operation we want to do implicitly.
/// Those methods will be created as they are needed.
///
/// See <https://www.w3.org/TR/REC-xml-names/#NT-NCName>.
pub struct NCName;
}
impl<Ix: SymbolIndexSize> NCName<Ix> {
/// Create a new NCName from a symbol without validating that the symbol
/// is a valid NCName.
///
/// Safety
/// ======
/// This is not unsafe in the traditional sense;
/// it's unsafe in a sense similar to non-UTF-8 `str` slices,
/// in that it is expected that an `NCName` means that you do not
/// have to worry about whether it's syntatically valid as XML.
pub unsafe fn new_unchecked(value: SymbolId<Ix>) -> Self {
Self(value)
}
}
#[derive(Debug, PartialEq, Eq)]
pub enum Error {
/// Provided name contains a `':'`.
NCColon(String),
/// Provided string contains non-ASCII-whitespace characters.
NotWhitespace(String),
}
impl Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::NCColon(name) => {
write!(f, "NCName must not contain a colon: `{}`", name)
}
Self::NotWhitespace(s) => {
write!(f, "String contains non-ASCII-whitespace: `{}`", s)
}
}
}
}
impl std::error::Error for Error {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
None
}
}
impl<Ix: SymbolIndexSize> TryFrom<&str> for NCName<Ix> {
type Error = Error;
fn try_from(value: &str) -> Result<Self, Self::Error> {
if value.contains(':') {
return Err(Error::NCColon(value.into()));
}
Ok(Self(value.intern()))
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Prefix<Ix: SymbolIndexSize>(NCName<Ix>);
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct LocalPart<Ix: SymbolIndexSize>(NCName<Ix>);
impl<Ix: SymbolIndexSize> Deref for Prefix<Ix> {
type Target = SymbolId<Ix>;
fn deref(&self) -> &Self::Target {
self.0.deref()
}
}
impl<Ix: SymbolIndexSize> Deref for LocalPart<Ix> {
type Target = SymbolId<Ix>;
fn deref(&self) -> &Self::Target {
self.0.deref()
}
}
impl<Ix: SymbolIndexSize> From<NCName<Ix>> for Prefix<Ix> {
fn from(name: NCName<Ix>) -> Self {
Self(name)
}
}
impl<Ix: SymbolIndexSize> From<NCName<Ix>> for LocalPart<Ix> {
fn from(name: NCName<Ix>) -> Self {
Self(name)
}
}
impl<Ix: SymbolIndexSize> TryFrom<&str> for Prefix<Ix> {
type Error = Error;
fn try_from(value: &str) -> Result<Self, Self::Error> {
Ok(Self(value.try_into()?))
}
}
impl<Ix: SymbolIndexSize> TryFrom<&str> for LocalPart<Ix> {
type Error = Error;
fn try_from(value: &str) -> Result<Self, Self::Error> {
Ok(Self(value.try_into()?))
}
}
#[derive(Debug, PartialEq, Eq)]
pub struct Whitespace<Ix: SymbolIndexSize>(SymbolId<Ix>);
impl<Ix: SymbolIndexSize> Deref for Whitespace<Ix> {
type Target = SymbolId<Ix>;
fn deref(&self) -> &Self::Target {
&self.0
}
}
impl<Ix: SymbolIndexSize> TryFrom<&str> for Whitespace<Ix> {
type Error = Error;
fn try_from(value: &str) -> Result<Self, Self::Error> {
// We do not expect this to ever be a large value based on how we
// use it.
// If it is, well, someone's doing something they ought not to be
// and we're not going to optimize for it.
if !value.as_bytes().iter().all(u8::is_ascii_whitespace) {
return Err(Error::NotWhitespace(value.into()));
}
Ok(Self(value.intern()))
}
}
impl<Ix: SymbolIndexSize> From<Whitespace<Ix>> for Text<Ix> {
fn from(ws: Whitespace<Ix>) -> Self {
// Whitespace needs no escaping
Self::Escaped(ws.0)
}
}
/// A qualified name (namespace prefix and local name).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct QName<Ix: SymbolIndexSize>(Option<Prefix<Ix>>, LocalPart<Ix>);
// Since we implement Copy, ensure size matches our expectations:
const_assert!(
std::mem::size_of::<QName<global::ProgSymSize>>()
<= std::mem::size_of::<usize>()
);
impl<Ix: SymbolIndexSize> QName<Ix> {
/// Create a new fully-qualified name (including both a namespace URI
/// and local name).
pub fn new(prefix: Prefix<Ix>, local_name: LocalPart<Ix>) -> Self {
Self(Some(prefix), local_name)
}
/// Create a new name from a local name only.
///
/// This should only be used for attributes in TAMER,
/// since all elements should have an associated namespace.
///
/// _(If this is ever not true (e.g. due to new targets),
/// please update this comment.)_
pub fn new_local(local_name: LocalPart<Ix>) -> Self {
Self(None, local_name)
}
/// Fully qualified namespace associated with a name.
pub fn prefix(&self) -> Option<Prefix<Ix>> {
self.0
}
/// Local part of a name (name without namespace).
pub fn local_name(&self) -> LocalPart<Ix> {
self.1
}
}
impl<Ix, P, L> TryFrom<(P, L)> for QName<Ix>
where
Ix: SymbolIndexSize,
P: TryInto<Prefix<Ix>>,
L: TryInto<LocalPart<Ix>, Error = P::Error>,
{
type Error = P::Error;
fn try_from(value: (P, L)) -> Result<Self, Self::Error> {
Ok(Self(Some(value.0.try_into()?), value.1.try_into()?))
}
}
/// Represents text and its escaped state.
///
/// Being explicit about the state of escaping allows us to skip checks when
/// we know that the generated text could not possibly require escaping.
/// This does, however, put the onus on the caller to ensure that they got
/// the escaping status correct.
/// (TODO: More information on why this burden isn"t all that bad,
/// despite the risk.)
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Text<Ix: SymbolIndexSize> {
/// Text node that requires escaping.
///
/// Unescaped text requires further processing before writing.
///
/// Note that,
/// since the unescaped text is interned,
/// it may be wasteful to intern a large text node with the intent of
/// escaping and re-interning later.
/// Instead,
/// if escaping is only needed for writing,
/// it is likely better to leave it to the writer to escape,
/// which does _not_ require interning of the resulting string.
Unescaped(SymbolId<Ix>),
/// Text node that either has already been escaped or is known not to
/// require escaping.
///
/// Escaped text can be written as-is without any further processing.
Escaped(SymbolId<Ix>),
}
/// Represents an attribute value and its escaped contents.
///
/// Being explicit about the state of escaping allows us to skip checks when
/// we know that the generated text could not possibly require escaping.
/// This does, however, put the onus on the caller to ensure that they got
/// the escaping status correct.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AttrValue<Ix: SymbolIndexSize> {
/// Value that requires escaping.
///
/// Unescaped values require further processing before writing.
Unescaped(SymbolId<Ix>),
/// Value that either has already been escaped or is known not to
/// require escaping.
///
/// Escaped values can be written as-is without any further processing.
Escaped(SymbolId<Ix>),
}
#[derive(Debug, PartialEq, Eq)]
pub enum NodeStream<Ix: SymbolIndexSize> {
/// Opening tag of an element.
Open(QName<Ix>, Span),
/// Closing tag of an element.
///
/// If no child nodes have been encountered since the last `Open`,
/// then the tag is assumed to be self-closing;
/// if this is not desired,
/// emit an empty `Text` variant.
Close(QName<Ix>, Span),
/// End of self-closing tag.
///
/// This is intended primarily as a safety measure:
/// It allows writers to act as simple state machines without having
/// to ensure balancing by indicating that a node was intended to
/// self-close.
/// Otherwise,
/// we wouldn't know whether to self-close or to close and then
/// create a new closing tag;
/// if we blindly did the former,
/// we risk losing a closing tag when it wasn't intended.
/// Instead of losing tags,
/// writers can error,
/// indicating a bug in the stream.
SelfClose(Span),
/// Element attribute name
AttrName(QName<Ix>, Span),
/// Element attribute value
AttrValue(AttrValue<Ix>, Span),
/// Comment node.
Comment(Text<Ix>, Span),
/// Character data as part of an element.
///
/// See also [`CData`](NodeStream::CData) variant.
Text(Text<Ix>, Span),
/// CData node (`<![CDATA[...]]>`).
///
/// See also [`Text`](NodeStream::Text) variant.
///
/// _Warning: It is up to the caller to ensure that the string `]]>` is
/// not present in the text!_
/// This is intended for reading existing XML data where CData is
/// already present,
/// not for producing new CData safely!
CData(Text<Ix>, Span),
/// Similar to `Text`,
/// but intended for use where only whitespace is allowed,
/// such as alignment of attributes.
Whitespace(Whitespace<Ix>, Span),
}
#[cfg(test)]
mod test {
use super::*;
use crate::{global, sym::GlobalSymbolIntern};
use std::convert::TryInto;
type Ix = global::PkgSymSize;
type TestResult = Result<(), Box<dyn std::error::Error>>;
lazy_static! {
static ref S: Span =
Span::from_byte_interval((0, 0), "test case".intern());
}
mod name {
use super::*;
#[test]
fn ncname_comparable_to_sym() {
let foo = "foo".intern();
assert_eq!(NCName::<Ix>(foo), foo);
}
#[test]
fn ncname_try_into_from_str_no_colon() -> TestResult {
let name: NCName<Ix> = "no-colon".try_into()?;
assert_eq!(name, "no-colon".intern());
Ok(())
}
#[test]
fn ncname_try_into_from_str_fails_with_colon() {
assert_eq!(
NCName::<Ix>::try_from("look:a-colon"),
Err(Error::NCColon("look:a-colon".into()))
);
}
#[test]
fn local_name_from_local_part_only() -> TestResult {
let name = QName::<Ix>::new_local("foo".try_into()?);
assert_eq!(name.local_name(), "foo".try_into()?);
assert_eq!(None, name.prefix());
Ok(())
}
#[test]
fn fully_qualified_name() -> TestResult {
let name: QName<Ix> = ("foons", "foo").try_into()?;
assert_eq!(name.prefix(), Some("foons".try_into()?));
assert_eq!(name.local_name(), "foo".try_into()?);
Ok(())
}
}
#[test]
fn whitespace() -> TestResult {
assert_eq!(Whitespace::<Ix>::try_from(" ")?, " ".try_into()?);
assert_eq!(Whitespace::<Ix>::try_from(" \t ")?, " \t ".try_into()?);
assert_eq!(
Whitespace::<Ix>::try_from("not ws!"),
Err(Error::NotWhitespace("not ws!".into()))
);
Ok(())
}
#[test]
fn whitespace_as_text() -> TestResult {
assert_eq!(
Text::Escaped(" ".intern()),
Whitespace::<Ix>::try_from(" ")?.into(),
);
Ok(())
}
}

View File

@ -0,0 +1,562 @@
// XIR writer
//
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! Lower XIR stream into an XML byte stream via [`Write`].
use super::{Error as XirError, NodeStream, QName};
use crate::ir::xir::{AttrValue, Text};
use crate::sym::GlobalSymbolResolve;
use crate::sym::SymbolIndexSize;
use std::io::{Error as IoError, Write};
use std::result;
pub type Result<T = WriterState> = result::Result<T, Error>;
#[derive(Debug)]
pub enum Error {
Io(IoError),
Xir(XirError),
// TODO: No String here, since then we cannot resolve inner symbols for
// display to the user
UnexpectedToken(String, WriterState),
Todo(String, WriterState),
}
impl std::fmt::Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Io(e) => e.fmt(f),
Self::Xir(e) => e.fmt(f),
Self::UnexpectedToken(tok, state) => write!(
f,
"Invalid token {} at XML writer state {:?}",
tok, state
),
Self::Todo(tok, state) => write!(
f,
"Unexpected {} at XML writer state {:?}; TAMER intends to \
support this operation, but it was not yet thought to be \
needed. More development is needed to support it, but \
hopefully very little. However, if you hit this during \
normal use of TAME, then this represents a bug.",
tok, state
),
}
}
}
impl std::error::Error for Error {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self {
Self::Io(e) => Some(e),
Self::Xir(e) => Some(e),
_ => None,
}
}
}
impl From<IoError> for Error {
fn from(e: IoError) -> Self {
Self::Io(e)
}
}
impl From<XirError> for Error {
fn from(e: XirError) -> Self {
Self::Xir(e)
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WriterState {
/// A node is expected to be output next.
NodeExpected,
/// A node is currently being output and has not yet been closed.
NodeOpen,
/// Cursor is position adjacent to an attribute name within an element.
AttrNameAdjacent,
}
impl Default for WriterState {
fn default() -> Self {
Self::NodeExpected
}
}
impl WriterState {
#[inline]
fn close_tag_if_open<W: Write>(&self, sink: &mut W) -> Result<()> {
Ok(match *self {
Self::NodeOpen => {
sink.write(b">")?;
}
_ => (),
})
}
}
pub trait XmlWriter {
#[must_use = "Write operation may fail"]
fn write<W: Write>(self, sink: &mut W, prev_state: WriterState) -> Result;
}
impl<Ix: SymbolIndexSize> XmlWriter for QName<Ix> {
#[inline]
fn write<W: Write>(self, sink: &mut W, prev_state: WriterState) -> Result {
if let Some(prefix) = self.prefix() {
sink.write(prefix.lookup_str().as_bytes())?;
sink.write(b":")?;
}
sink.write(self.local_name().lookup_str().as_bytes())?;
Ok(prev_state)
}
}
impl<Ix: SymbolIndexSize> XmlWriter for NodeStream<Ix> {
fn write<W: Write>(self, sink: &mut W, prev_state: WriterState) -> Result {
type S = WriterState; // More concise
match (self, prev_state) {
(Self::Open(name, _), S::NodeExpected | S::NodeOpen) => {
// If a node is still open, then we are a child.
prev_state.close_tag_if_open(sink)?;
sink.write(b"<")?;
name.write(sink, prev_state)?;
Ok(S::NodeOpen)
}
// TODO: Remove whitespace from here, add to stream, then it can
// be used with attrs
(Self::SelfClose(_), S::NodeOpen) => {
sink.write(b"/>")?;
Ok(S::NodeExpected)
}
(Self::Close(name, _), S::NodeExpected | S::NodeOpen) => {
// If open, we're going to produce an element of the form
// `<foo></foo>`.
prev_state.close_tag_if_open(sink)?;
sink.write(b"</")?;
name.write(sink, prev_state)?;
sink.write(b">")?;
Ok(S::NodeExpected)
}
(Self::AttrName(name, _), S::NodeOpen) => {
sink.write(b" ")?;
name.write(sink, prev_state)?;
Ok(S::AttrNameAdjacent)
}
(
Self::AttrValue(AttrValue::Escaped(value), _),
S::AttrNameAdjacent,
) => {
sink.write(b"=\"")?;
sink.write(value.lookup_str().as_bytes())?;
sink.write(b"\"")?;
Ok(S::NodeOpen)
}
// Unescaped not yet supported, but you could use CData.
(
Self::Text(Text::Escaped(text), _),
S::NodeExpected | S::NodeOpen,
) => {
prev_state.close_tag_if_open(sink)?;
sink.write(text.lookup_str().as_bytes())?;
Ok(S::NodeExpected)
}
// Escaped not yet supported, but you could use Text.
(
Self::CData(Text::Unescaped(text), _),
S::NodeExpected | S::NodeOpen,
) => {
prev_state.close_tag_if_open(sink)?;
sink.write(b"<![CDATA[")?;
sink.write(text.lookup_str().as_bytes())?;
sink.write(b"]]>")?;
Ok(S::NodeExpected)
}
// Unescaped not yet supported, since we do not have a use case.
(
Self::Comment(Text::Escaped(comment), _),
S::NodeExpected | S::NodeOpen,
) => {
prev_state.close_tag_if_open(sink)?;
sink.write(b"<!--")?;
sink.write(comment.lookup_str().as_bytes())?;
sink.write(b"-->")?;
Ok(S::NodeExpected)
}
(Self::Whitespace(ws, _), S::NodeOpen) => {
sink.write(ws.lookup_str().as_bytes())?;
Ok(S::NodeOpen)
}
// As-of-yet unsupported operations that weren't needed at the
// time of writing, but were planned for in the design of Xir.
(
invalid
@
(Self::AttrName(_, _)
| Self::AttrValue(AttrValue::Unescaped(_), _)),
S::AttrNameAdjacent,
)
| (invalid @ Self::Text(Text::Unescaped(_), _), S::NodeExpected)
| (invalid @ Self::CData(Text::Escaped(_), _), S::NodeExpected) => {
Err(Error::Todo(format!("{:?}", invalid), prev_state))
}
// Everything else represents either an invalid state transition
// that would produce invalid XML, or something that we forgot
// to account for in the above error.
(invalid, _) => Err(Error::UnexpectedToken(
format!("{:?}", invalid),
prev_state,
)),
}
}
}
impl<Ix: SymbolIndexSize, I: Iterator<Item = NodeStream<Ix>>> XmlWriter for I {
fn write<W: Write>(
mut self,
sink: &mut W,
initial_state: WriterState,
) -> Result {
self.try_fold(initial_state, |prev_state, tok| {
tok.write(sink, prev_state)
})
}
}
#[cfg(test)]
mod test {
use std::convert::{TryFrom, TryInto};
use super::*;
use crate::{
ir::xir::{AttrValue, QName, Text, Whitespace},
span::Span,
sym::GlobalSymbolIntern,
};
type TestResult = std::result::Result<(), Error>;
type Ix = u16;
lazy_static! {
static ref S: Span =
Span::from_byte_interval((0, 0), "test case".intern());
}
#[test]
fn writes_beginning_node_tag_without_prefix() -> TestResult {
let mut buf = vec![];
let name = QName::<Ix>::new_local("no-prefix".try_into()?);
assert_eq!(
NodeStream::Open(name, *S).write(&mut buf, Default::default())?,
WriterState::NodeOpen
);
assert_eq!(buf, b"<no-prefix");
Ok(())
}
#[test]
fn writes_beginning_node_tag_with_prefix() -> TestResult {
let mut buf = vec![];
let name = QName::<Ix>::try_from(("prefix", "element-name"))?;
assert_eq!(
NodeStream::Open(name, *S).write(&mut buf, Default::default())?,
WriterState::NodeOpen
);
assert_eq!(buf, b"<prefix:element-name");
Ok(())
}
#[test]
fn closes_open_node_when_opening_another() -> TestResult {
let mut buf = vec![];
let name = QName::<Ix>::try_from(("p", "another-element"))?;
assert_eq!(
NodeStream::Open(name, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeOpen
);
assert_eq!(buf, b"><p:another-element");
Ok(())
}
#[test]
fn closes_open_node_as_empty_element() -> TestResult {
let mut buf = vec![];
assert_eq!(
NodeStream::<Ix>::SelfClose(*S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"/>");
Ok(())
}
#[test]
fn closing_tag_when_node_expected() -> TestResult {
let mut buf = vec![];
let name = QName::<Ix>::try_from(("a", "closed-element"))?;
assert_eq!(
NodeStream::Close(name, *S)
.write(&mut buf, WriterState::NodeExpected)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"</a:closed-element>");
Ok(())
}
// Does _not_ check that it's balanced. Not our job. In fact, we want
// to explicitly support outputting malformed XML.
#[test]
fn closes_open_node_with_closing_tag() -> TestResult {
let mut buf = vec![];
let name = QName::<Ix>::try_from(("b", "closed-element"))?;
assert_eq!(
NodeStream::Close(name, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"></b:closed-element>");
Ok(())
}
// Intended for alignment of attributes, primarily.
#[test]
fn whitespace_within_open_node() -> TestResult {
let mut buf = vec![];
assert_eq!(
NodeStream::<Ix>::Whitespace(Whitespace::try_from(" \t ")?, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeOpen
);
assert_eq!(buf, b" \t ");
Ok(())
}
#[test]
fn writes_attr_name_to_open_node() -> TestResult {
let mut buf = vec![];
let name_ns = QName::<Ix>::try_from(("some", "attr"))?;
let name_local = QName::<Ix>::new_local("nons".try_into()?);
// Namespace prefix
assert_eq!(
NodeStream::AttrName(name_ns, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::AttrNameAdjacent
);
assert_eq!(buf, b" some:attr");
buf.clear();
// No namespace prefix
assert_eq!(
NodeStream::AttrName(name_local, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::AttrNameAdjacent
);
assert_eq!(buf, b" nons");
Ok(())
}
#[test]
fn writes_escaped_attr_value_when_adjacent_to_attr() -> TestResult {
let mut buf = vec![];
// Just to be sure it's not trying to escape when we say it
// shouldn't, we include a character that must otherwise be escaped.
let value = AttrValue::<Ix>::Escaped("test \" escaped".intern());
assert_eq!(
NodeStream::AttrValue(value, *S)
.write(&mut buf, WriterState::AttrNameAdjacent)?,
WriterState::NodeOpen
);
assert_eq!(buf, br#"="test " escaped""#);
Ok(())
}
#[test]
fn writes_escaped_text() -> TestResult {
let mut buf = vec![];
// Just to be sure it's not trying to escape when we say it
// shouldn't, we include a character that must otherwise be escaped.
let text = Text::<Ix>::Escaped("test > escaped".intern());
// When a node is expected.
assert_eq!(
NodeStream::Text(text, *S)
.write(&mut buf, WriterState::NodeExpected)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"test > escaped");
buf.clear();
// When a node is still open.
assert_eq!(
NodeStream::Text(text, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeExpected
);
assert_eq!(buf, b">test > escaped");
Ok(())
}
#[test]
fn writes_unescaped_data() -> TestResult {
let mut buf = vec![];
// Just to be sure it's not trying to escape when we say it
// shouldn't, we include a character that must otherwise be escaped.
let text = Text::<Ix>::Unescaped("test > unescaped".intern());
// When a node is expected.
assert_eq!(
NodeStream::CData(text, *S)
.write(&mut buf, WriterState::NodeExpected)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"<![CDATA[test > unescaped]]>");
buf.clear();
// When a node is still open.
assert_eq!(
NodeStream::CData(text, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"><![CDATA[test > unescaped]]>");
Ok(())
}
#[test]
fn writes_escaped_comment() -> TestResult {
let mut buf = vec![];
// Just to be sure it's not trying to escape when we say it
// shouldn't, we include a character that must otherwise be escaped.
let comment = Text::<Ix>::Escaped("comment > escaped".intern());
// When a node is expected.
assert_eq!(
NodeStream::Comment(comment, *S)
.write(&mut buf, WriterState::NodeExpected)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"<!--comment > escaped-->");
buf.clear();
// When a node is still open.
assert_eq!(
NodeStream::Comment(comment, *S)
.write(&mut buf, WriterState::NodeOpen)?,
WriterState::NodeExpected
);
assert_eq!(buf, b"><!--comment > escaped-->");
Ok(())
}
#[test]
fn unsupported_transition_results_in_error() -> TestResult {
assert!(matches!(
NodeStream::AttrValue(AttrValue::<Ix>::Escaped("".into()), *S)
.write(&mut vec![], WriterState::NodeExpected),
Err(Error::UnexpectedToken(_, WriterState::NodeExpected)),
));
Ok(())
}
// The cherry on top, just to demonstrate how this will be used in
// practice.
#[test]
fn test_valid_sequence_of_tokens() -> TestResult {
let mut buf = vec![];
let root: QName<Ix> = ("r", "root").try_into()?;
vec![
NodeStream::Open(root, *S),
NodeStream::AttrName(("an", "attr").try_into()?, *S),
NodeStream::AttrValue(AttrValue::Escaped("value".intern()), *S),
NodeStream::Text(Text::Escaped("text".intern()), *S),
NodeStream::Open(("c", "child").try_into()?, *S),
NodeStream::Whitespace(" ".try_into()?, *S),
NodeStream::SelfClose(*S),
NodeStream::Close(root, *S),
]
.into_iter()
.write(&mut buf, Default::default())?;
assert_eq!(buf, br#"<r:root an:attr="value">text<c:child /></r:root>"#);
Ok(())
}
}

View File

@ -27,6 +27,10 @@ pub mod global;
#[macro_use]
extern crate static_assertions;
#[cfg(test)]
#[macro_use]
extern crate lazy_static;
#[cfg(feature = "wip-frontends")]
pub mod frontend;