tamer: xir::reader: Initial introduction of spans

This is a large change, and was a bit of a tedious one, given the
comprehensive tests.

This introduces proper offsets and lengths for spans, with the exception of
some quick-xml errors that still need proper mapping.  Further, this still
uses `UNKNOWN_CONTEXT`, which will be resolved shortly.

This also introduces `SpanlessError`, which `Error` explicitly _does not_
implement `From<SpanlessError>` for---this forces the caller to provide a
span before the error is compatable with the return value, ensuring that
spans will actually be available rather than forgotten for errors.  This is
important, given that errors are generally less tested than the happy path,
and errors are when users need us the most (so, need span information).

Further, I had to use pointer arithmetic in order to calculate many of the
spans, because quick-xml does not provide enough information.  There's no
safety considerations here, and the comprehensive unit test will ensure
correct behavior if the implementation changes in the future.

I would like to introduce typed spans at some point---I made some
opinionated choices when it comes to what the spans ought to
represent.  Specifically, whether to include the `<` or `>` with the open
span (depends), whether to include quotes with attribute values (no),
and some other details highlighted in the test cases.  If we provide typed
spans, then we could, knowing the type of span, calculate other spans on
request, e.g. to include or omit quotes for attributes.  Different such
spans may be useful in different situations when presenting information to
the user.

This also highlights gaps in the tokens emitted by XIR, such as whitespace
between attributes, the `=` between name and value, and so on.  These are
important when it comes to code formatting, so that we can reliably
reconstruct the XML tree, but it's not important right now.  I anticipate
future changes would allow the XIR reader to be configured (perhaps via
generics, like a strategy-type pattern) to optionally omit these tokens if
desired.

Anyway, more to come.

DEV-10934
main
Mike Gerwitz 2022-04-08 11:03:46 -04:00
parent 942bf66231
commit ab181670b5
8 changed files with 660 additions and 272 deletions

View File

@ -185,7 +185,7 @@
use crate::{
global,
sym::{ContextStaticSymbolId, SymbolId},
sym::{st16, ContextStaticSymbolId, SymbolId},
};
use std::{convert::TryInto, fmt::Display};
@ -437,7 +437,7 @@ impl Display for Span {
/// A placeholder span indicating that a span is expected but is not yet
/// known.
pub const UNKNOWN_SPAN: Span = Span::st_ctx(crate::sym::st16::CTX_UNKNOWN);
pub const UNKNOWN_SPAN: Span = Span::st_ctx(st16::CTX_UNKNOWN);
/// A dummy span that can be used in contexts where a span is expected but
/// is not important.
@ -447,7 +447,7 @@ pub const UNKNOWN_SPAN: Span = Span::st_ctx(crate::sym::st16::CTX_UNKNOWN);
/// messages and source analysis.
///
/// Additional dummy spans can be derived from this one.
pub const DUMMY_SPAN: Span = Span::st_ctx(crate::sym::st16::CTX_DUMMY);
pub const DUMMY_SPAN: Span = Span::st_ctx(st16::CTX_DUMMY);
/// Context for byte offsets (e.g. a source file).
///
@ -461,6 +461,62 @@ pub const DUMMY_SPAN: Span = Span::st_ctx(crate::sym::st16::CTX_DUMMY);
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub struct Context(PathIndex);
impl Context {
/// Produce a [`Span`] within the given context.
#[inline]
pub fn span(
self,
offset: global::SourceFileSize,
len: global::FrontendTokenLength,
) -> Span {
Span::new(offset, len, self)
}
/// Attempt to produce a [`Span`] of the given length at the given
/// offset,
/// otherwise fall back to a `(0,0)` (ZZ) span.
///
/// If the offset cannot be stored,
/// then the length will always be `0` even if it could otherwise be
/// represented;
/// `(0,0)` indicates no span,
/// whereas `(0,N)` would indicate a span of length `N` at
/// offset `0`,
/// which would not be true.
///
/// If the offset can be represented but not the length,
/// then a zero-length span at that offset will be produced,
/// which still provides useful information.
/// This may be the case for very large objects,
/// like compiled text fragments.
///
/// The rationale here is that spans are intended to be informative.
/// If we are unable to provide that information due to exceptional
/// circumstances
/// (very large file or very large token),
/// then it's better to provide _some_ information than to bail out
/// with an error and interrupt the entire process,
/// potentially masking errors in the process.
#[inline]
pub fn span_or_zz(self, offset: usize, len: usize) -> Span {
self.span(offset.try_into().unwrap_or(0), len.try_into().unwrap_or(0))
}
}
/// A placeholder context indicating that a context is expected but is not
/// yet known.
pub const UNKNOWN_CONTEXT: Context = Context(PathIndex(st16::raw::CTX_UNKNOWN));
/// A dummy context that can be used where a span is expected but is not
/// important.
///
/// This is intended primarily for tests;
/// you should always use an appropriate span to permit sensible error
/// messages and source analysis.
///
/// See also [`UNKNOWN_CONTEXT`].
pub const DUMMY_CONTEXT: Context = Context(PathIndex(st16::raw::CTX_DUMMY));
impl<P: Into<PathIndex>> From<P> for Context {
fn from(path: P) -> Self {
Self(path.into())

View File

@ -532,8 +532,8 @@ pub mod st16 {
<u16>;
// Special contexts.
CTX_DUMMY: ctx "#!UNKNOWN",
CTX_UNKNOWN: ctx "#!DUMMY",
CTX_DUMMY: ctx "#!DUMMY",
CTX_UNKNOWN: ctx "#!UNKNOWN",
CTX_LINKER: ctx "#!LINKER",
// [Symbols will be added here as they are needed.]

View File

@ -26,30 +26,6 @@
//! or even general-purpose---it
//! exists to solve concerns specific to TAMER's construction.
//!
//! Parsing and Safety
//! ==================
//! Many XIR elements know how to safely parse into themselves,
//! exposing [`TryFrom`] traits that will largely do the right thing for
//! you.
//! For example,
//! [`QName`] is able to construct itself from a byte slice and from a
//! string tuple,
//! among other things.
//!
//! ```
//! use tamer::xir::QName;
//! use tamer::sym::GlobalSymbolIntern;
//!
//!# fn main() -> Result<(), tamer::xir::Error> {
//! let src = "foo:bar".as_bytes();
//! let qname = QName::try_from(src)?;
//!
//! assert_eq!(qname, ("foo", "bar").try_into()?);
//!
//!# Ok(())
//!# }
//! ```
//!
//! To parse an entire XML document,
//! see [`reader`].
@ -70,6 +46,8 @@ pub use error::Error;
mod escape;
pub use escape::{DefaultEscaper, Escaper};
use self::error::SpanlessError;
pub mod attr;
pub mod flat;
pub mod iter;
@ -162,7 +140,7 @@ impl NCName {
}
impl TryFrom<&[u8]> for NCName {
type Error = Error;
type Error = SpanlessError;
/// Attempt to parse a byte slice into an [`NCName`].
///
@ -173,7 +151,7 @@ impl TryFrom<&[u8]> for NCName {
/// The string will be interned for you.
fn try_from(value: &[u8]) -> Result<Self, Self::Error> {
match value.contains(&b':') {
true => Err(Error::NCColon(value.to_owned())),
true => Err(SpanlessError::NCColon(value.intern_utf8()?)),
false => Ok(NCName(value.intern_utf8()?)),
}
}
@ -194,11 +172,11 @@ impl PartialEq<SymbolId> for NCName {
}
impl TryFrom<&str> for NCName {
type Error = Error;
type Error = SpanlessError;
fn try_from(value: &str) -> Result<Self, Self::Error> {
if value.contains(':') {
return Err(Error::NCColon(value.into()));
return Err(SpanlessError::NCColon(value.into()));
}
Ok(Self(value.intern()))
@ -242,7 +220,7 @@ impl From<NCName> for LocalPart {
}
impl TryFrom<&str> for Prefix {
type Error = Error;
type Error = SpanlessError;
fn try_from(value: &str) -> Result<Self, Self::Error> {
Ok(Self(value.try_into()?))
@ -250,7 +228,7 @@ impl TryFrom<&str> for Prefix {
}
impl TryFrom<&str> for LocalPart {
type Error = Error;
type Error = SpanlessError;
fn try_from(value: &str) -> Result<Self, Self::Error> {
Ok(Self(value.try_into()?))
@ -398,7 +376,7 @@ where
}
impl TryFrom<&str> for QName {
type Error = Error;
type Error = SpanlessError;
fn try_from(value: &str) -> Result<Self, Self::Error> {
Ok(QName(None, value.try_into()?))
@ -406,7 +384,7 @@ impl TryFrom<&str> for QName {
}
impl TryFrom<&[u8]> for QName {
type Error = Error;
type Error = SpanlessError;
/// Attempt to parse a byte slice into a [`QName`].
///
@ -419,7 +397,7 @@ impl TryFrom<&[u8]> for QName {
// Leading colon means we're missing a prefix, trailing means
// that we have no local part.
Some(pos) if pos == 0 || pos == name.len() - 1 => {
Err(Error::InvalidQName(name.to_owned()))
Err(SpanlessError::InvalidQName(name.intern_utf8()?))
}
// There is _at least_ one colon in the string.
@ -621,7 +599,7 @@ mod test {
fn ncname_try_into_from_str_fails_with_colon() {
assert_eq!(
NCName::try_from("look:a-colon"),
Err(Error::NCColon("look:a-colon".into()))
Err(SpanlessError::NCColon("look:a-colon".into()))
);
}
@ -636,7 +614,7 @@ mod test {
fn ncname_from_byte_slice_fails_with_colon() {
assert_eq!(
NCName::try_from(b"a:colon" as &[u8]),
Err(Error::NCColon("a:colon".into()))
Err(SpanlessError::NCColon("a:colon".into()))
);
}

View File

@ -26,17 +26,19 @@ use std::{fmt::Display, str::Utf8Error};
#[derive(Debug, PartialEq)]
pub enum Error {
/// Provided name contains a `':'`.
NCColon(Vec<u8>),
NCColon(SymbolId, Span),
/// Provided string contains non-ASCII-whitespace characters.
NotWhitespace(String),
/// Provided QName is not valid.
InvalidQName(Vec<u8>),
InvalidQName(SymbolId, Span),
/// A UTF-8 error together with the byte slice that caused it.
///
/// By storing the raw bytes instead of a string,
/// we allow the displayer to determine how to handle invalid UTF-8
/// encodings.
InvalidUtf8(Utf8Error, Vec<u8>),
/// Further,
/// we cannot intern strings that are not valid UTF-8.
InvalidUtf8(Utf8Error, Vec<u8>, Span),
/// XML 1.0 only.
///
/// Other versions are not widely in use
@ -49,31 +51,34 @@ pub enum Error {
/// which should not be an unreasonable expectation.
UnsupportedEncoding(SymbolId, Span),
// TODO: Better error translation and spans.
QuickXmlError(quick_xml::Error),
// TODO: Better error translation.
QuickXmlError(quick_xml::Error, Span),
}
impl Error {
pub fn from_with_span<E: Into<SpanlessError>>(
span: Span,
) -> impl FnOnce(E) -> Self {
move |e: E| e.into().with_span(span)
}
}
impl Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::NCColon(bytes) => {
write!(
f,
"NCName `{}` cannot contain ':'",
String::from_utf8_lossy(bytes)
)
Self::NCColon(sym, span) => {
write!(f, "NCName `{sym}` cannot contain ':' at {span}",)
}
Self::NotWhitespace(s) => {
write!(f, "string contains non-ASCII-whitespace: `{}`", s)
}
Self::InvalidQName(bytes) => {
write!(f, "invalid QName `{}`", String::from_utf8_lossy(bytes))
Self::InvalidQName(qname, span) => {
write!(f, "invalid QName `{qname}` at {span}")
}
Self::InvalidUtf8(inner, bytes) => {
Self::InvalidUtf8(inner, bytes, span) => {
write!(
f,
"{} for string `{}`",
inner,
"{inner} for string `{}` with bytes `{bytes:?}` at {span}",
String::from_utf8_lossy(bytes)
)
}
@ -94,9 +99,9 @@ impl Display for Error {
but found unsupported encoding `{enc}`"
)
}
// TODO: See Error TODO
Self::QuickXmlError(inner) => {
write!(f, "internal parser error: {:?}", inner)
// TODO: Translate error messages
Self::QuickXmlError(inner, span) => {
write!(f, "internal parser error: {inner} at {span}")
}
}
}
@ -105,19 +110,77 @@ impl Display for Error {
impl std::error::Error for Error {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self {
Self::InvalidUtf8(err, ..) => Some(err),
Self::InvalidUtf8(e, ..) => Some(e),
Self::QuickXmlError(e, ..) => Some(e),
_ => None,
}
}
}
impl From<(Utf8Error, &[u8])> for Error {
/// An [`Error`] that requires its [`Span`] to be filled in by the caller.
///
/// These errors should not be converted automatically,
/// since only the caller can know the correct information to provide for
/// a useful [`Span`].
/// Failure to provide a useful span will betray the user when they need us
/// the most:
/// debugging an error.
///
/// As such,
/// please do not implement `From<SpanlessError> for Error`;
/// use [`SpanlessError::with_span`] instead.
#[derive(Debug, PartialEq)]
pub enum SpanlessError {
NCColon(SymbolId),
InvalidQName(SymbolId),
InvalidUtf8(Utf8Error, Vec<u8>),
QuickXmlError(quick_xml::Error),
}
impl SpanlessError {
pub fn with_span(self, span: Span) -> Error {
match self {
Self::NCColon(sym) => Error::NCColon(sym, span),
Self::InvalidQName(qname) => Error::InvalidQName(qname, span),
Self::InvalidUtf8(inner, bytes) => {
Error::InvalidUtf8(inner, bytes, span)
}
Self::QuickXmlError(inner) => Error::QuickXmlError(inner, span),
}
}
pub fn into_with_span<E>(span: Span) -> impl FnOnce(E) -> Error
where
E: Into<SpanlessError>,
{
move |e: E| e.into().with_span(span)
}
}
impl std::error::Error for SpanlessError {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self {
Self::InvalidUtf8(inner, ..) => Some(inner),
Self::QuickXmlError(inner) => Some(inner),
_ => None,
}
}
}
impl Display for SpanlessError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// This isn't friendly, since it shouldn't occur.
write!(f, "internal error: missing span for error: {self:?}")
}
}
impl From<(Utf8Error, &[u8])> for SpanlessError {
fn from((err, bytes): (Utf8Error, &[u8])) -> Self {
Self::InvalidUtf8(err, bytes.to_owned())
}
}
impl<E: Into<quick_xml::Error>> From<E> for Error {
impl<E: Into<quick_xml::Error>> From<E> for SpanlessError {
fn from(err: E) -> Self {
Self::QuickXmlError(err.into())
}

View File

@ -62,7 +62,7 @@ use crate::sym::{
};
use std::{borrow::Cow, cell::RefCell, collections::hash_map::Entry};
use super::Error;
use super::error::SpanlessError;
/// XIR escaper and unescaper.
///
@ -86,7 +86,7 @@ pub trait Escaper: Default {
/// Unescape raw bytes such that any relevant escape sequences are
/// parsed into their text representation.
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, Error>;
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, SpanlessError>;
/// Escape the given symbol and produce a [`SymbolId`] representing
/// the escaped value suitable for writing.
@ -112,7 +112,7 @@ pub trait Escaper: Default {
/// Unescape the provided raw value and return a [`SymbolId`]
/// representing the unescaped value.
#[inline]
fn unescape(&self, escaped: SymbolId) -> Result<SymbolId, Error> {
fn unescape(&self, escaped: SymbolId) -> Result<SymbolId, SpanlessError> {
Ok(
match Self::unescape_bytes(escaped.lookup_str().as_bytes())? {
// We got back what we sent in,
@ -139,7 +139,7 @@ impl Escaper for QuickXmlEscaper {
}
#[inline]
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, Error> {
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, SpanlessError> {
// For some reason,
// quick-xml has made EscapeError explicitly private to the crate,
// and so it is opaque to us.
@ -240,7 +240,7 @@ impl<S: Escaper> Escaper for CachingEscaper<S> {
}
#[inline]
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, Error> {
fn unescape_bytes(value: &[u8]) -> Result<Cow<[u8]>, SpanlessError> {
S::unescape_bytes(value)
}
@ -261,7 +261,7 @@ impl<S: Escaper> Escaper for CachingEscaper<S> {
}
#[inline]
fn unescape(&self, escaped: SymbolId) -> Result<SymbolId, Error> {
fn unescape(&self, escaped: SymbolId) -> Result<SymbolId, SpanlessError> {
Ok(match self.tounesc.borrow_mut().entry(escaped) {
Entry::Occupied(unescaped) => *unescaped.get(),
Entry::Vacant(entry) => {
@ -293,7 +293,7 @@ impl Escaper for NullEscaper {
}
#[inline]
fn unescape_bytes(_value: &[u8]) -> Result<Cow<[u8]>, Error> {
fn unescape_bytes(_value: &[u8]) -> Result<Cow<[u8]>, SpanlessError> {
panic!("NullEscaper should not be used for unescaping")
}
}
@ -341,7 +341,9 @@ mod test {
unreachable!("escape_bytes should not be called")
}
fn unescape_bytes(_: &[u8]) -> result::Result<Cow<[u8]>, Error> {
fn unescape_bytes(
_: &[u8],
) -> result::Result<Cow<[u8]>, SpanlessError> {
unreachable!("unescape_bytes should not be called")
}
@ -350,7 +352,10 @@ mod test {
*self.escape_map.get(&given).expect("unexpected escape")
}
fn unescape(&self, given: SymbolId) -> Result<SymbolId, Error> {
fn unescape(
&self,
given: SymbolId,
) -> Result<SymbolId, SpanlessError> {
*self.unescape_count.borrow_mut().entry(given).or_default() +=
1;
Ok(*self.unescape_map.get(&given).expect("unexpected unescape"))

View File

@ -21,9 +21,9 @@
//!
//! This uses [`quick_xml`] as the parser.
use super::{DefaultEscaper, Error, Escaper, Token};
use super::{error::SpanlessError, DefaultEscaper, Error, Escaper, Token};
use crate::{
span::{DUMMY_SPAN, UNKNOWN_SPAN},
span::{UNKNOWN_CONTEXT as UC, UNKNOWN_SPAN},
sym::GlobalSymbolInternBytes,
};
use quick_xml::{
@ -32,7 +32,7 @@ use quick_xml::{
attributes::Attributes, BytesDecl, BytesStart, Event as QuickXmlEvent,
},
};
use std::{collections::VecDeque, io::BufRead, result};
use std::{borrow::Cow, collections::VecDeque, io::BufRead, result};
pub type Result<T> = result::Result<T, Error>;
@ -106,23 +106,39 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
self.tokbuf.clear();
self.readbuf.clear();
match self.reader.read_event(&mut self.readbuf) {
// This is the only time we'll consider the iterator to be done.
Ok(QuickXmlEvent::Eof) => None,
let prev_pos = self.reader.buffer_position();
Err(inner) => Some(Err(inner.into())),
match self.reader.read_event(&mut self.readbuf) {
// TODO: To provide better spans and error messages,
// we need to map specific types of errors.
Err(inner) => {
let span = UC.span_or_zz(prev_pos, 0);
Some(Err(SpanlessError::from(inner).with_span(span)))
}
Ok(ev) => match ev {
// This is the only time we'll consider the iterator to be
// done.
QuickXmlEvent::Eof => None,
QuickXmlEvent::Empty(ele) => Some(
Self::parse_element_open(
&self.escaper,
&mut self.tokbuf,
ele,
prev_pos,
)
.and_then(|open| {
let new_pos = self.reader.buffer_position();
// `<tag ... />`
// ||
let span = UC.span_or_zz(new_pos - 2, 2);
// Tag is self-closing, but this does not yet
// handle whitespace before the `/`.
self.tokbuf.push_front(Token::Close(None, DUMMY_SPAN));
// handle whitespace before the `/`
// (as indicated in the span above).
self.tokbuf.push_front(Token::Close(None, span));
Ok(open)
}),
@ -132,13 +148,19 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
&self.escaper,
&mut self.tokbuf,
ele,
prev_pos,
)),
QuickXmlEvent::End(ele) => {
Some(ele.name().try_into().map_err(Error::from).and_then(
|qname| Ok(Token::Close(Some(qname), DUMMY_SPAN)),
))
}
QuickXmlEvent::End(ele) => Some({
// </foo>
// |----| name + '<' + '/' + '>'
let span = UC.span_or_zz(prev_pos, ele.name().len() + 3);
ele.name()
.try_into()
.map_err(Error::from_with_span(span))
.and_then(|qname| Ok(Token::Close(Some(qname), span)))
}),
// quick_xml emits a useless text event if the first byte is
// a '<'.
@ -152,35 +174,46 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
// unescape it again.
QuickXmlEvent::CData(bytes) => todo!("CData: {:?}", bytes),
QuickXmlEvent::Text(bytes) => Some(
QuickXmlEvent::Text(bytes) => Some({
// <text>foo bar</text>
// |-----|
let span = UC.span_or_zz(prev_pos, bytes.len());
bytes
.intern_utf8()
.map_err(Error::from)
.map_err(Into::into)
.and_then(|sym| self.escaper.unescape(sym))
.map(|unesc| Token::Text(unesc, DUMMY_SPAN)),
),
.map_err(Error::from_with_span(span))
.and_then(|unesc| Ok(Token::Text(unesc, span)))
}),
// Comments are _not_ returned escaped.
QuickXmlEvent::Comment(bytes) => Some(
QuickXmlEvent::Comment(bytes) => Some({
// <!-- foo -->
// |----------| " foo " + "<!--" + "-->"
let span = UC.span_or_zz(prev_pos, bytes.len() + 7);
bytes
.intern_utf8()
.map_err(Error::from)
.map(|text| Token::Comment(text, DUMMY_SPAN)),
),
.map_err(Error::from_with_span(span))
.and_then(|comment| Ok(Token::Comment(comment, span)))
}),
// TODO: This must appear in the Prolog.
QuickXmlEvent::Decl(decl) => match Self::validate_decl(&decl) {
Err(x) => Some(Err(x)),
Ok(()) => self.refill_buf(),
},
QuickXmlEvent::Decl(decl) => {
match Self::validate_decl(&decl, prev_pos) {
Err(x) => Some(Err(x)),
Ok(()) => self.refill_buf(),
}
}
// We do not support processor instructions.
// We do not support processor instructions or doctypes.
// TODO: Convert this into an error/warning?
// Previously `xml-stylesheet` was present in some older
// source files and may linger for a bit after cleanup.
QuickXmlEvent::PI(..) => self.refill_buf(),
x => todo!("event: {:?}", x),
QuickXmlEvent::PI(..) | QuickXmlEvent::DocType(..) => {
self.refill_buf()
}
},
}
}
@ -197,24 +230,44 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
/// people unfamiliar with the system do not have expectations that
/// are going to be unmet,
/// which may result in subtle (or even serious) problems.
fn validate_decl(decl: &BytesDecl) -> Result<()> {
fn validate_decl(decl: &BytesDecl, pos: usize) -> Result<()> {
// Starts after `<?`, which we want to include.
let decl_ptr = decl.as_ptr() as usize - 2 + pos;
// Fallback span that covers the entire declaration.
let decl_span = UC.span_or_zz(pos, decl.len() + 4);
let ver =
&decl.version().map_err(Error::from_with_span(decl_span))?[..];
// NB: `quick-xml` docs state that `version` returns the quotes,
// but it does not.
let ver = &decl.version()?[..];
if ver != b"1.0" {
// <?xml version="X.Y"?>
// |-|
let ver_pos = (ver.as_ptr() as usize) - decl_ptr;
let span = UC.span_or_zz(ver_pos, ver.len());
Err(Error::UnsupportedXmlVersion(
ver.intern_utf8()?,
UNKNOWN_SPAN,
ver.intern_utf8().map_err(Error::from_with_span(span))?,
span,
))?
}
if let Some(enc) = decl.encoding() {
match &enc?[..] {
match &enc.map_err(Error::from_with_span(decl_span))?[..] {
b"utf-8" | b"UTF-8" => (),
invalid => Err(Error::UnsupportedEncoding(
invalid.intern_utf8()?,
UNKNOWN_SPAN,
))?,
invalid => {
let enc_pos = (invalid.as_ptr() as usize) - decl_ptr;
let span = UC.span_or_zz(enc_pos, invalid.len());
Err(Error::UnsupportedEncoding(
invalid
.intern_utf8()
.map_err(Error::from_with_span(span))?,
span,
))?
}
}
}
@ -231,16 +284,37 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
escaper: &'s S,
tokbuf: &mut VecDeque<Token>,
ele: BytesStart,
pos: usize,
) -> Result<Token> {
// Starts after the opening tag `<`, so adjust.
let addr = ele.as_ptr() as usize - 1;
let len = ele.name().len();
// `ele` contains every byte up to the [self-]closing tag.
ele.name()
.try_into()
.map_err(Error::from)
.map_err(Error::from_with_span(UC.span_or_zz(pos + 1, len)))
.and_then(|qname| {
Self::parse_attrs(escaper, tokbuf, ele.attributes())?;
let noattr_add: usize =
(ele.attributes_raw().len() == 0).into();
// <tag ... />
// |--| name + '<'
//
// <tag>..</tag>
// |---| name + '<' + '>'
let span = UC.span_or_zz(pos, len + 1 + noattr_add);
Self::parse_attrs(
escaper,
tokbuf,
ele.attributes(),
addr - pos, // offset relative to _beginning_ of buffer
)?;
// The first token will be immediately returned
// via the Iterator.
Ok(Token::Open(qname, DUMMY_SPAN))
Ok(Token::Open(qname, span))
})
}
@ -250,20 +324,61 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
///
/// This does not yet handle whitespace between attributes,
/// or around `=`.
///
/// Note About Pointer Arithmetic
/// =============================
/// `ele_ptr` is expected to be a pointer to the buffer containing the
/// bytes read from the source file.
/// Attributes reference this buffer,
/// so we can use pointer arithmetic to determine the offset within
/// the buffer relative to the node.
/// This works because the underlying buffer is a `Vec`,
/// which is contiguous in memory.
///
/// However, since this is a `Vec`,
/// it is important that the address be retrieved _after_ quick-xml
/// read events,
/// otherwise the buffer may be expanded and will be reallocated.
fn parse_attrs<'a>(
escaper: &'s S,
tokbuf: &mut VecDeque<Token>,
mut attrs: Attributes<'a>,
ele_ptr: usize,
) -> Result<()> {
// Disable checks to allow duplicate attributes;
// XIR does not enforce this,
// because it needs to accommodate semantically invalid XML for
// later analysis.
for result in attrs.with_checks(false) {
let attr = result?;
// TODO: We'll need to map this quick-xml error to provide more
// detailed messages and spans.
let attr = result.map_err(Error::from_with_span(UNKNOWN_SPAN))?;
let keyoffset = attr.key.as_ptr() as usize;
let name_offset = keyoffset - ele_ptr;
// Accommodates zero-length values (e.g. `key=""`) with a
// zero-length span at the location the value _would_ be.
let valoffset = match attr.value {
Cow::Borrowed(b) => b.as_ptr() as usize,
// This should never happen since we have a reference to the
// underlying buffer.
Cow::Owned(_) => unreachable!(
"internal error: unexpected owned attribute value"
),
};
let value_offset = valoffset - ele_ptr;
let span_name = UC.span_or_zz(name_offset, attr.key.len());
let span_value = UC.span_or_zz(value_offset, attr.value.len());
// The name must be parsed as a QName.
let name = attr.key.try_into()?;
let name = attr
.key
.try_into()
.map_err(Error::from_with_span(span_name))?;
// The attribute value,
// having just been read from XML,
@ -273,11 +388,18 @@ impl<'s, B: BufRead, S: Escaper> XmlXirReader<'s, B, S> {
// that's okay as long as we can read it again,
// but we probably should still throw an error if we
// encounter such a situation.
let value =
escaper.unescape(attr.value.as_ref().intern_utf8()?)?.into();
let value = escaper
.unescape(
attr.value
.as_ref()
.intern_utf8()
.map_err(Error::from_with_span(span_value))?,
)
.map_err(Error::from_with_span(span_value))?
.into();
tokbuf.push_front(Token::AttrName(name, DUMMY_SPAN));
tokbuf.push_front(Token::AttrValue(value, DUMMY_SPAN));
tokbuf.push_front(Token::AttrName(name, span_name));
tokbuf.push_front(Token::AttrValue(value, span_value));
}
Ok(())

View File

@ -23,7 +23,7 @@ use super::*;
use crate::sym::GlobalSymbolIntern;
use crate::{
convert::ExpectInto,
span::DUMMY_SPAN,
span::UNKNOWN_CONTEXT as UC,
xir::{Error, Token},
};
@ -54,7 +54,9 @@ impl Escaper for MockEscaper {
unreachable!("Reader should not be escaping!")
}
fn unescape_bytes(value: &[u8]) -> result::Result<Cow<[u8]>, Error> {
fn unescape_bytes(
value: &[u8],
) -> result::Result<Cow<[u8]>, SpanlessError> {
let mut unesc = value.to_owned();
unesc.extend_from_slice(b":UNESC");
@ -88,15 +90,19 @@ macro_rules! new_sut {
#[test]
fn empty_node_without_prefix_or_attributes() {
new_sut!(sut = "<empty-node />");
// |---------| ||
// 0 10
// A B
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 11);
let b = UC.span(12, 2);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("empty-node".unwrap_into(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
],
Ok(vec![
Token::Open("empty-node".unwrap_into(), a),
Token::Close(None, b),
]),
sut.collect(),
);
}
@ -104,18 +110,24 @@ fn empty_node_without_prefix_or_attributes() {
#[test]
fn does_not_resolve_xmlns() {
new_sut!(sut = r#"<no-ns xmlns="noresolve" />"#);
// |----| |---| |-------| ||
// 0 5 7 11 14 22 25
// A B C D
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(7, 5);
let c = UC.span(14, 9);
let d = UC.span(25, 2);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("no-ns".unwrap_into(), DUMMY_SPAN),
Ok(vec![
Token::Open("no-ns".unwrap_into(), a),
// Since we didn't parse @xmlns, it's still an attribute.
Token::AttrName("xmlns".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("noresolve:UNESC".intern(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
],
Token::AttrName("xmlns".unwrap_into(), b),
Token::AttrValue("noresolve:UNESC".intern(), c),
Token::Close(None, d),
]),
sut.collect(),
);
}
@ -123,18 +135,24 @@ fn does_not_resolve_xmlns() {
#[test]
fn empty_node_with_prefix_without_attributes_unresolved() {
new_sut!(sut = r#"<x:empty-node xmlns:x="noresolve" />"#);
// |-----------| |-----| |-------| ||
// 0 12 14 20 23 31 34
// A B C D
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 13);
let b = UC.span(14, 7);
let c = UC.span(23, 9);
let d = UC.span(34, 2);
// Should be the QName, _unresolved_.
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open(("x", "empty-node").unwrap_into(), DUMMY_SPAN),
Token::AttrName(("xmlns", "x").unwrap_into(), DUMMY_SPAN),
Token::AttrValue("noresolve:UNESC".intern(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
],
Ok(vec![
Token::Open(("x", "empty-node").unwrap_into(), a),
Token::AttrName(("xmlns", "x").unwrap_into(), b),
Token::AttrValue("noresolve:UNESC".intern(), c),
Token::Close(None, d),
]),
sut.collect(),
);
}
@ -143,13 +161,18 @@ fn empty_node_with_prefix_without_attributes_unresolved() {
fn prefix_with_empty_local_name_invalid_qname() {
// No local name (trailing colon).
new_sut!(sut = r#"<x: xmlns:x="testns" />"#);
// ||
// 1
// A
let a = UC.span(1, 2);
let result = sut.collect::<Result<Vec<_>>>();
match result {
Ok(_) => panic!("expected failure"),
Err(given) => {
assert_eq!(Error::InvalidQName("x:".into()), given);
assert_eq!(Error::InvalidQName("x:".into(), a), given);
}
}
}
@ -158,132 +181,215 @@ fn prefix_with_empty_local_name_invalid_qname() {
#[test]
fn multiple_attrs_ordered() {
new_sut!(sut = r#"<ele foo="a" bar="b" b:baz="c" />"#);
// |--| |-| | |-| | |---| | ||
// 0 3 5 7 10 13 18 21 25 28 31
// A B C D E F G H
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 4);
let b = UC.span(5, 3);
let c = UC.span(10, 1);
let d = UC.span(13, 3);
let e = UC.span(18, 1);
let f = UC.span(21, 5);
let g = UC.span(28, 1);
let h = UC.span(31, 2);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("ele".unwrap_into(), DUMMY_SPAN),
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("a:UNESC".intern(), DUMMY_SPAN),
Token::AttrName("bar".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("b:UNESC".intern(), DUMMY_SPAN),
Token::AttrName(("b", "baz").unwrap_into(), DUMMY_SPAN),
Token::AttrValue("c:UNESC".intern(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
],
Ok(vec![
Token::Open("ele".unwrap_into(), a),
Token::AttrName("foo".unwrap_into(), b),
Token::AttrValue("a:UNESC".intern(), c),
Token::AttrName("bar".unwrap_into(), d),
Token::AttrValue("b:UNESC".intern(), e),
Token::AttrName(("b", "baz").unwrap_into(), f),
Token::AttrValue("c:UNESC".intern(), g),
Token::Close(None, h),
]),
sut.collect(),
);
}
// Contrary to the specification, but this is the responsibility of XIRT; we
// need to allow it to support e.g. recovery, code formatting, and LSPs.
#[test]
fn empty_attr_value() {
new_sut!(sut = r#"<ele empty="" />"#);
// |--| |---| | ||
// 0 3 5 9 12 14
// A B C D
// /
// zero-length span, where
// the value _would_ be
let a = UC.span(0, 4);
let b = UC.span(5, 5);
let c = UC.span(12, 0);
let d = UC.span(14, 2);
assert_eq!(
Ok(vec![
Token::Open("ele".unwrap_into(), a),
Token::AttrName("empty".unwrap_into(), b),
Token::AttrValue(":UNESC".intern(), c),
Token::Close(None, d),
]),
sut.collect(),
);
}
// Contrary to the specification, but this is the responsibility of another
// parsing layer; we need to allow it to support e.g. recovery, code
// formatting, and LSPs.
#[test]
fn permits_duplicate_attrs() {
new_sut!(sut = r#"<dup attr="a" attr="b" />"#);
// |--| |--| | |--| | ||
// 0 3 5 8 11 14 17 20 23
// A B C D E F
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 4);
let b = UC.span(5, 4);
let c = UC.span(11, 1);
let d = UC.span(14, 4);
let e = UC.span(20, 1);
let f = UC.span(23, 2);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("dup".unwrap_into(), DUMMY_SPAN),
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("a:UNESC".intern(), DUMMY_SPAN),
Token::AttrName("attr".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("b:UNESC".intern(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
],
Ok(vec![
Token::Open("dup".unwrap_into(), a),
Token::AttrName("attr".unwrap_into(), b),
Token::AttrValue("a:UNESC".intern(), c),
Token::AttrName("attr".unwrap_into(), d),
Token::AttrValue("b:UNESC".intern(), e),
Token::Close(None, f),
]),
sut.collect(),
);
}
#[test]
fn child_node_self_closing() {
new_sut!(sut = r#"<root><child /></root>"#);
// |----||----| |||-----|
// 0 5`6 11 13`15 21
// A B C D
// /
// note that this includes '>' when there are no attrs,
// since that results in a more intuitive span (subject to change)
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(6, 6);
let c = UC.span(13, 2);
let d = UC.span(15, 7);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("root".unwrap_into(), DUMMY_SPAN),
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
],
Ok(vec![
Token::Open("root".unwrap_into(), a),
Token::Open("child".unwrap_into(), b),
Token::Close(None, c),
Token::Close(Some("root".unwrap_into()), d),
]),
sut.collect(),
);
}
#[test]
fn sibling_nodes() {
new_sut!(sut = r#"<root><child /><child /></root>"#);
// |----||----| |||----| |||-----|
// 0 5`6 11 13`15 20 22`24 30
// A B C D E F
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(6, 6);
let c = UC.span(13, 2);
let d = UC.span(15, 6);
let e = UC.span(22, 2);
let f = UC.span(24, 7);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("root".unwrap_into(), DUMMY_SPAN),
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
],
Ok(vec![
Token::Open("root".unwrap_into(), a),
Token::Open("child".unwrap_into(), b),
Token::Close(None, c),
Token::Open("child".unwrap_into(), d),
Token::Close(None, e),
Token::Close(Some("root".unwrap_into()), f),
]),
sut.collect(),
);
}
#[test]
fn child_node_with_attrs() {
new_sut!(sut = r#"<root><child foo="bar" /></root>"#);
// |----||----| |-| |-| |||-----|
// 0 5`6 11 13 18 20 23`25 31
// A B C D E F
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(6, 6);
let c = UC.span(13, 3);
let d = UC.span(18, 3);
let e = UC.span(23, 2);
let f = UC.span(25, 7);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("root".unwrap_into(), DUMMY_SPAN),
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::AttrName("foo".unwrap_into(), DUMMY_SPAN),
Token::AttrValue("bar:UNESC".intern(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
],
Ok(vec![
Token::Open("root".unwrap_into(), a),
Token::Open("child".unwrap_into(), b),
Token::AttrName("foo".unwrap_into(), c),
Token::AttrValue("bar:UNESC".intern(), d),
Token::Close(None, e),
Token::Close(Some("root".unwrap_into()), f),
]),
sut.collect(),
);
}
#[test]
fn child_text() {
new_sut!(sut = r#"<text>foo bar</text>"#);
// |----||-----||-----|
// 0 5`6 12`13 19
// A B C
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(6, 7);
let c = UC.span(13, 7);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("text".unwrap_into(), DUMMY_SPAN),
Token::Text("foo bar:UNESC".into(), DUMMY_SPAN),
Token::Close(Some("text".unwrap_into()), DUMMY_SPAN),
],
Ok(vec![
Token::Open("text".unwrap_into(), a),
Token::Text("foo bar:UNESC".into(), b),
Token::Close(Some("text".unwrap_into()), c),
]),
sut.collect(),
);
}
#[test]
fn mixed_child_content() {
new_sut!(sut = r#"<text>foo<em>bar</em></text>"#);
// |----||-||--||-||---||-----|
// 0 5`6 9 12`13`16 21 27
// A B C D E F
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 6);
let b = UC.span(6, 3);
let c = UC.span(9, 4);
let d = UC.span(13, 3);
let e = UC.span(16, 5);
let f = UC.span(21, 7);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Open("text".unwrap_into(), DUMMY_SPAN),
Token::Text("foo:UNESC".into(), DUMMY_SPAN),
Token::Open("em".unwrap_into(), DUMMY_SPAN),
Token::Text("bar:UNESC".into(), DUMMY_SPAN),
Token::Close(Some("em".unwrap_into()), DUMMY_SPAN),
Token::Close(Some("text".unwrap_into()), DUMMY_SPAN),
],
Ok(vec![
Token::Open("text".unwrap_into(), a),
Token::Text("foo:UNESC".into(), b),
Token::Open("em".unwrap_into(), c),
Token::Text("bar:UNESC".into(), d),
Token::Close(Some("em".unwrap_into()), e),
Token::Close(Some("text".unwrap_into()), f),
]),
sut.collect(),
);
}
@ -299,38 +405,56 @@ fn mixed_child_content_with_newlines() {
</root>
"#
);
// \n<root>\n <child />\n</root>\n
// |||----|| -||----| |||||-----|||
// 0 1 6`7 9`10 15 17| `20 26`27
// 19
// A B C D E F G H
let result = sut.collect::<Result<Vec<_>>>();
let a = UC.span(0, 1);
let b = UC.span(1, 6);
let c = UC.span(7, 3);
let d = UC.span(10, 6);
let e = UC.span(17, 2);
let f = UC.span(19, 1);
let g = UC.span(20, 7);
let h = UC.span(27, 1);
assert_eq!(
result.expect("parsing failed"),
vec![
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
Token::Open("root".unwrap_into(), DUMMY_SPAN),
Token::Text("\n :UNESC".into(), DUMMY_SPAN),
Token::Open("child".unwrap_into(), DUMMY_SPAN),
Token::Close(None, DUMMY_SPAN),
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
Token::Close(Some("root".unwrap_into()), DUMMY_SPAN),
Token::Text("\n:UNESC".into(), DUMMY_SPAN),
],
Ok(vec![