`Ident` is now `Opaque`, but the new `Transparent` state isn't actually used
yet in any transitions; that'll come next.
The original (now "opaque") identifiers were added for the linker, which
does not need (at present) the associated expressions, since they've already
been compiled. In the future I'd like to do LTO (link-time optimization),
and then the graph will need more information.
DEV-13160
This ASG implementation is a refactored form of original code from the
proof-of-concept linker, which was well before the span and diagnostic
implementations, and well before I knew for certain how I was going to solve
that problem.
This was quite the pain in the ass, but introduces spans to the AIR tokens
and graph so that we always have useful diagnostic information. With that
said, there are some important things to note:
1. Linker spans will originate from the `xmlo` files until we persist
spans to those object files during `tamec`'s compilation. But it's
better than nothing.
2. Some additional refactoring is still needed for consistency, e.g. use
of `SPair`.
3. This is just a preliminary introduction. More refactoring will come as
tamec is continued.
DEV-13041
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
A previous commit mentioned that there's not a place for `Dim`, and
duplicated it between `asg` and `xmlo`. Well, `Dtype` is also needed in
both, and so here's a home for now.
`Dtype` has always been an inappropriate detail for the system and will one
day be removed entirely in favor of higher-level types; the machine
representation is up to the compiler to decide.
DEV-11864
This matches xmlo::Dim, and could be the same thing, if we can find a home
for it in the future; it's not worth creating such a home right now when I'm
not yet sure what else ought to live there; the duplication may be fine.
The conversion from xmlo needs to be moved, and `Dim` is going to be used
for more than just identifiers (expressions will have type inference
performed).
DEV-11864
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
This removes XIRT support for attribute fragments. The reason is that
because this is a write-only operation---fragments are used to concatenate
SymbolIds without reallocation, which can only happen if we are generating
XIR internally.
Given that this cannot happen during read, it was a mistake to complicate
the parsers. But it makes sense why I did originally, given that the XIRT
parser was written for simplifying test cases. But now that we want parsers
for real, and are writing production-quality parsers, this extra complexity
is very undesirable.
As a bonus, we also avoid any potential for heap allocations related to
attributes. Granted, they didn't _really_ exist to begin with, but it was
part of XIRT, and was ugly.
DEV-11268
To maintain a proper abstraction, this cannot be the responsibility of the
caller; most callers should not know that fragments exist, letalone how to
handle them.
Like previous commits, this replaces the explicit escaping context with the
convention that all values retrieved from `xir` are unescaped on read and
escaped on write.
Comments are a notable TODO, since we must escape only `--`.
CData is also an issue. I had _expected_ to use it as a means to avoid
unescaping fragments, but I had forgotten that quick_xml hard-codes escaping
on read, so that it can re-use BytesStart! That is terribly unfortunate,
and may result in us having to re-implement our own read method in the
future to avoid this nonsense. So I'm just leaving it as a TODO for now.
DEV-11081
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
I'm not fond of this implementation, which is why it's not fully
completed. I wanted to commit this for future reference, and take the
opportunity to explain why I don't like it.
First: this task started as an idea to implement a third variant to
AttrValue and friends that indicates that a value is fixed, in the sense of
a fixed-point function: escaped or unescaped, its value is the same. This
would allow us to skip wasteful escape/unescape operations.
In doing so, it became obvious that there's no need to leak this information
through the API, and indeed, no part of the system should care. When we
read XML, it should be unescaped, and when we write, it should be
escaped. The reason that this didn't quite happen to begin with was an
optimization: I'll be creating an echo writer in place of the current
filesystem-based copy in tamec shortly, and this would allow streaming XIR
directly from the reader to the writer without any unescaping or
re-escaping.
When we unescape, we know the value that it came from, so we could simply
store both symbols---they're 32-bit, so it results in a nicely compressed
64-bit value, so it's essentially cost-free, as long as we accept the
expense of internment. This is `XirString`. Then, when we want to escape
or unescape, we first check to see whether a symbol already exists and, if
so, use it.
While this works well for echoing streams, it won't work all that well in
practice: the unescaped SymbolId will be taken and the XirString discarded,
since nothing after XIR should be coupled with it. Then, when we later
construct a XIR stream for writting, XirString will no longer be available
and our previously known escape is lost, so the writer will have to
re-escape.
Further, if we look at XirString's generic for the XirStringEscaper---it
uses phantom, which hints that maybe it's not in the best place. Indeed,
I've already acknowledged that only a reader unescapes and only a writer
escapes, and that the rest of the system works with normal (unescaped)
values, so only readers and writers should be part of this process. I also
already acknowledged that XirString would be lost and only the unescaped
SymbolId would be used.
So what's the point of XirString, then, if it won't be a useful optimization
beyond the temporary echo writer?
Instead, we can take the XirStringWriter and implement two caches on that:
mapping SymbolId from escaped->unescaped and vice-versa. These can be
simple vectors, since SymbolId is a 32-bit value we will not have much
wasted space for symbols that never get read or written. We could even
optimize for preinterned symbols using markers, though I'll probably not do
so, and I'll explain why later.
If we do _that_, we get even _better_ optimizations through caching that
_will_ apply in the general case (so, not just for echo), and we're able to
ditch XirString entirely and simply use a SymbolId. This makes for a much
more friendly API that isn't leaking implementation details, though it
_does_ put an onus on the caller to pass the encoder to both the reader and
the writer, _if_ it wants to take advantage of a cache. But that burden is
not significant (and is, again, optional if we don't want it).
So, that'll be the next step.
See the previous commit. There is no sense in some common "IR" namespace,
since those IRs should live close to whatever system whose data they
represent.
In the case of these, they are general IRs that can apply to many different
parts of the system. If that proves to be a false statement, they'll be
moved.
DEV-10863
The IRs really ought to live where they are owned, especially given that
"IR" is so generic that it makes no sense for there to be a single location
for them; they're just data structures coupled with different phases of
compilation.
This will be renamed next commit; see that for details.
This also removes some documentation describing the lowering process,
because it's undergone a number of changes and needs to be accurately
re-summarized in another location. That will come at a later time after the
work is further along so that I don't have to keep spending the time
rewriting it.
DEV-10863
This allows AttrList not only to be lazily initialized (which is less of a
problem at the moment with Vec, but may become one in the future), but also
leaves a space open for attributes to be added _after_ having been
parsed. It further leaves room to _take_ attributes from their `Element`.
This is important because the next commit will re-introduce the ability to
parse attributes independently, allowing us to put the parser in a state
where we can parse AttrList without an Element context. To re-use that
parsing under an Element context, we can simply attach an AttrList after it
has been parsed.
Option adds no additional size cost to Vec, so we get this for free (except
for the tiny change that initializes the attribute list when we try to push
to it).
I also think this reads better ("attrs: None"). Though it makes the API
slightly more of a pain to work with.
DEV-10863
In particular, `name` needn't return an `Option`. `fragment` also returns a
copy, since it's just a `SymbolId`. (It really ought to be a newtype rather
than an alias, but we'll worry about that some other time.)
These changes allow us to remove some runtime panics.
DEV-10859
This moves the logic that sorts identifiers into sections into Sections
itself, and introduces XmleSections to allow for mocking for testing.
This then allows us to narrow the types significantly, eliminating some
runtime checks. The types can be narrowed further, but I'll be limiting the
work I'll be doing now; this'll be inevitably addressed as we use the ASG
for the compiler.
This also handles moving Sections tests, which was a TODO from the previous
commit.
DEV-10859
xmle sections will only ever contain an object of one type, so there is no
use in making this generic.
I think the original plan was to have this represent, generically, sections
of some object file (like ELF), but doing so would require a significant
redesign anyway, so it makes no sense. This is easier to reason about.
DEV-10859