This does two things:
1. Removes callback; it didn't add anything of practical value.
The operation will simply be performed as long as no error is provided
by the callee.
2. Consolodates three arguments into `ProposedRel`. This makes blocks in
`object_rel!` less verbose and boilerplate-y.
I'll probably implement `TplShape::Unknown` via the dynamic `Ident` `Tpl`
edge before continuing with any cleanup. This is getting pretty close to
reasonable for future implementations.
DEV-13163
This helps to remove some boilerplate. Testing this out in
`asg::graph::object::tpl` before applying it to other things; really `Map`
can just go away entirely then since it can be implemented in terms of
`TryMap`, but maybe it should stick around for manual impls (implementing
`TryMap` manually is more work).
DEV-13163
At least not how most people expect functors to be. I'm really just using
this as a map with powerful inference properties that make writing code more
pleasent.
And I need fallible methods now too.
DEV-13163
Things are starting to get interesting, and this shows how caching
information about template shape (rather than having to query the graph any
time we want to discover it) makes it easy to compose shapes.
This does not yet handle the unknown case. Before I do that, I'll want to
do some refactoring to address duplication in the `tpl` module.
DEV-13163
This enforces the new constraint that templates expanding into an `Expr`
context must only inline a single `Expr`.
Perhaps in the future we'll support explicit splicing, like `,@` in
Lisp. But this new restriction is intended for two purposes:
- To make templates more predictable (if you have a list of expressions
inlined then they will act differently depending on the type of
expression that they are inlined into, which means that more defensive
programming would otherwise be required); and
- To make expansion easier, since we're going to have to set aside an
expansion workspace ahead of time to ensure ordering (Petgraph can't
replace edges in-place). If we support multi-expansion, we'd have to
handle associativity in all expression contexts.
This'll become more clear in future commits.
It's nice to see all this hard work coming together now, though; it's easy
now to perform static analysis on the system, and any part of the graph
construction can throw errors with rich diagnostic information and still
recover properly. And, importantly, the system enforces its own state, and
the compiler helps us with that (the previous commits).
DEV-13163
This formalizes the previous commit a bit more and adds documentation
explaining why it exists and how it works. Look there for more
information.
This has been a lot of setup work. Hopefully things are now easier in the
future. And now we have nice declarative type-level hooks into the graph!
DEV-13163
This change is the first to utilize matching on edges to determine the state
of the template (to begin to derive its shape).
But this is notable for my finally caving on `min_specialization`.
The commit contains a bunch of rationale for why I introduced it. I've been
sitting on trying it for _years_. I had hoped for further progress in
determining a stabalization path, but that doesn't seem to be happening.
The reason I caved is because _not_ using it is a significant barrier to
utilizing robust types in various scenarios. I've been having to work
around that with significant efforts to write boilerplate code to match on
types and branch to various static paths accordingly. It makes it really
expensive to make certain types of changes, and it make the code really
difficult to understand once you start to peel back abstractions that try to
hide it.
I'll see how this goes and, if it goes well, begin to replace old methods
with specialization.
See the next commit for some cleanup. I purposefully left this a bit of a
mess (at the bottom of `asg::graph::object::tpl`) to emphasize what I'm
doing and why I introduced it.
DEV-13163
This allows for a declarative matching on edge targets using the trait
system, rather than having to convert the type to a runtime value to match
on (which doesn't make a whole lot of sense).
See a commit to follow shortly (with Tpl) for an example use case.
DEV-13163
Since we're statically invoking a particular ObjectKind's method, we already
know the source type. Let's pre-narrow it for their (my) convenience.
DEV-13163
There's a lot to say about this; it's been a bit of a struggle figuring out
what I wanted to do here.
First: this allows objects to use `AsgObjectMut` to control whether an edge
is permitted to be added, or to cache information about an edge that is
about to be added. But no object does that yet; it just uses the default
trait implementation, and so this _does not change any current
behavior_. It also is approximately equivalent cycle-count-wise, according
to Valgrind (within ~100 cycles out of hundreds of millions on large package
tests).
Adding edges to the graph is still infallible _after having received
permission_ from an `ObjectIndexRelTo`, but the object is free to reject the
edge with an `AsgError`.
As an example of where this will be useful: the template system needs to
keep track of what is in the body of a template as it is defined. But the
`TplAirAggregate` parser is sidelined while expressions in the body are
parsed, and edges are added to a dynamic source using
`ObjectIndexRelTo`. Consequently, we cannot rely on a static API to cache
information; we have to be able to react dynamically. This will allow `Tpl`
objects to know any time edges are added and, therefore, determine their
shape as the graph is being built, rather than having to traverse the tree
after encountering a close.
(I _could_ change this, but `ObjectIndexRelTo` removes a significant amount
of complexity for the caller, so I'd rather not.)
I did explore other options. I rejected the first one, then rejected this
one, then rejected the first one again before returning back to this one
after having previously sidelined the entire thing, because of the above
example. The core point is: I need confidence that the graph isn't being
changed in ways that I forgot about, and because of the complexity of the
system and the heavy refactoring that I do, I need the compiler's help;
otherwise I risk introducing subtle bugs as objects get out of sync with the
actual state of the graph.
(I wish the graph supported these things directly, but that's a project well
outside the scope of my TAMER work. So I have to make do, as I have been
all this time, by layering atop of Petgraph.)
(...I'm beginning to ramble.)
(...beginning?)
Anyway: my other rejected idea was to provide attestation via the
`ObjectIndex` APIs to force callers to go through those APIs to add an edge
to the graph; it would use sealed objects that are inaccessible to any
modules other than the objects, and assert that the caller is able to
provide a zero-sized object of that sealed type.
The problem with this is...exactly what was mentioned above:
`ObjectIndexRelTo` is dynamic. We don't always know the source object type
statically, and so we cannot make those static assertions.
I could have tried the same tricks to store attestation at some other time,
but what a confusing mess it would be.
And so here we are.
Most of this work is cleaning up the callers---adding edges is now fallible,
from the `ObjectIndex` API standpoint, and so AIR needed to be set up to
handle those failures. There _aren't_ any failures yet, but again, since
things are dynamic, they could appear at any moment. Furthermore, since
ref/def is commutative (things can be defined and referenced in any order),
there could be surprise errors on edge additions in places that might not
otherwise expect it in the future. We're now ready for that, and I'll be
able to e.g. traverse incoming edges on a `Missing->Transparent` definition
to notify dependents.
This project is going to be the end of me. As interesting as it is.
I can see why Rust just chose to require macro definitions _before_ use. So
much less work.
DEV-13163
AIR is no longer able to explicitly add edges without going through an
object-specific `ObjectIndex` API. `Asg::add_edge` was already private, but
`ObjectIndex::add_edge_{to,from}` was not.
The problem is that I want to augment the graph with other invariants, such
as caches. I'd normally have this built into the graph system itself, but I
don't have the time for the engineering effort to extend or replace
Petgraph, so I'm going to build atop of it.
To have confidence in any sort of caching, I need assurances that the graph
can't change out from underneath an object. This gets _close_ to
accomplishing that, but I'm still uncomfortable:
- We're one `pub` addition away from breaking these invariants; and
- Other `Object` types can still manipulates one-anothers' edges.
So this is a first step that at least proves encapsulation within
`asg::graph`, but ideally we'd have the system enforce, statically, that
`Objects` own their _outgoing_ edges, and no other `Object` is able to
manipulate them. This would ensure that any accidental future changes, or
bugs, will cause compilation failures rather than e.g. allowing caches to
get out of sync with the graph.
DEV-13163
The fixpoint tests for `meta-interp` are finally working. I could have
broken this up more, but I'm exhausted with this process, so, you get what
you get.
NIR will now recognize basic `<text>` and `<param-value>` nodes (note the
caveat for `<text>` in the comment, for now), and I finally include abstract
binding in the lowering pipeline. `xmli` output is also now able to cope
with metavariables with a single lexical association, and continues to
become more of a mess.
DEV-13163
This is a really obvious problem in retrospect, which makes me feel rather
silly.
The output was useful, but I don't have time to deal with this any further
right now. The comments in the commit explain the problem---that the output
ends up being interpolated as part of the fixpoint test, in an incorrect
context, and so the code that we generate is invalid. Also goes to show why
the fixpoint tests are important.
(Yes, they're still disabled for meta-interp, I'm trying to get them
enabled.)
DEV-13163
The provided documentation provides rationale, and the use case is the
ontree change. I was uncomfortable without the exhaustive match, and I was
further annoyed by the lack of easy `ObjectIndex` narrowing.
DEV-13163
This introduces the ability to specify an edge ordering for the ontological
tree traversal. `tree_reconstruction` will now use a
`SourceCompatibleTreeEdgeOrder`, which will traverse the graph in an order
that will result in a properly ordered source reconstruction. This is
needed for template headers, because interpolation causes
metavariables (exposed as template params) to be mixed into the body.
There's a lot of information here, including some TODOs on possible
improvements. I used the unstable `is_sorted` to output how many template
were already sorted, based on one of our very large packages internally that
uses templates extensively, and found that none of the desugared shorthand
template expansions were already ordered. If I tweak that a bit, then
nearly all templates will already be ordered, reducing the work that needs
to be done, leaving only template definitions with interpolation to be
concerned about, which is infrequent relative to everything else.
DEV-13163
Well, this is both good news and bad news.
The good news is that this finally produces the expected output and
reconstructs sources from interpolated values on the ASG. Yay!
...the bad news is that it's wrong. Notice how the fixpoint test is
disabled.
So, my plan was originally to commit it like this first and see if I was
comfortable relaxing the convention that `<param>` nodes had to appear in
the header. That's nice to do, that's cleaner to do, but would the
XSLT-based compiler really care? I had to investigate.
Well, turns out that TAMER does care. Because, well over a decade ago, I
re-used `<param>`, which could represent not only a template param, but also
a global param, or a function param.
So, XML->NIR considers all `<param>` nodes at the head of a template to be
template parameters. But after the first non-header element, we transition
to another state that allows it to be pretty much anything.
And so, I can't relax that restriction.
And because of that, I can't just stream the tree to the xmli generator,
I'll have to queue up nodes and order them.
Oh well, I tried.
DEV-13163
I'm not sure how I overlooked this previously, and I didn't notice until
trying to generate xmli output. I think I distracted myself with the
use of dangling status, which was not appropriate, and that has since
changed so that we have a dedicated concept.
This introduces the term "instantiation", or more specifically "lexical
instantiation". This is more specific and meaningful than simply
"expansion", which is what occurs during instantiation. I'll try to adjust
terminology and make things more consistent as I go.
DEV-13163
This logic ought to live alongside other definition logic...which in turn
needs its own extraction, but that's a separate concern.
This makes the definition of abstract identifiers very similar to
concrete. But, treating these as dangling, even if that's technically true,
has to change---we still want an edge drawn to the abstract identifier via
e.g. a template since we want the graph to mirror the structure of what it
will expand into concretely. I didn't notice this problem until trying to
generate the xmli for it.
So, see the commit to follow.
DEV-13163
This handles the common cases for meta, which includes what interpolation
desugars into. Most of this work was in testing and reasoning about the
issue; `asg::graph::visit:ontree::test` has a good summary of the structure
of the graph that results.
The last remaining steps to make this work end-to-end is for NIR->AIR to
lower `Nir::Ref` into `Air::BindIdent`, and then for `asg::graph::xmli` to
reconstruct concatenation lists. I'll then be able to commit the xmli test
case I've been sitting on, whose errors have been guiding my development.
DEV-13163
The term "metasyntactic" made sense literally---it's a variable in a
metalanguage that expands into a context that is able to contribute to the
language's syntax. But, the term has a different conventional use in
programming that is misleading.
The term "metalinguistic" is used in mathematics, to describe a metalanguage
or schema atop of a language. This is more fitting.
DEV-13163
This makes `SPair` construction more concise, getting rid of the `into`
invocations. For now I have only made this change in AIR's tests, since
that's what I'm working on and I want to observe how this convention
evolves. This may also encourage other changes, e.g. placing spans within
the `toks` array, rather than having to jump around the test for them.
The comment for `spair` mentions why this is a test-only function. But it
also shows how dangerous `impl Into<SymbolId> for &str` can be, since it
seems so innocuous---it uses a global interner. I'll be interested to see a
year from now if I decided to forego that impl in favor of explicit
internment, since I'm not sure it's worth the convenience anymore.
DEV-13163
This has been bothering me for quite a long time, and is just more test
cleanup before I introduce more. I suspect this came from habit with the
previous Rust edition where `into_iter()` on arrays was a much more verbose
operation.
To be clear: this change isn't for performance. It's about not doing
something silly when it's unnecessary, which also sets a bad example for
others.
There are many other tests in other modules that will need updating at some
point.
DEV-13163
This produces a representation of abstract identifiers on the graph, for
`Expr`s at least. The next step will probably be to get this working
end-to-end in the xmli output before extending it to the other remaining
bindable contexts.
DEV-13163
This enforces the naming convention that is utilized to infer whether an
identifier binding must be translated to an abstract binding.
This does not yet place any restrictions on other characters in identifier
names; both the placement of and flexibility of that has yet to be
decided. This change is sufficient enough to make abstract binding
translation reliable.
DEV-13163
The previous commit made me uncomfortable; we're already parsing with great
precision (and effort!) the grammar of NIR, and know for certain whether
we're in a metavariable binding context, so it makes no sense to have to try
to guess at another point in the lowering pipeline.
This introduces a new token to retain that information from XIR->NIR
lowering and then re-simplifies the lowering operation that was just
introduced in the previous commit (`AbstractBindTranslate`).
DEV-13163
This builds upon the concepts of the previous commit to translate identifier
binding into an abstract binding if it utilizes a symbol that follows a
metavariable naming convention.
See the provided documentation for more information.
This commit _does not_ integrate this into the lowering pipeline yet, since
the abstract identifiers are still rejected (as TODOs) by AIR.
DEV-13163
This introduces the notion of an abstract identifier, where the previous
identifiers are concrete. This serves as a compromise to either introducing
a new object type (another `Ident`), or having every `Ident` name be defined
by a `Meta` edge, which would bloat the graph significantly.
This change causes interpolation within a bind context to desugar into a new
`BindIdentAbstract` token, but AIR will throw an error if it encounters it
for now; that implementation will come soon.
This does not yet handle non-interpolation cases,
e.g. `<classify as="@foo@">`. This is a well-established shorthand for
`as="{@foo@}"`, but is unfortunately ambiguous in the context of
metavariable definitions (template parameters). This language ambiguity
will have to be handled here, and will have to fall back to today's behavior
of assuming concrete in that `param/@name` context but abstract every else,
unless of course interpolation is triggered using `{}` to disambiguate (as
in `<param name="{@foo@}"`).
I was going to handle the short-hand meta binding case as part of
interpolation, but I decided it may be appropriate for its own lowering
operation, since it is intended to work regardless of whether interpolation
takes place; it's a _translation_ of a binding into an abstract one, and it
can clearly delineate the awkward syntactic rules that we have to inherit,
as mentioned above.
DEV-13163
This prepares to make the name of an `Ident` optional to support abstract
identifiers derived from metavariables.
This is an unfortunate change to have to prepare for, since it complicates
how Idents are interpreted, but the alternative (a new object type) is not
good either. We'll see how this evolves.
DEV-13163
This is intended to support NIR's lexical interpolation, which expands in
place into metavariables.
This commit does not yet contain the NIR portion (or xmli system test)
because Meta needs to be able to handle concatenation first; that's next.
DEV-13163
This introduces template/param and regenerates it in the xmli output. Note
that this does not check that applications reference known params; that's a
later phase.
DEV-13163
The previous commit introduced support for a `.experimental` file to tigger
`xmlo-experimental`. This modifies the error message for unsupported
features to make mention of it to help to the user track down the problem.
DEV-13162
If a source file is paired with a `.experimental` file (for example,
`foo.xml` has a silbing `foo.experimental` file), then it will be
precompiled using `--emit xmlo-experimental` instead of `--emit
xmlo`. Further, the contents of the experimental file may contain
`#`-prefixed comments describing why it exists, as well as additional
options to pass to `tamec`.
For example, if this is an experimental file:
```
--foo
--bar=baz
```
Then the tamec invocation will contain:
tamec [...] --emit xmlo-experimental --foo --bar=baz -o foo.xmli
This allows for package-level conditional compilation with new features so
that I am able to focus on packages that will provide the most meaningful
benefits to our team, whether they be performance or features.
DEV-13162
Now that the feature flag for the parser is a command line option, it is
useful to be able to run it on any package and see what errors arise, to use
as a guide for development with the goal of getting a particular package to
compile.
This converts the TODO panic into a recoverable error so that the parser can
spit out as many errors as it can.
DEV-13162
This provides some initial information to help guide a user to discover how
TAMER works, though either the source code or the generated
documentation. This will improve over time, since all of the high-level
abstractions are still under development.
DEV-13162
This introduces `xmlo-experimental` for `--emit`, allowing the new parser to
be toggled selectively for individual packages. This has a few notable
benefits:
1. We'll be able to conditionally compile packages as they are
supported (TAMER will target specific packages in our system to try to
achieve certain results more quickly);
2. This cleans up the code a bit by removing awkward gated logic, allowing
natural abstractions to form; and
3. Removing the compile-time feature flag ensures that the new features
are always built and tested; there are fewer configuration combinations
to test.
DEV-13162
This flag should have never been sprinkled here; it makes the system much
harder to understand.
But, this is working toward a command-line tamec option to toggle NIR
lowering on/off for various packages.
DEV-13162
This was a significant undertaking, with a few thrown-out approaches. The
documentation describes what approach was taken, but I'd also like to
provide some insight into the approaches that I rejected for various
reasons, or because they simply didn't work.
The problem that this commit tries to solve is encapsulation of error
types.
Prior to the introduction of the lowering pipeline macro
`lower_pipeline!`, all pipelines were written by hand using `Lower` and
specifying the applicable types. This included creating sum types to
accommodate each of the errors so that `Lower` could widen automatically.
The introduction of the `lower_pipeline!` macro resolved the boilerplate and
type complexity concerns for the parsers by allowing the pipeline to be
concisely declared. However, it still accepted an error sum type `ER` for
recoverable errors, which meant that we had to break a level of
encapsulation, peering into the pipeline to know both what parsers were in
play and what their error types were.
These error sum types were also the source of a lot of tedious boilerplate
that made adding new parsers to the pipeline unnecessarily unpleasant;
the purpose of the macro is to make composition both easy and clear, and
error types were undermining it.
Another benefit of sum types per pipeline is that callers need only
aggregate those pipeline types, if they care about them, rather than every
error type used as a component of the pipeline.
So, this commit generates the error types. Doing so was non-trivial.
Associated Types and Lifetimes
------------------------------
Error types are associated with their `ParseState` as
`ParseState::Error`. As described in this commit, TAMER's approach to
errors is that they never contain non-static lifetimes; interning and
copying are used to that effect. And, indeed, no errors in TAMER have
lifetimes.
But, some `ParseState`s may. In this case, `AsgTreeToXirf`:
```
impl<'a> ParseState for AsgTreeToXirf<'a> {
// [...]
type Error = AsgTreeToXirfError;
// [...]
}
```
Even though `AsgTreeToXirfError` does not have a lifetime, the `ParseState`
it is associated with _does_`. So to reference that type, we must use
`<AsgTreeToXirf<'a> as ParseState>::Error`. So if we have a sum type:
```
enum Sum<'a> {
// ^^ oh no! vv
AsgTreeToXirfError(<AsgTreeToXirf<'a> as ParseState>::Error),
}
```
There's no way to elide or make anonymous that lifetime, since it's not
used, at the time of writing. `for<'a>` also cannot be used in this
context.
The solution in this commit is to use a macro (`lower_error_sum`) to rewrite
lifetimes: to `'static`:
```
enum Sum {
AsgTreeToXirfError(<AsgTreeToXirf<'static> as ParseState>::Error),
}
```
The `Error` associated type will resolve to `AsgTreeToXirfError` all the
same either way, since it has no lifetimes of its own, letalone any
referencing trait bounds.
That's not to say that we _couldn't_ support lifetimes as long as they're
attached to context, but we have no need to at the moment, and it adds
significant cognitive overhead. Further, the diagnostic system doesn't deal
in lifetimes, and so would need reworking as well. Not worth it.
An alternative solution to this that was rejected is an explicitly `Error`
type in the macro application:
```
// in the lowering pipeline
|> AsgTreeToXirf<'a> { // lifetime
type Error = AsgTreeToXirfError; // no lifetime
}
```
But this requires peeling back the `ParseState` to see what its error is and
_duplicate_ it here. Silly, and it breaks encapsulation, since the lowering
pipeline is supposed to return its own error type.
Yet another option considered was to standardize a submodule convention
whereby each `ParseState` would have a module exporting `Error`, among other
types. This would decouple it from the parent type. However, we still have
the duplication between that and an associated type. Further, there's no
way to enforce this convention (effectively a module API)---the macro would
just fail in obscure ways, at least with `macro_rules!`. It would have been
an ugly kluge.
Overlapping Error Types
-----------------------
Another concern with generating the sum type, resolved in a previous commit,
was overlapping error types, which prohibited `impl From<E> for ER`
generation.
The problem with that a number of `ParseState`s used `Infallible` as their
`Error` type. This was resolved in a previous commit by creating
Infallible-like newtypes (variantless enums).
This was not the only option. `From` fits naturally into how TAMER handles
sum types, and fits naturally into `Lower`'s `WidenedError`. The
alternative is generating explicit `map_err`s in `lower_pipeline!`. This
would have allowed for overlapping error types because the _caller_ knows
what the correct target variant is in the sum type.
The problem with an explicit `map_err` is that it places more power in
`lower_pipeline!`, which is _supposed_ to be a macro that simply removes
boilerplate; it's not supposed to increase expressiveness. It's also not
fun dealing with complexity in macros; they're much more confusing that
normal code.
With the decided-upon approach (newtypes + `From`), hand-written `Lower`
pipelines are just as expressive---just more verbose---as `lower_pipeline!`,
and handles widening for you. Rust's type system will also handle the
complexity of widening automatically for us without us having to reason
about it in the macro. This is not always desirable, but in this case, I
feel that it is.
This configures the pipeline and returns a closure that can then be provided
with the source and sink.
The next obvious step would be to curry the source and sink.
But I wanted to commit this before I take a different (but equivalent)
approach that makes the pipeline operations more explicit and helps to guide
the user (developer) in developing and composing them. The FP approach is
less boilerplate, but is also more general and provides less
guidance. Given that composition at the topmost levels of the system,
especially with all the types involved, is one of the most confusing aspects
of the system---and one of the most important to get right and make clear,
since it's intended to elucidate the entire system at a high level, and
guide the reader. Well, it does a poor job at that now, but that's the
ultimate goal.
In essence---brutally general abstractions make sense at lower levels, but
the complexity at higher levels benefits from rigid guardrails, even though
it does not necessitate it.
DEV-13162
This cleanup is an interesting one, because I think the present me may
disagree with the past me.
The use of generics here to compose the parser from smaller parsers was due
to how I wrote my object-oriented code in other languages: where a class was
an independently tested unit. I was trying to reproduce the same here,
utilizing generics in the same way that one would use compoisition via
object constructors in other languages.
But it's been a long time since then, and I've come to settle on different
standards in Rust. The components of `XmloReader` really are just
implementation details. As I find myself about to want to modify its
behavior, I don't _want_ to compose `XmloReader` from _different_ parsers;
that may result in an invalid parse. There's one correct way to parse an
xmlo file.
If I want to parse the file differently, then `XmloReader` ought to expose
a way of doing so. This is more rigid, but that rigidity buys us confidence
that the system has been explicitly designed to support those
operations. And that confidence gives us peace of mind knowing that the
system won't compose in ways that we don't intend for it to.
Of course, I _could_ design the system to compose in generic ways. But
that's an over-generalization that I don't think will be helpful; it's not
only a greater cognitive burden, but it's also a lot more work to ensure
that invariants are properly upheld and to design an API that will ensure
that parsing is always correct. It's simply not worth it.
So, this makes `XmloReader` consistent with other parsers now, like
`AirAggregate` and nir::parse (ele_parse). This prepares for a change to
make `XmloReader` configurable to avoid loading fragments from object files,
since that's very wasteful for `tamec`.
DEV-13162
More information will be presented in the commit that follows to generalize
these, but this sets the stage.
The recently-introduced pipeline macro takes care of most of the job of a
declarative pipeline, but it's still leaky, since it requires that the
_caller_ create error sum types. This not only exposes implementation
details and so undermines the goal of making pipelines easy to declare and
compose, but it's also one of the last major components of boilerplate for
the lowering pipeline.
My previous attempts at generating error sum types automatically for
pipelines ran into a problem because of overlapping `impl`s for the various
`<S as ParseState>::Error` types; this resolves that issue via
newtypes. I had considered other approaches, including explicitly
generating code to `map_err` as part of the lowering pipeline, but in the
end this is the easier way to reason about things that also keeps manual
`Lower` pipelines on the same level of expressiveness as the pipeline macro;
I want to restrict its unique capabilities as much as possible to
elimination of boilerplate and nothing more.
DEV-13162
At or around 00492ace01, I modified packages
to output canonical `@name`s, which contains a leading forward
slash. Previously, names omitted that slash. I did not believe that this
caused any problems.
It seems that the XSLT-based `standalones` system utilizes this package name
to derive a supplier name, which is supposed to be the filename of the
package without any path. Since the package name changed from
`suppliers/foo` to `/suppliers/foo`, for example, this was now producing
"suppliers/name" instead of "name".
Of course, it was never a good idea to strip off only the first path
component. But, this is how it has been since TAME was originally created
well over a decade ago.
I did not catch this since I was diff'ing the output of the xmle files, not
the final JS files. I had thought that was sufficient, given what I was
changing, but I was wrong.
DEV-14502
I had never intended to avoid pinning nightly. This is an unfortunate thing
to have to do---require a _specific_ version of a compiler to build your
software; it's madness. But the unstable features utilized by TAMER (as
rationalized in `src/lib.rs`) are still worth the effort.
It's not _actually_ that case that we need a specific version of the
compiler, granted; this is outlined in `rust-toolchain.toml`'s
rationale. You should look there for more information; my approach still
utilizes explicit channels via cargo. Unfortunately, I had hard-coded it
previously, putting me in a bit of a bind an unable to override the behavior
without modifying the software.
The reason for this change is that `adt_const_params` has a BC break
involving the introduction of `ConstParamTy`. This is only the second time
I've been bitten by a nightly BC break; the other was the renaming of
`int_log`'s API, as mentioned in
709291b107. This pinning will in fact
mitigate those future issues---TAMER will be able to resolve the issue at
its leisure, and will further be able to continue to build earlier commits
in the future by simply re-bootstrapping with the committed nightly
version.
If you're curious of my rationale for wanting to inhibit toolchain
downloading during build, or use system libraries, have a look at GNU Guix's
approach to building software safely and reproducibly. In particular,
dependencies are also built from source (rather than downloading binaries
from external sources), and builds take place in network-isolated
containers. The `TAMER_RUST_TOOLCHAIN` configure parameter is meant to
facilitate these situations by giving more flexibility to packagers.
DEV-14476
The code utilizing this is flagged, and so the build would output warnings
saying that it was not used. This resolves that (I've been aware of it for
far too long; I'm developing behind the `wip-asg-derived-xmli` flag where I
don't usually see it).
DEV-13162
This generates some documentation helping to describe the lowering pipeline,
since the function type signature can be daunting to those unfamiliar with
it (and I'm sure to the future me too).
DEV-13162
Like the previous commit's removal of the error type, this eliminates the
explicit source token type since we're able to infer it from the pipeline
definition.
DEV-13162
It does not matter what the error of the source is as long as the caller is
able to deal with it, especially given that the particular error is a
property of the source, which is under control of the caller.
DEV-13162
The macro is off-putting and more complicated than the pipeline definitions
themselves (of course), so this tucks it away so that readers are able to
more easily observe the definitions that they're probably looking for
without feeling compelled to try to understand the macro definition.
DEV-13162
All lowering pipelines are now using `lower_pipeline!`. Finally.
The macro does require some refactoring and documentation, but it's working,
and we now have three pipelines whose definitions are smaller than a single
one was previously. I've been hoping to do this for many months, so it's
nice to finally see this come to fruition.
I had been putting it off, but doing so has made it difficult to compose
other parts of the system, not knowing what abstractions I'll have at my
disposal.
DEV-13162
This makes the sink similar to other pipelines without creating a new
ParseState, and so will allow for integrating into the `lower_pipeline!`
abstraction.
DEV-13162
This has been the ultimate goal for the pipeline for some time---the ability
to declaratively define the lowering pipeline in a way that is clear,
concise, and is correct by definition.
The reason that the lowering pipeline required so much boilerplate was
because of the robust types involved, which ensures that everything in the
pipeline is compatible with one-another---it's not possible to construct a
pipeline that will not work.
Of course, there is nuance involved in some cases---I didn't want to include
the `until` clause, which makes it fail the "obviously correct" criterion,
but that can be improved over time.
This only abstracts away `load_xmlo` and `parse_package_xml`; next I'll have
to evolve the abstraction to support lifetimes for `lower_xmli`'s
`AsgTreeToXirf`. That pipeline also ends with a custom sink that really
ought to become its own parser, but I don't want to jump down that rabbit
hole right now, so we may just support custom sinks for now with the intent
of removing it in the future.
This has been a long time coming. The ultimate goal is that you should be
able to look at the parser pipelines to have a clear, high-level overview of
how everything fits together. I'm not generating documentation yet, but
that'll help serve as a guide as well.
DEV-13162
The report acts as the sink for `load_xmlo` and `parse_package_xml`. At the
moment, the type is `()`, and so there's nothing to report on but the
error. But the idea is to add logging via `AirAggregate::Object`, which is
currently just `()`.
This change therefore is only a refactoring---it changes no functionality
but sets up for future changes.
This also introduces consistency with `lower_xmli` in use of `terminal` for
the final operation.
DEV-13162
Diagnostic events need not be errors. While that was the original intent,
it'd also be nice to be able to use the diagnostic system for any type of
logging, where the verbosity level would determine the type of report that
is output (whether source information should be provided).
Then we could have e.g. AirAggregate produce events describing what actions
are occurring, which could be much more useful than a trace in many
contexts, and would be able to operate via a runtime toggle/filter without
having an adverse effect on performance (since the diagnostic rendering
itself is the hit; the underlying data are cheap).
Anyway---I'm addressing this now to generalize the reporter in the lowering
pipeline, so that it can report on not just errors but anything.
DEV-13162
This formats the pipeline to mirror the style of
`parse_package_xml`. Based on the previous commits, the end goal (though
not necessarily now) will be to derive a concise abstraction for all the
lowering pipelines, which means first factoring them into a common form.
DEV-13162
This makes the API of `load_xmlo` much closer to `parse_package_xml`, both
accepting a reporter and distinguishing between recoverable and
unrecoverable errors.
The linker still does not use a reporter and still fails on the first
error, as before; I wanted to keep this change small.
DEV-13162
This allows us to drop `AirIdent::IdentRef`, which in turn allows dropping
`AirIdent` entirely from `AirPkgAggregate`.
This is also a more appropriate abstraction; having to track all the ways in
which `IdentRef` was used can be confusing. This means that `AirIdent` is
true to its name---used only for identifiers. The new token type makes it
very clear where package imports are recognized, and it's also easier to
search for.
DEV-13162
This is the same idea as the previous two commits: get all the lowering
pipelines into the same place so that we can observe commonalities and
attempt to derive an appropriate abstraction.
`lower_xmli` could have invoked `tree_reconstruction` itself, since it has
all the information that it needs to do so, but the idea is that these will
accept sources from the caller. This also demonstrates that sinks need to
be flexible. In an ideal abstraction, perhaps this would be able to produce
an iterator that accepts the first token type and yields the last, which can
then be directed to a sink, but that's not compatible with how the lowering
operations currently work, which requires a single value to be
returned. But if it did work that way, then they'd be able to compose just
as any other parser.
Maybe for the future.
DEV-13162
The previous commit extracted xmlo loading, because that will be a common
operation between `tamec` and `tameld`. This extracts parsing, which will
only used by `tamec` for now, though components of the pipeline are similar
to xmlo loading.
Not only does it need to be removed from `tamec` and better abstracted, but
the intent now is to get all of these things into one place so that the
patterns are obviated and a better abstraction can be created to remove all
of this boilerplate and type complexity.
Furthermore, xmlo loading needs to use reporting and recovery, so having
`parse_package_xml` here will help show how to make that happen easily. I'm
pleased that it ended up being trivial to extract error reporting from the
lowering pipeline as a simple (mutable) callback. I'm not pleased about
the side-effects, but, this works well for now given how the system works
today.
DEV-13162
I want to clean this up a bit further. The motivation is that we need this
for imports in `tamec`.
Eventually this will be cleaned up to the point where it's declarative and
easy to understand---there's a mess of types involved now and, when
something goes wrong, it can be brutally confusing.
DEV-13162
This extracts and decouples the boundary rules from the stack frames
themselves, which not only clarifies what the rules are (and makes them
match the scope diagrams), but paves the way for future isolation.
DEV-13162
This was used for metavariable declaration before scoping was sorted
out. That was just resolved, and so this is no longer needed (and is indeed
not desirable, since it side-steps the scope index and so will not be found
except by `lookup_local_linear`).
DEV-13162
The ASG had its output reduced previously but I had apparently stashed it; I
found it while trying to clean up after so many failed or partial attempts
and the various scoping changes.
The most fundamental issue is that there's too much information: it's very
difficult to interrogate so I seldom look at it, and it slows down Parser
trace output to the point where it's useless on even one of our smallest
systems, generating 1.5GiB of output for a graph of ~10k
objects (via tameld).
DEV-13162
The scope system works with the AIR stack frames, expecting all parent
environments to be on that stack. Since metavariables were (awkwardly) part
of the template parser, that didn't happen.
This change extracts metavariable parsing (with some remaining TODOs) into
its own parser, so that `AirTplAggregate` will be on the stack; then it's a
simple matter of using the existing `AirAggregateCtx` methods to define a
variable and index its shadow scope, which addresses TODOs in the existing
scope test cases.
This also involved separating the tokens from `AirTpl` into `AirMeta`; they
need to be renamed, which will happen in a following commit, since this is
large enough as it is.
Another change that had to be included here, which I wish I could have just
done separately if it wasn't too much work, was to permit overlapping
identifier shadows. Local variables have to cast a shadow so that we can
figure out if they would in turn shadow an identifier (which would be an
error), but they don't conflict with one-another if they don't have a
shared (visible) scope.
`AirAggregate` can be simplified even further, e.g. to eliminate the
expression stack and just use the ctx stack (which didn't previously exist),
but I need to continue; I'll return to it.
DEV-13162
That was being done automatically before this change, but the change that
I'm about to introduce for metavariables will require this distinction, at
the very least to emphasize the behavior of the indexing.
See the next commit for more information.
(The next commit has a bit too much going on, so I wanted to at least
attempt to separate things where it wasn't much work to do so.)
DEV-13162
The motivating factor here is some out of date or corrupted rustc cache,
however we really ought to be doing fresh builds for TAME; it doesn't add
enough time that it's worth sacrificing assurances.
This finally removes the awkward index from the ASG. This will need much
more documentation and a better organized abstraction, but in the meantime,
previous commit dive into some of the rationale.
In essence: it only really makes sense to have indexing on the ASG itself if
it is used to cache queries or other expensive operations. But that is not
what we were using it for---it was used for caching _lexical_ properties,
which are useful only during parsing for the sake of forming relationships
on the graph. Once those relationships have formed, different types of
indexes will be useful in different lowering, optimization, or querying
contexts.
This formalizes that, and in doing so, ensures that the index is will always
be accurate relative to the content of the ASG. Once the index becomes
separated from it---through the `AirAggregateCtx::finish` operation---then
it is discarded and the ASG exposed.
This is also important because the index is incomplete---it contains only
the information necessary for the parser to carry out its task.
This change was a long time coming, and has reduced ASG to its essence.
DEV-13162
A new AirAggregate parser is utilized for each package import. This
prevents us from moving the index from `Asg` onto `AirAggregateCtx` because
the index would be dropped between each import.
This allows re-using that context and solves for problems that result from
attempting to do so, as explained in the new
`resume_previous_parsing_context` test case.
But, it's now clear that there's a missing abstraction, and that reasoning
about this problem at the topmost level of the compiler/linker in terms of
internal parsing details like "context" is not appropriate. What we're
doing is suspending parsing and resuming it later on for another package,
aggregating into the same destination (ASG + index). An abstraction ought
to be formed in terms of that.
DEV-13162
This was the remaining of my stashed changes that I had mentioned in a
previous commit, but is accomplished differently than I had prototyped. My
initial approach was a bit too klugey: to accept as an argument in various
scope contexts the active parser, as if it were the top stack frame. This
was prototyped before the `AirPkgAggregate` parser was even created.
So we've since created a Pkg parser and now an opaque parser for opaque
idents. There may be other opaque objects in the future.
Because of this change, the parent `AirPkgAggregate` gets stored on the
stack and just naturally becomes part of the lexical scope determination,
and so everything Just Works!
This commit was _supposed_ to be moving the index from `Asg` onto
`AirAggregateCtx`, but I wasn't able to do that because that context is
re-created for each package import currently.
DEV-13162
As evidenced by this change, the tuple syntax was no longer serving us
well. But the real reason for this change is to prepare for the addition of
a fourth field: the index, taken from `Asg`.
DEV-13162
This change means that `asg::air` is now the only module that directly
invokes index-related methods on `Asg`. This clears the way, finally, to
removing the index from `Asg` entirely.
Not only does this result in a less awkward architecture, it also ensures
that lookups are forced to go through the system that understands and
controls lexical scoping, which will be able to give the correct answer.
Of course, the caveat is that the "correct" answer depends on what's
currently on the stack, depending on what type of lookup is being performed,
but those details are still encapsulated within the `asg::air` module and
its tests.
DEV-13162
This is the culmination of a great deal of work over the past few
weeks. Indeed, this change has been prototyped a number of different ways
and has lived in a stash of mine, in one form or another, for a few weeks.
This is not done just yet---I have to finish moving the index out of Asg,
and then clean up a little bit more---but this is a significant
simplification of the system. It was very difficult to reason about prior
approaches, and this finally moves toward doing something that I wasn't sure
if I'd be able to do successfully: formalize scope using AirAggregate's
stack and encapsulate indexing as something that is _supplemental_ to the
graph, rather than an integral component of it.
This _does not yet_ index the AirIdent operation on the package itself
because the active state is not part of the stack; that is one of the
remaining changes I still have stashed. It will be needed shortly for
package imports.
This rationale will have to appear in docs, which I intend to write soon,
but: this means that `Asg` contains _resolved_ data and itself has no
concept of scope. The state of the ASG immediately after parsing _can_ be
used to derive what the scope _must_ be (and indeed that's what
`asg::air::test::scope::derive_scopes_from_asg` does), but once we start
performing optimizations, that will no longer be true in all cases.
This means that lexical scope is a property of parsing, which, well, seems
kind of obvious from its name. But the awkwardness was that, if we consider
scope to be purely a parse-time thing---used only to construct the
relationships on the graph and then be discarded---then how do we query for
information on the graph? We'd have to walk the graph in search of an
identifier, which is slow.
But when do we need to do such a thing? For tests, it doesn't matter if
it's a little bit slow, and the graphs aren't all that large. And for
operations like template expansion and optimizations, if they need access to
a particular index, then we'll be sure to generate or provide the
appropriate one. If we need a central database of identifiers for tooling
in the future, we'll create one then. No general-purpose identifier lookup
_is_ actually needed.
And with that, `Asg::lookup_or_missing` is removed. It has been around
since the beginning of the ASG, when the linker was just a prototype, so
it's the end of TAMER's early era as I was trying to discover exactly what I
wanted the ASG to represent.
DEV-13162
This is in the same spirit as previous commits modifying (or removing)
tests and benchmarks related to accessing the ASG and its indexes directly.
With this change, only `asg::air` uses the indexing and lookup methods on
`Asg`. This will allow me to extract the index from `Asg` entirely and have
`Air` solely responsible for lookup; the graph will be responsible only for,
well, being a graph. Indexing is an optimization strategy.
More information in the commit to follow. But notice how this moving
environment-related concerns away from `Asg` and into AIR, and how the
remaining environment concerns are index-related.
But there is one remaining barrier: to fully move the indexing away from
`Asg`, we have to use an alternative (and complete)
abstraction---AirAggregateCtx with its ability to resolve and introduce
scope based on the stack. The `AirIdent` token subset doesn't yet do that,
and all the work up to this point was in prepartion for doing that. Since
introducing indexing at Root a few commits ago, it's now possible to
proceed.
DEV-13162
These benchmarks were useful as TAMER was in its infancy and I was trying to
gain an intuition for working with Rust. But they are now out of date, and
there are better ways to measure TAMER's performance, including running it
on real-world data (which wasn't possible previously) and through profiling
tools like Valgrind.
With that said, these types of benchmarks _would_ be useful for helping to
dig down into improvements that could be made, at a glance. The problem is,
they aren't testing anything new, and they're also testing something I'm
about to extract from `Asg`. It is not worth the ongoing maintenance cost.
So benchmarks may be reintroduced in the future if they are found to be
valuable.
DEV-13162
The previous commit introduced a duplicate `asg_from_toks`; this just makes
it available publicly for any tests that might utilize AIR to lower the
barrier to writing such tests and provide some guidance in doing so.
DEV-13162
This uses AIR---the ASG's proper public interface now---to construct the
graph for tests, just as all the other modern tests do. This is change
works towards encapsulating index operations (both creation and lookups) so
that the index can be moved off of Asg and into AIR, where it belongs. More
information on that and rationale to come.
DEV-13162
This, finally, introduces identifier pooling in the global environment,
represented by `Root`. All package-level identifiers will be scoped as
such, which at the moment means anything that's not within a template.
As mentioned in recent commits, this does require additional cleanup to
finalize, and some more test will make additional rationale more clear.
It's also worth noting the intent of storing the `ObjectIndex<Root>`---not
only does it mean that the active root can be derived solely from the
current parsing state, but it also means that in the future we can
contribute to any, potentially multiple, roots. I had previously used Neo4J
to effectively diff two dependency graphs between versions in the current
XSLT-based TAMER; I'd like to be able to do that with TAMER in the future,
which is an important concept when considering automated data migration, as
well as querying for the effects of changes.
More to come. I'm hoping this is finally nearing a conclusion and I can
finally tie everything together with package imports. `AirIdent` will be
introduced into the mix soon now too, now that this commit is able to root
them.
DEV-13162
Okay, this is finally distilling into something fairly simple and
reasonable, but I'm not quite there yet.
In particular, the responsibility is simply between `Asg` (as the owner of
the index) and `AirAggregateCtx` (as the owner of the stack frames from
which environments and scope are derived). This was inevitable and I was
waiting for it, but now I have a good idea of how to clean it up and
proceed.
This also doesn't index in root yet (`active_rooting_oi` is still `None` for
`Root`), and I think I may remove `Pool` and just make it `Visible` at that
point, since it won't be going any further anyway. I don't think the
distinction is meaningful and will just complicate implementations.
The tests also need some more cleanup---the assertions ideally would live in
independent tests, and the assertion failure is in a function call rather
than the test (function) itself, so requires a Rust backtrace to locate the
line number of (unless you look at the failure data).
So I suppose this is more of a mental synchronization point than
anything. Nothing's broken, though.
DEV-13162
There's a lot of documentation on this in the commit itself, but this stems
from
a) frustration with trying to understand how the system needs to operate
with all of the objects involved; and
b) recognizing that if I'm having difficulty, then others reading the
system later on (including myself) and possibly looking to improve upon
it are going to have a whole lot of trouble.
Identifier scope is something I've been mulling over for years, and more
formally for the past couple of months. This finally begins to formalize
that, out of frustration with package imports. But it will be a weight
lifted off of me as well, with issues of scope always looming.
This demonstrates a declarative means of testing for scope by scanning the
entire graph in tests to determine where an identifier has been
scoped. Since no such scoping has been implemented yet, the tests
demonstrate how they will look, but otherwise just test for current
behavior. There is more existing behavior to check, and further there will
be _references_ to check, as they'll also leave a trail of scope indexing
behind as part of the resolution process.
See the documentation introduced by this commit for more information on
that part of this commit.
Introducing the graph scanning, with the ASG's static assurances, required
more lowering of dynamic types into the static types required by the
API. This was itself a confusing challenge that, while not all that bad in
retrospect, was something that I initially had some trouble with. The
documentation includes clarifying remarks that hopefully make it all
understandable.
DEV-13162
This begins demonstrating that the root will be utilized for identifier
lookup and indexing, as it was originally for TAME and is currently for the
linker.
This was _not_ the original plan---the plan was to have identifiers indexed
only at the package level, at least until we need a global lookup for
something else---but that plan was upended by how externs are currently
handled. So, for now, we need a global scope.
(Externs are resolved by the linker in such a way that _any_ package that
happens to be imported transitively may resolve the import. This is a
global environment, which I had hoped to get rid of, and which will need to
eventually go away (possibly along with externs) to support loading multiple
programs into the graph simultaneously for cross-program analysis.)
This commit renames the base state for `AirAggregate` to emphasize the fact,
especially when observing it in the `AirStack`, and changes
`AirAggregateCtx::lookup_lexical_or_missing` to resolve from the _bottom_ of
the stack upward, rather than reverse, to prove that the system still
operates correctly with this change in place.
The reason for this direction change is to simplify lookup in the most
general case of non-local identifiers, which are almost all of them in
practice---they'll be immediately resolved at the root once they're
indexed. This can be done because I determined that I will _not_ support
shadowing; rationale for that will come later, but TAME is intended to be a
language suitable for non-programmer audiences as well. Note that
identifiers will be resolved lexically within templates in TAMER, unlike
TAME, which means that the expansion context will _not_ be considered when
checking for shadowing, so templates will still be able to compose without a
problem so long as they do not shadow in their definition context. (I'll
have to consider how that affects template-generating templates later on,
but that's an ambiguous construction in TAME today anyway.)
This _does not_ yet index anything at the root where it wasn't already being
indexed explicitly.
DEV-13162
This requires the name as part of the package definition, which in turn
removes a state (and all the combinations resulting from it) from
AirAggregate, which results in significant complexity reduction for a very
complex part of the system.
Pushing this complexity outward results in a reduction of overall
complexity, and obviates the question of where NIR will receive a generated
name.
DEV-13162
The comment speaks for itself.
My concern is that this will be especially off-putting to people looking at
TAMER and wondering how one could possibly work with this system.
DEV-13162
This is something I've wanted to do for some time, but the system is
becoming hard enough to reason about (with some attempted future changes)
that I require the consistency afforded by this change.
It's not entirely done---as noted by the TODO for `UnnamedPkg`---but it's
close, and then `AirAggregate` will just be a delegating superstate, like
`ele_parse!`.
Importantly, this also puts a package parser on the stack, which will work
better with the stack-based scoping system being developed. It will also
make it easier to fall back to a base case that I had really wanted to
avoid, and will have more information on in the future: root indexing for a
shared global environment for package-level identifiers. (Imports are still
package-scoped, but only in appearance, by contributing to the global
environment of the compilation unit during import. Well, it doesn't do that
yet. The XSLT compiler works in that way.)
DEV-13162
This is one of many changes that have been lingering that I need to start to
break apart in an attempt to commit the confusing and disappointing
conclusion to this package loading madness.
More information to come.
DEV-13162
I had apparently forgotten about this, because I didn't benefit from the
exhaustiveness check; this needs to be eliminated so that this doesn't
happen again, and to provide a proper non-panicking error.
DEV-13162
This reverts commit da7fe96254e425bc7b75f8cf454465b71e27e372.
I'm a fool---this would be pursuant to a future plan that removes AirIdent
opaque tokens. But for now, I need it on IdentDecl and others, which
currently has a `Source` (that I want to go away, as just mentioned), which
contains the same information.
So maybe more to come on this...
DEV-13162
This allows for a canonical package name to be optionally provided to
explicitly resolve a reference against, avoiding a lexical lookup.
This change doesn't actually utilize this new value yet; it just
retains BC. The new argument will be used for the linker, since it already
knows the package that defined an identifier while reading the object file's
symbol table. It will also be used by tamec for the same purposes while
processing package imports.
DEV-13162
-- squashed with --
tamer: asg::air::ir::RefIdent: CanonicalName=SPair
The use of CanonicalName created an asymmetry between RefIdent and
BindIdent. The hope was to move CanonicalName instantiation outside of AIR
and into NIR, but doing so would be confusing and awkward without doing
something with BindIdent.
I don't have the time to deal with that for now, so let's observe how the
system continues to evolve and see whether hoisting it out makes sense in the
end. For now, this works just fine and I need to move on with the actual
goal of finishing package imports so that I can expand templates.
DEV-13162
NOTE: This fixes the aforementioned commit that caused the linker to
temporarily fail (670c5d3a5d at time of
writing). This does introduce an extra forward slash into
`l:dep/preproc:sym/@src`, but that does not appear to cause any
problems. That will eventually go away, so I'm not going to bother with it
any further.
As the `xmlo` file is lowered into AIR, the name will be prefixed with a
leading slash (if necessary, which it is atm) and will emit an
`Air::BindIdent`.
This means that packages will be properly indexed by their canonical name on
load, which will be important when we share this with tamec.
DEV-13162
This change requires every package to have a canonical name, and performs
namespec canonicalization on imports.
Since all package names are canonicalized, this opens the door to being able
to index package names at import, allowing the object to be shared on the
graph and properly reference a package after it has been resolved.
Note that the system tests' canonicalization is relative to the hard-coded
`/TODO` presently; that will change in the near future once `tamec`
generates names from the provided path.
DEV-13162
This introduces, but does not yet integrate, `CanonicalName`, which not only
represents canonicalized package names, but handles namespec resolution.
The term "namespec" is motivated by Git's use of *spec (e.g. refspec)
referring to various ways of specifying a particular object. Names look
like paths, and are derived from them, but they _are not paths_. Their
resolution is a purely lexical operation, and they include a number of
restrictions to simplify their clarity and handling. I expect them to
evolve more in the future, and I've had ideas to do so for quite some time.
In particular, resolving packages in this way and then loading the from the
filesystem relative to the project root will ensure that
traversing (conceptually) to a parent directory will not operate
unintuitively with symlinks. The path will always resolve unambigiously.
(With that said, if the symlink is to a shared directory with different
directory structures, that doesn't solve the compilation problem---we'll
have to move object files into a project-specific build directory to handle
that.)
Span Slicing
------------
Okay, it's worth commenting on the horridity of the path name slicing that
goes on here. Care has been taken to ensure that spans will be able to be
properly sliced in all relevant contexts, and there are plenty of words
devoted to that in the documentation committed here.
But there is a more fundamental problem here that I regret not having solved
earlier, because I don't have the time for it right now: while we do have
SPair, it makes no guarantees that the span associated with the corresponding
SymbolId is actually the span that matches the original source lexeme. In
fact, it's often not.
This is a problem when we want to slice up a symbol in an SPair and produce
a sensible span. If it _is_ a source lexeme with its original span, that's
no problem. But if it's _not_, then the two are not in sync, and slicing up
the span won't produce something that actually makes sense to the user. Or,
worse (or maybe it's not worse?), it may cause a panic if the slicing is out
of bounds.
The solution in the future might be to store explicitly the state of an
SPair, or call it Lexeme, or something, so that we know the conditions under
which slicing is safe. If I ever have time for that in this project.
But the result of the lack of a proper abstraction really shows here: this
is some of the most confusing code in TAMER, and it's really not doing
anything all that complicated. It is disproportionately confusing.
DEV-13162
NOTE: This temporarily breaks `tameld`. It is fixed in a future commit when
names are bound. This was an oversight when breaking apart changes into
separate commits, because the linker does not yet have system tests like
tamec does.
This is preparing for a full transition to requiring a canonical package
name. The previous `Unnamed` variant has been removed and `AirAggregate`
will provide a default `WS_EMPTY` name, as `Pkg` had done before.
The intent of this change is to allow for consulting the index before a
new `Pkg` object is created on the graph, but we're not quite ready for that
yet.
Well, that's not entirely true---the linker can be ready for that. But the
compiler needs to canonicalize import paths relative to the active package
canonical name, which it can't even do yet because tamec isn't generating a
name.
So maybe the linker will be first; it's useful to have that in a separate
commit anyway to emphasize the change.
DEV-13162
...this has apparently been consuming errors for some time. This would
cause the parser to enter an invalid state in some cases and terminate.
This would _not_ permit an invalid link, as the graph would not be correct,
but it was masking the actual error.
This part of linker is in dire need of tests. This also ought to be
replaced with tamec's approach of reporting all errors.
DEV-13162
The previous commit introduced canonical names, and this uses them to index.
The next step will be to utilize those names to look up packages on
definition rather than creating a new package node, so that references to
yet-to-be-defined (or yet-to-be-imported) packages can be resolved on the
graph.
DEV-13162
This is already a concept in the XSLT-based compiler, where each package has
a `package/@name` generated from its path. The same will happen with tamec.
Before we can load packages into the graph, we need canonical identifiers so
that they can be indexed. The next commit will handle indexing using this
information.
DEV-13162
The documentation explains the intent here---existing LaTeX documentation.
The intent was to simply copy the documentation into a LaTeX document based
on the lvspec package that I had created long ago. Of course, that's not
appropriate---we're a DSL and should provide first-class support for
documentation that will compile properly into the target format, whether it
be LaTeX, HTML, JS, or anything else.
DEV-13162
These have been a pain in the ass since TAMER began.
It seemed like a good idea at the time to have static code generated in this
way, but the lack of explicit dependencies just makes this a mess and works
against the operating theory of the system.
Furthermore, the _same_ static fragments were generated for each and every
map package.
There is still a post-link step (standalones) handled in XSLT; the
previously-static code has been moved there. This will eventually be
integrated into tameld itself, once TAMER has facilities for JS generation.
(This was discovered while trying to parent identifiers to packages.)
DEV-13162
With the previous commit using a visitor implemented within the `asg`
module, we can now finally encapsulate the graph. This is a wonderfully
liberating, long-awaited change, since I have been fighting with the lack of
encapsulation for some time; it has made certain changes challenging and has
made the system more difficult to reason about. It also made it impossible
to assert that invariants were _actually_ properly enforced, if things could
just peer into and modify the graph directly, out from underneath the API
that provides those assurances.
This also removes our dependency on Petgraph outside of the `asg`
module. There are no plans to migrate away from it currently; we'll see how
the graph continues to evolve over time and what redundancies are introduced
with our data structures. It may render petgraph unnecessary.
Interestingly, because my DFS implementation is so similar to Petgraph's,
the emitted ordering is _identical_ between this commit and the previous.
DEV-13162
This integrates the new topological sort, replacing the previous
implementation in the linker.
This will now allow encapsulating the graph, finally, and ensures that
future changes can be fully maintained within the `asg` module.
More cleanup will come over time.
DEV-13162
This commit includes plenty of documentation, so you should look there.
It's desirable to describe the sorting that TAME performs as a topological
sort, since that's the end result we want. This uses the ontology to
determine what to do to the graph when a cycle is encountered. So
technically we're sorting a graph with cycles, but you can equivalently view
this as first transforming the graph to cut all cycles and then sorting it.
For the sake of trivia, the term "cut" is used for two reasons: (1) it's an
intuitive visualization, and (2) the term "cut" has precedence in logic
programming (e.g. Prolog), where it (`!`) is used to prevent
backtracking. We're also preventing backtracking, via a back edge, which
would produce a cycle.
DEV-13162
This introduces cycle detection, but it does not yet filter ontologically
permitted cycles, which will be needed prior to utilizing this in `tameld`.
There's a considerable amount of documentation here. While the
implementation is fairly simple, there are important algorithmic decisions,
both in the DFS construction and the derivation of the cycle path from data
that already exists.
This also supports recovery (by ignoring cycles), which can then be utilized
to find more cycles and other errors in the system.
DEV-13162
This is an initial implementation that does not yet produce errors on
cycles. Documentation is not yet complete.
The implementation is fairly basic, and similar to Petgraph's DFS.
A terminology note: the DFS will be ontology-aware (or at least aware of
edge metadata) to avoid traversing edges that would introduce cycles in
situations where they are permitted, which effectively performs a
topological sort on an implicitly _filtered_ graph.
This will end up replacing ld::xmle::lower::sort.
DEV-13162
tameld isn't yet adding edges to Idents from their associated Pkg (see
previous commit), but this formalizes how the ontology will interpret such a
relationship. The idea is that Idents are always owned by Pkgs, but they
may be optionally explicitly rooted, which will be used by a particular type
of DFS walk that is about to be written, which can ignore Root->Pkg and
focus instead on cross edges to Idents.
Though it's not lost on me that now that I'll be introducing a DFS for the
linker, the terms "cross" and "tree" edge now become ambiguous; I used to
call them "ontological X edge", but I had fallen out of that habit; perhaps
I need to reintroduce that rigor.
DEV-13162
This modifies the xmlo reader, xmlo->AIR lowering, and AIR->ASG to introduce
a package for identifiers. It does not yet, however, add edges from the
package to the identifier.
Once edges are added, the DFS will change in undesirable ways, which will
require a new implementation. This is desirable to decouple from Petgraph
anyway, and then will be able to restore the prior single-pass sort+cycle
check.
That will also encapsulate visiting behavior within the `asg::graph` module
and, in turn, allow encapsulating `Asg.graph` finally.
DEV-13162
This doesn't go far enough, but it elaborates a bit---the existing was far
too much of a catch-all. It's important to take advantage of exhaustiveness
checks to ensure each transition is properly accounted for.
This parser is going to get more work over time, including right now, so I'm
not going to go too deep into this yet, but it's be useful (as a reader) to
compare it to e.g. asg::air's parsers' explicit enumeration of states and
favoring of explicit errors over dead state transitions.
DEV-13162
This may now index _any_ type of object, in preparation for indexing package
import paths. In practice, this only makes sense (at least currently) for
`Pkg` and `Ident`.
This generalization also applies to `Asg::lookup_or_missing`.
DEV-13162
Historically, the ASG was better described as a "dependency graph",
containing only identifiers (which are simply called "symbols" in the
XSLT-based compiler). Consequently, it was appropriate for the graph to
have operations specific to identifiers. (Indeed, that's the only type of
object the graph supported.)
Much has changed since then. This cleans things up, and makes parenting
identifiers to root an _explicit_ operation. This will make it easier to
move forward with handling of scope, and importing identifiers into
packages, and removing `Source`, and so on.
DEV-13162
I've been torturing myself trying to figure out how I want to generalize
indexing, lookups, and value numbering in a way that is appropriate for this
project (that is, not over-engineered relative to my needs).
Before I can do much of anything, though, I need to stop having indexing
only as a `Root` thing (previously it wasn't even tied to `Root`). This
makes that change for tamec, but temporarily removes scoping concerns until
I can add more specific types of indexing.
Not only does this allow cleaning up some `Ident`-specific stuff from `Asg`,
but the cleanup also helps to show that portions of the system aren't still
using Root-based globals.
The linker (`tameld`) still uses the old `global` methods for now; those
will eventually go away, but this needs to change to unify both tamec and
tameld once we get to imports as part of the compiler.
DEV-13162
This is needed to then support `@desc` for shorthand desugaring; it's
required by the XSLT-based compiler (and will eventually be required by
TAMER too).
DEV-13708
This is needed by TAMER's template desugaring. The XSD is superceded by
`nir::parse`, but can't go away until TAMER fully supplants the XSLT-based
compiler.
...and after all this time, I still never got rid of the duplicate XSD. Or
even recall which one is the duplicate.
DEV-13708
TAMER desugars shorthand template application bodies (`@values@`) into _the
name of a closed template_ whose body should be expanded into place. This
change recognizes that convention, and makes use of it.
Desugaring is part of `nir::tplshort`.
DEV-13708
XIRF->Nir produces `Todo` and `TodoAttr` tokens for many different
things. The previous approach was to ignore those things so that I could
begin adding portions of packages to the graph and observe how that goes.
But now that I'm starting to be able to compile certain packages that
utilize only small subsets of TAME features, I need to have confidence that
I'm fully parsing them. This means rejecting tokens that I haven't yet
gotten to.
DEV-13708
This supports arbitrary documentation as sibling text (mixed content, in XML
terms). The motivation behind this change is to permit existing system
tests to succeed when `Todo | TodoAttr` are both rejected, rather than
having to ignore this.
TAME has always had a philosophy of literate documentation, however it was
never fully realized. This just maintains the status quo; the text is
unstructured, and maybe will be parsed in the future.
Unfortunately, this does _not_ include the output in the `xmli` file or the
system tests. The reason has nothing to do with TAMER---`xmllint` does not
format the output when there is mixed content, it seems, and I need to move
on for now; I'll consider my options in the future. But, it's available on
the graph and ready to go.
DEV-13708
This _only_ re-introduces for PackageStmt since that's all I have tests for
at present. More will be re-added later.
They were previously removed when the attribute parsing was upended in
`ele_parse!`.
This does lose the attribute name, compared to before; that'll ideally be
re-added, and I'll explore options for doing so later, since I also want
them in other contexts. But it needs to be done generically (not
XML-related).
This had to be done before blowing up on TODOs, or system tests would fail.
DEV-13708
This is in preparation for throwing errors (with diagnostic information) on
yet-to-be-supported tokens, so that I can confidently compile individual
packages without worrying that something is just being ignored.
This makes obvious that `ele_parse!` had a different design in mind
previously, and it's now resulting in a lot of boilerplate; I'll address
that in the future once I'm certain requirements have been settled on, since
I've spent far too much time on it to waste more.
DEV-13708
This introduces a new `Doc` object that can be owned by `Expr` (only atm)
and contain what it describes as a concise independent clause. This
construction is not enforced, and is only really obvious today via the
Summary Pages.
There's a lot of latent and unrealized potential in TAME's documentation
philosophy that was never realized, so this will certainly evolve over
time. But for now, the primary purpose was to get `@desc` working on things
like classifications so that `xmli` output can compile for certain
packages.
DEV-13708
These are used by virtually every `ObjectKind`; I've been meaning to do this
for a while, but now that I'm about to introduce a new one (`Doc`), let's
just get it out of the way.
DEV-13708
This doesn't do the actual hard work yet of resolving and loading a package,
but it does place it on the graph and re-derive it into the xmli output.
DEV-13708
This introduces `<match on="foo" />` and `<match on="foo" value="bar" />`,
which are both equality predicates. Other types of predicates are not yet
supported.
This change is a bit messy and leaves a bit to be desired. `NirToAir` is
quite messy and needs some cleanup. There's also the issue of introducing
XML-specific errors in NIR so that users know what things like "subject"
mean, but not being able to do so yet because NIR is agnostic to the source
document type; another layer of abstraction is needed.
But, my priority is first to get derivation of a particularly
expensive (generated) package in our internal systems working first.
DEV-13708
The alternative I was floating was a tagged `Ref` (that is, with an enum
within it), but I settled on this for now, in part for a more concise
notation with the mapping in nir::parse.
We'll see how this evolves. For now, it's not important with the only thing
that uses ref in nir::parse, which is template application.
This was introduced for `match`, which is to come shortly.
DEV-13708
This recognizes template application within expressions. Since expressions
can occur within templates, this can occur arbitrarily deeply.
And with that, we have the core of the template system represented on the
graph. Of course, there are some glaring scoping issues to be resolved, but
those aren't unique to template application.
DEV-13708
I had hoped this would be considerably easier to implement, but there are
some confounding factors.
First of all: this accomplishes the initial task of getting nested template
applications and definitions re-output in the `xmli` file. But to do so
successfully, some assumptions had to be made.
The primary issue is that of scope. The old (XSLT-based) TAME relied on the
output JS to handle lexical scope for it at runtime in most situations. In
the case of the template system, when scoping/shadowing were needed, complex
and buggy XPaths were used to make a best effort. The equivalent here would
be a graph traversal, which is not ideal.
I had begun going down the rabbit hole of formalizing lexical scope for
TAMER with environments, but I want to get this committed and working first;
I've been holding onto this and breaking off changes for some time now.
DEV-13708
All ObjectIndex-like objects hash using only the underlying identifier,
which ultimately boils down to a `NodeIndex` (petgraph), which is just a
u32. And so in that sense, the only purpose we have for hashing it is to
(a) reduce the space required to store mappings, and (b) compose with other
`Hash`es.
DEV-13708
This creates another trait and struct `ObjectIndexToTree` that assert a
stronger invariant than `ObjectIndexRelTo`---that not only does it uphold
the invariants of `ObjectIndexRelTo`, but also that it represents a _tree_
edge, which indicates _ownership_ rather than just a reference.
This will be used to statically infer what can serve as a scope boundary for
upcoming changes. Specifically, anything that can own an `Ident` introduces
a new level of scope.
DEV-13708
This allows this method to be used on anything that is able to relate to an
identifier, which is needed for the changes being made for the template
system.
This linear lookup is actually going away (as hinted at by preceding
commits); this is extracted as part of a larger change and I wanted to get
it committed to make it easier to follow upcoming changes.
DEV-13708
The prior commit begins to explain the end goal of being able to index
identifiers outside of the global environment.
This change continues to index things as before, but introduces a new key
based on the pair of the symbol id together with a node that is _part of_
its target environment. The only environment utilized at the moment (in this
commit) is that of the root node (which is the global scope), in both
indexing and lookup. Future commits will extend this, and contain more
information about and rationale for the implementation.
The new general index methods are restricted to `pub(super)` until an
abstraction can be put in place that is responsible for environment
indexing; that's a responsibility that is currently handled by
`AirAggregateCtx` for tamec, and the linker has no scoping
requirements since all of that has already been dealt with.
DEV-13708
This reverts commit 1b7eac337cd5909c01ede3a5b3fba577898d5961.
This is a revert of the previous revert, just so that I (and you) have
references to prior rationale.
This was previously reverted because it wasn't worth doing, but now we have
a situation where we need to begin implementing lexical scoping rules for
nested containers (packages and templates). In particular, as you'll see in
the commits that follow, we need to be able to look up an identifier that
may have been created as Missing at one level of scope (certain types of
blocks), but then define it at another level.
Or, even more simply at this point, since I'm not yet doing anything
sophisticated with scope: we're only indexing in the global environment, and
we need to be able to index elsewhere too.
The next commit will go into more information, but suffice it to say for now
that indexing is going to get more complicated than a SymbolId.
Sticking with FxHash for now; we don't need a stable hash now.
DEV-13708
There are no such invalid expansion contexts yet, but this gets rid of the
final remaining TODO from introducing the stack. With the existing feature
set, at least.
DEV-13708
This eliminates the TODOs that existed when looking for an OI for rooting an
identifier.
The change to `rooting_ci` is ridiculous, but I want to get other things
done before I jump down the rabbit hole of generalizing that (indexing local
identifiers). Though I have an approach in mind.
DEV-13708
Just some continued cleanup.
Unfortunately, we have sacrificed knowing a package OI must exist
statically, even though one will always be available.
DEV-13708
This has AirAggregate preempt Expr parsing in the same way as templates,
rather than having `AirTplAggregate` concern itself with expression
tokens. This continues to simplify `AirTplAggregate`, which was getting
quite complex not too long ago.
A pattern is now emerging for the call/ret convention for preemption. That
was intentional, but it's nice to see it manifest so obviously before I
abstract it away.
DEV-13708
This was extracted from xir::parse::ele in previous commits. The
conventions help to ensure that pushing and returns are being performed
correctly. The abstraction will continue to evolve.
This ends up using `Ready` as the dead state. I need to determine if this
is ideal, and if so, maybe just use `Default`, otherwise yield an error.
DEV-13708
The use of ArrayVec doesn't buy us anything anymore. There is no difference
in performance through my own benchmarking (at least on our systems), and
the game has changed since this was written: the size of the states is much
smaller since we're no longer aggregating attributes. Further, the use of
ArrayVec during development was also to keep memory allocation away from
various parts of the code, which simplified analysis of the binary that was
produced. Maybe it also reduced memory contention, but clearly that has no
observable impact.
The use of `Vec` removes the arbitrary bound, though I still kept one around
just in case something goes wrong, so TAMER will terminate. Even though the
token stream is bounded in size, lookahead does create recursion, and the
system cannot (as written) prove that it doesn't.
This is preparing for extracting `StateStack` into `parse` for use with
`AirAggregate`.
DEV-13708
`AirAggregate` now handles all delegation to `AirExprAggregate`. This is
possible because `AirAggregate` is now the superstate for each of these
parsers, so `AirTplAggregate` is able to transition to a state that is not
its own.
This does not go so far as reaching the ultimate objective---having nested
template support---even though it'd be fairly simple to do now; there's
going to be a number of interesting consequences to these changes, and a bit
of cleanup is still needed, and I want tests observing this functionality to
accompany those changes. That is: let's keep this a refactoring, to the
extent that it's possible.
Things are getting much easier to understand now, and much cleaner.
DEV-13708
What hell have I gotten myself into.
In the end, this wasn't too bad, but the initial batch of errors was really
demotivating; the diff does this no justice. `Lookahead::into_super` was
created to help tame those errors.
...now I can move forward. Imagine my disappointment when I ran into this
when expecting from previous work that superstates would now work properly
for the AirAggregate parsers.
(The reason this was needed is because AirAggregate splits tokens into
subtypes for child parsers.)
DEV-13708
Oh, boy, I had forgotten about this, until I started working on some
SuperState stuff and discovered this again due to a compiler error. Don't
want to fix something that isn't used.
But this does not bring back great memories. It's unfortunate that it
didn't work out; I'm pretty sure this was part of ~1mo of wasted effort
going down a path that I ultimately had to abort. Not good times. I'm
still behind from it.
DEV-13708
I love deleting code I just wrote...
This doesn't solve the underlying problems with identifiers, but it does at
least lift it into the `AirAggregateCtx`, allowing `AirExprAggregate` to be
even further simplified. Now the `From` implementation is not specialized
and we can readily convert to a SuperState.
There's still a lot of TODOs here, though. And some of them will
unfortunately require runtime checks where there was previously a
compile-time check. But that's okay in a lot of the cases, because the
empty behavior will replace existing error checks.
DEV-13708
Whether or not dangling expressions are permitted is now based solely off of
the stack context, which is also much more intuitive.
`RootStrategy` now only does one thing, and the existing comments describe
why it exists despite that one thing seeming very similar.
`RootStrategy` further alludes to how `ExprStack` could also be
eliminated, should it be worth doing so. It is a tad redundant now with the
new stack.
DEV-13708
This does the same thing to `AirExprAggregate` that was previously done for
`AirAggregate`, taking all parent context from the stack.
This results in a fairly significant simplification of the code, which is
nice, and it makes the `RootStrategy` obviously obsolete in the dangling
case, which will result in more refactoring to simplify it even more.
I regret not taking this route to begin with, but not only was I hoping I
wouldn't need to, but I was still deriving the graph structure and wasn't
sure how this would eventually turn out. These commits serve as a proof of
necessity. Or, at least, concrete rationale.
It's worth noting that this also introduces `From` implementations for
`AirAggregate` and the child parsers, and then uses _that_ to push context
from the `AirTplAggregate` parser. This means that we're just about ready
for it to serve as a superstate. But there is still a specialization of
`AirExprAggregate` in that `From` impl, which must be removed.
DEV-13708
This is more of the same of the previous commit, but in a more digestable
chunk. We now have child states that are able to be constructed using a
simple `From`, which is important to making `AirAggregate` a `SuperState`.
This also makes `AirStack` act like a prototype chain for `ObjectIndex`es,
creating environments where context shadows. The linear search should only
have to check the last two frames (e.g. an Expr has a parent Pkg or Tpl
context which will have a `rooting_oi` value), and this is only done during
a rooting operation.
DEV-13708
This begins to introduce `AirStack` and starts to migrate context away from
the individual `ParseState`s onto the stack.
I should have started to commit earlier; this is getting a bit large and
makes it hard to follow what I'm doing so, hopefully stopping a little bit
short will allow the following commit to show that.
This is a work-in-progress change. All tests pass, but the refactoring is
incomplete. The `AirStack` abstraction is _also_ incomplete and will have
better, more domain-specific operations that make it harder to mess up
pairing pushes with pops.
The purpose of doing this is to allow `AirAggregate` to serve exclusively as
a sum state, which can then become a SuperState, much like `ele_parse!`'s
approach.
The _end_ goal of all of this is arbitrary template nesting.
DEV-13708
The graph's ontology is defined in the direction of the edge: from OA
to OB. This is enforced by the type system to ensure that no code path is
able to generate an invalid graph.
But that also makes it very difficult to work with a generic source to a
specific target.
This introduces a `ObjectIndexRelTo` trait that says whether `Self` is able
to be related to some `ObjectKind` `OB`, implements it for `ObjectIndex
where ObjectRelTo<OB>`, and introduces a new semi-opaque type
`ObjectIndexTo` that allows for the source `ObjectIndex` to be generic.
This then redefines some existing graph primitives in terms of
`ObjectIndexRelTo`, in particular creating edges, so that `ObjectIndex` can
be used as today, and the new `ObjectIndexTo` can be used in the same way
with the same API, without violating the graph ontology.
This will be used by `AirAggregate` to create dynamic targets for rooting
and splicing/expansion.
DEV-13708
To simplify things in support of upcoming changes, we'll just instantiate a
new one as needed. This doesn't have an appreciable performance impact, so
the optimization is premature. It was done just because it was more of the
same that TAMER was already doing, but now it's making things more
difficult.
DEV-13708
Future changes to `AirAggregate` are going to require additional context (a
stack, specifically), but the `Context` is currently utilized
by `Asg`. This introduces a layer of abstraction that will allow us to add
the stack.
Alongside these changes, `ParseState` has been augmented with a `PubContext`
type that is utilized on public APIs, both maintaining BC with existing code
and keeping these implementation details encapsulated.
This does make a bit of a mess of the internal implementation, though, with
`asg_mut()` sprinkled about, so maybe the next commit can clean that up a
bit. EDIT: After adding `AsMut` to a bunch of asg::graph::object::*
methods, I decided against it, because it messes with the inferred
ownership, requiring explicit borrows via `as_mut()` where they were not
required before. I think the existing code is easier to reason about than
what would otherwise result from having `mut asg: impl AsMut<Asg>`
everwhere.
DEV-13708
Previously, `AirTplAggregate` worked only in a `Pkg` context, being able to
root `Tpl` `Ident`s in `Pkg` and expand only into `Pkg`. This still does
the same, but generalizes to allow for different roots and expansion
targets.
This will be utilized to parse nested templates.
DEV-13708
I'm happy with how this ended up turning out---I was able to accomplish this
without having to introduce any additional state to the parser (I _removed_
a state, actually) by tweaking NIR a bit in a previous commit.
We can't update the system test yet, though, because nested templates are
not yet supported by asg::air::tpl; that'll come next. If you try, you'll
be greeted with this error presently (which is worth showing since you'll
never see it unless you're hacking TAMER):
,=====[ ./tests/xmli/template/ logs ]======
|
| thread 'main' panicked at 'not yet implemented: internal error:
| note: nested tpl open
| --> ./tests/xmli/template/src.xml:129:5
| |
| 129 | <t:inner-short />
| | -------------- note: for this template
|
|
| !!! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !!!
| !!! THIS IS AN UNFINISHED FEATURE IN TAMER !!!
| !!! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !!!
| !!! This message means that TAMER has encountered an !!!
| !!! unrecoverable error that forced it to terminate !!!
| !!! processing. !!!
| !!! !!!
| !!! TAMER has attempted to provide you with contextual !!!
| !!! information above that might allow you to work around !!!
| !!! this problem until it can be fixed. !!!
| !!! !!!
| !!! Please report this error, including the above !!!
| !!! diagnostic output beginning with 'internal error:'. !!!
| !!! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !!!
| ', src/asg/air/tpl.rs:207:55
| note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
| Command exited with non-zero status 101
| 0/165fault 0/8io 3528rss 14/2ctx
| /home/[...]/tame/tamer/target/debug/tamec -o ./tests/xmli/template/out.xmli --emit xmlo ./tests/xmli/template/src.xml
|
`====[ end ./tests/xmli/template/ logs ]====
DEV-13708
This is a long-overdue change to make this easier to read, but I'm _still_
holding off on refactoring, since there's still a lot of room for different
patterns to form with all of NIR that is left.
DEV-13708
This adds explicit variants for shorthand template application. This is
less cryptic, and we'll be able to check for the close directly during
desugaring.
DEV-13708
This represents a significant departure from how the XSLT-based TAME handles
the `@values@` param, but it will end up having the same effect. It builds
upon prior work, utilizing the fact that referencing a template in TAMER
will expand it.
The problem is this: allowing trees in `Meta` would add yet another
container; we have `Pkg` and `Tpl` already. This was the same problem with
template application---I didn't want to add support for binding arguments
separately, and so re-used templates themselves, reaching the generalization
I just mentioned above.
`Meta` is intended to be a lexical metasyntatic variable. That keeps its
implementation quite simple. But if we start allowing trees, that gets
rather complicated really quickly, and starts to require much more complex
AIR parser state.
But we can accomplish the same behavior by desugaring into an existing
container---a template---and placing the body within it. Then, in the
future, we'll parse `param-copy` into a simple `Air::RefIdent`, which will
expand the closed template and produce the same result as it does today in
the XSLT-based system.
This leaves open issues of closure (variable binding) in complex scenarios,
such as in templates that introduce metavariables to be utilized by the
body. That's never a practice I liked, but we'll see how things evolve.
Further, this does not yet handle nested template applications.
But this saved me a ton of work. Desugaring is much simpler.
The question is going to be how the XSLT-based compiler responds to this for
large packages with thousands of template applications. I'll have to see
if it's worth the hit at that time, or if we should inline it when
generating the `xmli` file, producing the same `@values@` as
before. But as it stands at this moment, the output is _not_ compatible
with the current compiler, as it expects `@values@` to be a tree, so a
modification would have to be made there.
DEV-13708
This applies to template application only; there's still some work to do for
template parameters in definitions (well, for deriving them in `xmli` at
least). And, as you can see, there's still a lot of TODO items here.
I ended up backtracking on tree edges to Meta, and even on cross edges to
Meta, because it complicated xmli derivation with no benefit right now;
maybe a cross edge will be re-added in the future, but I need to move on and
see where this takes me.
But, it works.
DEV-13708
I'm not happy with this implementation. The linear search is undesirable,
but not too bad (and maybe wouldn't even be worth caching, if this were the
whole story), but we _also_ need to prevent duplicate identifiers. We are
not going to want to perform a linear search of a linked list (effectively)
every time we add an identifier to check for uniqueness, so I think the
caching is going to have to be generalized very shortly anyway.
As it stands now, a duplicate identifier would cause an error at expansion
time. That's not what we want, but it's not terrible, because you can have
that same problem in normal circumstances without local conflicts.
But this'll be used for metavariables as well, where we absolutely _do_ want
to fail at template definition time.
DEV-13708
Identifier lookups, as done using the graph methods today, look up from a
cache representing the global environment.
Templates must not contribute to this environment until expansion. Further,
metavariables will not be present in this environment. To avoid confusion
and help obviate accidental contributions to this environment, the methods
have been renamed. This will also allow for the creation of more general
methods down the line.
DEV-13708
This makes the tests quite a bit easier to understand visually. I've been
doing this with all new tests but had to go back to some old ones, and still
have more to go back to. Baby steps.
DEV-13708
I had intended for this to be a full vertical slice initially, but AIR's
parser is going to need enough work that it'll muddy this patch a bit too
much.
This keeps the desugaring simple, which is what I was hoping for.
The next step is to load it into the graph and emit regenerated longhand
sources.
I also don't like how the namespace prefix is just being ignored for
shorthand param desugaring. This is also the case in the XSLT-based
compiler, but this violates TAMER's principle that it should parse every bit
of information; nothing should be ignored. If something does not contribute
useful information, then it is not a useful construct and ought to be
rejected.
DEV-13708
This moves translation from NirToAir into TplShortDesugar, and changes the
output from AIR to NIR.
This is going to be much easier to reason about as a desugaring
operation (and indeed that's always how TAME has implemented it, in XSLT);
this keeps the complexity isolated.
Ideally, NirToAir wouldn't even accept tokens that it can't handle, but
that's going to take quite a bit more work and I don't have the time right
now. Instead, we'll fail at runtime with some hopefully-useful
information. It shouldn't actually happen in practice.
DEV-13708
This makes it more visually apparent, when looking directly at a node,
whether an edge could represent a tree edge.
Dynamic edges could be tree edges, so I left those solid; that's the more
important visual indicator that I'm interested in, and it's disambiguated by
the dashed line.
DEV-13708
Previous to this commit, ontological cross edges were declared
statically. But this doesn't fare well with the decided implementation for
template application.
The documentation details it, but we have Tpl->Ident which could mean "I
define this Ident once expanded", or it could mean "this is a reference to a
template I will be applying". The former is a tree edge, the latter is a
cross edge, and that determination can only be made by inspecting edge data
at runtime.
It could have been resolved by introducing new Object types, but that is a
lot of work for little benefit, especially given that only (right now) the
visitor uses this information.
DEV-13708
This this a big change that's difficult to break up, and I don't have the
energy after it.
This introduces nullary template application, short- and long-form. Note
that a body of the short form is a `@values@` argument, so that's not
supported yet.
This continues to formalize the idea of what "template application" and
"template expansion" mean in TAMER. It makes a separate `TplApply`
unnecessary, because now application is simply a reference to a
template. Expansion and application are one and the same: when a template
expands, it'll re-bind metavariables to the parent context. So in a
template context, this amounts to application.
But applying a closed template will have nothing to bind, and so is
equivalent to expansion. And since `Meta` objects are not valid outside of
a `Tpl` context, applying a non-closed template outside of another template
will be invalid.
So we get all of this with a single primitive (getting the "value" of a
template).
The expansion is conceptually like `,@` in Lisp, where we're splicing trees.
It's a mess in some spots, but I want to get this committed before I do a
little bit of cleanup.
This was missing `@FEATURES@`, which was causing more compilation than
necessary, but also causing clippy to evaluate different code.
This also adds RUSTFLAGS, for the same reason of not wanting to recompile.
DEV-13708
This chooses Option B, as stated would likely be the case in the previous
commit. The reasons are practical---I intend to support partial application
if doing so is worth it, either in implementation of the compiler or the
source language.
Closed templates can be referenced using `IdentRef` to trigger
expansion---their value is what they expand into, and they are spliced into
that point in the tree, like `,@` in Lisp. We are able to overload this
behavior because we have the necessary type information.
However, I don't want to have to generate an Ident for every single template
expansion; there are many tens of thousands of them in our production
system. Since AIR doesn't presently have a way to deal with this situation,
I'll for now add a special token that will close and expand a template in
place; it can be replaced with two separate tokens (`TplEnd` + `Ref`, for
example) in the future if such a need arises.
Are we there yet...?
DEV-13708
Also known as metavariables or template parameters.
This is a bit of a tortured excursion, trying to figure out how I want to
best represent this. I have a number of pages of hand-written notes that
I'd like to distill over time, but the rendered graph ontology (via
`asg-ontviz`) demonstrates the broad idea.
`AirTpl::TplApply` highlights some remaining questions. What I had _wanted_
to do is to separate the concepts of application and expansion, and support
partial application and such. But it's going to be too much work for now,
when it isn't needed---partial application can be worked around by simply
creating new templates and duplicating params, as we do today, although that
sucks and is a maintenance issue. But I'd rather address that head-on in
the future.
So it's looking like Option B is going to be the approach for now, with
templates being closed (as in, no free metavariables) and expanded at the
same time. This simplifies the parser and error conditions significantly
and makes it easier to utilize anonymous templates, since it'll still be the
active context.
My intent is to get at least the graph construction sorted out---not the
actual expansion and binding yet---enough that I can use templates to
represent parts of NIR that do not have proper graph representations or
desugaring yet, so that I can spit them back out again in the `xmli` file
and incrementally handle them. That was an option I had considered some
months ago, but didn't want to entertain it at the time because I wasn't
sure what doing so would look like; while it was an attractive approach
since it pushes existing primitives into the template system (something I've
wanted to do for years), I didn't want to potentially tank performance or
compromise the design for it after I had spent so much effort on all of this
so far.
But my efforts have yielded a system that significantly exceeds my initial
performance expectations, with a decent abstractions, and so this seems
viable.
DEV-13708
See the Air docblock for more information. I'm introducing new tokens for
the template system, which uses the terms "free" and "closed". I prefer
open/close for delimiters, as I've expressed elsewhere, but unfortunately it
conflicts too much (and too confusingly) with other standard terminology as
we get more into the formal side of the language.
DEV-13708
This removes special cases, but it does complicate the parent `AirAggregate`
parser. A pattern of delegation is forming, though abstracting it may be an
interesting challenge, given Rust's limitation on macro invocations as match
arms. But, I think I can manage by generating the entire match using a
macro with a match-compatible syntax, augmenting where
needed...maybe. This'll be messy.
...but if I can write the nightmare that is `ele_parse!`, I'm sure I can
manage this. I just prefer to avoid complex macros unless I really need
them.
DEV-13708
Now that these are actually intended to be used as part of the build, this
is a more appropriate location. I originally wrote it as a manual tool.
DEV-13708
This parses the declarative `object_rel!` definitions from the Rust sources
and produces a DOT representation of the ontology of the graph, which can
then be rendered using Graphviz.
This does not yet introduce it into the build; it ought to be run as part of
`make check` (without rendering with Graphviz) to ensure that we catch
breaking changes, and `make html` ought to integrate it into the
documentation, perhaps as part of `asg::graph` or `asg::graph::object`.
DEV-13708
Small break from templates for something easier. I have COVID-19, so I'll
use that as my excuse for wanting to be more lazy.
The real reason is to see some more concrete progress and ensure that
patterns hold for simple expressions before further refactoring.
But, before I proceed with such refactoring, I really ought to approach
something that requires a NIR desugaring step, like case statements.
DEV-13708
Going higher than that doesn't make sense because we're in shell and
invoking commands all around this, so even milliseconds isn't going to be
entirely accurate here. However, what I am more interested in is observing
time relative to other runs; this isn't intended for profiling, but for
eyeballing unexpected behavior.
DEV-13708
There's a lot to look at, especially in the event of failure. Further, I
wanted to add additional statistics that could be eyeballed.
Right now, tamec is too fast (at least on my machine) for the precision of
/usr/bin/time: we need milliseconds, but we only get hundredths of a
second. So it'll all show as 0:00.00s. Which is okay, for now; it just
shouldn't exceed that. ;)
DEV-13708
The intent was to have a very simple implementation of `hold_dangling` and
have everything work. But, I had a nasty surprise when the system tests
caught bug caused by some interesting depth interactions as it relates to
`xmli` and auto-closing.
I added an extra test/example in `asg::graph::visit::test` to illustrate the
situation; it was difficult to derive from the traces, but trivially obvious
once I wrote it out as an example.
With that, templates can now aggregate tokens for dangling expressions.
DEV-13708
This won't try the fixpoint test if the prior one fails, which will always
cause that one to fail. And it further won't attempt the diff on
compilation failure.
DEV-13708
And finally we have tokens aggregated onto the ASG in the context of a
template. I expected to arrive here much more quickly, but there was a lot
of necessary refactoring. There's a lot more that could be done, but I need
to continue; I had wanted this done a week ago.
It is worth noting, though, that this finally achieves something I had been
wondering about since the inception of this project---how I'd represent
templates on the graph. I think this worked out rather nicely. It wasn't
even until a few months ago that I decided to use AIR instead of NIR for
that purpose (NIR wouldn't have worked).
And note how I didn't have to touch the program derivation at all---the
system test just works with the AIR change, because of the consistent
construction of the graph. Beautiful.
DEV-13708
This hoists the errors back into `AirAggregate`; I need dead states for the
`AirTplAggregate` parser so that it will know when to (and not to) interpret
tokens in the context of the template itself.
In a previous commit message, I had pondered whether it may be possible to
eliminate the dead state transition, and yet here I've used it with both of
the sub-parsers now. So it seems like the better option in the future may
be to narrow the type further---to say precisely _what_ types of tokens may
yield a dead state transition; otherwise you lose the match information from
the parser that yielded it.
A stubbornly persistent problem in Rust, this magical and hidden match
knowledge.
DEV-13708
This sets us up to be able to determine how `Dangling` expressions will be
rooted into templates.
This new strategy isn't yet handling `Dangling`; I wanted to get this
committed first so that the `Dangling` refactoring is more clear.
DEV-13708
Expressions were previously tied to packages. This prepares for using a
`Tpl` as a container for expressions.
This does not yet handle the situation of auto-rooting dangling expressions
within the container.
DEV-13708
This result in less useful debug output, but it'll be needed for using
a (possibly-anonymous) template as evidence.
This evidence is simply for debugging, and to require some sort of value
during development to help obviate when maybe something is being done
incorrectly (if no obvious value exists).
DEV-13708
This is more of the same refactoring that has been happening. This
extraction also helps emphasize the relationship between imported objects,
and isolates the growing number of test cases. This parser will only grow.
DEV-13708
Just as was done with the expression parser, which this will utilize. This
initializes it, but doesn't yet make use of it (`AirExprAggregate`).
Refactoring was definitely needed; decomposing this is quite a bit of work,
in no small part because of the complexity. This helps significantly.
DEV-13708
This works around limitations of Rust's borrow checker as of the time of
writing. See the provided documentation for more information.
The branch context is not yet exposed to the `delegate` family of methods;
it will be added only as needed in the future.
DEV-13708
This delegates expression parsing to `AirExprAggregate`, in an effort to
both begin to simplify the understanding and maintenance of `AirAggregate`;
and allow for parser composition for template parsing.
This utilizes the prior changes for token sum types to precisely define the
subset of AIR tokens supported by the expression parser. This differs from
prior approaches which delegated until a dead state, relying on runtime
information to determine if a parser has finished. This allows us to
determine that statically.
I do want to be able to eliminate the dead state from the parser so we can
get rid of the `unreachable!`, but I need to move on; that's something I had
tried to do in the past too, which ended up adding a bit of complexity, and
I'll have to consider my options in the future, including whether the dead
state transition can be entirely eliminated in favor of the combination of
these sum types and recovery; the parsing framework decisions were made
while recovery was still an open question, at least in practice.
DEV-13708
This was a rather frustrating thing to encounter. I was working on
refactoring `AirAggregate`, and found that my tests were hanging despite no
apparent cause in the parser itself.
As it turns out, rather than failing with a `FinalizeError` as I
expected (since I was mid-refactor), `collect()` was allocating space for an
endless stream of errors. This was easily verified by adding a `take(x)`
and observing the assertion failure (in this case, in `close_pkg_mid_expr`).
This happens to be the first time in a long time that I actually had to
debug---the combination of robust types as proofs and tests to fill in the
gaps means that runtime issues are caught at build time in all but
exceptional cases (like this one).
It's also worth noting that, because of my policy of iterating only at the
higher levels of the program, it was clear that this must somehow be
Parser-related, since that's the only part of the system that has the
potential for unbounded recursion due to its cyclic state machines.
DEV-13708
This introduces a new macro `sum_ir!` to help with a long-standing problem
of not being able to easily narrow types in Rust without a whole lot of
boilerplate. This patch includes a bit of documentation, so see that for
more information.
This was not a welcome change---I jumped down this rabbit hole trying to
decompose `AirAggregate` so that I can share portions of parsing with the
current parser and a template parser. I can now proceed with that.
This is not the only implementation that I had tried. I previously inverted
the approach, as I've been doing manually for some time: manually create
types to hold the sets of variants, and then create a sum type to hold those
types. That works, but it resulted in a mess for systems that have to use
the IR, since now you have two enums to contend with. I didn't find that to
be appropriate, because we shouldn't complicate the external API for
implementation details.
The enum for IRs is supposed to be like a bytecode---a list of operations
that can be performed with the IR. They can be grouped if it makes sense
for a public API, but in my case, I only wanted subsets for the sake of
delegating responsibilities to smaller subsystems, while retaining the
context that `match` provides via its exhaustiveness checking but does not
expose as something concrete (which is deeply frustrating!).
Anyway, here we are; this'll be refined over time, hopefully, and
portions of it can be generalized for removing boilerplate from other IRs.
Another thing to note is that this syntax is really a compromise---I had to
move on, and I was spending too much time trying to get creative with
`macro_rules!`. It isn't the best, and it doesn't seem very Rust-like in
some places and is therefore not necessarily all that intuitive. This can
be refined further in the future. But the end result, all things
considered, isn't too bad.
DEV-13708
This sets the stage for template parsing, and finally decides how we're
going to represent templates on the ASG. This is going to start simple,
since my original plans for improving how templates are
handled (conceptually) is going to have to wait.
This is the last difficult object type to figure out, with respect to graph
representation and derivation, so I wanted to get it out of the way.
DEV-13708
I wasn't initially sure whether I'd want separate tokens for different types
of identifying operations, but now that I see that it is clear from the
current state of the parser, there's no need.
This matches the name of the token in NIR.
DEV-13708
The previous commit demonstrated the amount of boilerplate necessary for
introducing new `ObjectKind`s; this abstracts away a lot of that
boilerplate, and allows for declarative relationship definition for the
ASG's ontology.
DEV-13708
There's quite a bit of boilerplate here that'll eventually need factoring
out. But it's also clear that it is somewhat onerous to add new object
types.
Note that a good chunk of this burden is _intentional_, via exhaustiveness
checks---adding a new type of object is an exceptional occurrence (well, in
principle, but we haven't added them all yet, so it'll be more common
initially), and we'd rather be safe to ensure that everything is properly
considering how that new type of object interacts with it.
Let's not confuse coupling with safety---the latter causes a burden because
of the former, not because of itself; it provides a service to us.
But, nonetheless, we'll want to reduce this burden somewhat since there are
a number more to add.
DEV-13708
Just as `rate` is a `sum`, `classify` is an `all` by default. The `@any`
attribute will change that interpretation, though I only intend to recognize
that in parsing later on, not emit that in XMLI.
DEV-13708
Let's start to be explicit about what's missing as we continue to add new
tokens; the exhaustiveness checks throughout the system will guide the
changes that need to be made.
DEV-13708
The element only, no attributes yet.
I'll keep forming boilerplate until abstraction points become obvious with
more variety; this is still pretty close to what was already supported.
DEV-13708
We already had `TreeContext`, and I'm passing the same arguments around, so
this uses it to lift arguments out of these functions, like partial
application.
DEV-13708
This tidies this method up into a decent state that I'm fairly content
with. This goes to emphasize my dislike of returns, which muddies control
flow and makes the code more difficult to read at a glance, which increase
the likelihood of logic bugs.
`match` statements in tail position, on the other hand, are very clear, and
less cognitively burdensome since you can see each individual code path at a
glance.
DEV-13708
This begins to develop a pattern for doing these transformations. I had
tried a number of things using iterators, but I wasn't satisfied with either
how they were turning out; had to fight too much with the type system; or
had to resort to heap allocations. Sticking with an explicit
`push`/`push_all` for now works just fine.
Almost done cleaning up `AsgTreeToXirf::parse_token`, and then I can move on
to introducing more objects.
DEV-13708
This is generic over the source, just as the target, defaulting just the
same to `ObjectIndex`.
This allows us to use only the edge information provided rather than having
to perform another lookup on the graph and then assert that we found the
correct edge. In this case, we're dealing with an `Ident->Expr` edge, of
which there is only one, but in other cases, there may be many such edges,
and it wouldn't be possible to know _which_ was referred to without also
keeping context of the previous edge in the walk.
So, in addition to avoiding more indirection and being more immune to logic
bugs, this also allows us to avoid states in `AsgTreeToXirf` for the purpose
of tracking previous edges in the current path. And it means that the tree
walk can seed further traversals in conjunction with it, if that is so
needed for deriving sources.
More cleanup will be needed, but this does well to set us up for moving
forward; I was too uncomfortable with having to do the separate
lookup. This is also a more intuitive API.
But it does have the awkward effect that now I don't need the pair---I just
need the `Object`---but I'm not going to remove it because I suspect I may
need it in the future. We'll see.
The TODO references the fact that I'm using a convenient `resolve_oi_pairs`
instead of resolving only the target first and then the source only in the
code path that needs it. I'll want to verify that Rust will properly
optimize to avoid the source resolution in branches that do not need it.
DEV-13708
This makes the inner `Object` type generic (but defaulting to the same inner
types as before) so that it can be used as a sum type for various types
where `ObjectKind`-based narrowing is required.
In this case, it's used to narrow `ObjectIndex` alongside the inner
`ObjectKind` so that the two are definitely in sync. This not only results
in cleaner code and a more intuitive API that's approachable to people
less familiar with the system, but it also helps to eliminate logic bugs
that might result form manually narrowing (as was done before this change).
DEV-13708
This was a fairly simple addition, since rate blocks already lower into sum
expressions; these are just non-identified.
This does emphasize that the nir::parse `ele_parse!` abstraction I spent so
much time on ended up not being a perfect fit, as it now has some
boilerplate after it was stripped of much of its capabilities some time ago.
Don't worry, `nir::air` and `asg::graph::xmli` will get cleaned up.
DEV-13708
This extends the POC a bit by beginning to reconstruct rate blocks (note
that NIR isn't producing sub-expressions yet).
Importantly, this also adds the first system tests, now that we have an
end-to-end system. This not only gives me confidence that the system is
producing the expected output, but serves as a compromise: writing unit or
integration tests for this program derivation would be a great deal of work,
and wouldn't even catch the bugs I'm worried most about; the lowering
operation can be written in such a way as to give me high confidence in its
correctness without those more granular tests, or in conjunction with unit
or integration tests for a smaller portion.
DEV-13708
This provides a test harness for running shell-based system tests. The
first of such tests will be introduced in the following commit.
This is done in place of integration tests written in Rust because it will
invoke the final binary exactly as the user or build system (using TAMER)
will, providing greater confidence. Besides, a lot of things are simply
more convenient to do in shell. ...though some of you may debate that.
DEV-13708
The intent is to source this in shell scripts, like tests.
This exposes feature flags to shell scripts, but it doesn't do so in quite
the same way that Rust does---it doesn't apply the dependencies. While this
isn't needed now, it does make me a little uncomfortable, and so I may take
a different approach in the future.
DEV-13708
Just some final POC setup for how this'll work; it's nothing
significant. This just emits an `@xmlns` on the `package` element to
demonstrate use of the stack.
With that, it's time to formalize this.
I also need to document at some point why I choose to use `ArrayVec` still
over `Vec`---it's not a microoptimization. It's intended to simplify the
runtime to keep execution simple with fewer code paths and make it more
amenable to analysis. Memory allocation is a pretty complex thing and
muddies execution. It's also another point of failure, though practically
speaking, I'm not worried about that---this is replacing a system that
consumes many GiB of memory (XSLT-based compiler) with one that consumes 10s
of MiB.
DEV-13708
This (a) hold the state of a stack that I can populate with tokens rather
than introducing a state for every single e.g. attribute and such on
elements (so, more like the `xmle` XIR lowering).
It also hides the obvious awkwardness of the `&mut &'a Asg`, but that's not
the intent of it.
DEV-13708
This is just a special case of lowering with a context, and maintaining two
separate implementations has resulted in divergence. I don't recall why I
didn't do this previously, though it's possible that the lowering pipeline
was in a state that made it more difficult to do (e.g. with error
handling).
DEV-13708
Technically, an "acceptor" in the context of state machines is actually a
state machine; the terminology here is more describing the configuration of
the state machine (`XirToXirf`) as an acceptor.
This change comes with significant documentation of the rationale and why
this is important; see that for more information.
This change is necessary so that we can enforce finalization on all parsers
in the lowering pipeline, which is not currently being done. If we were to
do that now, then `tameld` would fail because it halts parsing of the tokens
stream at the end of the `xmlo` header.
This is also quite the type soup, but I'm not going to refine this further
right now, since my focus is elsewhere (XMLI lowering).
DEV-13708
This has been a long time coming. The wiring of it all together is a little
rough around the edges right now, but this commit represents a working POC
to begin to fill in the gaps for the entire lowering pipeline.
I had hoped to be at this point a year ago. Yeah.
This marks a significant milestone in the project because this allows me to
begin to observe the implementation end-to-end, testing it on real-life
inputs as part of a production build pipeline.
...and now, with that, we can begin. So much work has gone into this
project so far, but aside from the linker (which has been in production for
years), most of this work has been foundational. It's been a significant
investment that I intend to have pay off in many different ways.
(All this outputs right now is `<package/>`.)
DEV-13708
This replaces the stub `derive_xmli` with the same result (well, minus a
space before the '/' in the output) using what will become the lowering
pipeline. Once again, this is quite verbose, and the lowering pipeline in
general needs to be further abstracted away.
Unlike the rest of the pipeline, an error during the derivation process will
immediately terminate with an unrecoverable error, because we do not want to
write partial files. This does not remove the garbage file, because the
build system ought to do that itself (e.g. `make`)...but that is certainly
open for debate.
DEV-13708
The reader previously yielded a `ParsedResult`, presumably to simplify
lowering operations. But the reader is not a `ParseState`, and does not
otherwise use the parsing API, so this was an inappropriate and confusing
coupling.
This resolves that, introducing a new `lowerable` which will translate an
iterator into something that can be placed in a lowering pipeline.
See the previous commit for more information.
DEV-13708
The token type was previously hard-coded to `UnknownToken`, since the use
case was the beginning of the lowering pipeline at the start of the program,
where there was no token type because the first parser (`XirReader`,
currently) is responsible for producing the first token type.
But when we're lowering from the graph (so, the other side of the lowering
pipeline), we _do_ have token types to deal with.
This also emphasizes the inappropriate coupling of `<XirReader as
Iterator>::Item` with `ParsedResult`; I'd like to follow the same approach
that I'm about to introduce with `tamec`, so see a future commit.
DEV-13708
This was missed (because it was not used) when EOF tokens were originally
introduced via `ParseState::eof_tok`---`LowerIter` also needs to consider
the token.
This separation betwen the two iterators is a maintenance burden that needs
to be taken care of; I knew that at the time, and then I forgot about it,
and here we are.
This was caught while beginning to wire together a POC graph lowering
pipeline to emit derived sources.
DEV-13708
This parser does exactly what it says it does. Its implementation is
simple, but I added a test anyway just to prove that it works, and the test
seems more complicated than the implementation itself, given the types
involved.
DEV-13708
This introduces a `Token` in place of the original tuple for
`TreePreOrderDfs` so that it can be used as input to a parser that will
lower into XIRF.
This requires that various things be describable (using `Display`), which
this also adds. This is an example of where the parsing framework itself
enforces system observability by ensuring that every part of the system can
describe its state.
DEV-13708
This lowering operation is intended to allow me to write a more concise and
clear mapping from the graph to XIRF, without having to worry about
balancing tags, which really complicated the implementation.
This has details docs; see that for more information.
I can't help but be reminded of Wisp (the whitespace-based Lisp-like
syntax). Which is unfortunate, because I'm not fond of Wisp; I like my
parenthesis.
DEV-13708
The `TreePreOrderDfs` iterator needed to expose additional edge context to
the caller (specifically, the `Span`). This was getting a bit messy, so
this consolodates everything into a new `DynObjectRel`, which also
emphasizes that it is in need of narrowing.
Packing everything up like that also allows us to return more information to
the caller without complicating the API, since the caller does not need to
be concerned with all of those values individually.
Depth is kept separate, since that is a property of the traversal and is not
stored on the graph. (Rather, it _is_ a property of the graph, but it's not
calculated until traversal. But, depth will also vary for a given node
because of cross edges, and so we cannot store any concrete depth on the
graph for a given node. Not even a canonical one, because once we start
doing inlining and common subexpression elimination, there will be shared
edges that are _not_ cross edges (the node is conceptually part of _both_
trees). Okay, enough of this rambling parenthetical.)
DEV-13708
This information is necessary to be able to reconstruct the tree, since
the `ObjectIndex` alone does not give you enough information. Even if you
inspected the graph, it _still_ wouldn't give you enough information, since
you don't know the current path of the traversal for nodes that may have
multiple incoming edges. (Any assumptions you could make today won't
always be valid in the future.)
DEV-13708
This begins to introduce a graph traversal useful for a source
reconstruction from the current state of the ASG. The idea is to, after
having parsed and ingested the source through the lowering pipeline, to
re-output it to (a) prove that we have parsed correctly and (b) allow
progressively moving things from the XSLT-based compiler into TAMER.
There's quite a bit of documentation here; see that for more
information. Generalizing this in an appropriate way took some time, but I
think this makes sense (that work began with the introduction of cross edges
in terms of the tree described by the graph's ontology). But I do need to
come up with an illustration to include in the documentation.
DEV-13708
The `Pkg` span will now properly reflect the entire definition of the
package including the opening and closing tags.
This was found while I was working on a graph traversal.
DEV-13597
I noticed this while working on a graph traversal. The unit test used the
same span for both the reference _and_ the binding, so I didn't notice. -_-
The problem with this, though, is that we do not have a separate span
representing the source location of the identifier reference. The reason is
that we decided to re-use an existing node rather than creating another one,
which would add another inconvenient layer of indirection (and complexity).
So, I may have to add (optional?) spans to edges.
DEV-13708
This introduces the concept of ontological cross edges.
The term "cross edge" is most often seen in the context of graph traversals,
e.g. the trees formed by a depth-first search. This, however, refers to the
trees that are inherent in the ontology of the graph.
For example, an `ExprRef` will produce a cross edge to the referenced
`Ident`, that that is a different tree than the current expression. (Well,
I suppose technically it _could_ be a back edge, but then that'd be a cycle
which would fail the process once we get to preventing it. So let's ignore
that for now.)
DEV-13708
This was done so we can use t:param template with the generated
enum, but not have to provide the value in the YML test. Without
a NONE enum as 0, the default value of 0 in YML test will have
a domain violation.
This causes a package definition to be rooted (so that it can be easily
accessed for a graph walk). This keeps consistent with the new
`ObjectIndex`-based API by introducing a unit `Root` `ObjectKind` and the
boilerplate that goes with it.
This boilerplate, now glaringly obvious, will be refactored at some point,
since its repetition is onerous and distracting.
DEV-13159
Included in this diff are the corresponding changes to the graph to support
the change. Adding the edge was easy, but we also need a way to get the
package for an identifier. The easiest way to do that is to modify the edge
weight to include not just the target node type, but also the source.
DEV-13159
This does not yet create edges from identifiers to the package; just getting
this introduced was quite a bit of work, so I want to get this committed.
Note that this also includes a change to NIR so that `Close` contains the
entity so that we can pattern-match for AIR transformations rather than
retaining yet another stack with checks that are already going to be done by
AIR. This makes NIR stand less on its own from a self-validation point, but
that's okay, given that it's the language that the user entered and,
conceptually, they could enter invalid NIR the same as they enter invalid
XML (e.g. from a REPL).
In _practice_, of course, NIR is lowered from XML and the schema is enforced
during that lowering and so the validation does exist as part of that
parsing.
These concessions speak more to the verbosity of the language (Rust) than
anything.
DEV-13159
Rather than panicing at this level, let's panic at the caller, simplifying
impls and keeping them total.
This can't occur now, but an upcoming change introducing a package type will
allow for such a thing.
DEV-13159
This hides information that's taking up a lot of space in the parser traces
and is not useful information. In particular, the `index` contains a lot of
empty space due to pre-interned symbols.
The index was going to be converted into a HashMap, but that was reverted
because the tradeoff did not make sense, and so this problem remains; see
the previous commit for more information.
DEV-13159
This reverts commit 1b7eac337cd5909c01ede3a5b3fba577898d5961.
I don't actually think this ends up being worth it in the end. Sure, the
implementation is simpler at a glance, but it is more complex at runtime,
adding more cycles for little benefit.
There are ~220 pre-interned symbols at the time of writing, so ~880 bytes (4
bytes per symbol) are potentially wasted if _none_ of the pre-interned
symbols end up serving as identifiers in the graph. The reality is that
some of them _will_ but, but using HashMap also introduces overhead, so in
practice, the savings is much less. On a fairly small package, it was <100
bytes memory saving in `tamec`. For `tameld`, it actually uses _more_
memory, especially on larger packages, because there are 10s of thousands of
symbols involved. And we're incurring a rehashing cost on resize, unlike
this original plain `Vec` implementation.
So, I'm leaving this in the history to reference in the future or return to
it if others ask; maybe it'll be worth it in the future.
This was originally written before there were a bunch of preinterned
symbols. Now the index vector is very sparse.
This simplifies things a bit. If this ends up manifesting as a bottleneck
in the future, we can revisit the implementation. While this does result in
more cycles, it's neglibable relative to the total cycle count.
This commit is what I've been sitting on for testing some of the recent
changes; it is a very basic demonstration of lowering all the way down
from source XML files into the ASG. This can be run on real files to
observe, beyond unit tests, how the system reacts.
Once this outputs data from the graph, we'll finally have tamec end-to-end
and can just keep filling the gaps.
I'm hoping to roll the desugaring process into NirToAir rather than having a
separate process as originally planned a couple of months back.
This also introduces the `wip-nir-to-air` feature flag. Currently,
interpolation will cause a `Nir::BindIdent` to be emitted in blocks that
aren't yet emitting NIR, and so results in an invalid parse.
DEV-13159
This adds support for identifier references, adding `Ident` as a valid edge
type for `Expr`.
There is nothing in the system yet to enforce ontology through levels of
indirection; that will come later on.
I'm testing these changes with a very minimal NIR parse, which I'll commit
shortly.
DEV-13597
This was originally created to populate Neo4J for querying, but it has not
been utilized. It's become a maintenance burden as I try to change the API
of and encapsulate the graph, which is important for upholding its
invariants.
This feature, or one like it, will return in the future. I have other
related plans; we'll see if they materialize.
The graph can't be encapsulated fully just yet because of the linker; those
commits will come in the following days.
DEV-13597
This allows for edges to be multiple types, and gives us two important
benefits:
(a) Compiler-verified correctness to ensure that we don't generate graphs
that do not adhere to the ontology; and
(b) Runtime verification of types, so that bugs are still memory safe.
There is a lot more information in the documentation within the patch.
This took a lot of iterating to get something that was tolerable. There's
quite a bit of boilerplate here, and maybe that'll be abstracted away better
in the future as the graph grows.
In particular, it was challenging to determine how I wanted to actually go
about narrowing and looking up edges. Initially I had hoped to represent
the subsets as `ObjectKind`s as well so that you could use them anywhere
`ObjectKind` was expected, but that proved to be far too difficult because I
cannot return a reference to a subset of `Object` (the value would be owned
on generation). And while in a language like C maybe I'd pad structures and
cast between them safely, since they _do_ overlap, I can't confidently do
that here since Rust's discriminant and layout are not under my control.
I tried playing around with `std::mem::Discriminant` as well, but
`discriminant` (the function) requires a _value_, meaning I couldn't get the
discriminant of a static `Object` variant without some dummy value; wasn't
worth it over `ObjectRelTy.` We further can't assign values to enum
variants unless they hold no data. Rust a decade from now may be different
and will be interesting to look back on this struggle.
DEV-13597
We only need a reference to the inner object, for which `AsRef` is the
proper and idiomatic solution.
There is a lot of boilerplate here that I hope to reduce in the future.
DEV-13597
ObjectRelTo is sufficient and, while I originally thought it was useful to
have it read left-to-right, it just ends up being a cognitive burden.
DEV-13597
I'm spending a lot of time considering how the future system will work,
which is complicating the needs of the system now, which is to re-output the
source XML so that we can selectively start to replace things.
So I'm going to punt on this.
I was also planning out how that edge reassignment out to work, along with
traits to try to enforce it, and that is also complicated, so I may wind up
wanting to leave them in the end, or handling this
differently. Specifically, I'll want to know how `value-of` expressions are
going to work on the graph first, since its target is going to be dynamic
and therefore not knowable at compile-time. (Rather, I know how I want to
make them work, but I want to observe that working in practice first.)
DEV-13597
There is extensive rationale in the documentation for this new macro. I'm
utilizing it to provide a more clear and friendly message for incomplete
ident resolution so that I can move on and return to those situations later.
It's worth noting that:
- Externs _will_ need to be handled in the near-term;
- Opaque and IdentFragment almost certainly won't be bound to a definition
until I introduce LTO, which is quite a ways off; and
- They may use the same mechanism and so may be able to be handled at the
same time anyway.
DEV-13597
The ASG delegates certain operations to Objects so that they may enforce
their own invariants and ontology. It is therefore important that only
objects have access to certain methods on `Asg`, otherwise those invariants
could be circumvented.
It should be noted that the nesting of this module is such that AIR should
_not_ have privileged access to the ASG---it too must utilize objects to
ensure those invariants are enforced in a single place.
DEV-13597
Starting to re-organize things to match my mental model of the new system;
the ASG abstraction has changed quite a bit since the early days.
This isn't quite enough, though; see next commit.
DEV-13597
This provides the initial implementation allowing an identifier to be
defined (bound to an object and made transparent).
I'm not yet entirely sure whether I'll stick with the "transparent" and
"opaque" terminology when there's also "declare" and "define", but a
`Missing` state is a type of declaration and so the distinction does still
seem to be important.
There is still work to be done on `ObjectIndex::<Ident>::bind_definition`,
which will follow. I'm going to be balancing work to provide type-level
guarantees, since I don't have the time to go as far as I'd like.
DEV-13597
This seems to have been an oversight from when I recently introduced SPairs
to ASG; I noticed it while working on another change and receiving back a
`DUMMY_SPAN`.
DEV-13597
`Ident` is now `Opaque`, but the new `Transparent` state isn't actually used
yet in any transitions; that'll come next.
The original (now "opaque") identifiers were added for the linker, which
does not need (at present) the associated expressions, since they've already
been compiled. In the future I'd like to do LTO (link-time optimization),
and then the graph will need more information.
DEV-13160
Some investigation into the disassembly of TAMER's binaries showed that Rust
was not able to conditionalize `expect`-like expressions as I was hoping due
to eager evaluation language semantics in combination with the use of
`format!`.
This solves the problem for the diagnostic system be creating types that
prevent this situation from occurring statically, without the need for a
lint.
This invokes clippy as part of `make check` now, which I had previously
avoided doing (I'll elaborate on that below).
This commit represents the changes needed to resolve all the warnings
presented by clippy. Many changes have been made where I find the lints to
be useful and agreeable, but there are a number of lints, rationalized in
`src/lib.rs`, where I found the lints to be disagreeable. I have provided
rationale, primarily for those wondering why I desire to deviate from the
default lints, though it does feel backward to rationalize why certain lints
ought to be applied (the reverse should be true).
With that said, this did catch some legitimage issues, and it was also
helpful in getting some older code up-to-date with new language additions
that perhaps I used in new code but hadn't gone back and updated old code
for. My goal was to get clippy working without errors so that, in the
future, when others get into TAMER and are still getting used to Rust,
clippy is able to help guide them in the right direction.
One of the reasons I went without clippy for so long (though I admittedly
forgot I wasn't using it for a period of time) was because there were a
number of suggestions that I found disagreeable, and I didn't take the time
to go through them and determine what I wanted to follow. Furthermore, it
was hard to make that judgment when I was new to the language and lacked
the necessary experience to do so.
One thing I would like to comment further on is the use of `format!` with
`expect`, which is also what the diagnostic system convenience methods
do (which clippy does not cover). Because of all the work I've done trying
to understand Rust and looking at disassemblies and seeing what it
optimizes, I falsely assumed that Rust would convert such things into
conditionals in my otherwise-pure code...but apparently that's not the case,
when `format!` is involved.
I noticed that, after making the suggested fix with `get_ident`, Rust
proceeded to then inline it into each call site and then apply further
optimizations. It was also previously invoking the thread lock (for the
interner) unconditionally and invoking the `Display` implementation. That
is not at all what I intended for, despite knowing the eager semantics of
function calls in Rust.
Anyway, possibly more to come on that, I'm just tired of typing and need to
move on. I'll be returning to investigate further diagnostic messages soon.
This introduces a number of abstractions, whose concepts are not fully
documented yet since I want to see how it evolves in practice first.
This introduces the concept of edge ontology (similar to a schema) using the
type system. Even though we are not able to determine what the graph will
look like statically---since that's determined by data fed to us at
runtime---we _can_ ensure that the code _producing_ the graph from those
data will produce a graph that adheres to its ontology.
Because of the typed `ObjectIndex`, we're also able to implement operations
that are specific to the type of object that we're operating on. Though,
since the type is not (yet?) stored on the edge itself, it is possible to
walk the graph without looking at node weights (the `ObjectContainer`) and
therefore avoid panics for invalid type assumptions, which is bad, but I
don't think that'll happen in practice, since we'll want to be resolving
nodes at some point. But I'll addres that more in the future.
Another thing to note is that walking edges is only done in tests right now,
and so there's no filtering or anything; once there are nodes (if there are
nodes) that allow for different outgoing edge types, we'll almost certainly
want filtering as well, rather than panicing. We'll also want to be able to
query for any object type, but filter only to what's permitted by the
ontology.
DEV-13160
Working with the graph can be confusing with all of the layers
involved. This begins to provide a better layer of abstraction that can
encapsulate the concept and enforce invariants.
Since I'm better able to enforce invariants now, this also removes the span
from the diagnostic message, since the invariant is now always enforced with
certainty. I'm not removing the runtime panic, though; we can revisit that
if future profiling shows that it makes a negative impact.
DEV-13160
This addresses the two outstanding `todo!` match arms representing errors in
lowering expressions into the graph. As noted in the comments, these errors
are unlikely to be hit when using TAME in the traditional way, since
e.g. XIR and NIR are going to catch the equivalent problems within their own
contexts (unbalanced tags and a valid expression grammar respectively).
_But_, the IR does need to stand on its own, and I further hope that some
tooling maybe can interact more directly with AIR in the future.
DEV-13160
This introduces a number of concepts together, again to demonstrate that
they were derived.
This introduces support for nested expressions, extending the previous
work. It also supports error recovery for dangling expressions.
The parser states are a mess; there is a lot of duplicate code here that
needs refactoring, but I wanted to commit this first at a known-good state
so that the diff will demonstrate the need for the change that will
follow; the opportunities for abstraction are plainly visible.
The immutable stack introduced here could be generalized, if needed, in the
future.
Another important note is that Rust optimizes away the `memcpy`s for the
stack that was introduced here. The initial Parser Context was introduced
because of `ArrayVec` inhibiting that elision, but Vec never had that
problem. In the future, I may choose to go back and remove ArrayVec, but I
had wanted to keep memory allocation out of the picture as much as possible
to make the disassembly and call graph easier to reason about and to have
confidence that optimizations were being performed as intended.
With that said---it _should_ be eliding in tamec, since we're not doing
anything meaningful yet with the graph. It does also elide in tameld, but
it's possible that Rust recognizes that those code paths are never taken
because tameld does nothing with expressions. So I'll have to monitor this
as I progress and adjust accordingly; it's possible a future commit will
call BS on everything I just said.
Of course, the counter-point to that is that Rust is optimizing them away
anyway, but Vec _does_ still require allocation; I was hoping to keep such
allocation at the fringes. But another counter-point is that it _still_ is
allocated at the fringe, when the context is initialized for the parser as
part of the lowering pipeline. But I didn't know how that would all come
together back then.
...alright, enough rambling.
DEV-13160
I had wanted to implement expression operations in terms of user-defined
functions (where primitives are just marked as intrinsic), and would still
like to, but I need to get this thing working, so I'll just include a note
for now.
Yes, TAMER's formalisms are inspired by APL, if that hasn't been documented
anywhere yet.
DEV-13160
This commit is purposefully coupled with changes that utilize it to
demonstrate that the need for this abstraction has been _derived_, not
forced; TAMER doesn't aim to be functional for the sake of it, since
idiomatic Rust achieves many of its benefits without the formalisms.
But, the formalisms do occasionally help, and this is one such
example. There is other existing code that can be refactored to take
advantage of this style as well.
I do _not_ wish to pull an existing functional dependency into TAMER; I want
to keep these abstractions light, and eliminate them as necessary, as Rust
continues to integrate new features into its core. I also want to be able
to modify the abstractions to suit our particular needs. (This is _not_ a
general recommendation; it's particular to TAMER and to my experience.)
This implementation of `Functor` is one such example. While it is modeled
after Haskell in that it provides `fmap`, the primitive here is instead
`map`, with `fmap` derived from it, since `map` allows for better use of
Rust idioms. Furthermore, it's polymorphic over _trait_ type parameters,
not method, allowing for separate trait impls for different container types,
which can in turn be inferred by Rust and allow for some very concise
mapping; this is particularly important for TAMER because of the disciplined
use of newtypes.
For example, `foo.overwrite(span)` and `foo.overwrite(name)` are both
self-documenting, and better alternatives than, say, `foo.map_span(|_|
span)` and `foo.map_symbol(|_| name)`; the latter are perfectly clear in
what they do, but lack a layer of abstraction, and are verbose. But the
clarity of the _new_ form does rely on either good naming conventions of
arguments, or explicit type annotations using turbofish notation if
necessary.
This will be implemented on core Rust types as appropriate and as
possible. At the time of writing, we do not yet have trait specialization,
and there's too many soundness issues for me to be comfortable enabling it,
so that limits that we can do with something like, say, a generic `Result`,
while also allowing for specialized implementations based on newtypes.
DEV-13160
Admittedly, there are _my_ debugging conventions. But I'm also the only one
working on this project right now.
I want to keep various things around without cluttering untracked file
output, because finding new files can be annoying in all the output.
Really, with a C background, I should have known that `write` may not write
all bytes, and I'm pretty sure I was aware, so I'm not sure how that slipped
my mind for every call. But it's not a great default, and I do feel like
`write_all` should be the deafult behavior, despite the syscall and C
library name.
It shouldn't take clippy to warn about something so significant.
This uses `ObjectIndex` to automatically narrow the type to what is
expected.
Given that `ObjectIndex` is supposed to mean that there must be an object
with that index, perhaps the next step is to remove the `Option` from `get`
as well.
DEV-13160
This makes the system a bit more ergonomic and introduces additional type
safety by associating the narrowed object type with the
`ObjectIndex` (previously `ObjectRef`). Not only does this allow us to
explicitly state the type of object wherever those indices are stored, but
it also allows the API to automatically narrow to that type when operating
on it again without the caller having to worry about it.
DEV-13160
This begins to place expressions on the graph---something that I've been
thinking about for a couple of years now, so it's interesting to finally be
doing it.
This is going to evolve; I want to get some things committed so that it's
clear how I'm moving forward. The ASG makes things a bit awkward for a
number of reasons:
1. I'm dealing with older code where I had a different model of doing
things;
2. It's mutable, rather than the mostly-functional lowering pipeline;
3. We're dealing with an aggregate ever-evolving blob of data (the graph)
rather than a stream of tokens; and
4. We don't have as many type guarantees.
I've shown with the lowering pipeline that I'm able to take a mutable
reference and convert it into something that's both functional and
performant, where I remove it from its container (an `Option`), create a new
version of it, and place it back. Rust is able to optimize away the memcpys
and such and just directly manipulate the underlying value, which is often a
register with all of the inlining.
_But_ this is a different scenario now. The lowering pipeline has a narrow
context. The graph has to keep hitting memory. So we'll see how this
goes. But it's most important to get this working and measure how it
performs; I'm not trying to prematurely optimize. My attempts right now are
for the way that I wish to develop.
Speaking to #4 above, it also sucks that I'm not able to type the
relationships between nodes on the graph. Rather, it's not that I _can't_,
but a project to created a typed graph library is beyond the scope of this
work and would take far too much time. I'll leave that to a personal,
non-work project. Instead, I'm going to have to narrow the type any time
the graph is accessed. And while that sucks, I'm going to do my best to
encapsulate those details to make it as seamless as possible API-wise. The
performance hit of performing the narrowing I'm hoping will be very small
relative to all the business logic going on (a single cache miss is bound to
be far more expensive than many narrowings which are just integer
comparisons and branching)...but we'll see. Introducing branching sucks,
but branch prediction is pretty damn good in modern CPUs.
DEV-13160
This will be used for expression start and end spans to merge into a span
that represents the entirety of the expression; see future commits for its
use.
Though, this has been generalized further than that to ensure that it makes
sense in any use case, to avoid potential pitfalls.
DEV-13160
This adds a line of padding between the last line of a source marking and
the first line of a footer, making it easier to read. This also matches the
behavior of Rust's error message.
This is something I intended to do previously, but didn't have the
time. Not that I do now, but now that we'll be showing some more robust
diagnostics to users, it ought to look decent.
DEV-13430
This moves the special handling of circular dependencies out of
`poc.rs`---and to be clear, everything needs to be moved out of there---and
into the source of the error. The diagnostic system did not exist at the
time.
This is one example of how easy it will be to create robust diagnostics once
we have the spans on the graph. Once the spans resolve to the proper source
locations rather than the `xmlo` file, it'll Just Work.
It is worth noting, though, that this detection and error will ultimately
need to be moved so that it can occur when performing other operation on the
graph during compilation, such as type inference and unification. I don't
expect to go out of my way to detect cycles, though, since the linker will.
DEV-13430
Previously this just exported the variable into the environment, but I'm not
comfortable with the lack of visibility that provides; I want to be able to
see not only that it's happening, which will help to debug issues, but also
when it's _not_ happening so that I know that it needs to be introduced into
a configuration at a particular installation site.
This ASG implementation is a refactored form of original code from the
proof-of-concept linker, which was well before the span and diagnostic
implementations, and well before I knew for certain how I was going to solve
that problem.
This was quite the pain in the ass, but introduces spans to the AIR tokens
and graph so that we always have useful diagnostic information. With that
said, there are some important things to note:
1. Linker spans will originate from the `xmlo` files until we persist
spans to those object files during `tamec`'s compilation. But it's
better than nothing.
2. Some additional refactoring is still needed for consistency, e.g. use
of `SPair`.
3. This is just a preliminary introduction. More refactoring will come as
tamec is continued.
DEV-13041
The previous commit had the ASG implicitly constructed and then
discarded. This will keep it around, which will be necessary not only for
imports, but for passing the ASG off to the next phases of lowering.
DEV-13429
This does not yet yield the produces ASG, but does set up the lowering
pipeline to prepare to produce it. It's also currently a no-op, with
`NirToAsg` just yielding `Incomplete`.
The goal is to begin to move toward vertical slices for TAMER as I start to
return to the previous approach of a handoff with the old compiler. Now
that I've gained clarity from my previous failed approach (which I
documented in previous commits), I feel that this is the best way forward
that will allow me to incrementally introduce more fine-grained performance
improvements, at the cost of some throwaway work as this progresses. But
the cost of delay with these build times is far greater.
DEV-13429
This finalizes the implementation for interpolation. There is some more
cleanup that can be done, but it is now functioning as intended and
providing errors.
Finally. How deeply exhausting all of this has been.
DEV-13156
This just cleans up these tests a bit before I add to them. What we're left
with follows the structure of most other parser tests and is atm a good
balance between boilerplate and clarity in isolation (a fair level of
abstraction).
Could possibly do better by putting the inner objects in a callback so that
the `Close` can be asserted on commonly as well, but that's a bit awkward
with how the assertion is based on the collection; we'd have to keep the
last item from being collected from the iterator. I'd rather not deal with
such restructuring right now and figuring out a decent pattern. Perhaps in
the future.
DEV-13156
This is the culmination of all the recent work---the third attempt at trying
to integrate this. It ended up much cleaner than what was originally going
to be done, but only after gutting portions of the system and changing my
approach to how NIR is parsed (WRT attributes). See prior commits for more
information.
The final step is to fill the error branches with actual errors rather than
`todo!`s.
What a relief.
DEV-13156
This begins to introduce the new, simplified NIR by creating tokens that
serve as the expansion for interpolation. Admittedly, `Text` may change, as
it doesn't really represent `<text>foo</text>`, and I'd rather that node
change as well, though I'll probably want to maintain some sort of BC.
DEV-13156
This removes quite a bit of work, and work that was difficult to reason
about. While I'm disappointed that that hard work is lost (aside from
digging it up in the commit history), I am happy that it was able to be
removed, because the extra complexity and cognitive burden was significant.
This removes more `memcpy`s than the sum state could have hoped to, since
aggregation is no longer necessary. Given that, there is a slight
performacne improvement. The re-introduction of required and duplicate
checks later on should be more efficient than this was, and so this should
be a net win overall in the end.
DEV-13346
This cleans up the old implementation now that it's no longer used (as of
the previous commit) by `ele_parse!`. It also removes the two error
variants that no longer apply: required attributes and duplicate
attributes.
DEV-13346
This handles the bulk of the integration of the new `attr_parse_stream!` as
a replacement for `attr_parse!`, which moves from aggregate attribute
objects to a stream of attribute-derived tokens. Rationale for this change
is in the preceding commit messages.
The first striking change here is how it affects the test cases: nearly all
`Incomplete`s are removed. Note that the parser has an existing
optimization whereby `Incomplete` with lookahead causes immediate recursion
within `Parser`, since those situations are used only for control flow and
to keep recursion out of `ParseState`s.
Next: this removes types from `nir::parse`'s grammar for attributes. The
types will instead be derived from NIR tokens later in the lowering
pipeline. This simplifies NIR considerably, since adding types into the mix
at this point was taking an already really complex lowering phase and making
it ever more difficult to reason about and get everything working together
the way that I needed.
Because of `attr_parse_stream!`, there are no more required attribute
checks. Those will be handled later in the lowering pipeline, if they're
actually needed in context, with possibly one exception: namespace
declarations. Those are really part of the document and they ought to be
handled _earlier_ in the pipeline; I'll do that at some point. It's not
required for compilation; it's just required to maintain compliance with the
XML spec.
We also lose checks for duplicate attributes. This is also something that
ought to be handled at the document level, and so earlier in the pipeline,
since XML cares, not us---if we get a duplicate attribute that results in an
extra NIR token, then the next parser will error out, since it has to check
for those things anyway.
A bunch of cleanup and simplification is still needed; I want to get the
initial integration committed first. It's a shame I'm getting rid of so
much work, but this is the right approach, and results in a much simpler
system.
DEV-13346
This really does need documentation.
With that said, this changes things up a bit: the value is now derived from
an `SPair` rather than an `Attr`, given that the name is redundant. We do
not need the attribute name span, since the philosophy is that we're
stripping the document and it should no longer be important beyond the
current context.
It does call into question errors, but my intent in the future is to be able
to have the lowering pipline augment errors with its current state---since
we're streaming, then an error that is encountered during lowering of an
element will still have the element parser in the state representing the
parsing of that element; so that information does not need to be propagated
down the pipeline, but can be augmented as it bubbles back up.
More on that at some point in the future; not right now.
DEV-13346
As I talked about in the previous commit, this is going to be the
replacement for the aggreagte `attr_parse!`; the next commit will integrate
it into `ele_parse!` so that I can begin to remove the old one.
It is disappointing, since I did put a bit of work into this and I think the
end result was pretty neat, even if was never fully utilized. But, this
simplifies things significantly; no use in maintaining features that serve
no purpose but to confound people.
DEV-13346
Alright, this has been a rather tortured experience. The previous commit
began to state what is going on.
This is reversing a lot of prior work, with the benefit of
hindsight. Little bit of history, for the people who will probably never
read this, but who knows:
As noted at the top of NIR, I've long wanted a very simple set of general
primitives where all desugaring is done by the template system---TAME is a
metalanguage after all. Therefore, I never intended on having any explicit
desugaring operations.
But I didn't have time to augment the template system to support parsing on
attribute strings (nor am I sure if I want to do such a thing), so it became
clear that interpolation would be a pass in the compiler. Which led me to
the idea of a desugaring pass.
That in turn spiraled into representing the status of whether NIR was
desugared, and separating primitives, etc, which lead to a lot of additional
complexity. The idea was to have a Sugared and Plan NIR, and further within
them have symbols that have latent types---if they require interpolation,
then those types would be deferred until after template expansion.
The obvious problem there is that now:
1. NIR has the complexity of various types; and
2. Types were tightly coupled with NIR and how it was defined in terms of
XML destructuring.
The first attempt at this didn't go well: it was clear that the symbol types
would make mapping from Sugared to Plain NIR very complicated. Further,
since NIR had any number of symbols per Sugared NIR token, interpolation was
a pain in the ass.
So that lead to the idea of interpolating at the _attribute_ level. That
seemed to be going well at first, until I realized that the token stream of
the attribute parser does not match that of the element parser, and so that
general solution fell apart. It wouldn't have been great anyway, since then
interpolation was _also_ coupled to the destructuring of the document.
Another goal of mine has been to decouple TAME from XML. Not because I want
to move away from XML (if I did, I'd want S-expressions, not YAML, but I
don't think the team would go for that). This decoupling would allow the
use of a subset of the syntax of TAME in other places, like CSVMs and YAML
test cases, for example, if appropriate.
This approach makes sense: the grammar of TAME isn't XML, it's _embedded
within_ XML. The XML layer has to be stripped to expose it.
And so that's what NIR is now evolving into---the stripped, bare
repsentation of TAME's language. That also has other benefits too down the
line, like a REPL where you can use any number of syntaxes. I intend for
NIR to be stack-based, which I'd find to be intuitive for manipulating and
querying packages, but it could have any number of grammars, including
Prolog-like for expressing Horn clauses and querying with a
Prolog/Datalog-like syntax. But that's for the future...
The next issue is that of attribute types. If we have a better language for
NIR, then the types can be associated with the NIR tokens, rather than
having to associate each symbol with raw type data, which doesn't make a
whole lot of sense. That also allows for AIR to better infer types and
determine what they ought to be, and further makes checking types after
template application natural, since it's not part of NIR at all. It also
means the template system can naturally apply to any sources.
Now, if we take that final step further, and make attributes streaming
instead of aggregating, we're back to a streaming pipeline where all
aggregation takes place on the ASG (which also resolves the memcpy concerns
worked around previously, also further simplifying `ele_parse` again, though
it sucks that I wasted that time). And, without the symbol types getting
in the way, since now NIR has types more fundamentally associated with
tokens, we're able to interpolate on a token stream using simple SPairs,
like I always hoped (and reverted back to in the previous commit).
Oh, and what about that desugaring pass? There's the issue of how to
represent such a thing in the type system---ideally we'd know statically
that desugaring always lowers into a more primitive NIR that reduces the
mapping that needs to be done to AIR. But that adds complexity, as
mentioned above. The alternative is to just use the templat system, as I
originally wanted to, and resolve shortcomings by augmenting the template
system to be able to handle it. That not only keeps NIR and the compiler
much simpler, but exposes more powerful tools to developers via TAME's
metalanguage, if such a thing is appropriate.
Anyway, this creates a system that's far more intuitive, and far
simpler. It does kick the can to AIR, but that's okay, since it's also
better positioned to deal with it.
Everything I wrote above is a thought dump and has not been proof-read, so
good luck! And lets hope this finally works out...it's actually feeling
good this time. The journey was necessary to discover and justify what came
out of it---everything I'm stripping away was like a cocoon, and within it
is a more beautiful and more elegant TAME.
DEV-13346
Also: Revert "tamer: nir::desugar::interp: Token {SPair=>Attr}"
This reverts commit 7fd60d6cdafaedc19642a3f10dfddfa7c7ae8f53.
This reverts commit 12a008c66414c3d628097e503a98c80687e3c088.
This has been quite a tortured experience, trying to figure out how to best
fit desugaring into the existing system. The truth is that it ultimately
failed because I was not sticking with my intuition---I was trying to get
things out quickly by compromising on the design, and in the end, it saved
me nothing.
But I wouldn't say that it was a waste of time---the path was a dead end,
but it was full of experiences.
More to come, but interpolation is back to operating on NIR directly, and I
chose to treat it as a source-to-source mapping and not represent it using
the type system---interpolation can be an optional feature when writing TAME
frontends (the principal one being the XML-based one), and it's up to later
checks to assert that identifiers match a given domain.
I am disappointed by the additional context we lose here, but that can
always be introduced in the future differently, e.g. by maintaining a
dictionary of additional context for spans that can be later referenced for
diagnostic purposes. But let's worry about that in the future; it doesn't
make sense to further complicate IRs for such a thing.
DEV-13346
Converts to use TAME's diagnostic panics, same as previous commits. Also
introduces impl for `Result`, which I apparently hadn't needed yet.
In the future, I hope trait impl specializations will be available to
automatically derive and expose span information in these diagnostic
messages for certain types.
DEV-13156
This changes the input token from a more generic `SPair` to `Attr`, which
reflects the new target integration point in the `attr_parse!`
parser-generator.
This is a compromise---I'd like for it to remain generic and have stitching
deal with all integration concerns, but I have spent far too much time on
this and need to keep moving.
With that said, we do benefit from knowing where this must fit in---it's
easier to reason about in a more concrete way, and we can take advantage of
the extra information rather than being burdened by its presence and
ignoring it. We need to be able to convert back into `XirfToken` (see a
recent commit that discusses that) for `StitchExpansion`, which is why
`Attr` is here. And since it is, we can use it to explain to the user not
just the interpolation specification used to derive params, but also the
attribute it is associated with. This is what TAME (in XSLT) does today,
IIRC (I wrote it, I just forget exactly). It also means that I can name the
parameters after the attribute.
So, that'll be in a following commit; I was disappointed when my prior
approach with `SPair` didn't give me enough information to be able to do
that, since I think it's important that the system be as descriptive as
possible in how it derives information. Of course, traces would reveal how
the parser came about the derivation, but that requires recompilation in a
special tracing mode.
DEV-13156
Of course I would run into integration issues. My foresight is lacking.
The purpose of this is to allow for type narrowing before passing data to a
more specialized ParseState, so that the other ParseState doesn't need to
concern itself with the entire domain of inputs that it doesn't need, and
repeat unnecessary narrowing.
For example, consider XIRF: it has an `Attr` variant, which holds an `Attr`
object. We'll want to desugar that object. It does not make sense to
require that the desugaring process accept `XirfToken` when we've already
narrowed it to an `Attr`---we should accept an Attr.
However, we run into a problem immediately: what happens with tokens that
bubble back up due to lookahead or errors? Those tokens need to be
converted _back_ (widened). Fortunately, widening is a much easier process
than narrowing---we can simply use `From`, as we do today so many other
places.
So, this still keeps the onus of narrowing on the caller, but for now that
seems most appropriate. I suspect Rust would optimize away duplicate
checks, but that still leaves the maintenance concern---the two narrowings
could get out of sync, and that's not acceptable.
Unfortunately, this is just one of the problems with integration...
DEV-13156
My initial plan with expansion was to wrap a `PasteState` in another that
unwraps `Expansion` and converts into a `Dead` state, so that existing
`TransitionResult` stitching methods (`delegate`, specifically) could be
used.
But the desire to use that existing method was primarily because stitching
was a complex operation that was abstracted away _as part of the `delegate`
method_, which made writing new ones verbose and difficult. Thus began the
previous commits to begin to move that responsibility elsewhere so that it
could be more composable.
This continues with that, introducing a new trait that will culminate in the
removal of a wrapping `ParseState` in favor of a stitching method. The old
`StitchableExpansionState` is still used for tests, which demonstrates that
the boilerplate problem still exists despite improvements made here These
will become more generalized in the future as I have time (and the
functional aspects of the code more formalized too, now that they're taking
shape).
The benefit of this is that we avoid having to warp our abstractions in ways
that don't make sense (use of a dead state transition) just to satisfy
existing APIs. It also means that we do not need the boilerplate of a
`ParseState` any time we want to introduce this type of
stitching/delegation. It also means that those methods can eventually be
extracted into more general traits in the future as well.
Ultimately, though, the two would have accomplished the same thing. But the
difference is most emphasized in the _parent_---the actual stitching still
has to take place for desugaring in the attribute parser, and I'd like for
that abstraction to still be in terms of expansion. But if I utilized
`StitchableExpansionState`, which converted into a dead state, I'd have to
either forego the expansion abstraction---which would make the parser even
more confusing---or I'd have to create _another_ abstraction around the dead
state, which would mean that I stripped one abstraction just to introduce
another one that's essentially the same thing. It didn't feel right, but it
would have worked.
The use of `PhantomData` in `StitchableExpansionState` was also a sign that
something wasn't quite right, in terms of how the abstractions were
integrating with one-another.
And so here we are, as I struggle to wade my way through all of the yak
shavings and make any meaningful progress on this project, while others
continue to suffer due to slow build times.
I'm sorry. Even if the system is improving.
DEV-13156
This is just intended to simplify the job of panicing when something is
expected to be `None`. In my case, `Lookahead`; see upcoming commits.
This is intended to be generalized to more than just `Option`, but I have no
use for it elsewhere yet; I primarily just needed to implement a method on
`Option` so that I could have the ergonomics of the dot notation.
DEV-13156
There's no use in duplicating this in util::expand.
Lookahead tokens are one of the few invariants that I haven't taken the time
of enforcing using the type system, because it'd be quite a bit of work that
I do not have time for, and may not be worth it with changes that may make
the system less ergonomic. Nonetheless, I do hope to address it at some
point in the (possibly-far) future.
If ever you encounter this diagnostic message, ask yourself how stable TAMER
otherwise is and how many other issues like this have been entirely
prevented through compile-time proofs using the type system.
DEV-13156
As in previous commits, this continues to replace panics with
`diagnostic_panic!`, which provides much more useful information both for
debugging and to help the user possibly work around the problem. And lets
the user know that it's not their fault, and it's a TAMER bug that should be
reported.
...am I going to rationalize it in each commit message?
DEV-13156
This moves enough of the handling of complex type conversions into the
various components of `TransitionResult` (and itself), which simplifies
delegation and opens up the possibility of having specialized
delegation/stitching methods implemented atop of `TransitionResult`.
DEV-13156
These delegation methods have been a pain in my ass for quite some time, and
their lack of generalization makes the introduction of new delegation
methods (in the general sense, not necessarily trait methods) very tedious
and prone to inconsistencies.
I'm going to progressively refactor them in separate commits so it's clear
what I'm doing, primarily for future me to reference if need be.
DEV-13156
This beings to introduce more primitive operations to `TransitionResult` and
its components so that I can actually work with them without having to write
a bunch of concrete, boilerplate implementations. This is demonstrated in
part by `EchoState` (which is nearly all boilerplate, but whose correctness
should be verifiable at a glance), which will be used going forward as a
basis for default implementations for parsers (e.g. expansion delegation).
DEV-13156
This has evolved into a more robust and independent concept, but it is still
a utility in the sense that it's utilizing existing parsing framework
features and making them more convenient.
DEV-13156
These traits serve to abstract away some of the type-level details and
clearly state what the end result is (something stitchable with a parent).
I'm admittedly battling myself on this concept a bit. The proper layer of
abstraction is the concept of expansion, which is an abstraction that is
likely to be maintained all the way through, but we strip the abstraction
for the sake of delegation. Maybe the better option is to provide a
different method of delegation and avoid the stripping at all, and avoid the
awkward interaction with the dead state.
The awkwardness comes from the fact that delegating right now is so rigid
and defined in terms of a method on state rather than a mapping between
`TransitionResult`s. But I really need to move on... ;_;
The original design was trying to generalize this such that composition at
the attribute parser level (for NIR) would be able to just accept any
sitchable parser with the convention that the dead state is the replacement
token. But that is the wrong layer of abstraction, which not only makes it
confusing, but is asking for trouble when someone inevitably violates that
contract.
With all of that said, `StitchableExpansionState` _is_ a delegation. It
could just as easily be a function (`is_accepting` always delegates too), so
perhaps that should just be generalized as reifying delegation as a
`ParseState`.
DEV-13156
This parser really just allows me to continue developing the NIR
interpolation system using `Expansion` terminology, and avoid having to use
dead states in tests. This allows for the appropriate level of abstraction
to be used in isolation, and then only be stripped when stitching is
necessary.
Future commits will show how this is actually integrated and may introduce
additional abstraction to help.
DEV-13156
This is a shift in approach.
My original idea was to try to keep NIR parsing the way it was, since it's
already hard enough to reason about with the `ele_parse!` parser-generator
macro mess. The idea was to produce an IR that would explicitly be denoted
as "maybe sugared", and have a desugaring operation as part of the lowering
pipeline that would perform interpolation and lower the symbol into a plain
version.
The problem with that is:
1. The use of the type was going to introduce a lot of mapping for all the
NIR token variants there are going to be; and
2. _The types weren't even utilized for interpolation._
Instead, if we interpolated _as attributes are encountered_ while parsing
NIR, then we'd be able to expand directly into that NIR token stream and
handle _all_ symbols in a generic way, without any mapping beyond the
definition of NIR's grammar using `ele_parse!`.
This is a step in that direction---it removes `NirSymbolTy` and introduces a
generic abstraction for the concept of expansion, which will be utilized
soon by the attribute parser to allow replacing `TryFrom` with something
akin to `ParseFrom`, or something like that, which is able to produce a
token stream before finally yielding the value of the attribute (which will
be either the original symbol or the replacement metavariable, in the case
of interpolation).
(Note that interpolation isn't yet finished---errors still need to be
implemented. But I want a working vertical slice first.)
DEV-13156
This was a substantial change. Design and rationale are documented on
`AttrFieldSum` and related as part of this change, so please review the diff
for more information there.
If you're a Ryan employee, DEV-13209 gives plenty of profiling information,
including raw data and visualizations from kcachegrind. For everyone else:
you're able to easy produce your own from this commit and the previous and
comparing the `__memcpy_avk_unaligned_erms` calls. The reduction is
significant in this commit (~90%), and the number of Parsers invoking it has
been reduced. Rust has been able to optimize more aggressively, and
compound some of those optimizations, with the smaller `NirParseState`
width.
It also worth noting that `malloc` calls do not change at all between
these two changes, so when we refer to memory, we're referring to
pre-allocated memory on the stack, as TAMER was designed to utilize.
DEV-13209
This is a diagnostic replacement for `unreachable!`.
Eventually TAMER'll have build-time checks to enforce the use of these over
alternatives; I need to survey the old instances on a case-by-case basis to
see what diagnostic information can be reasonably presented in that context.
DEV-13209
The spans were previously not being calculated relative to the offset of the
original symbol span. Tests were passing because all of those spans began
at offset 0.
DEV-13156
This demonstrates how desugaring of interpolated strings will work, testing
one of the happy paths. The remaining work to be done is largely
refactoring; handling some other cases; and errors. Each of those items are
marked with `todo!`s.
I'm pleased with how this is turning out, and I'm excited to see diagnostic
reporting within the specification string using the derived spans once I get
a bit further along; this robust system is going to be much more helpful to
developers than the existing system in XSLT.
This also eliminates the ~50% performance degredation mentioned in a recent
commit by eliminating the SugaredNirSymbol enum and replacing it with a
newtype; this is a much better approach, though it doesn't change that I do
need to eventually address the excessive `memcpy`s on hot code paths.
DEV-13156
Not sure why I didn't add a prelude sooner, considering all the import
boilerplate. This will evolve as needed and I'll go back and replace other
imports when I'm not in the middle of something.
DEV-13156
Add initial descriptions and consolodate some of the types. There'll be
more to come; this is just to get `Display` derives working for types
that'll be using it. I'd like to see where this description manifests
itself before I decide how user-friendly I'd like it to be.
DEV-13156
This mirror is only a `Todo` variant at the moment, but my hope had been to
try to creatively nest or use generics to simplify the conversaion between
the two flavors without a lot of boilerplate. But it doesn't seem like I'm
going to be successful, and may have to resort to macros to remove
boilerplate.
But I need to stop fighting with myself and move on. Though I would still
like to keep the types purely compile-time via const generics if possible,
since they're not needed in memory (or disk) until we get to templates;
they're otherwise static relative to a NIR token variant.
DEV-13209
This simply detects whether a value will need to be further parsed for
interpolation; it does not yet perform the parsing itself, which will happen
during desugaring.
This introduces a performance regression, for an interesting reason. I
found that introducing a single new variant to `SugaredNir` (with a
`(SymbolId, Span)` pair), was causing the width of the `NirParseState` type
to increase just enough to cause Rust to be unable to optimize away a
significant number of memcpys related to `Parser` moves, and consequently
reducing performance by nearly 50% for `tamec`. Yikes.
I suspected this would be a problem, and indeed have tried in all other
cases to avoid aggregation until the ASG---the problem is that I had wanted
to aggregate attributes for NIR so that the IR could actually make some
progress toward simplifying the stream (and therefore working with the
data), and be able to validate against a grammar defined in a single
place. The problem is that the `NirParseState` type contains a sum type for
every attribute parser, and is therefore as wide as the largest one. That
is what Rust is having trouble optimizing memcpy away for.
Indeed, reducing the number of attributes improves the situation
drastically. However, it doesn't make it go away entirely.
If you look at a callgrind profile for `tameld` (or a dissassembly), you'll
notice that I put quite a bit of effort into ensuring that the hot code path
for the lowering pipeline contains _no_ memcpys for the parsers. But that
is not the case with `tamec`---I had to move on. But I do still have the
same escape hatch that I introduced for `tameld`, which is the mutable
`Context`.
It seems that may be the solution there too, but I want to get a bit further
along first to see how these data end up propagating before I go through
that somewhat significant effort.
DEV-13156
Various parts of the system have to be converted to use `diagnostic_panic!`,
which makes it very clear that this is a bug in TAMER that should be
reported. I just happened to see this one near code I was about to touch.
DEV-13156
This introduces the concept of sugared NIR and provides the boilerplate for
a desugaring pass. The earlier commits dealing with cleaning up the
lowering pipeline were to support this work, in particular to ensure that
reporting and recovery properly applied to this lowering operation without
adding a ton more boilerplate.
DEV-13158
I'm struggling to go much further yet without sorting out some other things
first with regards to mutable `Context` and, in particular, the ASG.
I'm going to pause on refactoring the lowering pipeline---it's been improved
significantly with the recent work---and I will continue in the next few
weeks.
DEV-13158
Lowering errors in tamec end up utilizing recovery and reporting, so there
is a distinction between recoverable and unrecoverable errors.
tameld aborts on the first error, since recovery is not currently
supported (we'll want to add it, since tameld should output e.g. lists of
unresolved externs).
Note that tamec does not yet handle `FinalizeError` like tameld because it
uses `Lower::lower`, which does not yet finalize (though it does in practice
when it reaches the end of the stream and auto-finalizes, but that is
widened into a `ParseError`).
DEV-13158
This helps to clarify the situations under which these errors can occur, and
the generality also helps to show why the inner types are as they
are (e.g. use of `String`).
But more importantly, this allows for an error type in `finalize` that is
detached from the `ParseState`, which will be able to be utilized in the
lowering pipeline as a more general error distinguishable from other
lowering errors. At the moment I'm maintaining BC, but a following commit
will demonstrate the use case to introduce recoverable vs. non-recoverable
errors.
DEV-13158
This newtype allows a caller to prove (using types) that a parser of a given
type (`ParseState`) has been finalized.
This will be used by the lowering pipeline to ensure that all parsers in the
pipeline end up getting finalized (as you can see from a TODO added in the
code, one of them is missing). The lack of such a type was an oversight
during the (rather stressed) development of the parsing system, and I
shouldn't need to resort to unit tests to verify that parsers have been
finalized.
DEV-13158
This reverts commit 85ec626fcd804eb2fac3fd6f0339182554f72cfd.
This revert had to be modified to work alongside other changes. Interior
mutability is fortunately no longer needed after the previous commit which
allows reporting to occur in a single place in the lowering pipeline (at the
terminal parser).
DEV-13158
The term "terminal parser" isn't formalized yet in the system, but is meant
to refer to the innermost parser that is responsible for pulling tokens
through the lowering pipeline.
This approach is more of what one would expect when dealing with
`Result`-like monads---we are effectively chaining the inner operation while
propagating errors to short-circuit lowering and let the caller decide
whether recovery ought to be permitted with diagnostic messages. This will
become more clear as it is further refactored.
This also means that the previous changes for introducing interior
mutability for a shared mutable `Reporter` can be reverted, which is great,
since that approach was antithetical to how the streaming pipeline
operates (and introduces awkward mutable state into an
otherwise-mostly-immutable system).
DEV-13158
This extracts error tracking into the Reporter itself, which is already
shared between lowering operations. This can then be used to display the
number of errors.
A new formatter (in tamer::fmt) will be added to handle the singular/plural
conversion in place of "error(s)" in the future; I have more important
things to work on right now.
DEV-13158
Previously these errors would immediately abort.
This results in some duplicate code, but it's beginning to derive a common
implementation. Check out the commits that follow; this is really an
intermediate refactoring state.
DEV-13158
Another baby step. The small commits are intended to allow comprehension of
what changes when looking at the diffs.
This also removes a comment stating that errors do not fail compilation,
since they most certainly do.
DEV-13158
This begins refactoring the lowering pipeline to begin to obviate
abstraction boundaries. The lowering pipeline is the backbone of the
system, and so it needs to become clear and self-documenting, which will
take a little bit of work.
DEV-13158
This always annoys me when I add a dependency and I don't know where I ought
to put it.
Anyway, I was originally going to add the `regex` crate, but with further
planning, I may not end up having use for it. Nonetheless, at least this is
consistent.
Just preparing to actually define NIR itself. The _grammar_ has been
represented (derived from our internal systems, using them as a test case),
but the IR itself has not yet received a definition.
DEV-7145
This is a quick-and-dirty change. The lowering pipeline needs a proper
abstraction, but I'm about to be on vacation at the end of the week and
would like to get NIR->AIR lowering started before I consider that
abstraction further, so this will do for now.
NIR parsing has been tested in production without failing for over a week.
DEV-7145
This was originally the "noramlized" IR, but that's not possible to do
without template expansion, which is going to happen at a later point. So,
this is just "NIR", pronounced "near", which is an IR that is "near" to the
source code. You can define it was "Near IR" if you want, but it's just a
homonym with a not-quite-defined acronym to me.
DEV-7145
A type alias was added for BC before errors were hoisted out in a previous
commit, but they are unnecessary because of the associated type on
`ParseState`.
This also corrects the long-existing issue of using generated identifiers in
tests.
DEV-7145
This moves `paste::paste!` up a line and reduces a level of indentation,
since it's so squished. Aside from docblock reformatting, there are no
other changes.
DEV-7145
This slims out the macro even further. It does result in an
awkwardly-placed `PhantomData` because I don't want to add another variant
that isn't actually used (since they represent states).
DEV-7145
This is in preparation for hoisting out the common states, as was done with
the Sum NT in a previous commit.
I also think that organizing states in this way is more clear. The previous
embedding of the variants named after the NTs themselves was because the
parser was storing the child state within it, before the introduction of the
superstate trampoline.
DEV-7145
Everything except for one state was already accounted for. We can now have
confidence that the parser will never panic due to state transitions (beyond
legitimate error conditions).
There are some `unreachable!`s to contend with still.
DEV-7145
This is the same as the previous commits, but for non-sum NTs.
This also extracts errors into a separate module, which I had hoped to do in
a separate commit, but it's not worth separating them. My _original_ reason
for doing so was debugging (I'll get into that below), but I had wanted to
trim down `ele.rs` anyway, since that mess is large and a lot to grok.
My debugging was trying to figure out why Rust was failing to derive
`PartialEq` on `NtError` because of `AttrParseError`. As it turns out,
`AttrParseError::InvalidValue` was failing, thus the introduction of the
`PartialEq` trait bound on `AttrParseState::ValueError`. Figuring this out
required implementing `PartialEq` myself without `derive` (well, using LSP,
which did all the work for me).
I'm not sure why this was not failing previously, which is a bit of a
concern, though perhaps in the context of the macro-expanded code, Rust was
able to properly resolve the types.
DEV-7145
The `ele_parse!` macro is a monstrosity, and expands into many different
identifiers. The hope is that chipping away at things like this will not
only make the template easier to understand by framing portions of the
problem in terms of more traditional Rust code, but will also hopefully
reduce compile times by reducing the amount of code that is expanded by the
macro.
DEV-7145
This introduces an order-only prerequisite `bootstrap-if-necessary` for the
generation of `suppliers.mk`. Projects utilizing TAME as a dependency may
include a `bootstrap.mk` that overrides this target to trigger any
bootstrapping scripts that may be necessary due to toolchain updates.
DEV-7145
This was a relic of the old bootstrap system, where bootstrapping was in the
context of a parent project that utilized TAME (and so TAME needed to be
built). But that doesn't make sense in the context of TAME itself, and what
_part_ of TAME should be built should be controlled by the project utilizing
it.
This is especially important now that TAMER builds are getting much longer
with the introduction of NIR and its parser-generator.
DEV-7145
This introduces NIR, but only as an accepting grammar; it doesn't yet emit
the NIR IR, beyond TODOs.
This modifies `tamec` to, while copying XIR, also attempt to lower NIR to
produce parser errors, if any. It does not yet fail compilation, as I just
want to be cautious and observe that everything's working properly for a
little while as people use it, before I potentially break builds.
This is the culmination of months of supporting effort. The NIR grammar is
derived from our existing TAME sources internally, which I use for now as a
test case until I introduce test cases directly into TAMER later on (I'd do
it now, if I hadn't spent so much time on this; I'll start introducing tests
as I begin emitting NIR tokens). This is capable of fully parsing our
largest system with >900 packages, as well as `core`.
`tamec`'s lowering is a mess; that'll be cleaned up in future commits. The
same can be said about `tameld`.
NIR's grammar has some initial documentation, but this will improve over
time as well.
The generated docs still need some improvement, too, especially with
generated identifiers; I just want to get this out here for testing.
DEV-7145
This includes when on the last state / expecting a close.
Previously, there were a couple major issues:
1. After parsing an NT, we can't allow preemption because we must emit a
dead state so that we can remove the NT from the stack, otherwise
they'll never close (until the parent does) and that results in
unbounded stack growth for a lot of siblings. Therefore, we cannot
preempt on `Text`, which causes the NT to receive it, emit a dead
state, transition away from the NT, and not accept another NT of the
same type after `Text`.
2. When encountering an unknown element, the error message stated that a
closing tag was expected rather than one of the elements accepted by the
final NT.
For #1, this was solved by allowing the parent to transition back to the NT
if it would have been matched by the previous NT. A future change may
therefore allow us to remove repetition handling entirely and allow the
parent to deal with it (maybe).
For #2, the trouble is with the parser generator macro---we don't have a
good way of knowing the last NT, and the last NT may not even exist if none
was provided. This solution is a compromise, after having tried and failed
at many others; I desperately need to move on, and this results in the
correct behavior and doesn't sacrifice performance. But it can be done
better in the future.
It's also worth noting for #2 that the behavior isn't _entirely_ desirable,
but in practice it is mostly correct. Specifically, if we encounter an
unknown token, we're going to blow through all NTs until the last one, which
will be forced to handle it. After that, we cannot return to a previous NT,
and so we've forefitted the ability to parse anything that came before it.
NIR's grammar is such that sequences are rare and, if present, there's
really only ever two NTs, and so this awkward behavior will rarely cause
practical issues. With that said, it ought to be improved in the future,
but let's wait to see if other parts of the lowering pipeline provide more
appropriate places to handle some of these things (even though it really
ought to be handled at the grammar level).
But I'm well out of time to spend on this. I have to move on.
DEV-7145
This removes the deprecated `@const@` argument in favor of shorthand
`@value@` constants, which were introduced long ago precisely to avoid
having to define separate `@const@` parameters for all of these templates.
DEV-7145
"keep" is an old feature that forced the linker to retain symbols that were
unused. This was removed long ago in favor of having all linker roots
defined by the return map.
This also removes an old `@always`, which seems like a typo for
`when="always"` or something...not entirely sure.
DEV-7145
Accumulators were an ancient TAME feature removed long ago during The Great
Refactoring (...okay, that part didn't fit the definition of a "refactor",
but that's technically what that's referring to).
TAMER will not accept it.
DEV-7145
This has not been in use for years and it's time to go away---it is the only
thing in TAME that causes nondeterminism, at least that I'm immediately
aware of. Perhaps I'll find something else while reimplementing TAME in
TAMER.
_This does not remove the compiler code to produce this._ If something
still needs `__DATE_YEAR__` (because it's really old), it can define this
value itself, and still utilize it until TAMER (which will not include it).
DEV-7145
`ele_parse!` was recently converted to accept zero-or-more for every NT to
simplify the parser-generator, since NIR isn't going to be able to
accurately determine whether child requirements are met anyway (because of
the template system).
This ensures that `Close` can be accepted when we're expecting an
element. It also adds a test for a scenario that's causing me some trouble
in stashed code so that I can ensure that it doesn't break.
DEV-7145
This sets the maximum depth to 64, which is still arbitrary, but
unfortunately the sum types introduce multiple levels of nesting, in
particular for template applications, so nested applications can result in a
fairly large stack.
I have various ideas to improve upon that---limited a bit in that repetition
as it is current implemented inhibits tail calls---but they're not worth
doing just yet relative to other priorities. The impact of this change is
not significant.
DEV-7145
This removes support for configurable repetition.
What? Why?
As it turns out, the complexity that repetition adds is quite significant
and is not worth the effort. The truth is that NIR is going to have to
allow zero-or-more matches on virtually everything _anyway_ because template
application is allowed virtually anywhere---it is not possible to fully
statically analyze TAME's sources because templates can expand into just
about anything. Given that, AIR (or something down the line) is going to
have to supply the necessary invariants instead.
It does suck, though, that this removes a lot of code that I fairly recently
wrote, and spent a decent amount of time on. But it's important to know
when to cut your losses.
Perhaps I could have planned better, but deriving this whole system as been
quite the experiment.
DEV-7145
If attributes fail to parse (e.g. missing required attribute) and parsing
reaches a dead state, this will recover by ignoring the entire element. It
previously panicked with a TODO.
DEV-7145
These were initially used to prevent conflicts with generated variants, but
we are no longer generating such variants since they're being jumped to via
the trampoline.
DEV-7145
I'm starting to clean up some TODOs, and this was a glaring one causing
panics when encountered. The recovery for this is simple, because we have
no choice: just stop parsing; leave it to the next lowering operation(s) to
complain that we didn't provide what was necessary. They'll have to,
anyway, since templates mean that NIR cannot ever have enough information to
guarantee that a document is well-formed, relative to what would expand from
the template.
DEV-7145
This allows for a construction like this:
```
ele_parse! {
[...]
StmtX := QN_X {
[...]
};
StmtY := QN_Y {
[...]
};
ExprA := QN_A {
[...]
};
ExprB := QN_B {
[...]
};
Expr := (A | B);
Stmt := (StmtX | StmtY);
// This previously was not allowed:
StmtOrExpr := (Stmt | Expr);
}
```
There were initially two barriers to doing so:
1. Efficiently matching; and
2. Outputting diagnostic information about the union of all expected
elements.
The first was previously resolved with the introduction of `NT::matches`,
which is macro-expanded in a way that Rust will be able to optimize a
bit. Worst case, it's effectively a linear search, but our Sum NTs are not
that deep in practice, so I don't expect that to be a concern.
The concern that I was trying to avoid was heap-allocated `NodeMatcher`s to
avoid recursive data structures, since that would have put heap access in a
very hot code path, which is not an option.
That left problem #2, which ended up being the harder problem. The solution
was detailed in the previous commit, so you should look there, but it
amounts to being able to format individual entries as if they were a part
of a list by making them a function of not just the matcher itself, but also
the number of items in (recursively) the sum type and the position of the
matcher relative to that list. The list length is easily
computed (recursively) at compile-time using `const`
functions (`NT::matches_n`).
And with that, NIR can be abstracted in sane ways using Sum NTs without a
bunch of duplication that would have been a maintenance burden and an
inevitable source of bugs (from having to duplicate NT references).
DEV-7145
This exposes the internal rendering of `ListDisplayWrapper::fmt` such that
we can output a list without actually creating a list. This is used in an
upcoming change for =ele_parse!= so that Sum NTs can render the union of all
the QNames that their constituent NTs match on, recursively, as a single
list, without having to create an ephemeral collection only for display.
If Rust supports const functions for arrays/Vecs in the future, we could
generate this at compile-time, if we were okay with the (small) cost, but
this solution seems just fine. But output may be even _more_ performant
since they'd all be adjacent in memory.
This is used in these secenarios:
1. Diagnostic messages;
2. Error messages (overlaps with #1); and
3. `Display::fmt` of the `ParseState`s themselves.
The reason that we want this to be reasonably performant is because #3
results in a _lot_ of output---easily GiB of output depending on what is
being traced. Adding heap allocations to this would make it even slower,
since a description is generated for each individual trace.
Anyway, this is a fairly simple solution, albeit a little bit less clear,
and only came after I had tried a number of other different approaches
related to recursively constructing QName lists at compile time; they
weren't worth the effort when this was so easy to do.
DEV-7145
This allows using a `[attr]` special form to stream attributes as they are
encountered rather than aggregating a static attribute list. This is
necessary in particular for short-hand template application and short-hand
function application, since the attribute names are derived from template
and function parameter lists, which are runtime values.
The syntax for this is a bit odd since there's a semi-useless and confusing
`@ {} => obj` still, but this is only going to be used by a couple of NTs
and it's not worth the time to clean this up, given the rather significant
macro complexity already.
DEV-7145
This uses the same mechanism that was introduced for handling `Text` nodes
in mixed content, allowing for arbitrary element `Open` matches for
preemption by the superstate.
This will be used to allow for template expansion virtually
anywhere. Unlike the existing TAME, it'll even allow for it at the root,
though whether that's ultimately permitted is really depending on how I
approach template expansion; it may fail during a later lowering operation.
This is interesting because this approach is only possible because of the
CPS-style trampoline implementation. Previously, with the composition-based
approach, each and every parser would have to perform this check, like we
had to previously with `Text` nodes.
As usual, this is still adding to the mess a bit, and it'll need some future
cleanup.
DEV-7145
This introduces the concept of superstate node preemption generally, which I
hope to use for template application as well, since templates can appear in
essentially any (syntatically valid, for XML) position.
This implements mixed content handling by defining the mapping on the
superstate itself, which really simplifies the problem but foregoes
fine-grained text handling. I had hoped to avoid that, but oh well.
This pushes the responsibility of whether text is semantically valid at that
position to NIR->AIR lowering (which we're not transition to yet), which is
really the better place for it anyway, since this is just the grammar. The
lowering to AIR will need to validate anyway given that template expansion
happens after NIR.
Moving on!
DEV-7145
This has been optional for many years and is not actually used by the
current compiler. TAMER can infer it, in situations where it actually
matters in the future.
So, rather than adding support for this in the new parser, let's clean up.
DEV-7145
See previous commit for an explanation. This marker is intended to be
useful while looking through commits.
This is because we utilize an unstable `int_log` feature, which is expected
to occasionally cause BC issues.
https://github.com/rust-lang/rust/pull/100332
The above MR replaces `log10` and friends with `ilog10`; this is the first
time an unstable feature bit us in a substantially backwards-incompatible
way that's a pain to deal with.
Fortunately, I'm just not going to deal with it: this is used with the
diagnostic system, which isn't yet used by our projects (outside of me
testing), and so those builds shouldn't fail before people upgrade.
This is now pending stabalization with the new name, so hopefully we're good
now:
https://github.com/rust-lang/rust/issues/70887#issuecomment-1210602692
This was accepting an early EOF when the active child `ParseState` was in an
accepting state, because it was not ensuring that anything on the stack was
also accepting.
Ideally, there should be nothing on the stack, and hopefully in the future
that's what happens. But with how things are today, it's important that, if
anything is on the stack, it is accepting.
Since `is_accepting` on the superstate is only called during finalization,
and because the check terminates early, and because the stack practically
speaking will only have a couple things on it max (unless we're in tail
position in a deeply nested tree, without TCO [yet]), this shouldn't be an
expensive check.
Implementing this did require that we expose `Context` to `is_accepting`,
which I had hoped to avoid having to do, but here we are.
DEV-7145
I wonder when this option was introduced, unless I never saw it because it
is called "quiet". But this is what I always wanted (and how I write the
output for my own tools, like progtest in this repo); the output has long
gotten far too large.
DEV-7145
Along with this change we also had to change how we handle dead states in
the superstate. So there were two problems here:
1. Sum states were not yielding a dead state after recovery, which meant
that parsing was unable to continue (we still have a `todo!`); and
2. The superstate considered it an error when there was nothing left on
the stack, because I assumed that ought not happen.
Regarding #2---it _shouldn't_ happen, _unless_ we have extra input after we
have completed parsing. Which happens to be the case for this test case,
but more importantly, we shouldn't be panicing with errors about TAMER bugs
if somebody puts extra input after a closing root tag in a source file.
DEV-7145
This does two things:
1. Places the expected list on a separate help line as a footnote where
it'll be a bit more tolerable when it inevitably overflows the terminal
width in certain contexts (we may wrap in the future); and
2. Removes angled brackets from the element names so that they (a) better
correspond with the span which highlights only the element name and (b)
do not imply that the elements take no attributes.
DEV-7145
When we match a QName against a namespace, we ought to store the matching
QName to use (a) in error messages and (b) to make available as a
binding. The former is necessary for sensible errors (rather than saying
that it's e.g. expecting a closing `t:*`) and the latter is necessary for
e.g. getting the template name out of `t:foo`.
DEV-7145
This allows matching on a namespace prefix by providing a `Prefix` instead
of a `QName`. This works, but is missing a couple notable things (and
possibly more):
1. Tracking the QName that is _actually_ matched so that it can be used in
messages stating what the expected closing tag is; and
2. Making that QName available via a binding.
This will be used to match on `t:*` in NIR. If you're wondering how
attribute parsing is supposed to work with that (of course you're wondering
that, random person reading this)---that'll have to work differently for
those matches, since template shorthand application contains argument names
as attributes.
DEV-7145
This introduces `NodeMatcher`, with the intent of introducing wildcard QName
matches for e.g. `t:*` nodes. It's not yet clear if I'll expand this to
support text nodes yet, or if I'll convert text nodes into elements to
re-use the existing system (which I had initially planned on doing, but
didn't because of the work and expense (token expansion) involved in the
conversion).
DEV-7145
I need to move on, and there are (a) a couple different ways to proceed that
I want to mull over and (b) upcoming changes that may influence my decision
one way or another.
DEV-7145
This will utilize the superstate's error object in place of nested errors,
which was the result of the previous composition-based delegation.
As you can see, all we had to do was remove the special handling of these
errors; the existing delegation setup continues to handle the types properly
with no change. The composition continues to work for `*Attr_`.
The alternative was to box inner errors, since they're far from the hot code
path, but that's clearly unnecessary.
To be clear: this is necessary to allow for recursive grammars in
`ele_parse` without creating recursive data structures in Rust.
DEV-7145
Comments ought not have any more semantic meaning than whitespace. Other
languages may have conventions that allow for various types of things in
comments, like annotations, but those are symptoms of language
limitations---we control the source language here.
DEV-7145
This properly integrates the trampoline into `ele_parse!`. The
implementation leaves some TODOs, most notably broken mixed text handling
since we can no longer intercept those tokens before passing to the
child. That is temporarily marked as incomplete; see a future commit.
The introduced test `ParseState`s were to help me reason about the system
intuitively as I struggled to track down some type errors in the monstrosity
that is `ele_parse!`. It will fail to compile if those invariants are
violated. (In the end, the problems were pretty simple to resolve, and the
struggle was the type system doing its job in telling me that I needed to
step back and try to reason about the problem again until it was intuitive.)
This keeps around the NT states for now, which are quickly used to
transition to the next NT state, like a couple of bounces on a trampoline:
NT -> Dead -> Parent -> Next NT
This could be optimized in the future, if it's worth doing.
This also makes no attempt to implement tail calls; that would have to come
after fixing mixed content and really isn't worth the added complexity
now. I (desperately) need to move on, and still have a bunch of cleanup to
do.
I had hoped for a smaller commit, but that was too difficult to do with all
the types involved.
DEV-7145
This change introduces diagnostic messages for panics. The intent is to be
able to use panics in situations where it is either not possible to or not
worth the time to recover from errors and ensure a consistent/sensible
system state. In those situations, we still ought to be able to provide the
user with useful information to attempt to get unstuck, since the error is
surely in response to some particular input, and maybe that input can be
tweaked to work around the problem.
Ideally, invalid states are avoided using the type system and statically
verified at compile-time. But this is not always possible, or in some cases
may be way more effort or cause way more code complexity than is worth,
given the unliklihood of the error occurring.
With that said, it's been interesting, over the past >10y that TAME has
existed, seeing how unlikely errors do sometimes pop up many years after
they were written. It's also interesting to have my intuition of what is
"unlikely" challenged, but hopefully it holds generally.
DEV-7145
I had previously used `Context` to hold the parser configuration for
repetition, since that was the easier option. But I now want to utilize the
`Context` for a stack for the superstate trampoline, and I don't want to
have to deal with the awkwardness of the repetition in doing so, since it
requires that the configuration be created during delegation, rather than
just being passed through to all child parsers.
This adds to a mess that needs cleaning up, but I'll do that after
everything is working.
DEV-7145
And here's the thing that I've been dreading, partly because of the
`macro_rules` issues involved. But, it's not too terrible.
This module was already large and complex, and this just adds to it---it's
in need of refactoring, but I want to be sure it's fully working and capable
of handling NIR before I go spending time refactoring only to undo it.
_This does not yet use trampolining in place of the call stack._ That'll
come next; I just wanted to get the macro updated, the superstate generated,
and tests passing. This does convert into the
superstate (`ParseState::Super`), but then converts back to the original
`ParseState` for BC with the existing composition-based delegation. That
will go away and will then use the equivalent of CPS, using the
superstate+`Parser` as a trampoline. This will require an explicit stack
via `Context`, like XIRF. And it will allow for tail calls, with respect to
parser delegation, if I decide it's worth doing.
The root problem is that source XML requires recursive parsing (for
expressions and statements like `<section>`), which results in recursive
data structures (`ParseState` enum variants). Resolving this with boxing is
not appropriate, because that puts heap indirection in an extremely hot code
path, and may also inhibit the aggressive optimizations that I need Rust to
perform to optimize away the majority of the lowering pipeline.
Once this is sorted out, this should be the last big thing for the
parser. This unfortunately has been a nagging and looming issue for months,
that I was hoping to avoid, and in retrospect that was naive.
DEV-7145
I'm disappointed that I keep having to implement features that I had hoped
to avoid implementing.
This introduces a "superstate" feature, which is intended really just to be
a sum type that is able to delegate to stitched `ParseState`s. This then
allows a `ParseState` to transition directly to another `ParseState` and
have the parent `ParseState` handle the delegation---a trampoline.
This issue naturally arises out of the recursive nature of parsing a TAME
XML document, where certain statements can be nested (like `<section>`), and
where expressions can be nested. I had gotten away with composition-based
delegation for now because `xmlo` headers do not have such nesting.
The composition-based approach falls flat for recursive structures. The
typical naive solution is boxing, which I cannot do, because not only is
this on an extremely hot code path, but I require that Rust be able to
deeply introspect and optimize away the lowering pipeline as much as
possible.
Many months ago, I figured that such a solution would require a trampoline,
as it typically does in stack-based languages, but I was hoping to avoid
it. Well, no longer; let's just get on with it.
This intends to implement trampolining in a `ParseState` that serves as that
sum type, rather than introducing it as yet another feature to `Parser`; the
latter would provide a more convenient API, but it would continue to bloat
`Parser` itself. Right now, only the element parser generator will require
use of this, so if it's needed beyond that, then I'll debate whether it's
worth providing a better abstraction. For now, the intent will be to use
the `Context` to store a stack that it can pop off of to restore the
previous `ParseState` before delegation.
DEV-7145
Since we'll never be reading past the header, this is all that is needed.
If in the future this is violated, XIRF will cause a nice diagnostic error
displaying precisely what opening tag caused the increased level of nesting,
which will aid in debugging and allow us to determine if it ought to be
increased. Here's an example, if I set the max to `3`:
error: maximum XML element nesting depth of `3` exceeded
--> /home/.../foo.xmlo:261:10
|
261 | <preproc:sym-ref name=":_vproduct:vector_a"/>
| ^^^^^^^^^^^^^^^^ error: this opening tag increases the level of nesting past the limit of 3
Of course, the longer-term goal is to do away with `xmlo` entirely.
This had no (perceivable via `/usr/bin/time -v`, at least) impact on memory
or CPU time.
DEV-7145
"Mixed content" is the XML term representing element nodes mixed with text
nodes. For example, `foo <strong>bar</strong> baz` is mixed.
TAME supports text nodes as documentation, intended to be in a literate
style but never fully realized. In any case, we need to permit them, and I
wanted to do more than just ignore the nodes.
This takes a different approach than typical parser delegation---it has the
parent parser _preempt_ the child by intercepting text before delegation
takes place, rather than having the child reject the token (or possibly
interpret it itself!) and have to handle an error or dead state.
And while this makes it more confusing in terms of state machine stitching,
it does make sense, in the sense that the parent parser is really what
"owns" the text node---the parser is delegating _element_ parsing only, take
asserts authority when necessary to take back control where it shouldn't be
delegated.
DEV-7145
Previously a `Depth` was provided only for `Open` and `Close`. This depth
information, for example, will be used by NIR to quickly determine whether a
given parser ought to assert ownership of a text/comment token rather than
delegating it.
This involved modifying a number of test cases, but it's worth repeating in
these commits that this is intentional---I've been bit in the past using
`..` in contexts where I really do want to know if variant fields change so
that I can consider whether and how that change may affect the code
utilizing that variant.
DEV-7145
Recent changes regarding whitespace were all to support this change (though
it was also needed for XIRF, pre- and post-root).
Now I'll have to conted with how I want to handle text nodes in various
circumstances, in terms of `ele_parse!`.
DEV-7145
Various DUMMY_SPAN-derived spans are used by many test cases, so this
finally extracts them---something I've been meaning to do for some time.
This also places DUMMY_SPAN behind a `cfg(test)` directive to ensure that it
is _only_ used in tests; UNKNOWN_SPAN should be used when a span is actually
unknown, which may also be the case during development.
DEV-7145
Whether or not quoting is appropriate depends on context, and that parent
context is already performing the quoting. For example:
error: expected `</rater>`, but found `<import>`
--> /home/[...]/foo.xml:2:1
|
2 | <rater xmlns="http://www.lovullo.com/rater"
| ------ note: element starts here
--> /home/[...]/foo.xml:7:3
|
7 | <import package="/rater/core/base" />
| ^^^^^^^ error: expected `</rater>`
In these cases (obviously I'm still working on the parser, since this is
nonsense), the parser is responsible for quoting the token "<import>".
DEV-7145
There were two problem errors: one showing "element element" and one showing
the value along with the name of the attribute.
The change for `<Attr as Display>::fmt` is debatable. I'm going to do this
for now (only show `@name`) and adjust later if necessary.
I'll need to go use `crate::fmt` consistently in previously-existing format
strings at some point, too.
DEV-7145
This teaches XIRF to optionally refine Text into RefinedText, which
determines whether the given SymbolId represents entirely whitespace.
This is something I've been putting off for some time, but now that I'm
parsing source language for NIR, it is necessary, in that we can only permit
whitespace Text nodes in certain contexts.
The idea is to capture the most common whitespace as preinterned
symbols. Note that this heuristic ought to be determined from scanning a
codebase, which I haven't done yet; this is just an initial list.
The fallback is to look up the string associated with the SymbolId and
perform a linear scan, aborting on the first non-whitespace character. This
combination of checks should be sufficiently performant for now considering
that this is only being run on source files, which really are not all that
large. (They become large when template-expanded.) I'll optimize further
if I notice it show up during profiling.
This also frees XIR itself from being concerned by Whitespace. Initially I
had used quick-xml's whitespace trimming, but it messed up my span
calculations, and those were a pain in the ass to implement to begin with,
since I had to resort to pointer arithmetic. I'd rather avoid tweaking it.
tameld will not check for whitespace, since it's not important---xmlo files,
if malformed, are the fault of the compiler; we can ignore text nodes except
in the context of code fragments, where they are never whitespace (unless
that's also a compiler bug).
Onward and yonward.
DEV-7145
The trace outputs a note in the footer indicating _why_ it's being output,
so that the reader understands both where the potentially-unexpected
behavior originates from and so they know (in the case of the feature flag)
how to inhibit it.
That information originally lived in `Parser`, where the `cfg` directive to
enable it lives, but it was moved into the abstraction. This corrects that.
DEV-7145
This has gotten large and was cluttering `feed_tok`. This also provides the
ability to more easily expand into other types of tracing in the future.
DEV-7145
This information is likely redundant in a lowering pipeline, but is more
useful outside of such a pipeline. It's also more clear.
`Object` does not implement `Display`, though, because that's too burdensome
for how it's currently used. Many `Object`s are also `Token`s though and,
if fed to another `Parser` for lowering, it'll get `Display::fmt`'d.
DEV-7145
Rust was warning that `cfg` was unused if both `test` and
`parser-trace-stderr`. This both allows that and adjusts the precedence to
make more sense for tests.
DEV-7145
Because of recovery, the trace otherwise paints a really confusing-looking
picture when given unexpected input.
This is large enough now that it really ought to be extracted from
`feed_tok`, but I'll wait to see how this evolves further. I considered
adding color too, but it's not yet clear to me that the visual noise will be
all that helpful.
DEV-7145
This flag allows toggling the parser trace that was previously only
available to tests. Unfortunately, at the time of writing, Cargo cannot
enable flags in profiles, so I have to check for either `test` or this flag
being set to enable relevant features.
This trace is useful as I start to run the parser against existing code
written in TAME so that our existing systems can help to guide my
development. Unlike the current tests, it also allows seeing real-world
data as part of the lowering pipeline, where multiple `Parser`s are in
play.
Having this feature flag also makes this feature more easily discoverable to
those wishing to observe how the lowering pipeline works.
DEV-7145
impl for `&Token` instead of Token; the writer is just copying data into the
destination stream anyway.
This will allow us to continue writing the token while also using it for
further processing, like `tee`.
DEV-7145
We need to be able to export generated identifiers. Trying to figure out a
syntax for this was a bit tricky considering how much is generated, so I
just settled on something that's reasonably clear and easy to parse with
`macro_rules!`.
I had intended to just make everything public by default and encapsulate
using private modules, but that then required making everything else that it
uses public (e.g. error and token objects), which would have been a bizarre
thing to do in e.g. test cases.
DEV-7145
Values can be parsed using `TryFrom<Attr>`. Previously only `From<Attr>`
was supported, which could not fail.
This is critical for parsing values into types, which will wrap `SymbolId`
to provide data assurances.
DEV-7145
The tests had certain things in scope, but now that I'm trying to use it
outside of those modules, some fixes are needed.
This is admittedly a sloppy commit, with a number of miscellaneous fixes. I
didn't bother separating it more because most of them are type fixes, and
the `From<Attr>` stuff is going to have to change into, likely,
`TryFrom<Attr>` so that parse failures can occur when attributes do not
match certain patterns.
DEV-7145
The only additional information needed was opening spans so that we can
provide useful information regarding closing tags.
This uses a generic Span in place of {Open,Close}Span because the latter
wasn't necessary, but more descriptive types would be nice; it may be
beneficial later on to introduce newtypes for each of the span generated by
{Open,Close}Span.
DEV-7145
This was a TODO for the attribute parser generator. The first attribute
will be kept and later ones will be ignored, producing an error. Recovery
permits further attribute parsing having ignored the duplicate.
DEV-7145
This allows an element to be repeated by the parent NT. The easiest way I
saw to implement this for now was to abuse the Context to provide a runtime
configuration that would allow the state machine to reset after it has
completed parsing.
This also influences error recovery, in that if we're expecting zero or more
of something, we cannot provide an error for an unexpected name, and instead
must emit a dead state so that the caller can determine what to do.
DEV-7145
This produces useful parse traces that are output as part of a failing test
case. The parser generator macros can be a bit confusing to deal with when
things go wrong, so this helps to clarify matters.
This is _not_ intended to be machine-readable, but it does show that it
would be possible to generate machine-readable output to visualize the
entire lowering pipeline. Perhaps something for the future.
I left these inline in Parser::feed_tok because they help to elucidate what
is going on, just by reading what the trace would output---that is, it helps
to make the method more self-documenting, albeit a tad bit more
verbose. But with that said, it should probably be extracted at some point;
I don't want this to set a precedent where composition is feasible.
Here's an example from test cases:
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is parsing attributes for `package`.
| | Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false })
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
| | Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))))
= note: this trace was output as a debugging aid because `cfg(test)`.
[Parser::feed_tok] (input IR: XIRF)
| ==> Parser before tok is expecting opening tag `<classify>`.
| | ChildA(Expecting_)
|
| ==> XIRF tok: `<unexpected>`
| | Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))
|
| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`).
| | ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))
| | Lookahead: None
= note: this trace was output as a debugging aid because `cfg(test)`.
DEV-7145
This resolves a TODO by including the name of the element whose attributes
are currently being parsed.
This also frees a parent from having to provide additional context, allowing
Display to be fully delegated when stitching.
DEV-7145
This introduces `Nt := (A | ... | Z);`, where `Nt` is the name of the
nonterminal and `A ... Z` are the inner nonterminals---it produces a parser
that provides a choice between a set of nonterminals.
This is implemented efficiently by understanding the QName that is accepted
by each of the inner nonterminals and delegating that token immediately to
the appropriate parser. This is a benefit of using a parser generator macro
over parser combinators---we do not need to implement backtracking by
letting inner parsers fail, because we know ahead of time exactly what
parser we need.
This _does not_ verify that each of the inner parsers accept a unique QName;
maybe at a later time I can figure out something for that. However, because
this compiles into a `match`, there is no ambiguity---like a PEG parser,
there is precedence in the face of an ambiguous token, and the first one
wins. Consequently, tests would surely fail, since the latter wouldn't be
able to be parsed.
This also demonstrates how we can have good error suggestions for this
parsing framework: because the inner nonterminals and their QNames are known
at compile time, error messages simply generate a list of QNames that are
expected.
The error recovery strategy is the same as previously noted, and subject to
the same concerns, though it may be more appropriate here: it is desirable
for the inner parser to fail rather than retrying, so that the sum parser is
able to fail and, once the Kleene operator is introduced, retry on another
potential element. But again, that recovery strategy may happen to work in
some cases, but'll fail miserably in others (e.g. placing an unknown element
at the head of a block that expects a sequence of elements would potentially
fail the entire block rather than just the invalid one). But more to come
on that later; it's not critical at this point. I need to get parsing
completed for TAME's input language.
DEV-7145
This adds the ability to bind identifiers to represent `OpenSpan` and
`CloseSpan`, available to the `@` and `/` maps. Since identifiers in TAME
originate from attributes, this may not get a whole lot of use, but it's
important to be available.
There is some awkwardness in that the opening span appears to be scoped to
the entire nonterminal, but it's actually only available in the `@`
mapping. I'll change this if it's actually needed; this keeps things simple
for now.
DEV-7145
Since the parsers produce streaming IRs, we need to be able to emit tokens
representing closing delimiters, where they are important.
This notably doesn't use spans; I'll add those next, since they're also
needed for the previous work.
DEV-7145
The comment explains the issue. I don't think the strategy is going to be a
desirable one, but I want to move on and observe in retrospect how it ought
to be handled.
The important part right now is that recovery is accounted for and possible,
which was a long-standing concern.
DEV-7145
This begins generating parsers that are capable of parsing elements. I need
to move on, so this abstraction isn't going to go as far as it could, but
let's see where it takes me.
This was the work that required the recent lookahead changes, which has been
detailed in previous commits.
This initial support is basic, but robust. It supports parsing elements
with attributes and children, but it does not yet support the equivalent of
the Kleene star (`*`). Such support will likely be added by supporting
parsers that are able to recurse on their own definition in tail position,
which will also require supporting parsers that do not add to the stack.
This generates parsers that, like all the other parsers, use enums to
provide a typed stack. Stitched parsers produce a nested stack that is
always bounded in size. Fortunately, expressions---which can nest
deeply---do not need to maintain ancestor context on the stack, and so this
should work fine; we can get away with this because XIRF ensures proper
nesting for us. Statements that _do_ need to maintain such context are not
nested.
This also does not yet support emitting an object on closing tag, which
will be necessary for NIR, which will be a streaming IR that is "near" to
the source XML in structure. This will then be used to lower into AIR for
the ASG, which gives structure needed for further analysis.
More information to come; I just want to get this committed to serve as a
mental synchronization point and clear my head, since I've been sitting on
these changes for so long and have to keep stashing them as I tumble down
rabbit holes covered in yak hair.
DEV-7145
Having the lookahead token generic over the `ParseState` was a pain in the
ass for stitching, since they shared the same token type but not the same
parser. I don't expect there to be any need to be able to infer other
parser-related types for a token of lookahead, so I'd rather just make my
life easier until such a thing is needed.
DEV-7145
Oh what a tortured journey. I had originally tried to avoid formalizing
lookahead for all parsers by pretending that it was only needed for dead
state transitions (that is---states that have no transitions for a given
input token), but then I needed to yield information for aggregation. So I
added the ability to override the token for `Dead` to yield that, in
addition to the token. But then I also needed to yield lookahead for error
conditions. It was a mess that didn't make sense.
This eliminates `ParseStatus::Dead` entirely and fully integrates the
lookahead token in `Parser` that was previously implemented.
Notably, the lookahead token is encapsulated in `TransitionResult` and
unavailable to `ParseState` implementations, forcing them to rely on
`Parser` for recursion. This not only prevents `ParseState` from recursing,
but also simplifies delegation by removing the need to manually handle
tokens of lookahead.
The awkward case here is XIRT, which does not follow the streaming parsing
convention, because it was conceived before the parsing framework. It needs
to go away, but doing so right now would be a lot of work, so it has to
stick around for a little bit longer until the new parser generators can be
used instead. It is a persistent thorn in my side, going against the grain.
`Parser` will immediately recurse if it sees a token of lookahead with an
incomplete parse. This is because stitched parsers will frequently yield a
dead state indication when they're done parsing, and there's no use in
propagating an `Incomplete` status down the entire lowering pipeline. But,
that does mean that the toplevel is not the only thing recursing. _But_,
the behavior doesn't really change, in the sense that it would infinitely
recurse down the entire lowering stack (though there'd be an opportunity to
detect that). This should never happen with a correct parser, but it's not
worth the effort right now to try to force such a thing with Rust's type
system. Something like TLA+ is better suited here as an aid, but it
shouldn't be necessary with clear implementations and proper test
cases. Parser generators will also ensure such a thing cannot occur.
I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a
lot of type inference that happens based on the fact that `ParseStatus` has
a `ParseState` type parameter; `Parsed` has only `Object`. It is desirable
for a public-facing `Parsed` to not be tied to `ParseState`, since consumers
need not be concerned with such a heavy type; however, we _do_ want that
heavy type internally, as it carries a lot of useful information that allows
for significant and powerful type inference, which in turn creates
expressive and convenient APIs.
DEV-7145
*NB: This is the initial change to introduce the token of lookahead, but this
does not fully integrate it. In particular, this is missing from the
stitching/delegation layer.*
This has been a long time coming, I suppose, though I had tried to avoid it
with `Parser::delegate_lookahead`. But the problem with doing that is that
it forced the ParserState to recurse, which both violates that I want no
looping constructs except for the toplevel, and performs additional stack
allocation as it is not in tail position.
The final straw was having to both return an error _and_ an aggregate object
for the attribute parser when an unexpected element is encountered (this
code is not yet committed). One option was to add a recovery object to the
error object, and formalize that, but then we have other concerns; for
example, what if that recovery object triggered an error? We'd have to mask
either the old or the new error. But we wouldn't want to mask either,
because the object causing the error would be the aggregate attributes,
which is _not_ a recovery object, but actual data we want to emit. And so
it's a kluge right off of the bat.
The use of a token of lookahaed is a more traditional approach and has uses
outside of just this one scenario. It'll also allow for the removal of
recursion from the existing ParserStates, and possibly the elimination of
dead state associated data, though I may end up leaving that; more to come.
Rust will also optimize away lookahead storage and processing in Parsers
that do not utilize it.
DEV-7145
I'd feel rather silly if I used `debug_assert!` for the sake of tests and
they weren't actually being run due to optimization settings.
This is just to catch potential future regressions; all is well today.
DEV-7145
There was only one test outside of the `parse` module using these
fields. The next commit will be introducing lookahead, and I do not want to
have to trust callers to ensure invariants are met.
DEV-7145
This reverts commit b973d36862.
Alright, I'm getting sick of fighting with myself on this. But rather than
just removing the last commit, I'm going to keep it around, so that my
thoughts are clearly documented for my future quarrels with myself.
Firstly: this added more overhead than I wanted it to. While it wasn't
significant, it did add 100--150ms to one of our largest systems, up from
~2.8s, which seems a bit much for a token that's really just meant to make
life easier for the parser.
Further, it seems that all I've managed to do is push my original problem to
a different layer---this started as a means to resolve having to emit both
an object and an error simultaneously in the case where aggregate attribute
parsing has completed, but we encounter an error on the next token (e.g. an
unexpected element). But XIRF, if it's missing AttrEnd, should throw an
error, but should also recover. Recovery is easy---just assume that it was
present---_but then we don't emit a XIRF `AttrEnd` token_, which is
necessary for downstream systems. So we'd need to either:
(a) emit both a token and an error; or
(b) panic.
But if we're doing (a), then the need for `AttrEnd` goes away, because it
solves the original problem (though the other concerns of the previous
commit still stand). (b) is not ideal at all, even though the missing token
does represent an internal system error; it's not something the user can
correct. But, given that it's something that the user cannot correct,
doesn't that imply that it's an awkward thing to include in the token
stream? So back to `AttrEnd` being an awkward PITA to have.
So, given (a), I'll just do that: errors will become more of a "hey, this
error just occurred, but I'm trying to recover---here's an object that you
should use if you choose to continue parsing, but it may or may not be what
you're looking for; proceed with caution". That flips the original script:
I imagined having external systems feed recovery tokens, but this
encapsulates recovery within the parser, which really is more appropriate,
though less flexible than having an omniscient external recovery system;
such a monolith was always an awkward concept and would be difficult to
implement cleanly.
This can also potentially be implemented as a generalization of the Dead
state change that allowed an object to be emitted alongside the
lookahead/error.
Anyway, back to where I was...I'm sure I'll look back on this in the future
shaking my head, reflecting on how naive I was.
DEV-7145
AttrEnd was initially removed in
0cc0bc9d5a (and the commit prior), because
there was not a compelling reason to use it over a lookahead
operation (returning a token via the a dead state transition); `AttrEnd`
simply introduced inconsistencies between the XIR reader (which produced
AttrEnd) and internal XIR stream generators (e.g. the lowering operations
into XIR->XML, which do not).
But now that parsers are performing aggregation---in particular the
attribute parser-generator `xir::parse::attr`---this has become quite a
pain, because the dead state is an actionable token. For example:
1. Open
2. Attr
3. Attr
4. Open
5. ...
In the happy case, token #4 results in `Parsed::Incomplete`, and so can just
be transformed into the object representing the aggregated attributes. But
even in this happy path, it's ugly, and it requires non-tail recursion on
the parser which requires a duplicate stack allocation for the
`ParserState`. That violates a core principle of the system.
But if there is an error at #4---e.g. an unexpected element---then we no
longer have a `Parsed::Incomplete` to hijack for our own uses, and we'd have
to introduce the ability to return both an error and a token, or we'd have
to introduce the ability to keep a token of lookahead instead of reading
from the underlying token stream, but that's complicated with push parsers,
which are used for parser composition. Yikes.
And furthermore, the aggregation has caused me to introduce the ability to
override the dead state type to introduce both a token of lookahead and
aggregation information. This complicates the system and is going to be
confusing to others.
Given all of this, AttrEnd does now seem appropriate to reintroduce, since
it will allow processing of aggregate operations when encountering that
token without having to worry about the above scenario; without having to
duplicate a `ParseState` stack; without having to hijack dead state
transitions for producing our aggregate object; and everything else
mentioned above.
This commit does not modify those abstractions to use AttrEnd yet; it
re-introduces the token to the core system, not the parser-generators, and
it doesn't yet replace lookahead operations in the parsers that use
them. That'll come next. Unlike the commit that removed it, though, we are
now generating proper spans, so make note of that here. This also does not
introduce the concept to XIRF yet, which did not exist at the time that it
was removed, so XIRF is filtering it out until a following commit.
DEV-7145
This is not longer needed after the previous commit, with static spans
having been replaced by `const` spans.
This used to be required before Rust acquired better const features, and
before I had preinterned symbols.
DEV-7145
This isn't conceptally all that significant of a change, but there was a lot
of modify to get it working. I would generally separate this into a commit
for the implementation and another commit for the integration, but I decided
to keep things together.
This serves a role similar to AttrSpan---this allows deriving a span
representing the element name from a span representing the entire XIR
token. This will provide more useful context for errors---including the tag
delimiter(s) means that we care about the fact that an element is in that
position (as opposed to some other type of node) within the context of an
error. However, if we are expecting an element but take issue with the
element name itself, we want to place emphasis on that instead.
This also starts to consider the issue of span contexts---a blob of detached
data that is `Span` is useful for error context, but it's not useful for
manipulation or deriving additional information. For that, we need to
encode additional context, and this is an attempt at that.
I am interested in the concept of providing Spans that are guaranteed to
actually make sense---that are instantiated and manipulated with APIs that
ensure consistency. But such a thing buys us very little, practically
speaking, over what I have now for TAMER, and so I don't expect to actually
implement that for this project; I'll leave that for a personal
project. TAMER's already take a lot of my personal interests and it can
cause me a lot of grief sometimes (with regards to letting my aspirations
cause me more work).
DEV-7145
Non-attribute and non-empty start/end tags will have their whitespace
as part of the produced span. This sets us up for a following change that
will allow for deriving the name span from this span given a QName, which
gives us a span that both represents the entire XIR token and allows
deriving the element name.
An accurate token span is necessary for parsing errors where an element was
not expected, while an element name span is more appropriate for issues of
grammar and semantic errors that deal not with the fact that an element was
encountered, but _what_ element was encountered.
DEV-7145
This both adds clarifying tests and corrects the case of `<foo/>`, where the
offset was erroneously off by one---it saw that there were no attributes and
added a byte thinking it'd include `>`, as in `<foo>`.
DEV-7145
This is the first parser generator for the parsing framework. I've been
waiting quite a while to do this because I wanted to be sure that I
understood how I intended to write the attribute parsers manually. Now that
I'm about to start parsing source XML files, it is necessary to have a
parser generator.
Typically one thinks of a parser generator as a separate program that
generates code for some language, but that is not always the case---that
represents a lack of expressiveness in the language itself (e.g. C). Here,
I simply use Rust's macro system, which should be a concept familiar to
someone coming from a language like Lisp.
This also resolves where I stand on parser combinators with respect to this
abstraction: they both accomplish the exact same thing (composition of
smaller parsers), but this abstraction doesn't do so in the typical
functional way. But the end result is the same.
The parser generated by this abstraction will be optimized an inlined in the
same manner as the hand-written parsers. Since they'll be tightly coupled
with an element parser (which too will have a parser generator), I expect
that most attribute parsers will simply be inlined; they exist as separate
parsers conceptually, for the same reason that you'd use parser combinators.
It's worth mentioning that this awkward reliance on dead state for a
lookahead token to determine when aggregation is complete rubs me the wrong
way, but resolving it would involve reintroducing the XIR AttrEnd that I had
previously removed. I'll keep fighting with myself on this, but I want to
get a bit further before I determine if it's worth the tradeoff of
reintroducing (more complex IR but simplified parsing).
DEV-7145
This was missed. It was not possible, using the documentation
alone (without looking at the linked source) to tell what the QName actually
represented, though you could assume by the name.
DEV-7145
This is partly an experiment, but is designed to simplify producing English
sentences in various contexts. It makes use of a not only unstable, but
incomplete, Rust feature---adt_const_params, for a static str const type
parameter. Hopefully that ends up being stabalized.
This uses types, but it's the same as function composition due to Rust's
monomorphization.
DEV-7145
`ParseState` originally required `Default` for use with `mem::take` in
`Parser::feed_tok`. This unfortunately cannot last, since more specialized
parsers require context during initialization in order to provide useful
diagnostic information. (The other option is to require the caller to
augment errors with diagnostic information, but that would have to be
duplicated by every caller and complicates parser composition; I'd prefer
those diagnostic details remain encapsulated.)
Replacing `Default` with `Option` is uglier, but it ends up producing the
same assembly as `mem::take` did, at least at the time of writing. Because
Rust is able to elide unnecessary moves using this implementation, there is
no need for `unwrap_unchecked` or other unsafe methods, which is great,
since it shows that this parsing methodology is viable entirely in safe
Rust.
DEV-7145
Previously, `ParseStatus::Dead` always yielded
`ParseState::Token`. However, I'm working on introducing parsers that
aggregate (parsing XML attributes into structs), and those parsers do not
know that they have completed aggregation until they reach a dead state;
given that, I need to yield additional information at that time.
I played around with a number of alternative ideas, but this ended up being
the cleanest, relative to the effort involved. For example, introducing
another parameter to `ParseStatus::Dead` was too burdensome on APIs that
ought not concern themselves with the possibility of receiving an object in
addition to a lookahead token, since many parsers are not capable of doing
so (given that they map M:(N<=M)).
Another option that I abandoned fairly quickly was having
`is_accepting` (potentially renamed) return an aggregate object, since
that's on the side and didn't feel like it was part of the parsing pipeline.
The intent is to abstract this some in a new `ParseState` method for
delegation + aggregation.
DEV-7145
I'll document it more formally eventually, but this settles on a mix of the
two: square brackets and dashes for intervals, `+` for intersecting lines,
byte offsets below interval endpoints, and names below that.
The docblock for `Span` itself iss still off; I'll probably just take one of
the test cases and paste it there at some point.
DEV-7145
This replaces a tuple with a tuple struct that allows for calculating more
complete span information, such as the span encompassing the entire
attribute and the value span including the surrounding quotes.
This includes logic that ought to be abstracted into `Span` itself, and it's
not as formal as I'd like it to be (e.g. not ensuring context), but this is
a good starting point.
Note that parsers call `Token::span`, which in turn calculates the attribute
span, each time an attribute is encountered during lowering. But Rust does
a good job at optimizing away unnecessary operations, so this didn't have an
observable impact on time.
DEV-7145
This provides much more clarity as to what is going on. Further, it's less
ambiguous, since I'm about to introduce a new type of xmlo lowering into XIR
for writing the actual xmlo files.
DEV-7145
This allows `XmlXirReader` to be used in a `Lower` operation, just as
everything else, bringing me one step closer to a pipeline that can be
concisely represented; this is finally beginning to unify in a clear way,
though it is still a bit of a mess.
This causes `XmlXirReader` to _act_ like a `parse::Parser` in that it yields
a `ParsedResult`, but it does not use `parse::Parser` itself; that was the
_original_ plan: convert it into a `ParseState` where `XmlXirReader` became
a context, and force `Parser` to yield by feeding it a stream of tokens with
`repeat`, but that ended up performing poorly relative to this change. I
did some investigation, which I might write about in the future, but for
now, this solution works just fine.
DEV-7145
This abstraction has grown quite a bit, and it's time to start formalizing
it a bit. This split doesn't change any behavior, but it does start to make
it easier to reason about by clearly stating the broad components and how
they interact with one-another.
This doesn't yet move the tests; those will come next, but they are very
few. The reason I gave previously for this was because (a) they're tested
indirectly via the systems that utilize them and (b) because the abstraction
was not yet settled on the process was already very expensive. No test
coverage was lost---it's only that failures were potentially harder to debug
on test failures, but in practice not even this was true, because the deeply
expressive types all but ensured that, if it compiles, it will function in a
way that is expected. Unit tests and documentation for this system will be
added once I'm sure that this abstraction is in a proper state.
DEV-7145
This also modifies `poc` such that `Lower` is invoked as an associated
function rather than a method to emphasize the pattern that is forming, so
that it can be later abstracted away.
DEV-11864
The `while_ok` can just be implied with a lowering operation, and that
reduces the name complexity so that we can maybe introduce even more
specialized methods without resulting in a huge sentence as a name.
DEV-11864
This finally uses `parse` all the way up to aggregation into the ASG, as can
be seen by the mess in `poc`. This will be further simplified---I just need
to get this committed so that I can mentally get it off my plate. I've been
separating this commit into smaller commits, but there's a point where it's
just not worth the effort anymore. I don't like making large changes such
as this one.
There is still work to do here. First, it's worth re-mentioning that
`poc` means "proof-of-concept", and represents things that still need a
proper home/abstraction.
Secondly, `poc` is retrieving the context of two parsers---`LowerContext`
and `Asg`. The latter is desirable, since it's the final aggregation point,
but the former needs to be eliminated; in particular, packages need to be
worked into the ASG so that `found` can be removed.
Recursively loading `xmlo` files still happens in `poc`, but the compiler
will need this as well. Once packages are on the ASG, along with their
state, that responsibility can be generalized as well.
That will then simplify lowering even further, to the point where hopefully
everything has the same shape (once final aggregation has an abstraction),
after which we can then create a final abstraction to concisely stitch
everything together. Right now, Rust isn't able to infer `S` for
`Lower<S, LS>`, which is unfortunate, but we'll be able to help it along
with a more explicit abstraction.
DEV-11864
This is present on all other packages. Rather than complicating TAMER to
accommodate a missing name, it's trivial to just add it.
This will, unfortunately, invalidate and require rebuilding of all xmlo
files, based on the `.rev-xmlo` bump.
DEV-11864
This is intended to describe, to the user, the state that the parser is
in. This will be used to convey additional information for general parser
errors, but it should also probably be integrated into parsers' individual
errors as well when appropriate.
This is something I expected to add at some point, but I wanted to add them
because, when dealing with lowering errors, it can be difficult to tell
what parser the error originated from.
DEV-11864
The `*_iter_while_ok` functions now compose like monads, flattening `Result`
at each step and drastically simplifying handling of error types. This also
removes the bunch of `?`s at the end of the expression, and allows me to use
`?` within the callback itself.
I had originally not used `Result` as the return type of the callback
because I was not entirely sure how I was going to use them, but it's now
clear that I _always_ use `Result` as the return type, and so there's no use
in trying to be too accommodating; it can always change in the future.
This is desirable not just for cleanup, but because trying to refactor
`asg_builder` into a pair of `Parser`s is really messy to chain without
flattening, especially given some state that has to leak temporarily to the
caller. More on that in a future commit.
DEV-11864
This was always the intent, but I didn't have a higher-level object
yet. This removes all the awkwardness that existed with working the root
in as an identifier.
DEV-11864
This wraps `Ident` in a new `Object` variant and modifies `Asg` so that its
nodes are of type `Object`.
This unfortunately requires runtime type checking. Whether or not that's
worth alleviating in the future depends on a lot of different things, since
it'll require my own graph implementation, and I have to focus on other
things right now. Maybe it'll be worth it in the future.
Note that this also gets rid of some doc examples that simply aren't worth
maintaining as the API evolves.
DEV-11864
A previous commit mentioned that there's not a place for `Dim`, and
duplicated it between `asg` and `xmlo`. Well, `Dtype` is also needed in
both, and so here's a home for now.
`Dtype` has always been an inappropriate detail for the system and will one
day be removed entirely in favor of higher-level types; the machine
representation is up to the compiler to decide.
DEV-11864
asg_builder is about to be replaced, but in the process of simplifying the
destination IR (the ASG), I'm moving things into the proper place. This
never belonged here---it belongs with the actual lowering operation.
Previously, this was not reasoned about in terms of a lowering operation,
and was written when I was first introducing myself to Rust and trying to
get a proof-of-concept linker working.
DEV-11864
This matches xmlo::Dim, and could be the same thing, if we can find a home
for it in the future; it's not worth creating such a home right now when I'm
not yet sure what else ought to live there; the duplication may be fine.
The conversion from xmlo needs to be moved, and `Dim` is going to be used
for more than just identifiers (expressions will have type inference
performed).
DEV-11864
This allows retrieving and providing a context to a `Parser`. This is
intended for use with an aggregating parser, in particular to construct the
ASG and return it.
This is a component of a change that replaces `asg_builder` with a
`Parser`-based lowering into the ASG, but there are still changes that need
to be made to simplify things and complete its integration.
DEV-11864
Previously, since the graph contained only identifiers, discovered roots
were stored in a separate vector and exposed to the caller. This not only
leaked details, but added complexity; this was left over from the
refactoring of the proof-of-concept linker some time ago.
This moves the root management into the ASG itself, mostly, with one item
being left over for now in the asg_builder (eligibility classifications).
There are two roots that were added automatically:
- __yield
- __worksheet
The former has been removed and is now expected to be explicitly mapped in
the return map, which is now enforced with an extern in `core/base`. This
is still special, in the sense that it is explicitly referenced by the
generated code, but there's nothing inherently special about it and I'll
continue to generalize it into oblivion in the future, such that the final
yield is just a convention.
`__worksheet` is the only symbol of type `IdentKind::Worksheet`, and so that
was generalized just as the meta and map entries were.
The goal in the future will be to have this more under the control of the
source language, and to consolodate individual roots under packages, so that
the _actual_ roots are few.
As far as the actual ASG goes: this introduces a single root node that is
used as the sole reference for reachability analysis and topological
sorting. The edges of that root node replace the vector that was removed.
DEV-11864
Rather than having the linker add this symbol opaquely, let's remove the
special case and generalize it. There's nothing special about yield, except
historical precedent.
Systems can explicitly add it as a root in a common return map.
DEV-11864
In the actual implementation (outside of tests), this is always looking up
before adding the symbol. This will simplify the API, while still retaining
errors, since the identifier will fail the state transition if the
identifier did not exist before attempting to set a fragment. So while this
is slower in microbenchmarks, this has no effect on real-world performance.
Further, I'm refactoring toward a streaming ASG aggregation, which is a lot
easier if we do not need to perform lookups in a separate step from the
ASG's primitives.
DEV-11864
`PartialEq` remains, and is all that is needed. See previous commit
regarding the removal of this same bound from `Context`.
This can be re-added if it ends up actually being necessary. But Tokens are
ephemeral and used only in lowering pipelines, using pattern matching.
DEV-11864
These traits are no longer necessary now that I'm using concrete types; they
just add unnecessary noise and confusion as I attempt to further refactor.
Don't abstract prematurely.
DEV-11864
This removes the generic on the Asg (which was formerly BaseAsg),
hard-coding `IdentObject`, which will further evolve. This makes the IR an
actual concrete IR rather than an abstract data structure.
These tests bring me back a bit, since they were written as I was still
becoming familiar with Rust.
DEV-11864
This is the beginning of an incremental refactoring to remove generics, to
simplify the ASG. When I initially wrote the linker, I wasn't sure what
direction I was going in, but I was also negatively influenced by more
traditional approaches to both design and unit testing.
If we're going to call the ASG an IR, then it needs to be one---if the core
of the IR is generic, then it's more like an abstract data structure than
anything. We can abstract around the IR to slice it up into components that
are a little easier to reason about and understand how responsibilities are
segregated.
DEV-11864
This is unnecessarily restrictive, since we do not require anything further
than `PartialEq` for the situations where we care about equality (tests).
DEV-11864
This is too restrictive, especially for parsers that fold into something,
like the ASG, which may exist prior to invoking the parser.
This moves the trait bound to the functions that actually need it. Those
obviously cannot be used if the Context does not implement `Default`, but
I'll provide alternative conveniences.
DEV-11864
I attempted to resolve an error previously, and I thought I had, but
apparently some symbols acquire a @dtype at some point in the process, or
lose it. Regardless, I have no interest in debugging or resolving this
mess, since it's going away.
The linker ensures that externs match, so while this could potentially allow
conflicting imports within a package (unlikely, given that extern templates
are recommended), it still will not resolve with a conflicting concrete
implementation. I'm not worried.
DEV-1036
Extern resolution has apparently been failing for quite some time, resulting
in `preproc:error` nodes in the _symbol table_ of return maps. This was
caught by the new xmlo parser, which does not ignore nodes it does not care
about.
The failure was caused by missing `@dtype`---the externs did in fact match,
and if they did not, then the linker would have failed.
This doesn't modify the map compiler to properly detect these, because
this compiler is going away in the hopefully-near future, and the problems
will now be caught, though in a very unideal way (as a parse error during
xmlo reading).
DEV-10936
preproc:sym/preproc:from is used for generating `knownFields` using the
_input_ map, so this has no use for return map values; the map still
produces edges to its dependencies.
The issue is that there are return map entries in some of our systems that
are producing multiple `preproc:from`, but I somewhat-recently modified the
system to support only a single map, to remove dynamic allocation. This
resolves that problem.
With that said, `knownFields` was created for Liza to know when the
classifier ought to be invoked, to save time. Back when it was first
introduced ~10y ago, this provided significant savings, however the
structure of our system now is such that nearly every single field invokes
the classifier.
Furthermore, these details should remain encapsulated; if we wanted to make
that determination, we should be provided with a delta, which we could also
use to do incremental classification in the future, if there's an ROI there
after other improvements have been made.
So, eventually, preproc:sym/preproc:from will go away entirely.
DEV-10936
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
These are no longer TODOs---they represent invalid tokens.
I'm going to put effort into providing further context with the diagnostic
system [right now] because these are internal errors caused by either
miscompilation or an incomplete reader.
DEV-10936
The new xmlo parser was failing on a worksheet xmlo file because fragments
were not properly placed within the header.
This was a change made when tameld was introduced so that we could stop
reading xmlo files early.
DEV-10936
This was missed when removing it from other Display impls when the new
diagnostic system was introduced. Raw `Span`s display byte offsets and the
context, which is no longer desirable as part of an error message.
DEV-10936
TAMER rejects this, because we shouldn't be using anything but UTF-8. My
use of this encoding is ancient, from over a decade ago, that was apparently
just copied around.
DEV-10936
I had waited to provide more documentation until I was sure that the
abstraction was not going to change significantly; there was a lot of
refactoring in prior commits.
DEV-12151
This moves construction out of `From` and into separate associated
functions, which can be further simplified in a bit.
We also need unit tests for this, since this still relies on integration
tests due to the cost of the aggressive and tight refactoring iterations.
DEV-12151
Previously, when adjacent duplicate spans were both resolved, if one failed,
the other certainly would, which would result in duplicate labels each
squash. Elided spans do not have syslabels, and so this is no longer a
concern.
DEV-12151
This was removed in a previous commit while working on simplifying the
implementation, with the hope of returning to it once things were in a
better place. They are, so let's bring it back.
DEV-12151
`SpanLabel` was created during a very early refactoring of this system, and
I've just been fighting with it sense. This removes it, and simplifies
some things in the process.
It also makes clear that `Level` is never optional and removes the awkward
`Level::default` that was there previously; the default is now the lowest
level, which will always be able to be escalated.
DEV-12151
This does what the original proof-of-concept implementation did---skip a
span that was just processed, since it'll be squashed into the previous
anyway. These duplicate spans originate from the diagnostic system when
producing supplemental help information.
DEV-12151
Tests are large and will be getting larger. The source will also grow as
it's better documented and cleaned up. It's getting more difficult to
navigate efficiently and concurrently modify implementation and tests, and
parsing via LSP is getting slower with certain types of changes.
DEV-12151
Alright, starting to settle on an abstraction now, and things are coming
together. This gives us line numbers in the previously-empty gutter, and
widens the gutter to accommodate. Gutters are normalized across
sections. Sections are not yet collapsed for sequential line numbers in the
same context.
Exciting!
Here's an example, on an xmlo file:
error: expected closing tag for `preproc:symtable`
--> /home/.../foo.xmlo:16:4
|
16 | <preproc:symtable xmlns:map="http://www.w3.org/2005/xpath-functions/map">
| ----------------- note: element `preproc:symtable` is opened here
--> /home/.../foo.xmlo:11326:4
|
11326 | </preproc:wrong>
| ^^^^^^^^^^^^^^^^ error: expected `</preproc:symtable>`
DEV-12151
The `Section` itself is now responsible for outputting the gutter, which
puts us in a position to be able to apply consistent formatting without
having to propagate width data to every line variant.
Now `SourceLine` _does_ actually correspond to a line of output, which will
allow for better formatting (e.g. collapsing padding) and, importantly,
proper management of gutters.
Note that the seemingly unnecessary `SectionSourceLine` allows for a subtle
consistent formatting for all variants' gutters in `SectionLine`, which will
allow us to hoist that rendering out in the next commit. The other option
was to include a trailing space for padding and marks, but that is not only
sloppy and undesirable, but asking for confusion, especially in editors (like
mine) that trim trailing whitespace.
DEV-12151
If a column isn't present, it degrades to displaying labels like footnotes
anyway, so this simplifies the system rather than catering to a rare
case. With that said, this does lose functionality, since it does not
render the source line at all, even though we _could_ do so.
I may re-introduce that rendering after some further refactoring,
specifically for gutters.
DEV-12151
Using a byte vector just makes life more difficult with regard to preparing
the diagnostic reports. We're already validating UTF-8 data for column
generation, which is necessary for a robust report, so let's just store it
as a String to begin with.
DEV-12151
Note that, if a span is first encountered with a mark but with _no_ label,
the first label (if collapsed) will be on the next line. This allows a span
to be marked without extra visual noise if it's not necessary, and to be
able to trust that it'll stay that way.
Until coloring is introduced, this may or may not be easier to read
depending on context.
This is also not yet taking into account where on the line it begins, and so
may render poorly if the span is at the end of a line. That will be fixed
later on.
DEV-12151
This is now visible in the diagnostic output. Example at this point in
time, on an xmlo file for one of our smallest systems:
error: expected closing tag for `preproc:symtable`
--> /home/.../foo.xmlo:16:4
|
| <preproc:symtable xmlns:map="http://www.w3.org/2005/xpath-functions/map">
| -----------------
= note: element `preproc:symtable` is opened here
--> /home/.../foo.xmlo:11326:4
|
| </preproc:wrong>
| ^^^^^^^^^^^^^^^^
= error: expected `</preproc:symtable>`
DEV-12151
Looking more and more Rust-like. Shameless copy.
TBH I forget what character it uses for help, but it's easy enough to
change.
Also, to be clear: this is modeled after Rust, but it's not a requirement of
mine that it look exactly like it. I just like the general style; I'll
surely deviate over time, as appropriate (or as I feel like it).
DEV-12151
This has the effect of highlighting the columns of the source lines using
'^' as an underline.
The next step will be to have the underline character depend on the
`Level`.
If this commit message doesn't sound all that exciting, given what it
finally achieved after all this time, it's because I'm exhausted, and my
prototype has already taken my excitement. But this is significant, given
all the work leading up to it.
There is some code cleanup needed and some unit tests that ought to be
written rather than relying on integration, but considering how much this is
being refactored, I don't want to add to that refactoring cost just yet
before gutters are introduced and I know things are settled for now.
DEV-12151
This has been a lot of refactoring for something that I prototyped a week
ago, and the prototype is still further along in its output formatting (it
has line numbering in gutters and span markings).
But, this has come a long way, and I'm happy with it overall, though I'm not
happy with my slow pace and struggle to maintain focus. But those are
personal issues.
This leaves a lot to be desired, but at the same time is still really
helpful. There's a couple notable TODOs regarding pointless allocation and
UTF8 re-checking, but otherwise, the feature-related steps are:
- Gutters with line numbers; and
- Marking columns associated with the span.
DEV-12151
Rather than squashing as a separate operation, and explicitly denoting when
it occurred, we'll just always squash, as was done before these changes. It
doesn't really make sense to make this optional and there's not any value in
keeping the decision around.
This also sets us up favorably for future changes: it creates a vector of
labels, which can be analyzed later to determine how to best lay out marks
and labels.
DEV-12151
Just renames the lifetime to refer to the `Diagnostic`, rather than a
`Label` returned by it, which was all `'l` was previously used for.
Note that many labels have a `'static` lifetime; this doesn't change that or
somehow cause it to reallocate; the label must life _for at least `'d`_.
DEV-12151
Rather than rendering the diagnostic `Display` message to a string only to
copy it to yet another buffer later on, this simply stores a reference to
the `Diagnostic` that was provided. This also adds a type to the `Report`
associating it with the provided `Diagnostic`, which does seem appropriate,
given that the report was produced for it.
I should probably rename '{l=>d} now.
DEV-12151
Rather than writing to the provided `Write` object, this produces a `Report`
object. While a lifetime still exists for the diagnostic data (labels,
specifically), I was able to remove the other lifetime resulting from
`ResolvedSpan` by transferring ownership of the data to the `Report`
itself. Once actual source lines are integrated shortly, `Report` will
include those as well.
This has been a tedious process, but it's coming together. Hopefully these
commits documenting the progressive and ugly refactoring are found useful by
some reader in the future.
DEV-12151
The line number was getting special treatment that is simply not worth the
cost (with regards to how burdensome it is on the type definitions). This
simplifies things quite a bit.
If we want header customization in the future, we can worry about that in a
different way, or allow the header as a whole to be swapped out, rather than
its constituents.
DEV-12151
`HeadingColNum` is no longer constructed by `HeadingLineNum`. This both
narrows the types and required data (e.g. removing dummy values in test
cases), and reduces the coupling (by favoring composition, but still coupled
with the concrete type).
DEV-12151
I'm unhappy with the current state of this, which is why I haven't settled
on docs or unit tests for these changes yet (though note that the
integration tests do cover these changes)---this is still a prototype
refactoring.
In particular, this needs to do more lowering---the `ResolvedSpan` and
`MaybeResolvedSpan` need to be eliminated and lowered into exactly what is
needed so that we can stop reasoning about them and propagating them.
Further, having lines and columns lazily evaluate themselves for
display---based on `MaybeResolvedSpan`---adds extra generics that shouldn't
be necessary; they should be pre-computed and store the concrete data they
need in variants. Display shouldn't involve computation beyond formatting
of pre-computed data.
That was always the plan, but this refactoring has been incremental.
Anyway: this is in a working and integration-tested state, but it's going to
change.
DEV-12151
This generalizes the types a bit more and introduces unit tests. Note that
these are still also covered by integration tests.
The next step will be to finish generalizing
`<VisualReporter as Reporter>::render`, after which I'll get back to the
task of outputting the source line along with markings and labels.
DEV-12151
This is just to provide clarity. `ctx` is not so widely used that we
benefit from such a short identifier, and it's not worth the cognitive
burden of people unfamiliar with what it may mean.
DEV-12151
This is redundant with the `Endpoints` variant, although it did read
better. It's just another case to have to handle.
I was originally going to use `std::ops::RangeInclusive` for `Endpoints`,
however that struct also contains an extra bool indicating whether it was
exhausted (as an iterator), which isn't appropriate for this.
DEV-12151
This logic is still covered by the integration tests; I'll be adding unit
tests once it's decoupled to the point where that's possible, which should
be shortly, and after I make sure this is the route I do want to go down.
DEV-12151
This simplifies types and error handling since we will always have at least
one line, provided that the span is within the range of the context. To
ensure that, this patch introduces a new error.
DEV-12151
I did not initially introduce lifetimes because I wasn't sure how the system
was going to evolve, but now lifetimes are going to be needed in a number of
contexts. The core of TAMER is able to avoid lifetimes in most instances
because of its internment system, but its use is not appropriate for the
diagnostic system's buffers (beyond sourcing strings from already-interned
data).
DEV-12151
Determining the column number is not as simple as performing byte
arithmetic, because certain characters have different widths. Even if we
only accepted ASCII, control characters aren't visible to the user.
This uses the unicode-width crate as an alternative to POSIX wcwidth, to
determine (hopefully) the number of fixed-width cells that a unicode
character will take up on a terminal. For example, control characters are
zero-width, while an emoji is likely double-width. See test cases for more
information on that.
There is also the unicode-segmentation crate, which can handle extended
grapheme clusters and such, but (a) we'll be outputting the line to the
terminal and (b) there's no guarantee that the user's editor displays
grapheme clusters as a single column. LSP measures in UTF-16,
apparently. I use both Emacs and Vim from a terminal, so unicode-width
applies to me. There's too much variation to try to solve that right now.
The columns can be considered a visual span---this gives us enough
information to draw line annotations, which will happen soon.
Here are some useful links:
- https://hsivonen.fi/string-length/
- https://unicode.org/reports/tr29/
- https://github.com/rust-analyzer/rowan/issues/17
- https://www.reddit.com/r/rust/comments/gpw2ra/how_is_the_rust_compiler_able_to_tell_the_visible/
DEV-10935
This does not yet resolve columns, and omits the length of the span, but
it's starting to come together.
This is particularly exciting for me to see because I've been wanting line
numbers in TAME error messages for over a decade.
DEV-10935
This does adds support for rewinding the underlying buffer when necessary to
read a span that occurs earlier within the same context (which could also
include the same span read twice).
As part of this change, I cleaned up the code a bit. Working with this
system can be confusing with the different meanings of the byte offsets and
the different ways of interpreting lines relative to the span that is
provided. There's not a lot of code here, but it represents a lot of work
to get right.
This works, but it's ugly and requires some cleanup. It shows that there
are some interesting considerations when determining how to best represent
the location of spans to the user in a way that is intuitive.
This is not yet integrated with the reporter, which will require a layer to
load a `Context` from disk.
DEV-10935
This is a POC, minimal-effort integration that also creates the TamecError
sum type analogous to TameldError.
I'll work on reducing the boilerplate in the future.
A note regarding the type and boilerplate vs. dynamic dispatch, for any
future readers: the purpose of this is to be explicit about the error types
so that the system is self-documenting and it forces and understanding of
its error conditions. `Box<dyn Error>` is basically "eh idk anything can
happen!", which is not what I'm interested in having.
DEV-10935
This is a working concept that will continue to evolve. I wanted to start
with some basic output before getting too carried away, since there's a lot
of potential here.
This is heavily influenced by Rust's helpful diagnostic messages, but will
take some time to realize a lot of the things that Rust does. The next step
will be to resolve line and column numbers, and then possibly include
snippets and underline spans, placing the labels alongside them. I need to
balance this work with everything else I have going on.
This is a large commit, but it converts the existing Error Display impls
into Diagnostic. This separation is a bit verbose, so I'll see how this
ends up evolving.
Diagnostics are tied to Error at the moment, but I imagine in the future
that any object would be able to describe itself, error or not, which would
be useful in the future both for the Summary Page and for query
functionality, to help developers understand the systems they are writing
using TAME.
Output is integrated into tameld only in this commit; I'll add tamec
next. Examples of what this outputs are available in the test cases in this
commit.
DEV-10935
We can just use PathSymbolId directly and simplify things. Typing can (and
should) happen on the symbol itself, and if we want a separate symbol type,
it ought to have its own interner.
For now, it doesn't, and having this extra type is just a PITA.
DEV-10935
There's no use in complicating the error handling here when we'd just
default to `UNKNOWN_SPAN` anyway when trying to render it. `UNKNOWN_SPAN`
didn't exist at the time of writing.
DEV-10935
This entirely removes the old XmloReader that has since been replaced with a
XIR-based reader.
I had been holding off on this because the new reader is slower, pending
performance optimizations (which I'll do a little later on), however the
performance loss is of no practical consideration and only affects the
linker, which is still fast.
Therefore, it's better to get this old code out of the way to simplify
refactoring going forward. In particular, I'm working on the diagnostic
system.
This is a little sad, in a way---this is some of my first Rust code that I'm
deleting.
DEV-10935
This does not deal directly with XIRF (that's composed into a pipeline
outside of this parser).
I'd like to clean up further...perhaps I should retire the
wip-xmlo-xir-reader flag now, despite the minor performance regression (see
previous recent commits for explanation).
DEV-10935
This aggregates all non-panic errors that can occur during link time, making
`Box<dyn Error>` unnecessary. I've been wanting to do this for a long time,
so it's nice seeing this come together. This is a powerful tool, in that we
know, at compile time, all errors that can occur, and properly report on
them and compose them. This method of error composition ensures that all
errors have a chance to be handled within their context, though it'll take
time to do so in a decent way.
This just maintains compatibility with the dynamic dispatch that was
previous occurring. This work is being done to introduce the initial
diagnostic system, which was really difficult/confusing to do without proper
errors types at the top level, considering the toplevel is responsible for
triggering the diagnostic reporting.
The cycle error is in particular going to be interesting once the system is
in place, especially once it provides spans in the future, since it will
guide the user through the code to understand how the cycle formed.
More to come.
DEV-10935
tamec and tameld will now both introduce a `Context` to XIR, which will use
it to create spans.
Here's an example of an error, now that it's all working well together:
$ target/release/tameld --emit xmle -o /dev/null path/to/package.xmlo
error: invalid preproc:sym/@dim `9` at [/../path/to/package.xmlo offset 1175451-1175452]
A future task will make this human-readable by producing line and column
numbers, and perhaps even a snippet (if not now, then eventually).
It's exciting to see this coming together finally.
DEV-10934
There's a bit to unpack here. Some of the spans originate from quick-xml's
error handling, but in coming up with test cases to try to trigger errors, I
found that quick-xml is far too permissive in what it accepts, and
oughtright dangerous in some situations.
I feel like the writing is on the wall for quick-xml, but I'll probably wait
until replacing `xmlo` with a more efficient format before deciding whether
to use a different library or implement parsing ourselves. There's a lot of
factors to consider, and a library would have to not only be correct and
performant, but provide useful information for span generation.
But for now, I have other more important things to work on, like a
functioning compiler. So while quick-xml is around, I'll just have to do
the best I can to provide a correct parser with useful errors.
DEV-10934
This is a large change, and was a bit of a tedious one, given the
comprehensive tests.
This introduces proper offsets and lengths for spans, with the exception of
some quick-xml errors that still need proper mapping. Further, this still
uses `UNKNOWN_CONTEXT`, which will be resolved shortly.
This also introduces `SpanlessError`, which `Error` explicitly _does not_
implement `From<SpanlessError>` for---this forces the caller to provide a
span before the error is compatable with the return value, ensuring that
spans will actually be available rather than forgotten for errors. This is
important, given that errors are generally less tested than the happy path,
and errors are when users need us the most (so, need span information).
Further, I had to use pointer arithmetic in order to calculate many of the
spans, because quick-xml does not provide enough information. There's no
safety considerations here, and the comprehensive unit test will ensure
correct behavior if the implementation changes in the future.
I would like to introduce typed spans at some point---I made some
opinionated choices when it comes to what the spans ought to
represent. Specifically, whether to include the `<` or `>` with the open
span (depends), whether to include quotes with attribute values (no),
and some other details highlighted in the test cases. If we provide typed
spans, then we could, knowing the type of span, calculate other spans on
request, e.g. to include or omit quotes for attributes. Different such
spans may be useful in different situations when presenting information to
the user.
This also highlights gaps in the tokens emitted by XIR, such as whitespace
between attributes, the `=` between name and value, and so on. These are
important when it comes to code formatting, so that we can reliably
reconstruct the XML tree, but it's not important right now. I anticipate
future changes would allow the XIR reader to be configured (perhaps via
generics, like a strategy-type pattern) to optionally omit these tokens if
desired.
Anyway, more to come.
DEV-10934
When wip-frontends is on, this will parse the input file using XIR and then
immediately output it again. This makes the necessary changes to be able to
read every source file we have in our largest project, such that the output
is identical after having been formatted with `xmllint --format -` (there
are differences because e.g. whitespace between attributes is not yet
maintained).
This is performant too, with times remaining essentially identical despite
the additional work.
DEV-10413
These declarations are relics from when all XML files could be loaded in the
browser to render the Summary Page. Such a thing has not worked for many
years.
The previous commit will cause files produced by these scripts to be
regenerated.
I noticed this when reading source files using XIR.
DEV-10413
This ensures that, when changes are made to these scripts, the files that
are generated from them are re-generated.
Historically this probably was not noticed because (a) they seldom changed
and (b) we had a small team and I told people to re-run bootstrapping
scripts or clean files. The team is much larger now, and regardless,
there's no reason not to have had this in place.
DEV-10413
This resolves the performance issues caused by Rust's failure to elide the
ElementStack (ArrayVec) memcpys on move.
Since XIRF is invoked tens of millions of times in some cases for larger
systems, prior to this change, failure to optimize away moves for XIRF
resulted in tens of millions of memcpys. This resulted in linking of one
program going from 1s -> ~15s. This change reduces it to ~2.5s with the
wip-xmlo-xir-reader flag on, with the extra time coming from elsewhere (the
subject of future changes).
In particular, this change introduces a new mutable reference to
`ParseState::parse_token`, which is a reference to a `Context` owned by the
caller (e.g. `Parser`). In the case of XIRF, this means that
`Parser<flat::State, _>` will own the `ElementStack`/`ArrayVec` instead of
`flat::State`; this allows the latter to remain pure and benefit from Rust's
move optimizations, without sacrificing the otherwise-pure implementation.
ParseStates that do not need a mutable context can use `NoContext` and
remain pure.
DEV-12024
This makes the necessary tweaks to have the entire linker work end-to-end
and produce a compatible xmle file (that is, identical except for
nondeterministic topological ordering). That's good, and finally that can
get off of my plate.
What's disappointing, and what I'll have more information on in future
commits, is how slow it is.
The linking of our largest package goes from ~1s -> ~15s with this
change. The reason is because of tens of millions of `memcpy` calls. Why?
The ParseState abstraction is pure and passes an owned `self` around, and
Parser replaces its own reference using this:
let result;
TransitionResult(Transition(self.state), result) =
take(&mut self.state).parse_token(tok);
Naively, this would store a copy of the old state in `result`, allocate a
new ParseState for `self.state`, pass the original or a copy to
`parse_token`, and then overwrite `self.state` with the new ParseState that
is returned once it is all over.
Of course, that'd be devastating. What we want to happen is for Rust to
realize that it can just pass a reference to `self.state` and perform no
copying at all.
For certain parsers, this is exactly what happens. Great!
But for XIRF, it we have this:
/// Stack of element [`QName`] and [`Span`] pairs,
/// representing the current level of nesting.
///
/// This storage is statically allocated,
/// allowing XIRF's parser to avoid memory allocation entirely.
type ElementStack<const MAX_DEPTH: usize> = ArrayVec<(QName, Span), MAX_DEPTH>;
/// XIRF document parser state.
///
/// This parser is a pushdown automaton that parses a single XML document.
#[derive(Debug, Default, PartialEq, Eq)]
pub enum State<const MAX_DEPTH: usize, SA = AttrParseState>
where
SA: FlatAttrParseState,
{
/// Document parsing has not yet begun.
#[default]
PreRoot,
/// Parsing nodes.
NodeExpected(ElementStack<MAX_DEPTH>),
/// Delegating to attribute parser.
AttrExpected(ElementStack<MAX_DEPTH>, SA),
/// End of document has been reached.
Done,
}
ParseState contains an ArrayVec, and its implementation details are causes
LLVM _not_ to elide the `memcpy`. And there's a lot of them.
Considering that ParseState is supposed to use only statically allocated
memory and be zero-copy, this is rather ironic.
Now, this _could_ be potentially fixed by not using ArrayVec; removing
it (and the corresponding checks for balanced tags) gets us down to
2s (which still needs improvement), but we can't have a core abstraction in
our system resting on a house of cards. What if the optimization changes
between releases and suddenly linking / building becomes shit slow? That's
too much of a risk.
Further, having to limit what abstractions we use just to appease the
compiler to optimize away moves is very restrictive.
The better option seems like to go back to what I used to do: pass around
`&mut self`. I had moved to an owned `self` to force consideration of _all_
state transitions, but I can try to do the same thing in a different type of
way using mutable references, and then we avoid this problem. The
abstraction isn't pure (in the functional sense) anymore, but it's safe and
isn't relying on delicate inlining and optimizer implementation details to
have a performant system.
More information to come.
DEV-10863
This concludes the bulk of the header parsing, though there are surely going
to be other issues when I try to read a real xmlo file, such as
whitespace. That is something I expect that I'd rather handle as part of
XIRF, but maybe I'll initially ignore it here just to get it working. We'll
see.
DEV-10863
This parses the symbol dependency list (adjacency list).
I'm noticing some glaring issues in error handling, particularly that the
token being parsed while an error occurs is not returned and so recovery is
impossible. I'll have to address that later on, after I get this parser
completed.
Another previous question that I had a hard time answering in prior months
was how I was going to compose boilerplate parsers, e.g. handling the
parsing of single-attribute elements and such. A pattern is clearly taking
shape, and with the composition of parsers more formalized, that'll be able
to be abstracted away. But again, that's going to wait until after this
parser is actually functioning. Too many delays so far.
DEV-10863
Ideally this would just be an attribute, but I guess I never got around to
making that change in the compiler and I don't want a detour right now.
DEV-10863
I clearly was not paying attention to what was correct behavior here, since
the tests also verified the wrong behavior: rather than taking the last
processed attribute span, we should be taking the span of the opening
tag for the `preproc:sym` node.
DEV-10863
This simply removes boilerplate.
This will receive concrete examples once I come up with docs for the entire
module; there's boilerplate involved in testing and documenting this in
isolation and the time investment is not worth it yet until I'm certain that
this will not be changed.
DEV-10863
This integrates much of the work done so far to parse into a
`XmloEvent::SymDecl`. The attribute parsing _is_ verbose, and I do intend
to abstract it away later on, but I'm going to wait on that for now.
The new reader should be finishing up soon, which is really exciting, since
I started working on this months ago (before having to take a break on
TAMER); I'm anticipating strong performance gains in the reader, and this is
a test that will tell us how the compiler will perform moving forward with
the abstractions that I've spent so much time on.
DEV-10863
This introduces a new method similar to the previous `delegate`, but with
another closure that allows for handling lookahead tokens from the child
parser.
Admittedly, this isn't exactly what I was going for---a list of arguments
isn't exactly self-documenting, especially with the brevity when the
arguments line up---but this was easy to do and so I'll run with this for
now.
This also modified `delegate` to accept a context, even though it wasn't
necessary, both for consistency with its lookup counterpart and for brevity
with the `into` argument (allowing, in our case, to just pass the name of
the variant, rather than a closure).
I'm not going to handle the actual starting and accepting state stitching
abstraction for now; I'd like to observe future boilerplate more before I
consider the best way to handle it, though I do have some ideas.
DEV-10863
This is the delegation portion of what I've come to call "state
stitching"---wiring together two state machines that recognize the same
input tokens.
This handles the delegation of tokens once the parser has been entered, but
does not yet handle the actual stitching part of it: wiring the start and
accepting states of the child parser to the parent.
This is indirectly tested by the XmloReader, but it will receive its own
tests once I further finalize this concept. I'm playing around with some
ideas. With that said, a quick visual inspection together with the
guarantees provided by the type system should convince any familiar reader
of its correctness.
DEV-10863
This wasn't the simplest thing to start with, but I wanted to explore
something with a higher level of complexity. There is some boilerplate to
observe here, including:
1. The state stitching (as I guess I'm calling it now) of SymtableState
with XmloReaderState is all boilerplate and requires no lookahead,
presenting an abstraction opportunity that I was holding off on
previously (attr parsing for XIRF requires lookahead).
2. This is simply collecting attributes into a struct. This can be
abstracted away in the future.
3. Creating stub parsers to verify that generics are stitched rather than
being tightly coupled with another state is boilerplate that maybe can
be abstracted away after a pattern is observed in future tests.
DEV-10863
This does some cleanup and adds `parse::Object` for use in disambiguating
`From` for `ParseStatus`, allowing the `Transition` API to be much more
flexible in the data it accepts and automatically converts. This allows us
to concisely provide raw output data to be wrapped, or provide `ParseStatus`
directly when more convenient.
There aren't yet examples in the docs; I'll do so once I make sure this API
is actually utilized as intended.
DEV-10863
This replaces u8 and will be used for the new XmloReader.
Previously I wasn't sure what direction TAMER was going to go in with
regards to dimensionality, but I do not expect that higher dimensions will
be supported, and if they are, they'd very likely compile down to lower ones
and create an illusion of higher-dimensionality.
Whatever the future holds, it's not used today, and I'd rather these types
be correct.
ASG needs changing too, but one step at a time.
DEV-10863
This converts the tuple type alias into a newtype, so that we may provide
our own implementations.
This differs from a previous approach that I took, which involved making
this type `Result<(S, T), (S, E)>` so that the return values composed well
with other functions. But the reality is that this is used only by other
`ParseState`s and `Parser`, so it's unnecessary.
However, this is also an attempt to utilize the new Try and FromResidual
traits; note how the Try associated types match precisely what I was trying
to do before, though they're used as intermediate types. I'll see how this
evolves.
DEV-10863
This allows the Results to compose and, importantly, is compatible with
`?` without having to put in any extra effort.
This makes puts the caller in an awkward spot, so I introduced a utility
function `result_tup0_invert` for now; we'll see if that stays or evolves
differently.
DEV-10863
Since this is the object produced by this parser, this is likely the most
useful first thing to present as a summary of what `XmloReader` actually
does.
DEV-10863
This removes the flag from most of the code, which also resolves the
indentation. Not only was it bothering me, but I don't want (a) every line
modified when the module body is hoisted and (b) `rustfmt` to reformat
everything when that happens.
This means that everything will be built, even though it's not used, when
the flag is off, but I see that as a good thing.
DEV-10863
Finally we get to do some actual parsing with all of the preparatory work!
This means that we're finally ready to fully replace the old XmloReader,
provided that I'm okay with some boilerplate / lack of abstractions for
now (and I am, because all I've been doing is working on abstractions to
prepare lowering operations).
DEV-10863
This makes more sense for pattern matching. Encapsulation of these fields
is not necessary, given that it's passed around as an owned value and its
`new` method constructs it verbatim; the individual fields are
self-validating.
DEV-10863
This introduces a WIP lowering operation, abstracting away quite a bit of
the manual wiring work, which is really important to providing an API that
provides the proper level of abstraction for actually understanding what the
system is doing.
This does not yet have tests associated with it---I had started, but it's a
lot of work and boilerplate for something that is going to
evolve. Generally, I wouldn't use that as an excuse, but the robust type
definitions in play, combined with the tiny amount of actual logic, provide
a pretty high level of confidence. It's very difficult to wire these types
together and produce something incorrect without doing something obviously
bad.
Similarly, I'm holding off on proper docs too, though I did write some
information here.
More to come, after I actually get to work on the XmloReader.
On a side note: I'm happy to have made progress on this, since this wiring
is something I've been dreading and wondering about since before the Parser
abstraction even existed.
Note also that this makes parser::feed_toks private again---I don't intend
to support push parsers yet, since they're only needed internally. Maybe
for error recovery, but I'll wait to decide until it's actually needed.
DEV-10863
This begins to transition XmloReader into a ParseState. Unlike previous
changes where ParseStates were composed into a single ParseState, this is
instead a lowering operation that will take the output of one Parser and
provide it to another.
The mess in ld::poc (...which still needs to be refactored and removed)
shows the concept, which will be abstracted away. This won't actually get
to the ASG in order to test that that this works with the
wip-xmlo-xir-reader flag on (development hasn't gotten that far yet), but
since it type-checks, it should conceptually work.
Wiring lowering operations together is something that I've been dreading for
months, but my approach of only abstracting after-the-fact has helped to
guide a sane approach for this. For some definition of "sane".
It's also worth noting that AsgBuilder will too become a ParseState
implemented as another lowering operation, so:
XIR -> XIRF -> XMLO -> ASG
These steps will all be streaming, with iteration happening only at the
topmost level. For this reason, it's important that ASG not be responsible
for doing that pull, and further we should propagate Parsed::Incomplete
rather than filtering it out and looping an indeterminate number of times
outside of the toplevel.
One final note: the choice of 64 for the maximum depth is entirely
arbitrary and should be more than generous; it'll be finalized at some point
in the future once I actually evaluate what maximum depth is reasonable
based on how the system is used, with some added growing room.
DEV-10863
This introduces a (still-private) way to _push_ tokens into the parser,
rather than relying purely on a pull-based interface. Not only does this
simplify the iterator, but this is also preparing to make the new `feed_tok`
public so that parsers can be composed in more contexts. I suspect that
this method may also be useful for error recovery, since it can be used to
inject tokens into arbitrary points of a token stream.
I kept the new method private for now so that I can introduce the new API
and docs separate from this refactoring.
DEV-10863
The parsing framework originally created for XIR is now more general and
useful to other things. We'll see how this evolves.
This needs additional documentation, but I'd like to see how it changes as
I implement XmloReader and then some of the source readers first.
DEV-10863
This adds a `Token` type to `ParseState`. Everything uses `xir::Token`
currently, but `XmloReader` will use `xir::flat::Object`.
Now that this has been generalized beyond XIR, the parser ought to be
hoisted up a level.
DEV-10863
This does a couple of things: it ensures that documents one and only one
root note, and it properly handles dead transitions once parsing is
complete (allowing it to be composed).
This should make XIRF feature-complete for the time being. It does rely on
the assumption that the reader is stripping out any trailing whitespace, so
I guess we'll see if that's true as we proceed.
DEV-10863
I'm not rendering errors yet in practice, so this wouldn't have been
noticed, but we want error messages to reference the final byte in a file on
EOF, not the offset of the last-encountered token, which would be confusing.
This doesn't _directly_ pertain to what I'm working on; I just happened to
notice it.
DEV-10863
XIRF introduced the concept of `Transition` to help document code and
provide mental synchronization points that make it easier to reason about
the system. I decided to hoist this into XIR's parser itself, and have
`parse_token` accept an owned state and require a new state to be returned,
utilizing `Transition`.
Together with the convenience methods introduced on `Transition` itself,
this produces much clearer code, as is evidenced by tree::Stack (XIRT's
parser). Passing an owned state is something that I had wanted to do
originally, but I thought it'd lead to more concise code to use a mutable
reference. Unfortunately, that concision lead to code that was much more
difficult than necessary to understand, and ended up having a net negative
benefit by leading to some more boilerplate for the nested types (granted,
that could have been alleviated in other ways).
This also opens up the possibility to do something that I wasn't able to
before, which was continue to abstract away parser composition by stitching
their state machines together. I don't know if this'll be done immediately,
but because the actual parsing operations are now able to compose
functionally without mutability getting the way, the previous state coupling
issues with the parent parser go away.
DEV-10863
This introduces XIR Flat (XIRF), which is conceptually between XIR and
XIRT. This provides a more appropriate level of abstraction for further
lowering operations to parse against, and removes the need for other parsers
to perform their own validations (inappropriately) to ensure well-formed
XML.
There is still some cleanup worth doing, including moving some of the
parsing responsibility up a level back into the XIR parser.
DEV-10863
This behavior is unchanged, but it allows us to create more constant spans
for testing. For example:
const S = DUMMY_SPAN.offset_add(1).unwrap();
This, in turn, will allow for removing lazy_static! for tests that use it
for span generation.
DEV-10863
Petgraph was previously held back due to petgraph-graphml. I'd like to
transition away from that at some point, given that it's tied to petgraph
and also pulls in xmlns, on top of quick-xml and our XIR, but that can come
down the line.
The Options here are awkward and will be able to go away in the new reader
and in AsgBuilder once it has a proper state machine.
This gets rid of some of the initial migratory work for the new reader,
because PackageAttrs is gone. I'm going to wait to update this to the new
way until I get further into this.
DEV-11449
I'm finally back to TAMER development.
The original plan, some time ago, was to gate an entirely new XmloReader
behind a feature flag (wip-xmlo-xir-reader), and go from there, leaving the
existing implementation untouched. Unfortunately, it became too difficult
and confusing to marry the old aggregate API with the new streaming one.
AsgBuilder is the only system interacting with XmloReader, so I decided (see
previous commits) to just go the route of refactoring the existing
one. I'm not yet sure if I'll continue to progressively refactor this one
and eliminate the two separate implementations behind the flag, or if I'll
get this API similar and then keep the flag and reimplement it. But I'll
know soon.
DEV-11449
This was broken by the previous fix, because I had cast to a numeric value
before invoking `set_defaults`, which needs the empty string retained so
that it knows whether a default ought to be applied.
This also ensures that `set_values` will always return a numeric value when
that default is applied.
DEV-10484
This has been a problem for...ever, but the old classification system (and
calculations) had `||0` for ever variable reference, whereas the new one
does not; NaNs result in undefined behavior in the new classification
system, since those values are not expected to exist.
This ought to have automated tests, but it will be rewritten in TAMER.
DEV-10484
This was originally my plan with the new classification system, but it was
undone because I had hoped to punt on the somewhat controversial
issue. Unfortunately, I see no other way. Here I attempt to summarize the
reasons why, many of which are specific to the design decisions of TAME.
Keep in mind that TAME is a domain-specific language (DSL) for writing
insurance rating systems. It should act intuitively for our use case, while
still being mathematically sound.
If you still aren't convinced, please see the link at the bottom.
Target Language Semantics (ECMAScript)
--------------------------------------
First: let's establish what happens today. TAME compiles into ECMAScript,
which uses IEEE 754-2008 floating-point arithmetic. Here we have:
x/0 = Infinity, x > 0;
x/0 = -Infinity, x < 0;
0/0 = NaN, x = 0.
This is immediately problematic: TAME's calculations must produce concrete
real numbers, always. NaN is not valid in its domain, and Infinity is of no
practical use in our computational model (TAME is build for insurance rating
systems, and one will never have infinite premium). Put plainly: the
behavior is undefined in TAME when any of these values are yielded by an
expression.
Furthermore, we have _three different possible situations_ depending on
whether the numerator is positive, negative, or zero. This makes it more
difficult to reason about the behavior of the system, for values we do not
want in the first place.
We then have these issues in ECMAScript:
Infinity * 0 = NaN.
-Infinity * 0 = NaN.
NaN * 0 = NaN.
These are of particular concern because of how predicates work in TAME,
which will be discussed further below. But it is also problematic because
of how it propagates: once you have NaN, you'll always have NaN, unless you
break out of the situation with some control structure that avoids using it
in an expression at all.
Let's now consider predicates:
NaN > 0 = false.
NaN < 0 = false.
NaN === 0 = false.
NaN === NaN = false.
These will be discussed in terms of classification predicates (matches).
We also have issues of serialization:
JSON.stringify(Infinity) = "null".
JSON.stringify(NaN) = "null".
These means that these values are difficult to transfer between systems,
even if we wanted them.
TAME's Predicates
-----------------
TAME has a classification system based on first-order logic, where ⊥ is
represented by 0 and ⊤ is represented by 1. These classifications are used
as predicates to calculations via the @class attribute of a rate block. For
example:
<rate-each class="property" generates="propValue" index="k">
<c:quotient>
<c:value-of name="buildingTiv" index="k" />
<c:value-of name="tivPropDivisor" index="k" />
</c:quotient>
</rate>
As can be observed via the Summary Page, this calculation compiles into the
following mathematical expression:
∑ₖ(pₖ(tₖ/dₖ)),
that is—the quotient is then multiplied by the value of the `property`
classification, which is a 0 or 1 respectively for that index.
Let's say that tivPropDivisor were defined in this way:
<rate-each class="property" generates="tivPropDivisor" index="k">
<!--- ... logic here ... -->
</rate>
It does not matter what the logic here is. Observe that the predicate here
is `property` as well, which means that, if this risk is not a property
risk, then `tivPropDivisor` will be `0`.
Looking back at `propValue`, let's say that we do have a property risk, and
that `buildingTiv` is `[100_000, 200_000]` and `tivPropDivisor` is 1000. We
then have:
1(100,000 / 1000) + 1(200,000 / 1000)) = 300.
Consider instead what happens if `property` is 0. Since we have no property
locations, we have `[0, 0]` as `buildingTiv` and `tivPropDivisor` is 0.
0(0/0) + 0(0/0)) = 0(NaN + NaN) = NaN.
This is clearly not what was intended. The predicate is expected to be
_strongly_ zero, as if using an Iverson bracket:
((0/0)[0] + (0/0)[0]) = 0.
Of course, one option is to redefine TAME such that we use Iverson's
convention in place of summation, however this is neither necessary nor
desirable given that
(a) NaN is not valid within the domain of any TAME expression, and
(b) Summation is elegantly generalized and efficiently computed using
vector arithmetic and SIMD functions.
That is: there's no use in messing with TAME's computational model for a
valid that should be impossible to represent.
Short-Circuiting Computation
----------------------------
There's another way to look at it, though: that we intended to skip the
computation entirely, and so it doesn't matter what the quotient is. If the
compiler were smart enough (and maybe one day it will be), it would know
that the predicate of `tivPropDivisor` and `propValue` are the same and so
there is no circumstance under which we would compute `propValue` and have
`tivPropDivisor` be 0.
The problem is: that short-circuiting is employed as an _optimization_, and
is an implementation detail. Mathematically, the expression is unchanged,
and is still invalid within TAME's domain. It is unrepresentable, and so
this is not an out.
But let's pretend that it was defined that way, which would yield this:
{ ∑ₖ(pₖ(tₖ/dₖ)), ∀x∈p(x = 1);
propValue = <
{ 0, otherwise.
This is the optimization that is employed, but it's still not mathematically
correct! What happens if p₀ = 1, but p₁ = 0? Then we have:
1(100,000/1000) + 0(0/0) = 100 + NaN = NaN,
but the _intent_ was clearly to have 100 + 0 = 100, and so we return to the
original problem once again.
Classification Predicates and Intent
------------------------------------
Classifications are used as predicates for equations, but classifications
_themselves_ have predicates in the form of _matches_. Consider, for
example, a classification that may be used in an assertion to prevent
negative premium from being generated:
<t:assert failure="premBuilding must not be negative for any index">
<t:match-gte value="premBuilding" value="#0" />
</t:assert>
Simple enough—the system will fail if the premium for a given building is
below $0.
But what happens if premBuilding is calculated as so?
<rate-each class="property" yields="premBuildingTotal"
generates="premBuilding" index="k">
<c:product>
<c:value-of name="propValue" index="k" />
<c:value-of name="propRate" index="k" />
</c:product>
</rate-each>
Alas, if `property` is false for any index, then we know that `propValue` is
NaN, and NaN * x = NaN, and so `premBuilding` is NaN.
The above assertion will compile the match into the first-order sentence
∀x∈b(x > 0).
Unfortunately, NaN is not greater than, less than, equal to, or any other
sort of thing to 0, and so _this assertion will trigger_. This causes
practical problems with the `_premium_` template, which has an
`@allow-zero@` argument to permit zero premium.
Consider this real-world case that I found (variables renamed), to avoid a
strawman:
<t:premium class="loc" round="cent"
yields="locInitialTotal"
generates="locInitial" index="k"
allow-zero="true"
desc="...">
<c:value-of name="premAdditional" />
<c:quotient>
<c:value-of name="premLoc" index="k" />
<c:value-of name="premTotal" />
</c:quotient>
</t:premium>
This appears to be responsible for splitting up `premAdditional` relative to
the total premium contribution of each location. It explicitly states that
it wants to permit a zero value. The intent of this block is clear: a value
of 0 is explicitly permitted and _expected_.
But if `premTotal` is for whatever reason 0—whether it be due to a test
case or some unexpected input—then it'll yield a NaN and make the entire
expression NaN. Or if `premAdditional` or `premLoc` are tainted by a NaN,
the same result will occur. The assertion will trigger. And, indeed, this
is what I'm seeing with test cases against the new classification system.
What about Infinity? Is it intuitive that, should `propValue` in the
previous example be positive and `propRate` be 0, that we would, rather than
producing a very small value, produce an infinitely large one? Does that
match intuition? Remember, this system is a domain-specific language for
_our_ purposes—it is not intended to be used to model infinities.
For example, say we had this submission because the premium exceeds our
authority to write with some carrier:
<t:submit reason="Premium exceeds authority">
<t:match-gt name="premBuilding" value="#100k" />
</t:submit>
If we had
(100,000 / 0) = ∞,
then this submit reason would trigger. Surely that was not intended, since
we have `property` as a predicate and `propRate` with the same predicate,
implying that the answer we _actually_ want is 0! In that case, what we
_probably_ want to trigger is something like
<rate yields="premFinal">
<t:maxreduce>
<c:value-of name="premBuildingTotal" />
<c:value-of name="#500" />
</t:maxreduce>
</rate>,
in order to apply a minimum premium of $500. But if `premBuildingTotal` is
Infinity, then you won't get that—you'll get Infinity, which is of course
nonsense.
And nevermind -Infinity.
Why Wasn't This a Problem Before?
---------------------------------
So why bring this up now? Why have we survived a decade without this?
We haven't, really—these bugs have been hidden. But the old classification
system covered them up; predicates would implicitly treat missing values as
0 by enclosing them in `(x||0)` in the compiled code. Observe this
ECMAScript code:
NaN || 0 = 0.
Consequently, the old classification system absorbed bad values and treated
them implicitly as 0. But that was a bug, and had to be removed; it meant
that missing indexes in classifications would trigger predicates that were
not intended to be triggered, if they matched against 0, or matched against
a value less than some number larger than zero. (See
`core/test/core/class` for examples.)
The new classification system does not perform such defaulting. _But it
also does not expect to receive values outside of its valid domain._
Consequently, _NaN and Infinity lead to undefined behavior_, and the
current implementation causes the predicate to match (NaN < 0) and therefore
fail.
The reason for this is because that this implementation is intended to
convey precisely the computation necessary for the classification system, as
formally defined, so that it can be later optimized even further. Checking
for values outside the domain not only should not be necessary, but it would
prevent such future optimizations.
Furthermore, parameters used to compile into (param||0), to account for
missing values or empty strings. This changed somewhat recently with
5a816a4701, which pre-cast all inputs and
allowed relaxing many of those casts since they were both wasteful and no
longer necessary.
Given that, for all practical purposes, 0/0=0 in the system <1yr ago.
Infinity, of course, is a different story, since (Infinity||0)=Infinity;
this one has always been a problem.
Let's Just Fail
---------------
Okay, so we cannot have a valid expression, so let's just fail.
We could mean that in two different ways:
1. Fail at runtime if we divide by 0; or
2. Fail at compile-time if we _could_ divide by 0.
Both of these have their own challenges.
Let's dismiss #2 right off the bat for now, because until we have TAMER,
that's not really feasible. We need something today. We will discuss that
in the future.
For #1—we cannot just throw an error and halt computation, because if the
`canterm` flag passed into the system is `false`, then _computation must
proceed and return all results_. Terminating classifications are checked
after returning rather than throwing errors.
Since we have to proceed with computation, then the computations have to be
valid, and so we're left with the same problem again—we cannot have
undefined behavior.
One could argue that, okay, we have undefined behavior, but we're going to
fail because of the assertion anyway! That's potentially defensible, but it
is at the moment undesirable, because we get so many failures. And,
relative to the section below, it's not clear to me what benefit we get from
that behavior other than making things more difficult for ourselves.
Furthermore, such an assertion would have to be defined for every
calculation that performs a quotient, and would have to set some
intermediate flag in the calculation which would then have to be checked for
after-the-fact. This muddies the generated calculation, which causes
problems for optimizations, because it requires peering into state of the
calculation that may be hidden or optimized away.
If we decide that calculations must be valid because we cannot fail, and we
have to stick with the domain of calculations, then `x/0` must be
_something_ within that domain.
x/0=0 Makes Sense With the Current System
-----------------------------------------
Let's take a step back. Consider a developer who is unaware that
NaN/Infinity are permitted in the system—they just know that division by
zero is a bad thing to do because that's what they learned, and they want to
avoid it in their code.
Consider that they started with this:
<rate-each class="property" generates="propValue" index="k">
<c:quotient>
<c:value-of name="buildingTiv" index="k" />
<c:value-of name="tivPropDivisor" index="k" />
</c:quotient>
</rate>
They have inspected the output of `tivPropDivisor` and see that it is
sometimes 0. They understand that `property` is a predicate for the
calculation, and so reasonably think that they could do something like this:
<classify as="nonzero-tiv-prop-divisor" ...>
<t:match-ne on="tivPropDivisor" value="#0" />
</classify>
and then change the rate-each to
<rate-each class="property nonzero-tiv-prop-divisor" ...>.
Except that, of course, we know that will have no effect, because a NaN is a
NaN. This is not intuitive.
So they'd have to do this:
<rate-each class="property" generates="propValue" index="k">
<c:cases>
<c:case>
<t:when-ne name="tivPropDivisor" value="#0" />
<c:quotient>
<c:value-of name="buildingTiv" index="k" />
<c:value-of name="tivPropDivisor" index="k" />
</c:quotient>
</c:case>
<c:otherwise>
<c:value-of name="#0" />
</c:otherwise>
</c:cases>
</rate>.
But for what purpose? What have we gained over simply having x/0=0, which
does this for you?
The reason why this is so unintuitive is because 0 is the default case in
every other part of the system. If something doesn't match a predicate, the
value becomes 0. If a value at an index is not defined, it is implicitly
zero. A non-matching predicate is 0.
This is exploited for reducing values using summation. So the behavior of
the system with regards to 0 is always on the mind of the developer. If we
add it in another spot, they would think nothing of it.
It would be nice if it acted as an identity in a monoidic operation,
e.g. as 0 for sums but as 1 for products, but that's not how the system
works at all today. And indeed such a thing could be introduced using a
special template in place of `c:value-of` that copies the predicates of the
referenced value and does the right thing.
The _danger_, of course, is that this is _not_ how the system as worked, and
so changing the behavior has the risk of breaking something that has relied
on undefined behavior for so long. This is indeed a risk, but I have taken
some confident in (a) all the test cases for our system pass despite a
significant number of x/0=0 being triggered due to limited inputs, and (b)
these situations are _not correct today_, resulting in `null` in serialized
result data because `JSON.stringify([NaN, Infinity]) === "[null, null]"`.
Given all of that, predictable incorrect behavior is better than undefined
behavior.
So x/0=0 Isn't Bad?
-------------------
No, and it's mathematically sound. This decision isn't unprecedented—
Coq, Lean, Agda, and other theorem provers define x/0=0. APL originally
defined x/0=1, but later switched to 0. Other languages do their own thing
depending on what is right for their particular situation.
Division is normally derived from
a × a⁻¹ = 1, a ≠ 0.
We're simply not using that definition—when we say "quotient", or use the
`/` symbol, we mean a _different_ function (`div`, in the compiled JS),
where we have an _additional_ axiom that
a / 0 = 0.
And, similarly,
0⁻¹ = 0.
So we've taken a _normally undefined_ case and given it a definition. No
inconsistency arises.
In fact, this makes _sense_ to do, because _this is what we want_. The
alternative, as mentioned above, is a lot of boilerplate—checking for 0 any
time we want to do division. Complicating the compiler to check for those
cases. And so on. It's easier to simple state that, in TAME, quotients
have this extra convenient feature whereby you don't have to worry about
your denominator being zero because it'll act as though you enclosed it in a
case statement, and because of that, all your code continues to operate in
an intuitive way.
I really recommend reading this blog post regarding the Lean theorem prover:
https://xenaproject.wordpress.com/2020/07/05/division-by-zero-in-type-theory-a-faq/
This is intended to be set via the configure script, and is being added
primarily for the upcoming flag to enable the legacy classification
system. This is only used for the XSLT-based compiler.
preproc:symtable-process-symbols is run on each pass (e.g. during initial
processing and after each template expansion) to introduce new symbols into
the symbol table from imports and newly discovered symbols.
This processing was previously optimized a bit using maps to reduce the cost
of symbol table lookups, but the processing was still inefficient, relying
on XSLT1-style processing (as originally written) for deduplication. This
now uses `for-each-group` and `perform-sort` to offload the expensive
computation onto Saxon, which is much more efficient.
Symbol table processing has long been a culprit, but I hadn't attempted to
optimize further in recent months because of TAMER work. Since TAMER has
been on pause for a few months with other things needing my attention, I
needed to provide a short-term performance improvement to keep up with
increasing build times.
DEV-11716
This provides logging that can be used to analyze jobs. See `tamed --help`
for some examples. More to come.
You'll notice that one of the examples reprents package build time in
_minutes_. This is why TAMER is necessary; as of the time of writing, the
longest-building package is nearly five and a half minutes, and there are a
number of packages that take a minute or more. But, there are potentially
other optimizations that can be done. And this is _after_ many rounds of
optimizations over the years. (TAME was not originally built for what it is
currently being used for.)
This is something that I've wanted to do for quite some time, but for good
reason, have been avoiding.
`tamed --report` is fairly basic right now, but allows you to see what each
of the runners are doing. This will be expanded further to gather data for
further analysis.
The thing that I was avoiding was a status line during the build to
summarize what the runners are doing, since it's nearly impossible to do so
from the build output with multiple runners. This will not only allow me to
debug more easily, but will keep the output plainly visible to developers at
all times in the hope that it can help them improve the build times
themselves in certain cases.
It is currently gated behind TAMED_TUI, since, while it works well overall,
it is imperfect, and will cause artifacts from build output partly
overwriting the status line, and may even occasionally clobber the PS1 by
erasing the line. This will be improved upon in the future; something is
better than nothing.
This is simply not worth it; the size is not going to be the bottleneck (at
least any time soon) and the generic not only pollutes all the things that
will use ASG in the near future, but is also incompatible with the SymbolId
default that is used everywhere; if we have to force it to 32 bits anyway,
then we may as well just default it right off the bat.
I thought that this seemed like a good idea at the time, and saving bits is
certainly tempting, but it was premature.
It's a bit odd that I've done next to nothing with TAMER for the past week
or so, and decided to do this one small thing before I go on break for the
holidays, but I felt compelled to do _something_. Besides, this gets me in
a better spot for the inevitable mental planning and writing I'll be doing
over the holidays.
This move was natural, given what this has evolved into---it has nothing to
do with the concept of a "tree", and the modules imports emphasized that
fact given the level of inappropriate nesting.
Now that the parser has been simplified by removing attributes, we can
further simplify the state transitions to make it more clear what further
refactoring can be done.
DEV-11339
More information can be found in the prior commit message, but I'll
summarize here.
This token was introduced to create a LL(0) parser---no tokens of
lookahead. This allowed the underlying TokenStream to be freely passed to
the next system that needed it.
Since then, Parser and ParseState were introduced, along with
ParseStatus::Dead, which introduces the concept of lookahead for a single
token---an LL(1) grammar.
I had always suspected that this would happen, given the awkwardness of
AttrEnd; it was just a matter of time before the right abstraction
manifested itself to handle lookahead.
DEV-11339
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too. This commit message is accurate, but confusing.
This performs the long-awaited task of trying to observe, concretely, how to
combine two automata. This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.
The next step will be to abstract this away.
There are some important things to note here. First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token. This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.
The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation. It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context". The "I've done my
job" part is only applicable in an accepting state.
If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.
The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional. Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.
Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one. Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.
All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.
DEV-11268
These were missed from a couple of commits ago, after I recalled that I
could now simplify the Stack variants; they were made more complicated due
to isolated attribute parsing.
These progressive refactorings do a good job illustrating why composing
parsers is better than a monolith---the complexity of the parsers is
significantly reduced, and the number of combinations of states are also
greatly reduced, which allows us to reason about them in isolation.
DEV-11268
This was added only for isolated attribute parsing. Of course, this does
mean that a new union type will be needed when combining the two parsers,
depending on the desired resolution, but that'll come at a later time and
possibly in a more general way.
DEV-11268
This nearly completely integrates the new Parser with xir::tree, but does
not yet compose AttrParseState. I also need to determine what to do with
`parse()` and, further, make `parser_from` generic as part of mod parse.
If we take a moment to reflect on all of the changes, this struggle has been
a roundabout way of converting tree's parser into parse::Parser; providing
a trait for Stack (as ParseState); beginning parser decomposition; and
moving some common logic into Parser. The composition of parsers is the
final piece to be realized.
This could have been a lot less work if I really understood exactly what I
wanted to do up front, but as was mentioned in previous commits, I was
really confusing myself trying to maintain API BC in ways that I should not
have for XmloReader. More on that will be coming soon as well.
DEV-11268
This will allow Parser to operate on both owned and &mut values, and is the
same approach that Rust's built-in iterators take.
This is at first quite surprising, and I often forget that this is a
feature, and, as a bonus, an attractive way to avoid lifetimes in struct
definitions when generics are used for the type that may become a
reference.
DEV-11268
This isn't currently used by anything, and this is collecting, which does
not fit well with the streaming model. AttrList was originally written for
Element parsing, and the isolated attr parser was written for test cases,
before it was fully decided how this system ought to work.
Instead, if AttrList is in fact needed, we can either collect (ideally not)
or implement Extend for AttrList. (Or create TryExtend.)
DEV-11268
This removes the layer of encapsulation that was hiding Stack, which is the
actual parser. The new layer of encapsulation is parse::Parser, which will
be introduced here soon. Baby steps, so it's clear how this evolves.
DEV-11268
The old Parsed was renamed to ParseStatus to be used by Parser, and Parser
converts it into Parsed, which has the same variants as it did before and
has all but the Done variant, since it's not possible for Parser to yield
it.
DEV-11268
This removes Option from ParseState, as mentioned in previous commits.
This is ideal because it not only removes a layer of abstraction, but also
makes the intent very clear; the use of None was too tied to the concept of
an Iterator, which is the concern of Parser, _not_ ParseState.
This is now similar to tree::Parsed, which will help with that refactoring
shortly.
The Done variant is not accessible outside of Parser, since it always
coverts it to None (to halt iteration); given that, we should have another
public-facing type, as was also mentioned in a previous commit.
DEV-11268
This also renames related types.
See previous commits for more in formation. In essence, this trait
represents the reification of all parser state. The omission of "r" in the
name ParseState is intentional, since it indicates the state of a current
parse. We'll see whether that naming ends up being too confusing; it's easy
enough to change.
DEV-11268
This just leaves Parser, which is what I started with, but I wasn't sure how
far I was going to take this. I went against my usual judgment in creating
a trait that I may not need, in an attempt to try to reason about the API
that I wanted, because it wasn't yet clear at the time whether the Parser
ought to be generic.
Since then (as detailed in the last commit), this has become more of a
coordinator/mediator, and the real parser is actually TokenStreamState,
which will be renamed shortly.
DEV-11268
This begins to integrate the isolated AttrParser. The next step will be
integrating it into the larger XIRT parser.
There's been considerable delay in getting this committed, because I went
through quite the struggle with myself trying to determine what balance I
want to strike between Rust's type system; convenience with parser
combinators; iterators; and various other abstractions. I ended up being
confounded by trying to maintain the current XmloReader abstraction, which
is fundamentally incompatible with the way the new parsing system
works (streaming iterators that do not collect or perform heap
allocations).
There'll be more information on this to come, but there are certain things
that will be changing.
There are a couple problems highlighted by this commit (not in code, but
conceptually):
1. Introducing Option here for the TokenParserState doesn't feel right, in
the sense that the abstraction is inappropriate. We should perhaps
introduce a new variant Parsed::Done or something to indicate intent,
rather than leaving the reader to have to read about what None actually
means.
2. This turns Parsed into more of a statement influencing control
flow/logic, and so should be encapsulated, with an external equivalent
of Parsed that omits variants that ought to remain encapsulated.
3. TokenStreamState is true, but these really are the actual parsers;
TokenStreamParser is more of a coordinator, and helps to abstract away
some of the common logic so lower-level parsers do not have to worry
about it. But calling it TokenStreamState is both a bit
confusing and is an understatement---it _does_ hold the state, but it
also holds the current parsing stack in its variants.
Another thing that is not yet entirely clear is whether this AttrParser
ought to care about detection of duplicate attributes, or if that should be
done in a separate parser, perhaps even at the XIR level. The same can be
said for checking for balanced tags. By pushing it to TokenStream in XIR,
we would get a guaranteed check regardless of what parsers are used, which
is attractive because it reduces the (almost certain-to-otherwise-occur)
risk that individual parsers will not sufficiently check for semantically
valid XML. But it does _potentially_ match error recovery more
complicated. But at the same time, perhaps more specific parsers ought not
care about recovery at that level.
Anyway, point being, more to come, but I am disappointed how much time I'm
spending considering parsing, given that there are so many things I need to
move onto. I just want this done right and in a way that feels like it's
working well with Rust while it's all in working memory, otherwise it's
going to be a significant effort to get back into.
DEV-11268
This stores the last seen Span and uses that when reporting EOF, so that the
user will be able to be notified of where exactly the problem occurred.
When I get into creating combinators, it'll be the responsibility of those
combinators to ensure that any None return value will be supplemented by its
own last span.
DEV-11268
This permits retrieving a Span from any Token variant. To support this,
rather than having this return an Option, Token::AttrEnd was augmented with
a Span; this results in a much simpler and friendlier API.
DEV-11268
This removes XIRT support for attribute fragments. The reason is that
because this is a write-only operation---fragments are used to concatenate
SymbolIds without reallocation, which can only happen if we are generating
XIR internally.
Given that this cannot happen during read, it was a mistake to complicate
the parsers. But it makes sense why I did originally, given that the XIRT
parser was written for simplifying test cases. But now that we want parsers
for real, and are writing production-quality parsers, this extra complexity
is very undesirable.
As a bonus, we also avoid any potential for heap allocations related to
attributes. Granted, they didn't _really_ exist to begin with, but it was
part of XIRT, and was ugly.
DEV-11268
The XIRT parser was initially written for test cases, so that unit tests
should assert more easily on generated token streams (XIR). While it was
planned, it wasn't clear what the eventual needs would be, which were
expected to differ. Indeed, loading everything into a generic tree
representation in memory is not appropriate---we should prefer streaming and
avoiding heap allocations when they’re not necessary, and we should parse
into an IR rather than a generic format, which ensures that the data follow
a proper grammar and are semantically valid.
When parsing attributes in an isolated context became necessary for the
aforementioned task, the state machine of the XIRT parser was modified to
accommodate. The opposite approach should have been taken---instead of
adding complexity and special cases to the parser, and from a complex parser
extracting a simple one (an attribute parser), we should be composing the
larger (full XIRT) parser from smaller ones (e.g. attribute, child
elements).
A combinator, when used in a functional sense, refers not to combinatory
logic but to the composition of more complex systems from smaller ones. The
changes made as part of this commit begin to work toward combinators, though
it's not necessarily evident yet (to you, the reader) how that'll work,
since the code for it hasn't yet been written; this is commit is simply
getting my work thusfar introduced so I can do some light refactoring before
continuing on it.
TAMER does not aim to introduce a parser combinator framework in its usual
sense---it favors, instead, striking a proper balance with Rust’s type
system that permits the convenience of combinators only in situations where
they are needed, to avoid having to write new parser
boilerplate. Specifically:
1. Rust’s type system should be used as combinators, so that parsers are
automatically constructed from the type definition.
2. Primitive parsers are written as explicit automata, not as primitive
combinators.
3. Parsing should directly produce IRs as a lowering operation below XIRT,
rather than producing XIRT itself. That is, target IRs should consume
XIRT and produce parse themselves immediately, during streaming.
In the future, if more combinators are needed, they will be added; maybe
this will eventually evolve into a more generic parser combinator framework
for TAME, but that is certainly a waste of time right now. And, to be
honest, I’m hoping that won’t be necessary.
There are a number of reasons for this, where the benefits do not make up
for the losses.
First: this is actually invoking cargo. Not only is this not necessary, but
it's not desirable: cargo by default hits the network and does all sorts of
other stuff, when all we want to do is invoke the executable. So the tests
aren't really testing the right thing in that sense. See the previous
commit for more information.
The way it invokes cargo is different than the way the Makefile invokes
cargo, so on my system, it's actually invoking a _different cargo_! This is
causing problems, in particular with lock files, which causes my tests to
fail.
Importantly, this also removes a _lot_ of dependencies, which removes a lot
of supplier chain risk and a lot of code to audit. This provides
significant security benefits, especially given that what was being tested
was rather small, and could be done in a shell script.
TAMER will receive significant system testing later on. But for now, none
of this was worth it.
Further audits of dependencies will come later on. I've always been fairly
insistent on keeping the dependency graph small and auditable, but recent
supply chain attacks have given me a better way to rationalize the security
risk. Further, I'm the only one on this project right now.
Cargo's default behavior is unfortunately to issue network calls each time
it is invoke in order to check for dependencies updates. This is not only
bad for reproducibility and privacy, but it's also a concern for supply
chain attacks, since most developers are unaware that this is occurring.
Instead, we pin to the lockfile. Installing dependencies can be done with
`cargo fetch` and updating dependencies must be explicitly done by the
developer, with the lockfile updated.
Well, parse to the extent that it was being parsed before, anyway.
The core of this change demonstrates how well TAMER's abstractions work well
together. (As long as you have an e.g. LSP to help you make sense of all of
the inference, I suppose.)
Token::Open(QN_LV_PACKAGE | QN_PACKAGE, _) => {
return Ok(XmloEvent::Package(
attr_parser_from(&mut self.reader)
.try_collect_ok()??,
));
}
This finally makes use of `attr_parser_from` and `try_collect_ok`. All of
the types are inferred---from the iterator transformations, to the error
conversions, to the destination PackageAttrs type.
DEV-10863
This was forgotten when the attribute parser was introduced, and led to the
parser continuing to the token following AttrEnd, which properly caused a
failure given that the parser was in the Done state.
There is a future task I have in my backlog to properly address the Done
state, but this is sufficient for now.
To maintain a proper abstraction, this cannot be the responsibility of the
caller; most callers should not know that fragments exist, letalone how to
handle them.
Like previous commits, this replaces the explicit escaping context with the
convention that all values retrieved from `xir` are unescaped on read and
escaped on write.
Comments are a notable TODO, since we must escape only `--`.
CData is also an issue. I had _expected_ to use it as a means to avoid
unescaping fragments, but I had forgotten that quick_xml hard-codes escaping
on read, so that it can re-use BytesStart! That is terribly unfortunate,
and may result in us having to re-implement our own read method in the
future to avoid this nonsense. So I'm just leaving it as a TODO for now.
DEV-11081
This adds a constant `ST_COUNT` representing the number of statically
allocated symbols, and uses that to estimate an initial capacity for the
`CachingEscaper`.
This is just a guess (and is certainly too low), but we can adjust later on
after profiling, if it ever comes up.
This rewrites a good portion of the previous commit.
Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.
Given that, we need only unescape on read and escape on write. This is
customary, so why didn't I do that to begin with?
The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming. However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around. If we share the Escaper between _all_
readers and the writer, the result is that
1. Duplicate strings between source files and object files (many of which
are read by both the linker and compiler) avoid re-unescaping; and
2. Writers can use this cache to avoid re-escaping when we've already seen
the escaped variant of the string during read.
The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.
DEV-11081
I'm not fond of this implementation, which is why it's not fully
completed. I wanted to commit this for future reference, and take the
opportunity to explain why I don't like it.
First: this task started as an idea to implement a third variant to
AttrValue and friends that indicates that a value is fixed, in the sense of
a fixed-point function: escaped or unescaped, its value is the same. This
would allow us to skip wasteful escape/unescape operations.
In doing so, it became obvious that there's no need to leak this information
through the API, and indeed, no part of the system should care. When we
read XML, it should be unescaped, and when we write, it should be
escaped. The reason that this didn't quite happen to begin with was an
optimization: I'll be creating an echo writer in place of the current
filesystem-based copy in tamec shortly, and this would allow streaming XIR
directly from the reader to the writer without any unescaping or
re-escaping.
When we unescape, we know the value that it came from, so we could simply
store both symbols---they're 32-bit, so it results in a nicely compressed
64-bit value, so it's essentially cost-free, as long as we accept the
expense of internment. This is `XirString`. Then, when we want to escape
or unescape, we first check to see whether a symbol already exists and, if
so, use it.
While this works well for echoing streams, it won't work all that well in
practice: the unescaped SymbolId will be taken and the XirString discarded,
since nothing after XIR should be coupled with it. Then, when we later
construct a XIR stream for writting, XirString will no longer be available
and our previously known escape is lost, so the writer will have to
re-escape.
Further, if we look at XirString's generic for the XirStringEscaper---it
uses phantom, which hints that maybe it's not in the best place. Indeed,
I've already acknowledged that only a reader unescapes and only a writer
escapes, and that the rest of the system works with normal (unescaped)
values, so only readers and writers should be part of this process. I also
already acknowledged that XirString would be lost and only the unescaped
SymbolId would be used.
So what's the point of XirString, then, if it won't be a useful optimization
beyond the temporary echo writer?
Instead, we can take the XirStringWriter and implement two caches on that:
mapping SymbolId from escaped->unescaped and vice-versa. These can be
simple vectors, since SymbolId is a 32-bit value we will not have much
wasted space for symbols that never get read or written. We could even
optimize for preinterned symbols using markers, though I'll probably not do
so, and I'll explain why later.
If we do _that_, we get even _better_ optimizations through caching that
_will_ apply in the general case (so, not just for echo), and we're able to
ditch XirString entirely and simply use a SymbolId. This makes for a much
more friendly API that isn't leaking implementation details, though it
_does_ put an onus on the caller to pass the encoder to both the reader and
the writer, _if_ it wants to take advantage of a cache. But that burden is
not significant (and is, again, optional if we don't want it).
So, that'll be the next step.
This is intended to alleviate what will be some common boilerplate because
of the Rust compiler error described therein.
This will evolve over time, I'm sure.
DEV-10863
This provides convenience methods atop of the already-existing
functions. These are a bit more ergonomic since they (a) remove a variable
and its generics and (b) are conveniently suggested via LSP (with
e.g. rust-analyzer) if the iterator is of the right type, even if the trait
is not yet imported. This should help with discoverability as well.
These traits augment Rust's built-in traits to handle failure scenarios,
which will allow us to encapsulate lowering logic into discrete,
self-parsing units that enforce e.g. schemas (the example alludes to my
intentions).
The previous implementation took ownership over the provided iterator, which
was an oversight, considering that this is intended to be used in contexts
where doing so is not possible. A good example where isolated test cases
aren't necessarily painting the correct picture.
`scan` takes owned values, so this instead uses the same parsing method as
`parse_attrs`, but using a `FromFn` iterator to avoid having to create a
whole new iterator type. This will work well so long as we don't need to
store the type returned by this (while also wanting to avoid boxing).
DEV-11062
See the previous commit. There is no sense in some common "IR" namespace,
since those IRs should live close to whatever system whose data they
represent.
In the case of these, they are general IRs that can apply to many different
parts of the system. If that proves to be a false statement, they'll be
moved.
DEV-10863
Calling it "legacyir" is just confusing. The original hope, when beginning
TAMER, was that I'd be able to use a new object format in the near future to
help speed up the compilation process. But that's far from our list of
priorities now, and so seeing "legacy" all over the place is really
confusing considering that it implies that perhaps it shouldn't be used for
new code.
This helps to clear up that cognitive dissonance by remaining neutral on the
topic. And the reality is that it won't be "legacy" for some time.
DEV-10863
The IRs really ought to live where they are owned, especially given that
"IR" is so generic that it makes no sense for there to be a single location
for them; they're just data structures coupled with different phases of
compilation.
This will be renamed next commit; see that for details.
This also removes some documentation describing the lowering process,
because it's undergone a number of changes and needs to be accurately
re-summarized in another location. That will come at a later time after the
work is further along so that I don't have to keep spending the time
rewriting it.
DEV-10863
This was previous gated behind the negation of the wip-xmlo-xir-reader flag,
which meant that it was not being compiled or picked up by LSP. Both of
those things are inconvenient and unideal.
DEV-10863
This allows for the lazy parsing of attributes, and makes the necessary
changes to the parser to be able to do so safely without getting into a bad
context.
When XIRT was originally conceived, this concept existed somewhat, but it
was done in a way that would allow the parser to accept invalid input. This
avoids that problem.
This also introduces the concept of "Done", primarily because we had to for
the AttrEnd token. This will evolve in following commit(s), which will
allow carrying out the important check of ensuring that the parser has ended
parsing in a valid accepting state (in terms of a state machine).
DEV-11062
This produces an `AttrList` independent from a containing
`Element`. Upcoming changes may further permit the parser to yield smaller
components that are not part of an aggregate.
DEV-10863
This allows Rust to carry out its exhaustiveness check for when we add new
tokens. It further ensure that we understand what we missed, or chose not
to handle.
DEV-10863
This allows AttrList not only to be lazily initialized (which is less of a
problem at the moment with Vec, but may become one in the future), but also
leaves a space open for attributes to be added _after_ having been
parsed. It further leaves room to _take_ attributes from their `Element`.
This is important because the next commit will re-introduce the ability to
parse attributes independently, allowing us to put the parser in a state
where we can parse AttrList without an Element context. To re-use that
parsing under an Element context, we can simply attach an AttrList after it
has been parsed.
Option adds no additional size cost to Vec, so we get this for free (except
for the tiny change that initializes the attribute list when we try to push
to it).
I also think this reads better ("attrs: None"). Though it makes the API
slightly more of a pain to work with.
DEV-10863
The purpose of this token is to implement a lazy streaming attribute
collection operation without a token of lookup, which would complicate
parsing or require that a TokenStream provide a `peek` method.
This is only required for readers to produce, since readers will be feeding
data to parsers. I have the writer ignoring it. If you're looking back at
this commit, the question is whether this was a bad idea: it introduces
inconsistencies into the token stream depending on the context, which can be
confusing and error-prone.
The intent is to have the parser throw an explicit error if the new token is
missing in the context in which it is required, which will safely handle the
issue, but does defer it to runtime. But only readers need auditing, and
there's only one XIR reader at the moment.
DEV-10863
There isn't a whole lot here, but there is additional work needed in various
places to support upcoming changes and so I want to get this commited to
ease the cognitive burden of what I have thusfar. And to stop stashing. We
have a feature flag for a reason.
DEV-10863
This macro was previously using the path of wherever the template expanded
into, which I found to be unexpected considering that I thought the macros
were hygenic and the names bound to the environment in which they were
defined.
In any case, this solves the problem in all cases.
DEV-10863
This was forgotten in the previous commit and exists simply to ensure that
the TripIter doesn't add any significant overhead. The tests are
a handful of nanoseconds apart, on my machine.
See the documentation in this commit for more information.
This is pretty significant, in that it's been a long-standing question for
me how I'd like to join together `Result` iterators without having
unnecessarily complex APIs, and also allow for error recovery. This solves
both of those problems.
It should be noted, however, that this does not yet explicitly implement
error recovery, beyond being able to observe the failure as the result of
the provided callback function. Proper recovery will be implemented once
there's a use-case.
DEV-11006
This moves the Iterator impl and From<B> back into `quickxml`. The type of
the new reader is different, taking an iterator instead of a BufRead. This
will allow us to easily mock for unit tests, without the clustfuckery that
has ensued previously with quick-xml mocking.
DEV-10863
The original plan was to modify the existing reader to use the new
XmlXirReader, but that's going to be a lot of ongoing uncommitted work, with
both tests and implementation. The better option seems to be to reimplement
it, since so many things are changing.
This flag will be short-lived and removed as soon as the implementation is
complete.
DEV-10863
Comments re-use Text, but they are _not_ escaped, so we need to take care
with the type to ensure that, if the value were ever used with a
Token::Text, that we don't end up injecting XML.
quick_xml provides us the value escaped, so we can just handle this the same
way as Text for now.
In the future, we may want to distinguish between the two so that we can
reconstruct an identical XML document, but at the moment CData isn't used at
all in TAME sources or outputs, and so I'm not going to worry about it for
now.
DEV-10863
It's nice being able to breeze through changes, since that's been a pretty
rare thing so far, given all the foundational work that has been needed.
This should get us pretty damn close to being able to parse the `xmlo` files
for the reader linker, if we're not there already.
DEV-10863
This is quick-and-dirty; refactoring can be done later on. This is also
intended to demonstrate the ease with which additional events can be
added---the hard work is done.
This is an initial working concept for the reader which handles, so far,
just a single attribute. But extending it to completion will not be all
that much more work.
This does not have namespace support---that will be added later as part of
XIRT, which is responsible for semantic analysis. This allows XIR to stay
wonderfully simple, and won't have any impact on the writer (which expects
that QNames are unresolved and contain the namespace prefix to be written).
This is the safe version of the existing intern_utf8_unchecked, and exists
as a performance optimization.
We're about to introduce a XIR reader, which is going to intern a _lot_ of
duplicate strings, since it will intern node and attribute names as
well. Given that, we do not want to spent a lot of time performing UTF-8
checks that have already been performed.
We know that, if an intern is in the pool, it's either already UTF-8 or that
check was bypassed when it was initially interned. Therefore, if we find an
existing symbol, that can be returned without having to perform any
check. Otherwise, we intern as we usually would after attempting to convert
the byte slice into a string.
This allows us to continue to have good performance for interning without
sacrificing safety for strings.
The intent of this is to demonstrate how significant of an impact checking
byte arrays for UTF-8 validity will have, since the existing tests do not
make that clear (a static string in Rust is always valid UTF-8).
These benchmarks show that the cost when re-interning an already existing
value is +50%.
This is important, because the new reader will be interning a _lot_ of
duplicate strings, whereas the existing reader operates on byte arrays
without interning unless necessary. And, when it does, it does so
unchecked. But we'd rather not do that, since we cannot guarantee that
those XML files are valid (and not modified in some way).
Upcoming commits will have what I think is a reasonable compromise to this,
based on the fact that we'll be encountering _many_ duplicate strings in
parsing XML files.
DEV-10920
This provides a child `raw` module that exposes a SymbolId representing the
inner value of each of the static newtypes. This is needed in situations
where the type must match and the type of the static symbol is not
important.
In particular, when comparing against runtime-allocated symbols in `match`
expressions.
It is also worth noting that this commit managed to hit a bug in Rustc that
was fixed on 10/1/2021. We use nightly, and it doesn't seem that this
occurred in stable, from bug reports.
- https://github.com/rust-lang/rust/issues/89393
- 5ab1245303
- Original issue: https://github.com/rust-lang/rust/issues/72476
The error was:
compiler/rustc_mir_build/src/thir/pattern/deconstruct_pat.rs:1191:22:
Unexpected type for `Single` constructor: <u32 as sym::symbol::SymbolIndexSize>::NonZero
thread 'rustc' panicked at 'Box<dyn Any>', compiler/rustc_errors/src/lib.rs:1146:9
This occurred because we were trying to use `SymbolId` as the type, which
uses a projected type as its inner value: `SymbolId<Ix: SymbolIndexSize>(Ix::NonZero)`.
This was not a problem with the static newtypes because their inner type was
simply `SymbolId<Ix>`, which is not projected.
This is one of the risks of using nightly.
But, the point is: if you receive this error, upgrade your toolchain.
Tbh, I was unaware that this was supported by tuple variants until reading
over the Rustc source code for something. (Which I had previously read, but
I must have missed it.)
This is more proper, in the sense that in a lot of cases we not only care
about how many values a tuple has, but if we explicitly match on them using
`_`, then any time we modify the number of values, it would _break_ any code
doing so. Using this method, we improve maintainability by not causing
breakages under those circumstances.
But, consequently, it's important that we use this only when we _really_
don't care and don't want to be notified by the compiler.
I did not use `..` as a prefix, even where supported, because the intent is
to append additional information to tuples. Consequently, I also used `..`
in places where no additional fields currently exist, since they may in the
future (e.g. introducing `Span` for `IdentObject`).
In particular, `name` needn't return an `Option`. `fragment` also returns a
copy, since it's just a `SymbolId`. (It really ought to be a newtype rather
than an alias, but we'll worry about that some other time.)
These changes allow us to remove some runtime panics.
DEV-10859
This moves the logic that sorts identifiers into sections into Sections
itself, and introduces XmleSections to allow for mocking for testing.
This then allows us to narrow the types significantly, eliminating some
runtime checks. The types can be narrowed further, but I'll be limiting the
work I'll be doing now; this'll be inevitably addressed as we use the ASG
for the compiler.
This also handles moving Sections tests, which was a TODO from the previous
commit.
DEV-10859
This is the appropriate place to be, now that we've begun narrowing the
types. We'll be able to do so further; this is just the first step.
This does not yet move the tests, but the code is still tested because it's
tightly coupled with `sort`. Those will move in the next commit(s).
DEV-10859
xmle sections will only ever contain an object of one type, so there is no
use in making this generic.
I think the original plan was to have this represent, generically, sections
of some object file (like ELF), but doing so would require a significant
redesign anyway, so it makes no sense. This is easier to reason about.
DEV-10859
This has always been a lowering operation, but it was not phrased in terms
of it, which made the process a bit more confusing to understand.
The implementation hasn't changed, but this is an incremental refactoring
and so exposes BaseAsg and its `graph` field temporarily.
DEV-10859
Sections, as written, are specific to xmle files.
I think the intent originally was to have this be more generic, but that
doesn't really make sense.
By explicitly coupling it with `xmle` files, that will allow us to turn this
into a proper lowering operation with its own validations that will allow
`xmle::xir` to do its job without having to validate anything itself.
This outputs enough information to be a little bit useful in the event of an
error. In the future, we'll want to provide a (likely non-Display)
implementation that provides line number and source file context with
the problem characters indicated, like Rust.
This is a significant departure from my original plans---this makes it
_easy_ to display symbol values, despite me not wanting that to occur unless
absolutely necessary.
The reality is, based on the design of the system, they will only occur in
these situations:
1. Writing to files;
2. Displaying errors;
3. Tests; or
4. People not following the design of the system.
The fourth one is the most risky as people begin to contribute in the
future, but the reality is that those can be fixed as they are encountered,
since if they're not showing up in a profiler, then they must not be causing
much of a problem.
This removes `SymbolStr` in favor of, simply, `&'static str`.
The abstraction provided no additional safety since the slice was trivially
extracted (and commonly, in practice), and was inconvenient to work with.
This is part of a process of relaxing lookups so that symbols can be
conveniently displayed in errors; rather than trying to prevent the
developer from doing something bad, we'll just rely on conventions, hope
that it doesn't happen, and if it does, address it either at that time or
when it shows up in the profiler.
The docs still need to be improved, but they can be touched as we go.
This concludes the initial development of XIR. That was much more involved
that I had originally intended, but the result is good.
DEV-10561
This generalizes it a bit and provides tests, which was always the intent;
the existing code was POC to determine if this could be done without
performance degradation (see that commit for more information).
The intent is to support the composition and decomposition of spans such
that (A, B) is as documented here. This only performs the trivial case for
the sake of providing a convenient API when the developer would otherwise
just type (S, S).
This is intended to represent the sections written to the final xmle file,
and there was unnecessary complexity in separating everything.
By reducing this IR further, we can begin to constrain its types to
eliminate some of the runtime panics and error checking we have/had in the
writer.
The new writer has reached parity of the old, with the exception of some
edge case explicit error handling that should never occur (which will be
added), and cleanup/docs.
Removing this flag now allows me to perform that cleanup without having to
worry about updating the now-old implementation.
I ran `tameld` with the new writer against our production system with
numerous programs and a significant number of test cases, and diff'd the old
and new xmle files, and everything looks good.
This is a significant milestone, in the sense that it is the culmination of
the past month or so of work to prove that an Iterator-based XIR will be
viable for the system.
This barely had any impact on the performance from the previous commit
reporting the profiling. This performs at least as well as the quick-xml
based writer. In isolated benchmarks, it performs better, but in the real
world, the linker spends most of its time reading xmlo files, and so minor
differences in writing do not have a significant overall impact.
With that said, a lot of cleanup and documentation is still needed. That is
the subject of the upcoming commits, before this writer can finalized.
The previous iterators had to be used in a certain order because they mixed
concerns, out of concern for performance. This attempts to chain even more
iterators to see how it may perform.
To be clear: this will be cleaned up. This was just an experiment.
Here were profiles on the average of 50 runs of linking our largest program:
Baseline, pre-XIR (with fragments removed from output) 0.8082
XIR writer, pre-ElemWrap, no #[inline] 0.7844s
XIR writer, ElemWrap, no #[inline] 0.7918s
XIR writer, ElemWrap, inlines in obj::xmle::xir 0.7892s
XIR writer, ElemWrap, inlines in obj::xmle::xir and ir::asg::section 0.7858s
XIR writer, ElemWrap, inline in only ir::asg::section 0.781s
Pre-ElemWrap, inlines in ir::asg::section 0.7772s
These profiles are difficult, because they hit the filesystem so much. I
write to /dev/null, but it reads 100s of xmlo files from disk.
It's clear that the impact is fairly modest and within a margin of error; as
such, I will continue down the path of writing code that's easier to grok
and maintain, since not doing so would be a micro-optimization relative to
the concerns of the rest of the system at this point.
But the purpose of all of this work was to determine whether an
iterator-based XIR would be viable. It seems to be competitive. I'll
finish up the writer reimplementation and move on.
Two reasons for this:
1. It's unnecessary, since it's the same ref, so long as we actually build
everything as part of the stage job; and
2. In our environment, the token used doesn't have access to pull from the
registry.
Fixing the latter item can be done at another time.
This contains some awkward coupling for opening and closing tags to reduce
the complexity of the `Iterator` types that must be manually
specified. That may be addressed shortly.
This was creating a heap-allocated `Vec` for each map symbol despite not
actually needing it. We do have multiple froms for return map values.
But by the time we may want this type of thing, we'll have a different IR
for it anyway.
See the docs for a much deeper discussion. In summary: traits do not
support static methods, and this is the workaround, which relies on unstable
nightly constant function features.
This implementation is tested using `qname_const!`, and will be utilized
with a new static type in a following commit.
This is to support two things:
1. Early switch to 2021 Edition, which is stable Oct 21; and
2. To make use of unstable const features.
The rationale is that switching to nightly does not really have any
significant downside for us, given that TAMER is used only by us and
the only risk is that unstable features may change a bit, which can be
mitigated with certain precautions.
The rationale for each unstable feature will be documented as they are used,
including documentation on what would be required to remove it and what
functionality would be lost / need to change in doing so.
This is far from fully documented; it's just a start. I'll document fully
once the implementation is done, to ensure I don't waste time documenting
things that may change.
These are getting large and messy.
And I now notice that I never completed the header test after
prototyping. Shame on me.
Also, errata from the previous commit message: the diffs are identical
_except for attribute escaping_ that is unnecessary; we're outputting data
read directly from existing XML files (output by Saxon), so characters are
already escaped as needed.
DEV-10561
The `l:dep` section of the `xmle` file, after formatting (since XIR writes
without newlines and indentation), is now identical to the existing xmle
writer. I can now move on to the other sections.
Note that the attribute movement in this commit is simply to get the diff to
properly align. Once the current xmle writer is removed, I'll organize them
a bit more sensibly.
`obj::xmle::xir` also needs documentation, now that it's shown to be viable.
The new xmle writer was having to intern before write, which did not make
sense.
This continues with consistently using symbols throughout the system, and
is a smaller size than `String` as a bonus.
`IdentKind` needs to be written to `xmle` files and displayed in error
messages. String slices were used when quick-xml was used for writing,
which will be going away with the new writer.
This has been a long time coming, and has been repeatedly stashed as other
parts of the system have evolved to support it. The introduction of the XIR
tree was to write tests for this (which are sloppy atm).
This currently writes out the `xmle` header and _most_ of the `l:dep`
section; it's missing the object-type-specific attributes. There is,
relatively speaking, not much more work to do here.
The feature flag `wip-xir-xmle-writer` was introduced to toggle this system
in place of `XmleWriter`. Initial benchmarks show that it will be
competitive with the quick-xml-based writer, but remember that is not the
goal: the purpose of this is to test XIR in a production system before we
continue to implement it for a frontend, and to refactor so that we do not
have multiple implementations writing XML files (once we echo the source XML
files).
I'm excited to get this done with so that I can move on. This has been
rather exhausting.
The 16-bit interner at present will be used only for span contexts. In the
future, this interner may become specialized specifically for that, but for
now let's just re-use what we already have so that I can move on.
DEV-10733
I want to make it clear in the assertion that the problem could be caused by
duplicate strings. We do not sort by string, because in part we may in the
future want to group certain symbols together in some arbitrary way so we
can compare ranges (using the markers).
If that doesn't end up happening, it may be better to just sort by string
to obviate the problem.
It's really awkward not having them caps, when not only are constants
expected to be, but also that we cannot maintain consistency between the
string and the identifier name in even the simplest of cases.
(We could use `r#`, but that's too cumbersome.)
`StaticSymbolId` was created before the more specific types, which render it
unnecessary. If we need a generic type, it can be re-introduced, but using
`static_symbol_newtypes!`.
This is the interner that is intended to be used with the majority of the
system; the 16-bit interner is left around for the moment, but will likely
later become specialized.
This had the writing on the wall all the same as the `'i` interner lifetime
that came before it. It was too much of a maintenance burden trying to
accommodate both 16-bit and 32-bit symbols generically.
There is a situation where we do still want 16-bit symbols---the
`Span`. Therefore, I have left generic support for symbol sizes, as well as
the different global interners, but `SymbolId` now defaults to 32-bit, as
does `Asg`. Further, the size parameter has been removed from the rest of
the code, with the exception of `Span`.
This cleans things up quite a bit, and is much nicer to work with. If we
want 16-bit symbols in the future for packing to increase CPU cache
performance, we can handle that situation then in that specific case; it's a
premature optimization that's not at all worth the effort here.
We'll see how the syntax evolves over time. It's not ideal to have to
specify the type, rather than having the compiler infer it, but I don't much
feel like getting into my first procedural macro right now, so we'll stick
with this approach for the time being.
This will set the stage to be able to safely e.g. create QNames statically
at compile-time and would allow us to make any attempts to bypass it
unsafe.
Previously, we were allocating only u32 versions of `SymbolId` for the
statically allocated symbols. This introduces a new symbol type with a very
small datatype (8 bits) that is able to cast into any `SymbolId`. This is
explained in the docs.
We'll be taking this typing further in future commits so that static symbols
are better-suited for compile-time guarantees for static newtype
construction.
DEV-10710
This is the beginning of static symbols, which is becoming increasing
necessary as it's quite a pain to have to deal with interning static strings
any place they're used.
It's _more_ of a pain to do that in conjunction with newtypes (e.g. `QName`,
`AttValue`, etc) that make use of `SymbolId`; this will allow us to
construct _those_ statically as well, and additional work to support that
will be coming up.
DEV-10701
These were using GiB of memory, which is ...unnecessary.
I reduced the iteration count significantly, but it was still wasting a lot
of time and memory and needed `with_capacity` to reduce the number of copies
after reallocation.
It is not typical that a buffer would contain this much information.
This broke when I removed `SelfClose`. I used to run
`make all fmt check bench` before every push, but they take a while to run,
in part because it uses nightly and has to recompile too.
But it looks like I need to be more diligent again.
This is exactly was I said I was _not_ going to do in the previous commit,
but apparently hacking late at night had me forget the whole reason that
XIRT is being introduced now---unit tests. I'll be emitting a XIR stream
and I need to parse it for convenience in the tests.
So, here's a good start. Next will be some generalizations that are useful
for the tests as well. This is pretty bare, but accomplishes the task.
See docs for more info.
The `tree` module is getting more difficult to navigate. The tests still
remain where they were, since a bunch of concerns are mixed together. Any
tests specific only to this module will be added here.
This is implemented only for the writer, since its use case is to be able to
concatenate strings without copying during writing.
It doesn't really make sense to support this in XIR Tree, since a reader
should never produce this. But if we ever run into this (e.g. due to some
internal processing pipeline), we'll address it then; XIR Tree might have to
do copying, then, but should probably wait until encountering all fragments
before interning. That'd be a distraction right now.
This commit will make more sense once the broader context is committed, but
it's needed for lowering from `Sections` into a XIR stream.
This will also change once we pre-allocate symbols, like rustc, when the
interner is initialized.
This is my first use of the `paste` crate, which is used to generate
identifiers. So this is partly an experiment, and it seems much better than
having to write a proc macro, at least at this point in time. If this code
stays around, it'll probably be generalized further and used elsewhere, but
I'd prefer not to go this route long-term.
This moves some logic into `ElementStack` (which would be part of `Stack` if
variants were their own types), rather than peering so deeply into its
data.
This correctly retains and restores the parent stack after processing an
attribute for a child element.
This does increase the size of [`Stack`] a bit, but we can evaluate whether
it's too large at a later time. It's currently 832 bits with `Ix=u32`,
which is large, but the question is whether it matters; we'll see as we
begin to use it.
This moves most of the parsing logic into `Stack`, which rightfully owns the
stack manipulation and state transitions. `ParserState` becomes exactly
what it says it is---a management of the persistent state of the parser, and
is also responsible for digesting tokens and dispatching their data to the
proper event.
This approach has a number of benefits over the old design: it's
self-documenting, making the intent clear; and it is easier to reason about
the subset of states (for both humans and Rusts) than a large match of
transitions.
This contains a number of TODO items that will be addressed shortly. It
also obviated that the previous commit was incomplete---it doesn't persist
`pstack` for attributes on child elements! That'll be fixed too.
This modifies the tree parser to handle child elements. It's mostly
proof-of-concept code; the next commit will clean it up a bit so that it's
largely self-documenting.
This removes `SelfClose` and merges it with `Close` by making the first
parameter an `Option`. This isn't really ideal, but it really simplifies
pattern matching, especially for the next commit. I'll have more details
there.
The primary motivation was lack of stabalization for binding after `@` in
matches, e.g. `Foo(name, ele) | ele @ Element { name, .. }`. It looks like
it's ready, though; maybe next Rust release?
https://github.com/rust-lang/rust/issues/65490
I don't know if I'll revert this change after then. This seems plenty
clear, albeit more verbose.
This introduces parser errors, but does not yet support error recovery; that
problem will be discussed in a commit in the near future, after the writer
is sorted out a bit more.
DEV-10561
The idea, previously, was that parsing could begin at attributes selectively
and be parsed independently. But that's really awkward with `Tree`, since
it effectively allows orphan attributes as children of an
`Element`. Nonsense.
Instead, if we truly only want an attribute list, we can offer a function to
create a parser with an empty `Stack::BuddingElement` that can accumulate
them.
Previously, `parser_from` was a simple wrapper around `parse`; now, this
provides a more convenient API where `next` will yield the next parsed
object.
See docs for much more information and rationale.
These traits are intended to eliminate boilerplate, primarily in tests, in
situations where from/into is not expected to fail.
Given that TAMER must only panic for internal compiler errors, this should
not often be used outside of test cases. Further, there may be better
options in the future (e.g. QNames could be statically compiled rather than
trying to convert at runtime, in this case).
This begins to introduce the XIR tree. I was originally going to wait on
this until after implementing the xmle writer in terms of XIR, but writing
unit tests is too much of a pain on the stream, so now is as good of a time
as any.
This has very limited support so far; it'll be added to as time goes on.
These groups happen to correspond with the sections of the xmle file, which
suggests again that this lives in the wrong place. But I should really have
my focus elsewhere right now, so I don't know if I'll go any further right
now. I guess we'll see as the writer is reimplemented.
`SectionsIter` was introduced to remove that responsibility from xmle
writer, since that's currently being reimplemented using XIR.
The existing iterator has been renamed SectionIter{ator=>} for a more
idiomatic name for iterator structs, and now has a static type rather than
relying on dynamic dispatch. The author of that code wasn't sure how to
handle it otherwise. (Which is understandable, since we were both still
getting acquainted with Rust.) There's no notable change in performance in
my benchmarking.
This abstraction is a bit awkward, in that it's named for object file
sections, but they aren't. Further, it's coupled with the ASG via
`SortableAsg` and perhaps should be generalized into a sorting routine that
takes a function for sorting, so that `Sections` can be moved into xmle's
packages.
This macro is used to consume whitespace so that the following sentence can
start on the next line without producing any whitespace in the output. Its
argument is, therefore, whitespace.
This used to work in earlier versions of Texinfo, but around 6.{6,7} it
began failing because an argument was provided when it wasn't defined with
one.
The return value has no meaningful side-effects at all; the write operation
failing isn't worth pointing out, since it has to be used regardless.
The normal `write` does have useful side-effects, of course.
This change was primarily intended to clean up unit tests. Since it
allocates and returns a new buffer, I do not expect this to have much use
within TAMER itself in the near future. Maybe in later tooling.
If this is abused, person from the future: add `#[cfg(test)]` to its
definition.
I decided not to do this in a previous commit because I had documented
"NodeStream" elsewhere, so I'd like it to be in the Git history to
understand its evolution.
This never was a "Node" stream beyond the initial concept phase, because it
represents tokens that aren't themselves nodes. It is intended to generate
XML nodes, but may need to accommodate non-nodes (e.g. XML declarations) in
the future.
The name originated from `Node`, which was a tree-based IR that was
initially conceived, but removed because it's not yet needed. What we need
is a streaming IR for xmle writing, and then for reading and echoing back
out XML for the new frontend.
This is a working streaming IR for XML. I want to get this committed before
I go further cleaning it up and integrating it into the xmle writer.
This is lacking detailed documentation, and the names of things may end up
changing.
Initial benchmarks do show that it has a ~2x performance improvement over
quick-xml when dealing with two attributes on a node, and I suspect that
improvement will increase with the number of attributes. We will see how it
compares in real-world benchmarks once the linker has been modified to use
it.
The goal isn't to _avoid_ quick-xml---it'll be used in the future for things
like escaping that would be a huge waste to implement ourselves. It just so
happened that quick-xml was not beneficial for these changes; indeed, its
own writer is fairly simple for the portions that were implemented here, so
there's no use in fighting with its API, particularly around attributes and
our need to explicitly control whitespace (with the intent of handling code
formatters in the future).
To put this into perspective: the reason this work is being done isn't to
refactor the linker, or to speed it up, but to generalize XML writing and
provide a suitable IR for use in the compiler. The first step of the
frontend is to essentially echo the XML token stream back out so we can
incrementally parse it and do something useful, to incrementally rewrite the
compiler in Rust.
This adds benchmarking for the memchr crate. It is used primarily by
quick-xml at the moment, but the question is whether to rely on it for
certain operations for XIR.
The benchmarking on an Intel Xeon system shows that memchr and Rust's
contains() perform very similarly on small inputs, matching against a single
character, and so Rust's built-in should be preferred in that case so that
we're using APIs that are familiar to most people.
When larger inputs are compared against, there's a greater benefit (a little
under ~2x).
When comparing against two characters, they are again very close. But look
at when we compare two characters against _multiple_ inputs:
running 24 tests
test large_str:1️⃣:memchr_early_match ... bench: 4,938 ns/iter (+/- 124)
test large_str:1️⃣:memchr_late_match ... bench: 81,807 ns/iter (+/- 1,153)
test large_str:1️⃣:memchr_non_match ... bench: 82,074 ns/iter (+/- 1,062)
test large_str:1️⃣:rust_contains_one_byte_early_match ... bench: 9,425 ns/iter (+/- 167)
test large_str:1️⃣:rust_contains_one_byte_late_match ... bench: 123,685 ns/iter (+/- 3,728)
test large_str:1️⃣:rust_contains_one_byte_non_match ... bench: 123,117 ns/iter (+/- 2,200)
test large_str:1️⃣:rust_contains_one_char_early_match ... bench: 9,561 ns/iter (+/- 507)
test large_str:1️⃣:rust_contains_one_char_late_match ... bench: 123,929 ns/iter (+/- 2,377)
test large_str:1️⃣:rust_contains_one_char_non_match ... bench: 122,989 ns/iter (+/- 2,788)
test large_str:2️⃣:memchr2_early_match ... bench: 5,704 ns/iter (+/- 91)
test large_str:2️⃣:memchr2_late_match ... bench: 89,194 ns/iter (+/- 8,546)
test large_str:2️⃣:memchr2_non_match ... bench: 85,649 ns/iter (+/- 3,879)
test large_str:2️⃣:rust_contains_two_char_early_match ... bench: 66,785 ns/iter (+/- 3,385)
test large_str:2️⃣:rust_contains_two_char_late_match ... bench: 2,148,064 ns/iter (+/- 21,812)
test large_str:2️⃣:rust_contains_two_char_non_match ... bench: 2,322,082 ns/iter (+/- 22,947)
test small_str:1️⃣:memchr_mid_match ... bench: 4,737 ns/iter (+/- 842)
test small_str:1️⃣:memchr_non_match ... bench: 5,160 ns/iter (+/- 62)
test small_str:1️⃣:rust_contains_one_byte_non_match ... bench: 3,930 ns/iter (+/- 35)
test small_str:1️⃣:rust_contains_one_char_mid_match ... bench: 3,677 ns/iter (+/- 618)
test small_str:1️⃣:rust_contains_one_char_non_match ... bench: 5,415 ns/iter (+/- 221)
test small_str:2️⃣:memchr2_mid_match ... bench: 5,488 ns/iter (+/- 888)
test small_str:2️⃣:memchr2_non_match ... bench: 6,788 ns/iter (+/- 134)
test small_str:2️⃣:rust_contains_two_char_mid_match ... bench: 6,203 ns/iter (+/- 170)
test small_str:2️⃣:rust_contains_two_char_non_match ... bench: 7,853 ns/iter (+/- 713)
Yikes.
With that said, we won't be comparing against such large inputs
short-term. The larger strings (fragments) are copied verbatim, and not
compared against---but they _were_ prior to the previous commit that stopped
unencoding and re-encoding.
So: Rust built-ins for inputs that are expected to be small.
Fragments' text were unescaped on reading, producing an owned String and
spending time parsing the text to unescape. We were then copying that into
an internement pool (so, copying twice, effectively).
Further, we were then _re-escaping_ on write.
This was all wasteful, since we do not do any manipulation of the fragment
before outputting to the xmle file; we know that Saxon produced properly
escaped XML to begin with, and can trust to propagate it.
This also introduces a new global `clone_uninterned_utf8_unchecked` method.
In profiling this change, I tested (a) before this change, (b) after writing
without escaping, and (c) after both reading escaped and writing without
escaping.
(a) (b) (c)
sec mem (B) sec B sec B
0:00.95 47896 -> 0:00.91 47988 -> 0:00.87 48288
0:00.40 30176 -> 0:00.37 25656 -> 0:00.36 25788
0:00.39 45672 -> 0:00.37 45756 -> 0:00.35 34952
0:00.39 20716 -> 0:00.38 19604 -> 0:00.36 19956
0:00.33 16836 -> 0:00.32 16988 -> 0:00.31 16892
0:00.23 15268 -> 0:00.23 15236 -> 0:00.22 15312
0:00.44 20780 -> 0:00.44 20048 -> 0:00.41 20148
0:00.54 44516 -> 0:00.50 36964 -> 0:00.49 36728
0:00.62 55976 -> 0:00.57 46204 -> 0:00.54 41468
0:00.31 28016 -> 0:00.30 27308 -> 0:00.28 23844
0:00.23 15388 -> 0:00.22 15316 -> 0:00.21 15304
0:00.05 4888 -> 0:00.05 4760 -> 0:00.05 4948
0:00.41 19756 -> 0:00.41 19852 -> 0:00.40 19992
0:00.47 20828 -> 0:00.46 20844 -> 0:00.44 20968
0:00.27 18152 -> 0:00.26 18184 -> 0:00.25 18312
Interestingly, the peak memory usage increases very slightly between the
second and third steps (though decreases from the first), likely because the
raw (encoded) is larger than the unencoded text (e.g. `>` takes more
space than `>`).
Fragments were previously represented by `String` to avoid the cost of
interning (hashing and copying). This change modifies it to use uninterned
symbols, which does still have a copy overhead but it does not hash.
Initial tests shows a small performance decrease of about 15% and a small
memory increase of similar proportion. However, once I realized that I was
not clearing buffers from quick_xml events and implemented that change in a
previous commit, this change ended up being approximately on par with
`String`, despite the copying of some pretty large fragments.
YMMV, though, and perhaps on less powerful systems time may increase
slightly.
The upcoming XIR (XML IR) was originally going to support both owned strings
and symbols, but now we'll just use uninterned symbols; I can't rationalize
complicating the API at this time when it will provide an almost
imperceivable performance benefit. If ever that changes in the future,
that change will be entertained.
The end result is that the fate of a fragment's underlying memory is
determined by whatever is processing the data, _not_ by the API itself---the
API was previously forcing use of a String, whereas now it's up to the
caller to determine whether we want comparable interns. For fragments,
that's not likely ever to be the case, especially considering that the
representation will change so drastically in the future.
This clears the buffers used by quick_xml, which was apparently forgotten
during initial development (I think I expected it to re-use the previously
allocated space automatically).
This has significant effects in some cases. For example, one of our UI
builds drops from ~9KiB to ~5KiB peak memory usage. Other builds for larger
suppliers are only slightly effected because of some of their massive
fragments.
This adds support for uninterned symbols. This came about as I was creating
Xir (not yet committed) where I had to decide if I wanted `SymbolId` for all
values, even though some values (e.g. large text blocks like compiled code
fragments for xmle files) will never be compared, and so would be wastefull
hashed.
Previous IRs used `String`, but that was clumsy; see documentation in this
commit for rationale.
The switch to the `main` branch follows our conventions for other
repositories as we switch to trunk-based development.
Given that main will always be in a deployable state, there's no use in
waiting for tags.
This is an initial implementation optimized for expected use
cases. Hopefully that pans out and doesn't come back to bite me.
Regarding the context: it only allows for interned paths atm, which are
strings (and so much be valid UTF-8, which is fine for us, but sucks for
something more general-purpose). I'll be curious if the context needs
extension later on, or if different contexts will be stored in IRs (e.g. to
store a template application site as well as the location of the expansion
within the template body).
SymboldIds must only be constructed by interners, otherwise we lose
confidence in the type.
This offers an associated function to construct raw SymbolIds from integers
for testing purposes.
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
Oh boy. What a mess of a change.
This demonstrates some significant issues we have with Symbol. I had
originally modelled the system a bit after Rustc's, but deviated in certain
regards:
1. This has a confurable base type to enable better packing without bit
twiddling and potentially unsafe tricks I'd rather avoid unless
necessary; and
2. The lifetime is not static, and there is no global, singleton interner;
and
3. I pass around references to a Symbol rather than passing around an
index into an interner.
For #3---this is done because there's no singleton interner and therefore
resolving a symbol requires a direct reference to an available interner. It
also wasn't clear to me (and still isn't, in fact) whether more than one
interner may be used for different contexts.
But, that doesn't preclude removing lifetimes and just passing around
indexes; in fact, I plan to do this in the frontend where the parser and
such will have direct interner access and can therefore just look up based
on a symbol index. We could reserve references for situations where
exposing an interner would be undesirable.
Anyway, more to come...
As mentioned in the previous commit, this flips the types such that the base
type if the primitive and the associated type is the `NonZero*` type; this
is much more natural, concise, and allows Rust to infer the proper type in
most every situation.
The next step will be to stop defaulting the index type for SymbolIndex and
related, since we are about to care very much what size it is (compiler
vs. linker).
This was previously a NonZeroU32, but it was intended to support NonZeroU16
as well for packages, so that we can fit symbols into smaller spaces. In
particular, the upcoming Span wants to fit within 8 bytes, and so requires a
smaller SymbolIndex type.
I'm unhappy with this current implementation, and so comments are unfinished
and there are a couple ignores for dead code warnings. I want to flip the
`SupportedSymbolIndex` trait so that users can specify the primitive rather
than the NonZero* type, which is really awkward-looking and verbose,
especially if you have to do `SymbolIndex::<NonZeroU32>::from_int` or
something. It also prevents (at least in the cases I've observed) Rust from
inferring the proper type for you based on the argument you provide.
So, the goal will be `SymbolIndex::<u32>::from_int(n)`, for example.
The first step in the process is to emit the raw XML events that can then be
immediately output again to echo the results into another file. This will
then allow us to begin parsing the input incrementally, and begin to morph
the output into a real `xmlo` file.
This introduces the beginnings of frontends for TAMER, gated behind a
`wip-features` flag.
This will be introduced in stages:
1. Replace the existing copy with a parser-based copy (echo back out the
tokens), when the flag is on.
2. Begin to parse portions of the source, augmenting the output xmlo (xmli
at the moment). The XSLT-based compiler will be modified to skip
compilation steps as necessary.
As portions of the compilation are implemented in TAMER, they'll be placed
behind their own feature flags and stabalized, which will incrementally
remove the compilation steps from the XSLT-based system. The result should
be substantial incremental performance improvements.
Short-term, the priorities are for loading identifiers into an IR
are (though the order may change):
1. Echo
2. Imports
3. Extern declarations.
4. Simple identifiers (e.g. param, const, template, etc).
5. Classifications.
6. Documentation expressions.
7. Calculation expressions.
8. Template applications.
9. Template definitions.
10. Inline templates.
After each of those are done, the resulting xmlo (xmli) will have fully
reconstructed the source document from the IR produced during parsing.
This was incorrect to begin with---it does not make sense that an input
mapping should depend upon the identifier that it maps to, in the sense that
we make use of these dependencies. If we add weak symbol references in the
future, then this can be reintroduced.
By removing this, we free tameld from having to perform the check itself.
.rev-xmlo bumped to force rebuilding of object files since the linker now
expects that no such dependencies will exist within them.
This is something that changed when the TAMER POC was initially created, as
I was learning Rust. I don't recall the original reason why this was moved,
but it could have been moved back long ago.
In our systems, constants can hold tables (as matrices) with tens or
hundreds of thousands of rows, and there are a number of them in certain
projects. As an example, the YAML-based test cases for one of our systems
went from ~2m30s to ~45s after this change was made. Much of the cost
savings comes from saving GC.
This can occur in generated code (e.g. from proguic if a question-based
predicate inherits a predicate already specified). This commit does not
change anything that's emitted; it merely allows proceeding.
TAMER can be smarter about this; I don't want to invest more time into
generalizing deduplication of predicates.
There was a bug whereby TRUE matches would keep whatever value was being
matched on, even if it was not a boolean. That was an oversight from the
proof-of-concept code, and this fixes it; that's why this is behind a flag!
This also adjusts the class aliasing optimization so that it doesn't check
for a `TRUE` symbol name, which was a bad idea to begin with.
This change also ends up expanding `lv:match[@value="TRUE"]` into the long
form, where it didn't previously; this will result in slightly larger xmlo
files in some cases, but it's nothing significant, and it does not impact
compilation times.
This is a nearly-10-year-old bug that was introduced when the Summary Page
was modified to use the then-new symbol table. The compiler previously
concatenated all packages into a single XML tree and processed that, so no
package resolution was necessary here before.
A long time ago (about a decade), package names were required, but they are
now generated by the compiler relative to the root path. The name here was
incorrect, which was generating an incorrect path for the linked symbols,
which was causing problems with the Summary Page.
See RELEASES.md for a list of changes.
This was a significant effort that began about six months ago, but was
paused at a number of points. Rather than risking further pauses from
interruptions, the new classification system has been gated behind a
package-level feature flag, since it causes BC breaks in certain buggy
situations.
Since this flag was introduced late, there is the potential that it causes
bugs when new optimizations are mixed with the old system.
This largely reintroduces the legacy classification system, but there are a
number of things that are not affected by the flag. For example:
1. Alias classifications are still optimized when the flag is off;
2. Classifications without predicates emit slightly different code than
before, though their functionality has not changed;
3. There's been a lot of refactoring and minor optimizations that are
unaffected by the flag;
4. lv:match/@pattern will now emit a warning; and
5. Cleaning and casting of input data is not gated.
This allows us to incrementally migrate to the new system where behavior may
be different, but this is admittedly a bit dangerous in that the new system
was aggressively tested and reasoned about, so reintroducing the legacy
system may combine in unexpected ways.
This is another significant milestone.
The next logical step with classification optimization is to inline all of
those intermediate classifications generated from any and all blocks, since
there are so many of them. This means having the parent classification
absorb all dependencies; not output dependencies for the classification; not
compile the assignments for those classifications; and to inline them at the
match site. They’re used only once, since they’re generated for each
individual block.
We need to keep the actual classification generation around (and just inline
them) for now, probably until TAMER, because we depend upon their symbol for
determining their dimensionality, which we need for the optimization work we
just did---we must inline them into the proper group (matrix, vector, or
scalar).
The optimization work done up to this point had inlining in mind---only a
little bit of work was needed to make sure that every classification can
simply be stripped of its assignment and be a valid expression that can be
inlined in place of the original reference.
The result of that was predictably significant for the `ui/package` program
that I've been testing with:
- 4,514 classifications were inlined;
- The file size dropped to 7.5MiB (from 8.2MiB previously---remember that
we started at 16MiB); and
- GC ticks were cut in half, from 67->31.
Unfortunately, this optimization added nearly 1m of time to the compilation
of that program. Speaking from the future: the UI build optimizations in
liza-proguic were introduced to offset this difference (and provide a net
gain in performance).
This convets disjunctive classifications into conjunctive and places an
<any> within it.
This ends up handling all the generated qwhen classifications from proguic,
which were probably converted into <any> by a previous optimization pass.
The UI program I've been using to test these compiler optimizations has
decreased in size down from 8.2MiB since the beginning of this branch; we
started at ~16MiB.
See comments. This is meant to help mitigate the damage done by one of our
code generation systems. The benefit is significant, allowing the code
generator to remain simple. By placing this optimization within the
compiler, hand-written and template-generated code also benefit.
Rather than extracting every any/all into their own classifications,
eliminate them (and replace them with their body) if they contain only one
predicate. This is most likely to happen after template expansion, and
there were an alarming number of them in our system.
Stripping them out of one of our programs saved ~0.2MiB of output, and
removed many intermediate classifications. It removed ~1,075 lines, which
should correspond closely to the actual number of classifications.
Discovering this required stripping the template barriers, which was done in
a previous commit.
Unfortunately, the performance improvement from this wasn't significantly,
largely because of the nondeterminisim of GC, which can easily mask the
gains. But a new line `v8::internal::FixedArray::set(int,
v8::internal::Object)` appeared in the profiler output, making me wonder
whether the JIT is starting to understand more interesting properties of the
system.
`mprotect` and `v8::internal::heap_internals::GenerationalBarrier` also
appeared, which are related to GC.
!!!
(Message from the future: this ends up being reintroduced and the new
classification system being placed behind a feature toggle. But it will be
eliminated eventually.)
This is a major milestone for class optimization---the old anyValue-based
system is no longer in use; the classification system has been wholly
rewritten.
The ticks in the sampling profiler are now where they should be, open to
further optimization with a much more solid foundation.
[JavaScript]:
ticks total nonlib name
5 0.6% 3.0% LazyCompile: *vu [...]/ui/package.strip.js:25191:16
5 0.6% 3.0% LazyCompile: *M [...]/ui/package.strip.js:25267:15
3 0.4% 1.8% LazyCompile: *vmu [...]/ui/package.strip.js:25144:17
3 0.4% 1.8% LazyCompile: *ve [...]/ui/package.strip.js:25204:16
2 0.2% 1.2% LazyCompile: *precision [...]/ui/package.strip.js:25137:23
2 0.2% 1.2% LazyCompile: *me [...]/ui/package.strip.js:25178:16
2 0.2% 1.2% LazyCompile: *cmatch [...]/ui/package.strip.js:25495:20
2 0.2% 1.2% LazyCompile: *ceq [...]/ui/package.strip.js:25273:17
1 0.1% 0.6% LazyCompile: *init_defaults [...]/ui/package.strip.js:25624:27
1 0.1% 0.6% LazyCompile: *MM [...]/ui/package.strip.js:25268:16
1 0.1% 0.6% LazyCompile: *E [...]/ui/package.strip.js:25239:15
1 0.1% 0.6% LazyCompile: *<anonymous> [...]/ui/package.strip.js:25184:13
1 0.1% 0.6% LazyCompile: *<anonymous> [...]/ui/package.strip.js:25171:13
Much better than the 102 ticks that anyValue was taking some time ago!
A lot of time used to be spent compiling functions as well, a lot of which
was removed by previous commits, bringing us to:
[C++]:
ticks total nonlib name
50 5.9% 30.5% node::contextify::ContextifyContext::CompileFunction(v8::FunctionCallbackInfo<v8::Value> const&)
20 2.4% 12.2% write
9 1.1% 5.5% node::native_module::NativeModuleEnv::CompileFunction(v8::FunctionCallbackInfo<v8::Value> const&)
6 0.7% 3.7% __pthread_cond_timedwait
4 0.5% 2.4% mmap
All of this work has simplified the output enough that it's obviated a slew
of other optimizations that can be done in future work, though a lot of that
may wait for TAMER, since performing them in XSLT will be difficult and not
performant; the compiler is slow enough as it is.
This shaves ~1m off of the total build time for our largest system. Output
is impressively slow.
Around this point in time, we have the following profile from V8's sampling
profiler:
[JavaScript]:
ticks total nonlib name
36 2.8% 10.7% LazyCompile: *anyValue [...]/ui/package.strip.new.js:31020:22
3 0.2% 0.9% LazyCompile: *m1v1u [...]/ui/package.strip.new.js:30941:19
2 0.2% 0.6% LazyCompile: *precision [...]/ui/package.strip.new.js:30934:23
1 0.1% 0.3% LazyCompile: *vu [...]/ui/package.strip.new.js:30964:16
1 0.1% 0.3% LazyCompile: *init_defaults [...]/ui/package.strip.new.js:31341:27
This allows us to easily see their shape looking at the compiled code. See
the previous commit for more of an explanation and examples. And future
commits.
This allows us to analyze the compiler runlog and determine the frequency of
certain shapes to prioritize optimization efforts.
This is a proof-of-concept. It also contains arrow functions, which do not
exist in ES5.
The notation m#v#s# refers to matrix, vector, and scalar counts of a
classification. This optimization therefore focuses on classifications with
a single vector and a single matrix.
I'd like to note that this commit message was written in retrospect, months
later, after I returned to these proof-of-concept commits to finalize
them. I'll try my best to have things make sense in a historical context
based on my notes.
The choice to focus on m1v1 was based on taking survey of the shape of
classifications in our largest rating system. m1v*, and specifically m1v1,
was the largest by far, followed by v1s1. Here's an example program used
for a UI:
$ grep -h 'internal: [svm][0-9]\+[svm][0-9]\+ ' run*.log > result
$ cut -d' ' -f2 result | sort | uniq -c | sort -rn
10056 m1v1
1788 m1v2
473 v1s1
18 v2s1
13 v1s5
8 v1s3
7 v1s2
4 v2s5
2 v4s4
2 v4s2
2 v2s8
2 v2s6
2 v1s9
2 v1s4
1 v7s7
1 v6s2
1 v5s7
1 v5s5
1 v5s4
1 v5s2
1 v4s9
1 v4s7
1 v4s3
1 v3s9
1 v3s7
1 v3s5
1 v3s2
1 v3s1
1 v33s21
1 v2s60
1 v2s4
1 v2s3
1 v2s2
1 v28s1
1 v23s8
1 v22s9
1 v1s8
1 v1s6
1 v18s24
1 v15s14
1 v14s6
1 v14s5
1 v13s7
1 v13s6
1 v12s6
1 v11s1
1 m76v7
1 m3v1
1 m1v3
1 m1374v1
The excessively large ones (like the last one) are aggregate classifications
that are generated by a template. But note the first count.
Here's another example, one of the raters:
8812 m1v1
311 v1s1
17 v2s1
14 v1s5
4 v2s5
4 v1s6
4 v11s10
3 v3s1
3 v1s8
2 v5s14
2 v4s7
2 v3s9
2 v3s5
2 v2s4
2 v1s9
2 v1s4
2 v1s2
1 v8s7
1 v7s7
1 v7s15
1 v6s4
1 v6s2
1 v6s10
1 v5s8
1 v5s7
1 v5s4
1 v5s2
1 v53s9
1 v4s9
1 v4s4
1 v4s3
1 v4s2
1 v4s11
1 v3s8
1 v3s7
1 v3s20
1 v3s2
1 v3s19
1 v3s15
1 v2s8
1 v2s60
1 v2s6
1 v2s2
1 v2s12
1 v29s20
1 v28s1
1 v23s8
1 v1s3
1 v15s23
1 v13s6
1 v13s20
1 v12s6
1 v12s10
1 v11s1
1 m1v2
1 m1s1
Given these examples, m1v1 is an easy first choice for this commit.
The general pattern for this commit and those that follow is to match on a
specific shape of classification that we're optimizing for, falling back to
the old anyValue-based system for all other cases, with the intent of
eventually removing it.
This has long been a curse, and I don't know why I didn't resolve it sooner.
This makes explicit some of the odd things that this is doing, to maintain
the previous behavior. Changing that behavior would be ideal, but ought to
be done separately and put behind a feature flag.
This reverts commit e2d9467633bb75d79dbc8fe9f8971bfa412ea59f.
BUT: it does cause more data to be returned, perhaps unnecessarily. See if
that may offset the slight increase in GC cost.
Further, we may end up getting rid of some of these generated values; check
after we do some class optimizations.
This was a waste of time; it actually reduces performance slightly and increased
GC, unintuitively enough.
Leaving commit here and reverting to keep it for reference.
When the Summary Page was _first written_ (the first part of TAME), it was
compiled in the browser---development consisted of refreshing the page,
which was familiar to how we wrote PHP at the time. No compile process.
In that situation, we couldn't have the XSLT stylesheet failing to
translate. But of course those days are long since gone, and this must be a
compile-time error.
It shouldn't ever get to this point, granted.
Single-predicate classifications matching on TRUE can be optimized into
aliases. These sometimes occur in hand-written code, but can also be
generated by templates.
A previous commit used a rustdoc tool lint, but that support wasn't added
until 1.52.0 (2021-05-06).
Note that this represents the minimum _required_ version to build TAMER; you
can use a later version.
This template prepares for the introduction of the new classification
system, which is a full rewrite that is both more performant and more
correct in its behavior. Unfortunately, the corrections will cause problems
with old code that may be relying on certain cases, particularly where
undefined values are implicitly treated as zero.
Consequently, the legacy and new systems will exist side-by-side, able to be
toggled on as desired so people can verify that behavior is correct before
we switch it on by default. This template allows switching on the system
for an entire package (if it's placed at the toplevel), or portions of a
package, though the latter should only be used in exceptional circumstances.
See the test cases in commits to follow for more information.
This package is not used today. See RELEASES.md for more information; This
is a dangerous package that never should have existed.
This also fixes the test suite.
The classification system rewrite removed the debug value collection that
previously existed. It didn't make a whole lot of sense anyway, given that
that compiler rearranges matches.
This falls back to showing the value of the @on, which should be good
enough, and is honestly better than what we had before.
This provides an element-level rather than row-level focus, which I feel is
more appropriate.
One could draw lines to connect each of the elements, but that'd likely be
too noisy and it'd be a lot of work.
This starts with the Hadamard Product as an example. It also:
- Configures BibLaTeX with biber.
- Renames \undef, since BibLaTeX apparently defines it.
- Redefines the citation and url colors, since they're bright and ugly.
This is just some plain English to go along with and help rationalize the
text. Further rationale will be provided in a dedicated section in the
future; such information is vitally important to understand why the system
evolved as it did.
I find this provides a visualization that is likely to be significantly more
intuitive for others. It even holds when the matrix is not
rectangular (yes, I know, it's not really a matrix then), so long as all
matrices share the same respective K_j.
This uses the same variable subscript on \equiv itself to define the symbol,
rather than the previous symbol which looked like equiv rotated, but also
looked too much like a turnstile used for "infer", a metalanguage construct
that is not appropriate here. It kept bothering me.
This represents the old cmatch system (which is in use today, but the
classification system has since been rewritten, though it has not yet been
merged). It was my attempt over a decade ago to reason about how this
system ought to work.
I think it's fair to say that this is absolute insanity and that the new
formulation is significantly better.
I removed this when I added concmath, thinking that it would include it for
me, and apparently I never re-added it after realizing that it didn't.
I'm a big fan of the typography of Concrete Mathematics.
The subscript of the matrix family adds too much vertical space. This
offsets that to restore it to about what it otherwise would be, since the
second subscript does not get in the way.
This was originally going to be used to define @yields for the classifier,
but I took a very different approach which doesn't require reasoning about
the system in terms of recursion.
This defines @as and @yields, but does not yet define matches formally.
It's also missing index entries, which I'll take the time to add after I'm
sure things are staying as they are.
This was quite a bit of work, and the approach I took is different than I
originally expected, so Section 0 can use some cleanup.
There is more to come from here.
This is going to evolve a great deal, and note that the yield definition is
completely absent.
It may be time to switch to natural deduction (Gentzen-style).
That macro previously expanded into \Classify, but that was undone before
committing to make it clear when one is referring to a variable vs. a
classification as a definition.
This will be used as an IR of sorts to eliminate the XML, which will be far
too verbose to use in proofs. It also allows us to attach behavior to the
operator, which will end up defining two values for @as and @yields.
The previously-existing notation for this has been removed. These will be
updated soon to account for vectors and matrices, but until then, this is
simply nonsense.
This is an unnecessary feature to maintain right now. I will include
symbols at the very beginning of the index, which is common in mathematics
texts, and may will add a table of common symbols in the future.
Stacking originally seemed like a good idea, but perhaps this does read a
bit better (and looks more like the composition operation being applied),
and composes a bit better if we needed e.g. \bicomp\bicomp{R}.
It's also less ambiguous when it's over a larger expression. For example,
\bicomp{[A]} places \circ over top of the A, which looks as if it's
[\bicomp{A}]. It's obvious what the intention is in that context, since
\bicomp{A} makes no sense, but there could be other situations where it
doesn't. With this change, it results in {[A]}^\circ.
There's a lot of change that's likely going to take place with this thing,
but it's a start. The abstract summarizes the purpose of this---to formally
define TAME in terms of algebra, first-order logic, and [ZFC] set theory.
This came about while working on compiler changes and optimizations, since
it's difficult to ensure correctness (and discover further optimizations)
without being able to formally define the language. The focus at the moment
is the classification system rewrite, which can be expressed in terms of
first order logic and set theory.
This commit contains essentially a POC with some carefully chosen
mathematical foundations (abstractions of which are subject to change) and a
basic representation of a subset of the classification system for scalars.
%.xml{=>o}: %csvo rater/core/vector/table.xmlo
That is: we'll only build an object file when we try to build another object
file. This was causing problems with dependency generation, because it will
triggering compilation early.
This should have been done many years ago. This will determine if any of
the dependencies have changed for the included suppliers.mk and regenerate
it as needed, without the developer having to do so manually when imports
change.
* build-aux/Makefile.am (suppliers.mk): Invoke ant with `-q` to eliminate
"processing" messages for each and every file. This also speeds up
operation slightly.
* build-aux/gen-make: Remove information echos for each file.
These changes will allow for suppliers.mk to be regenerated automatically
without being so invasive.
The intent originally was to try to keep developers to a reasonable name
length, but generated identifiers can easily exceed this, and we further do
not support namespacing.
This can be handled at a template level instead for enforcing naming
conventions.
This had gotten quite out of date from the actual rater.xsd, which existed
outside of this repository, that is used during our build process. That was
an unintended artifact from moving files around.
That file has been removed and symlinked to this one.
Note: this really belongs in liza-proguic, and should be moved in the near
future.
liza-proguic is being modified to generate step-level packages, which are
significantly faster to build than larger ones (XSLT TAME scales
terribly). These changes handle those new dependencies.
One important thing to note with this change is that suppliers.mk now
requires proguic to have run before generation so that those generated
dependencies can be properly examined. This is a quick operation, so that
is not problematic.
This also depends on the .version.xml change that was previously made: when
the timestamp changed every time, we got into an infinite build loop.
First thing to note: this belong in liza-proguic, not here. But it's here
right now, so for now I'm making the change. The relationship between TAME
and proguic is awkward and will hopefully be improved upon in the near
future.
As for this actual change: step-level fragments will be concatenated such
that the imports will appear at the step level rather than the root.
This will be generated automatically by the Makefile. It's not appropriate
to generate in the configure script, and I do not recall why I did
so---possibly to work around the issue of delayed tab completion when it
needs regeneration?
This removes suppmk-gen in favor of more generic Makefile targets---in this
case, having `%.tdat` depend upon `rater/core/tdat.xml`, even though that's
not quite true (the %.xml file generated from it needs it). But these files
are going away soon; a pending TAME optimization branch removes support for
the underlying pattern primitive entirely; CSVMs should be used instead.
The timestamp of the file will now only be updated if the hash (version)
_actually_ changes. This allows this to be used as a target dependency
without forcing a rebuild each and every time.
This solves issues of hitting stack limits, particularly in browsers, when
querying matrices that return a large number of rows for one or more
predicates.
We were still having issues with this function when taking the positive
branch, when predicates cause many matches within tables. This was causing
us to hit stack limits in certain browsers on the Summary Page.
This converts it to an iterator so that all branches are tail-recursive, and
then enables TCO on them.
I was disappointed to find that there's little performance or memory benefit
in running our test suite.
I did say it was _experimental_ guided TRO.
This waits to perform the actual argument reassignment until after
processing the expressions associated with the new arguments, since they
will otherwise be replaced when their original values are still needed.
This change simply prevents failure in such situations, (e.g. on invalidated
fields in Liza). We'll worry about proper errors and correctness, which
ought to be compile-time, in TAMER.
The MathJax CDN stopped working in April 2017. I updated it to the
recommended CDN with the last version from April 2017 to ensure it works
like it used to work before the CDN stopped.
I added the checksum to ensure the content of the script.
This problem manifested when the name of the attempted classification is the
same name as another object. For example, if we have `t:match-class
name="foo"`, and `foo` is a param instead of a class, then `@yields` will
fail, and it'd fall back to matching on the param.
This is absolutely not what we want.
The error message in this context is ugly, but it does work.
Example:
!!! Unknown match @on (/lv:package/lv:classify/match): `error: unable to
determine @yields for class `scheduled_ai' (has the class been imported?)'
is unknown for classification --vis-scheduled-ai-type
This was urgently needed for a project using TAME. Somehow, we've gone
all of these years without a table in which the first predicate is unable to
sufficiently filter out enough results that we do not hit stack limits.
Each recursive step of mrange before inlining and TCO, at the time of
writing, was adding eight stack frames. This is because each let (and many
other things) compile into self-applying functions. Since mrange is invoked
once for every single row for a given value, we quickly run out of stack
space.
For example, consider this table:
1, $a, $b
2, $a, $b
2, $b, $c
2, $c, $d
3, $a, $b
If we were to filter the first column on the value 2, it would first bisect
to find the middle row, backtrack to the first, and then move forward to the
last, producing:
2, $a, $b
2, $b, $c
2, $c, $d
This is at least three mrange calls, for a potential total of 8*3=24 stack
frames, depending on implementation details I don't quite recall at the
moment about the how the query system works.
We had over 1000 rows after applying the first predicate; the stack was
exhausted before it could even reach the last row.
Tail call optimization (TCO) is the process of turning recursive calls in
tail position into jumps. So, rather than the stack growing on a recursive
call, it stays constant. A common way to accomplish this in stack-based
languages is using a trampoline.
In our case, we enclose the entirety of the function in a `do` loop, and
clear a flag indicating that a tail call took place. When we reach a
recursive tail call, we set that flag. Then, instead of invoking the
function again, we _overwrite the original arguments_ with their new
values, and simply return 0. When the function hits the end of the loop, it
will see that the flag is set, and jump back to the beginning of the
function, starting all over with the new values.
Compiling in this functionality is not difficult. Tracking whether a given
call is in tail position, however, is a bit of a pain given how the XSLT
code is currently written. Given that this is all being replaced with
TAMER, it's difficult to stomach making too many changes to the compiler,
when we can do it properly in the future with TAMER. But we need the
feature now.
As a compromise, I call this implementation "guided" TCO---we rely on a
human to indicate that a call is in tail position by setting an experimental
flag manually. That frees us from having to have the compiler do it, but
does create some nasty problems if the human is wrong. Consequently, this
should only be used in core, and people should not use it unless they know
what they're doing.
Using this feature currently outputs a warning---that way, if there are
problems, people have some idea of where they maybe can look. The warning
will be removed in the future after this has been in production for some
time (granted, our test suite passes).
Once again: TAMER will implement proper tail calls automatically, without
the need for a human to intervene.
For more information on tail calls:
- https://en.wikipedia.org/wiki/Tail_call
This implements TCO in the XSLT compiler by requiring a human to manually
indicate when a recursive call is in tail position. This was somewhat
urgently needed to resolve stack exhaustion on large rate tables.
TAMER will do this properly by determining itself whether a call is in tail
position. Until then, this will serve as a test for this type of feature.
This handles moving to another repository structure (our gigarepo) where
this relative path is no longer true. The absolute path generated by this
is okay since it's ephemeral and only used for this build invocation.
This checks explicitly for unresolved objects while sorting and provides an
explicit error for them. For example, this will catch externs that have no
concrete resolution.
This previously fell all the way through to the unreachable! block. The old
POC implementation was catching unresolved objects, albeit with a debug
error.
This will be used for the next commit, but this change has been isolated
both because it distracts from the implementation change in the next commit,
and because it cleans up the code by removing the need for a type parameter
on `AsgError`.
Note that the sort test cases now use `unwrap` instead of having
`{,Sortable}AsgError` support one or the other---this is because that does
not currently happen in practice, and there is not supposed to be a
hierarchy; they are siblings (though perhaps their name may imply otherwise).
The only reason this function was a method of `BaseAsg` was because of
`self.graph`, which is accessible within the scope of this
module. `check_cycles` is logically associated with `SortableAsg`, and so
should exist alongside it (though it can't exist as an associated function
of that trait).
Merge branch 'jira-7504'
* jira-7504:
[DEV-7504] Update RELEASES.md to make it less technical
[DEV-7504] Add cypher script for post-graph import
[DEV-7504] Add make target for "graphml"
[DEV-7504] Add GraphML generation
We want to be able to build a representation of the dependency graph so
we can easily inspect it.
We do not want to make GraphML by default. It is better to use a tool.
We use "petgraph-graphml".
This was never completed and will be able to be deleted entirely, but I
didn't want to lose this history by having it sit out in a branch. Joe is
working on something better.
This begins providing release notes for changes and provides scripts to
facilitate this:
- tools/mkrelease will update RELEASES.md and run some checks.
- build-aux/release-check is intended for use in pipelines (e.g. see
.gitlab-ci.yml) to verify that releases were done properly.
This was originally omitted because there wasn't a use case for it. Now
that we're adding context to errors, however, an owned value is highly
desirable.
This adds almost no measurable overhead to the internment system in
benchmarks (largely within the margin of error).
This is a union (sum type) of three other errors types, plus errors specific
to this builder.
This commit does a good job demonstrating the boilerplate, as well as a need
for additional context (in the case of `IdentKindError`), that we'll want to
work on abstracting away.
The `Debug` bound is inconvenient and requires propagation to any types that
use it. Further, it's really awkward having `Display` depend on `Debug`; if
we want to render a useful display here, we can write one.
To be clear: IndexType implements Debug.
For now, this is pretty-printed by another part of the code, which we don't
want to implement in `Display` because it requires looking things up from
the graph.
This flips the API from using XmloWriter as the context to using Asg and
consuming anything that can produce XmloResults. This not only makes more
sense, but avoids having to create a trait for XmloReader, and simplifies
the trait bounds we have to concern ourselves with.
This just tidies things up a little bit before I get into some further
refactoring. I wrote the original code when I was just learning Rust not
too long ago, so it's interesting to see how my understanding has changed
over that relatively short period of time.
This abstracts away the canonicalizer and solves the problem whereby
canonicalization was not being performed prior to recording whether a path
has been visited. This ensures that multiple relative paths to the same
file will be properly recognized as visited.
This will be entirely replaced in an upcoming commit. See that for
details. I don't feel like dealing with the conflicts for rearranging and
squashing these commits.
This also includes an implementation to visit paths only once. Note that it
does not yet canonicalize the path before visiting, so relative paths to the
same file can slip through, and relative paths to _different_ files could be
erroneously considered to have been visited.
This will be fixed in an upcoming commit.
This serves as a constructor for the time being, decoupling from POC. We
may do something better once we have a better idea of how the various
abstractions around this will evolve.
Replacing the existing macros with templates will allow us to now have
to deal with macros in the new compiler.
The `indexNameType` pattern needed to change to allow for variables. I
also had to remove the prefix for the `gentle-no` option of `rate`.
Create a "yield" and add backwards compatibility for the macro of the
same name. This is one of 2 macros that need to be replaced so we do not
have to worry about them with the new compiler.
Add a stub executable that will eventually become a full-featured TAME
compiler. The first implementation will only copy the source file to an
intermediary file that will be compiled by the XSLT compiler.
Add a new step to the build process that copies the `xml` file to an
`xmli` file. Eventually, the new compiler will create the `xmli` file
and the old compiler will convert it to an `amle` file during the
transition.
This is an awkward system that I'd like to remove at some point. It adds
complexity. For the meantime, overrides have been arbitrarily restricted to
a single override (no override-override). But it's needed being until we
rework maps and can handle the illusion of overrides using the template
system.
Benchmark performance for this method is still substantially slower. And
oddly, this nearly doubled the speed of the other two calls (granted, at
that speed, it doesn't matter).
This properly checks identifier types when resolving externs. It also
includes a bit of refactoring. Note that some of that refactoring was
already merged into master.
The old linker was missing some things, so there are template changes in
here as well.
An example of an error currently:
error: extern `__retry` of type `cgen[boolean; 1]` is incompatible with
type `cgen[boolean; 0]`
All of these refactoring commits to arrive at this one final change: the
ability to store the source location for externs so that we can report on
what package is expecting an identifier to be defined.
Phew. Goodnight.
This undoes work I did earlier today...but now we'll be able to support a
Source on an extern.
There is duplicate code between `BaseAsg::declare{,_extern}` that will be
resolved in an upcoming commit. Upcoming commits will also simplify
terminology and clean up methods on ObjectState.
There is some duplication here with `declare` that will be cleared up in a
following commit. Reintroducing this method is necessary so that Source can
be used to represent the source location of the extern itself; it's
currently None to indicate an extern in `declare`.
This is the first step in a more incremental refactoring that previous
commits to undo the optional Source in `ObjectState::ident`. This provides
an explicit transition to an extern, with the intent of requiring an initial
missing state. This will simplify logic on the ASG.
Note that the Source provided to this new method is not yet used. That too
will come in a following commit and will represent the source of the defined
extern rather than the concrete identifier.
This properly verifies extern types, and cleans up Asg's API a little so
that externs aren't handled much differently than other declarations.
With that said, after making src optional, I realized that we will indeed
want source information for externs themselves so we can direct the user to
what package is expecting that symbol (as the old linker does). So this
approach will not work, and I'll have to undo some of those changes.
Merge branch 'jira-7133'
* jira-7133:
[DEV-7133] Clearly show the cycles in the output
[DEV-7133] Check for cyclic dependencies
[DEV-7133] Remove dependency from "lv:function/lv:param"
[DEV-7133] Add AsgError::Cycle
This is essential to clarify what exactly the different object types
represent with the new generic abstractions. For example, we will have
expressions as an object type.
There's a lot here to make the object stored on the `Asg` generic. This
introduces `ObjectState` for state transitions and `ObjectData` for pure
data retrieval. This will allow not only for mocking, but will be useful to
enforce compile-time restrictions on the type of objects expected by the
linker vs. the compiler (e.g. the linker will not have expressions).
This commit intentionally leaves the corresponding tests in their original
location to prove that the functionality has not changed; they'll be moved
in a future commit.
This also leaves the names as "Object" to reduce the number the cognative
overhead of this commit. It will be renamed to something like "IdentObject"
in the near future to clarify the intent of the current object type and to
open the way for expressions and a type that marries both of them in the
future.
Once all of this is done, we'll finally be able to make changes to the
compatibility logic in state transitions to implement extern compatibility
checks during resolution.
DEV-7087
The next commit will generalize this further. This moves logic out of
BaseAsg so that we can implement more sophisticated transitions for
compatability checks.
The logic is still tested as part of BaseAsg; the next commit will change
that as it's generalized further.
* tamer/src/ir/asg/base.rs: Extract object transitions.
* tamer/src/ir/asg/graph.rs (AsgError)[IncompatibleIdent]: New variant.
(From<TransitionError> for AsgError): Basic type translation.
* tamer/src/ir/asg/object.rs (TransitionResult): New type.
(impl Object): Transition methods.
(TransitionError): New enum.
This variant is unnecessary, as it was used only by the indexer to represent
the absence of a node, for which was can simply use `None` in the containing
`Option`.
* tamer/Cargo.toml: Add `lazy_static`.
* tamer/Cargo.lock: Update.
* tamer/src/ir/asg/base.rs (with_capacity): Use `None` in place of
`Some(Object::Empty)`.
* tamer/src/ir/asg/object.rs: Adjust state machine graphic.
(Empty): Remove variant.
(Missing): Remove reference to variance.
* tamer/src/lib.rs: Import `lazy_static` for test builds.
* tamer/obj/xmle/writer/writer.rs (Section::iter): Remove `Object::Empty`
from documentation.
(test::): Remove references to `Object::Missing`. `lazy_static!` used
here.
* tamer/obj/xmle/writer/xmle.rs (test::write_section_catch_missing): Replace
reference to `Object::Missing`.
Merge branch 'jira-7085'
* jira-7085:
TAMER: Tidy up graph_sort test
[DEV-7085] Create `SortableAsg` trait
[DEV-7085] Implement `PartialEq` for `Sections`
[DEV-7085] Move sections to IR module
This still isn't comprehensive. Further, it won't be able to be, because
we'd have to rely on Petgraph implementation details: there are potentially
many acceptable orderings for a given graph.
Create a trait that sorts a graph into `Sections` that can then be used
as an IR. The `BaseAsg` should implement the trait using what was
originally in the POC.
Merge branch 'jira-7134'
* jira-7134:
[DEV-7134] Remove unnecessary node replacement
[DEV-7134] Propagate errors from the writer
[DEV-7134] Propagate sorting errors
[DEV-7134] Propagate errors setting fragments
[DEV-7134] Pass read event errors up the stack
[DEV-7134] Return error for XmloEvent::SymDecl
[DEV-7134] Add alias for LoadResult
[DEV-7134] Remove unwrap so we can bubble up error messages
[DEV-7134] Escalate the error from finding the absolute path
If we cannot set a fragment, we need to display the error to the user.
We are currently ignoring "___head", "___tail", and objects that are
both virtual and overridden. Those will be corrected in with future
changes.
We want to add an option to set the output file to the linker so we do
not need to redirect output to awk any longer.
This also adds integration tests for tameld.
We will continue to finalize this as we go. It is currently used in
production, both for performance and because it fixes a bug in the
XSLT-based linker.
All systems should be using the provided Makefile, so this shouldn't be
invoked anymore. The new linker is still considered a proof-of-concept, but
bugs have been encountered in the old one that are not worth investing the
time into fixing.
The new linker has been used in production for nearly a couple months and is
functioning properly.
This begins to introduce the ASG, backed by Petgraph. The API will continue
to evolve, and Petgraph will likely be encapsulated so that our
implementation can vary independently from it (or even remove it in the
future).
This introduces the reader for xmlo files produced by the XSLT-based
compiler. It is an initial implementation but is not complete; see future
commits.
One of the benefits of storing a reference to the interned string on the
symbol itself is that we get to get its underlying value essentially for
free.
This ordering will simplify streaming processing of xmlo files in
TAMER. Specifically, we know that symbols will have been declared by the
time dependencies are added to the graph (and so we should only be creating
edges to existing nodes); and we can halt reading as soon as the closing
fragments tag is encountered, avoiding parsing the entirety of these massive
XML files.
On one particularly large program, this cuts time down from ~0.333s to
~0.300 in the POC linker.
Contrary to what I said previously, this replaces the previous
implementation with an arena-backed internment system. The motivation for
this change was investigating how Rustc performed its string interning, and
why they chose to associate integer identifiers with symbols.
The intent was originally to use Rustc's arena allocator directly, but that
create pulled in far too many dependencies and depended on nightly
Rust. Bumpalo provides a very similar implementation to Rustc's
DroplessArena, so I went with that instead.
Rustc also relies on a global, singleton interner. I do not do that
here. Instead, the returned Symbol carries a lifetime of the underlying
arena, as well as a pointer to the interned string.
Now that this is put to rest, it's time to move on.
For strings of any notable length, Fx Hash outperforms FNV. Rustc also
moved to this hash function and noticed performance
improvements. Fortunately, as was accounted for in the design, this was a
trivial switch.
Here are some benchmarks to back up that claim:
test hash_set::fnv::with_all_new_1000 ... bench: 133,096 ns/iter (+/- 1,430)
test hash_set::fnv::with_all_new_1000_with_capacity ... bench: 82,591 ns/iter (+/- 592)
test hash_set::fnv::with_all_new_rc_str_1000_baseline ... bench: 162,073 ns/iter (+/- 1,277)
test hash_set::fnv::with_one_new_1000 ... bench: 37,334 ns/iter (+/- 256)
test hash_set::fnv::with_one_new_rc_str_1000_baseline ... bench: 18,263 ns/iter (+/- 261)
test hash_set::fx::with_all_new_1000 ... bench: 85,217 ns/iter (+/- 1,111)
test hash_set::fx::with_all_new_1000_with_capacity ... bench: 59,383 ns/iter (+/- 752)
test hash_set::fx::with_all_new_rc_str_1000_baseline ... bench: 98,802 ns/iter (+/- 1,117)
test hash_set::fx::with_one_new_1000 ... bench: 42,484 ns/iter (+/- 1,239)
test hash_set::fx::with_one_new_rc_str_1000_baseline ... bench: 15,000 ns/iter (+/- 233)
test hash_set::with_all_new_1000 ... bench: 137,645 ns/iter (+/- 1,186)
test hash_set::with_all_new_rc_str_1000_baseline ... bench: 163,129 ns/iter (+/- 1,725)
test hash_set::with_one_new_1000 ... bench: 59,051 ns/iter (+/- 1,202)
test hash_set::with_one_new_rc_str_1000_baseline ... bench: 37,986 ns/iter (+/- 771)
This will be used for generating the common tests between HashSet and
HashMap implementations.
This is my first macro in Rust. There does not seem to be a way to
concatenate identifiers (!), so I'm placing them within modules
instead. That ended up working out just fine, since then I can use a type
to provide the SUT.
This is missing two key things that I'll add shortly: a HashMap-based one
for use in the ASG for node mapping, and an entry-based system for
manipulations.
This has been a nice start for exploring various aspects of Rust
development, as well as conventions that I'd like to implement. In
particular:
- Robust documentation intended to guide people through learning the
necessary material about the compiler, as well as related work to
rationalize design decisions;
- Benchmarks;
- TDD;
- And just getting used to Rust in general.
I've beat this one to death, so I'll commit this and make smaller changes
going forward to show how easily it can evolve.
(This module was originally named `intern` but this commit and those that
follow rewrote it to `sym`.)
This is enabled by default in nightly, and is not available at all in
stable. Considering the PITA that it will be to go back and rewrite docs to
use the new format, and how important of a feature this is, we will just
make use of it now.
Given that developers should be doing TDD and therefore running this target
frequently, this has the effect of providing immediate feedback when
formatting is needed and outputting a diff. Developers will then quickly
understand what changes need to be made to avoid future issues (and can run
`cargo fmt` to fix it), at which point they'll rarely ever encounter
formatting errors.
The original purpose was to ensure pipelines fail when the formatter has not
been run.
The UI values need to match AND the question needs to be
visible. We do not have the visibility classifications yet, so we need to
define externs to allow this to build.
This both reduces some of the output and permits it to be run through
Google's Closure compiler. Combined, this has the potential to halve the
size of classification-heavy executables, like the UI's classifier.
This not only reduces file size, but also has a significant performance
benefit for the UI, which is almost entirely classifications. A run for one
of our systems was reduced from 1m30s to 11s from this change.
This was used to provide additional information on the stack for debugging
the compiled code. Since this is very rarely needed, and is only needed by
someone debugging the compiler, it can be manually enabled if desired.
This also wraps it so that it'll be stripped if it is included.
Note that, because of the way this is implemented, the timestamps may become
mangled (multiple per line) for parallel builds.
Output can be prettied up in the future.
The `<t:match-class-code-lookup />` matches were not showing in the
summary pages. I loosened the selector so it is able to find the matches
when it generates the summary pages.
This makes use of Petgraph for representing the dependency graph and uses a
separate data structure for both string interning and indexing by symbol
name.
This is garbage code. Do not use it. It is intentionally throwaway.
While I've researched Rust, I haven't actually _used_ it for a project, so
this is a combination of me exploring various ways of accomplishing the
problem and forcing myself to learn certain aspects of the language.
I'll likely be using petgraph, and this also currently lacks symbol
abstractions. This commit also performs far too much heap allocation
copying strings around. But it _does_ perform the topological sort.
Since this only stores the symbol name, it lacks enough information about
the symbol to perform a proper linking.
We moved to an internal container registry so that we do not have to rely on
DockerHub. Since TAME is a public project, this will allow our
configuration internally to vary from a public configuration.
If an `lvm:if` is immediately followed by another 'lvm:if`, both should
be used to create the conditional. The existing code wouild only "select
the nearest condition".
The LOB being passed into the function was being ignored and instead it
was pulling it from the contract object. With Package, this caused all 3
LOB to be "COMMPKGE" rather than the correct LOB being processed at the
time.
Going forward, one cannot `map` or `pass` to "line_code" as it will be
considered a reserved word.
Co-Authored-By: Jim Grundner <james.grundner@rtspecialty.com>
It doesn't makes sense to consider a question to be set if it's not even
applicable. This also helps to remove a bunch of duplicate code where these
templates are being used.
This is left over from f2db9f1268, in which I
should have cleaned all of this up. One of our developers was hitting the
removed warning, which isn't necessary since the concept of a separate
"classifier" is no longer a thing after the aforementioned commit.
* rater/rater.xsd (no-extclass, no-extclass-keeps): Remove.
* src/current/rater.xsd: Likewise. (I really need to deduplicate these.)
* src/current/compiler/js.xsl (compiler:entry-rater): Remove inaccurate
comment (genclasses is used for other things).
* src/current/include/depgen.xsl (preproc:depgen-match): Remove error
checking for pulling in non-external classes (this is the error that the
developer hit that is no longer needed).
* src/current/include/preproc/eligclass.xsl (preproc:sym): Remove
`@extclass' predicate. Remove portion of comment.
* src/current/include/preproc/expand.xsl: Remove ancient footnote that
even references an old internal rater!
* src/current/include/preproc/macros.xsl (preproc:class-groupgen): Remove
external propagation.
* src/current/include/preproc/symtable.xsl (preproc:symimport): Remove
extclass checks and propagation.
(preproc:symtable)[lv:rate]: Remove external propagation.
[lv:classify]: Likewise.
* src/current/include/preproc/template.xsl (preproc:inline-apply): Remove
external sym metadata support.
These exist because TAME is nondeterministic, so all state must be passed
into it. But it's inconvenient to have users have to manually fill in
dates, so we derive them from the environment unless they are set.
* src/current/scripts/entry-form.js (fillTimeValues): New function.
(rater): Use it.
csvm2csv was not failing when csvm-expand exited with a non-zero
status. Further, the tests were written incorrectly to account for this.
* build-aux/csvm2csv: Set `pipefail' option.
* build-aux/test/test-csvm2csv: Fix tests.
While tabs aren't desirable, users that are not developers will be modifying
these files, and so we need to be permissive in what we want to
accept. That doesn't mean that we need to forego occasional formatting, though.
tamed was originally designed with support for parallel builds in mind, but
I hadn't completed that work because we didn't have enough hardware that
we'd benefit strongly from it. That has since changed.
tamed will now spawn additional runners as needed to fulfill requests, which
works around the issue of not knowing how many jobs GNU Make is going to try
to do at once.
There were a couple minor dependency fixes/workarounds for now in the
Makefile, but otherwise everything appears to be working great.
A table with a couple hundred thousand rows was taking minutes to
generate. This gets it down to a few seconds.
* build-aux/csvm-expand (parse_date): New function.
(parseline): use it.
This aims to prevent needlessly wasted time debugging a non-working test
case, and to avoid writing incorrect test cases that happen to succeed even
though their inputs aren't properly defined.
For example, a common error is to use the name of a bucket field rather than
the name of the param that it maps to.
* progtest/src/TestRunner.js (_verifyKnownParams): New method.
(_tryRun): Use it.
* progtest/test/TestRunnerTest.js: New test case. Modify existing test
cases to define used params.
* progtest/test/_stub/program.js (exports.rater.params): Declare used param.
This reduces overall build times for one of our systems by ~50% by
addressing a lot of the low-hanging fruit for compilation of object
files. There is much more work to be done, and the addition of maps added a
little bit of a mess that will be abstracted in future commits once I'm done
surveying the possible improvements that can be done.
This further improves performance of the symbol table processing. The next
step will be to address how symbols are handled on a more intimate level,
since it's a huge mess atm. But I'll save that for later, after the
low-hanging fruit has been resolved.
* src/current/include/preproc/symtable.xsl (preproc:sym-discover): Use
`for-each-group' in place of `preceding-sibling'. Aggressive use of
maps for geneating the `dedup' sequence, which is a mess.
(preproc:symtable-process-symbols): Additional maps to avoid
preceding-sibling and following-sibling selectors (O(n²)=>O(n)).
Same concept as previous commits: rather than iterating over the symbol
table and scanning the tree for the matching node, iterate over the document
and look up from a symbol map: O(n²) => O(n).
This gives a respectable performance boost to compilation of certain
packages (best improving packages with many classifications or rate blocks).
* src/current/compiler/fragments.xsl (@xmlns:xs, @xmlns:map): New namespace
declarations.
(preproc:compile-fragments): Generate `preproc:fragment' nodes and match
on document rather than symbols.
[lv:package]: Generate map and tunnel it.
* src/current/compiler/js.xsl (compile)[lv:classify, lv:match]: Use
symtable-map.
(compile-class-condition)[lv:rate]: Likewise.
(compile-cmatch)[lv:rate]: Likewise.
This uses the same map strategy (and same duplicate code) as previous
commits, but this one generates a map for two separate tables.
There is more room for improvement, but this cuts down on the time a
lot. Also keep in mind that this is performed multiple times (once per
pass), so it's still worth revisiting. Performance is still very poor for
very large (many thousands of symbols) symbol tables.
The next slowest part appears to be the fragment compilation. I'm nearing
the end of the low-low-hanging fruit for maps. The /common/gl package
mentioned in previous commits that previously took over a minute to compile
now compiles in 20s as of this commit on equivalent hardware.
* src/current/include/preproc/symtable.xsl (@xmlns:map): New namespace
declaration.
(preproc:symtable-process-symbols): Create map for `cursym' and
`extresults'. Use it. Remove unused `dup'. Output message when
done (another is output slightly later on in the process).
This is the first step to improving the map. Note that this duplicates the
symbol table generation code that's used in a few other places
now---that'll be cleaned up in future commits once I have a better idea of
all the places this will be used and try to move it to a higher level.
* src/current/compiler/validate.xsl (@xmlns:xs, @xmlns:map): New namespace
definitions.
(lvv:validate)[lv:package]: Generate symbol table map. Tunnel to
templates.
[c:apply[@name], lv:classify[@as]//lv:match, lv:match[@value]]
[c:*[@name or @of], c:apply/c:arg[@name], lv:rate/lv:class]: Use it.
The existing code was not only complex (because of XSLT 1), but mostly
unnecessary. We don't need to consult remote symbol tables at all anymore.
This shaves off an additional few seconds on large packages.
* src/current/include/preproc/package.xsl (preproc:resolv-syms)[preproc:sym]:
Only consult local symbol table. Simplify max dimension calculation.
This is a first step (low-hanging-fruit kinda thing) for improving the
performance of symbol resolution, where the compiler has to figure out the
dimensions of a symbol by first resolving its dependencies,
recursively. This is approximately an O(n³) polynomial-time algorithm _per
recursive step_. Yikes.
This is traditionally where dynamic programming methods would be used, but
that's considerably more difficult in a immutable languages like XSLT, so
I'll do my best without. (Saxon does offer some support for mutability, but
I'd prefer to avoid it if possible.)
This first change improves performance 30--40%. For example, on two large
packages we have, build times drop from 55s to 35s and from 1m42s to 1m13s
respectively.
Good start, but much more to be done!
* src/current/include/preproc/package.xsl (preproc:resolv-syms)[lv:package]:
Compute maps for preproc:symtable and preproc:sym-deps at each recursive
step. Pass along via tunneling.
(preproc:resolv-syms)[preproc:sym]: Use them.
DEV-4354
This only saves 1--2s on a 30s run, but I want to move into this direction,
so it'll simplify future refactoring if I just add it. Small changes like
these will accumulate, too.
* src/current/compiler/linker.xsl (l:orig-package, l:root-symtable-map): New
variables.
(l:resov-extern): Use it.
A bunch of failing pipelines apparently wasn't obvious to me. And shame on
me for not running these locally; I forgot that the part of the system that
I touched had tests.
This was broken by b6cfdb4221.
This now uses year ranges, which I'll update annually.
This also renames "R-T Specialty" to "Ryan Specialty Group". The latter is
the parent company of the former. I was originally employed under the
former when LoVullo Associates was purchased, by I now work for the parent
company.
The previous commit made dependency lists optional for certain symbols. The
Summary Page needs to be updated to permit such a thing.
The whole Summary Page needs aggressive refactoring, though, so this doesn't
bother checking for `no-deps' to see if this is a bad thing.
* src/current/summary.xsl (typeset-final)[preproc:sym-ref]: Permit missing
symbol dependencies.
(lv:param|lv:const|lv:item): Likewise.
This is a significant performance improvement for dependency
generation (which is responsible for building the dependency graph for a
package).
The previous algorithm ran in O(n²) time: it would iterate over the given
symbol table, and for _each_ symbol, do a linear scan of the entire document
to search for the corresponding source block. This resulted in explosive
depgen time for larger packages.
This makes the algorithm run in O(n) by:
- Using an XSLT 3 map for the symbol table for O(1) lookups; and
- Iterating over the _document_ a single time rather than the symbol
table, referencing the symbol table as needed (in O(1) time).
There are other parts of the system that can benefit from these same
improvements. This is important, since we need to be able to handle many
thousands of symbols efficiently.
* src/current/compiler/linker.xsl (l:depgen-sym): Recognize smybol `no-deps'
property, permitting missing dependencies. This allows us to avoid
creating nonsense nodes just to satisfy the linker, while still allowing
the linker to perform essential checks to defend against compiler bugs.
* src/current/compiler/map.xsl (lvmc:stub-symtable): Set @no-deps on
`___head' and `___tail' symbols.
(lvmc:mapsym): Set `no-deps' as appropriate on map symbols.
(preproc:depgen)[lvm:map[@from]]: Generate `preproc:sym-dep' node, which
is now expected by the depgen process.
(preproc:depgen)[lvm:map[*]]: Likewise.
(preproc:depgen)[*[@lvmc:type='retmap']//lvmm:map[@from]]: Remove
unnecessary template.
(preproc:symtable)[lvm:map[@value]]: Pass `no-deps' to `lvmc:mapsym'.
* src/current/include/depgen.xsl (preproc:depgen)[preproc:symtable]: Create
and use XSLT 3 map in place of `preproc:symtable' tree. This allows for
constant-time lookups. Provide to templates via tunnelling. Use it in
place of exiting tree references. Process source tree rather than
iterating over symbol table.
(preproc:depgen)[lv:rate, c:sum[@generates], c:product[@generates],
lv:classify, lv:function/lv:param, lv:function, lv:typedef]: Produce
`preproc:sym-dep' nodes (which was previously done while iterating
over the symbol table).
(preproc:depgen)[preproc:sym]: Remove all such processing, since we no
longer iterate over the symbol table.
(preproc:depgen)[c:value-of]: Use symtable map.
(preproc:depgen-match): Likewise.
(preproc:depgen)[lv:union]: Modify to handle changes to lv:typedef
template.
(preproc:depgen)[text()]: Remove and replace with `node()'.
* src/current/include/preproc/package.xsl (preproc:resolv-syms): Remove
logging of symbol resolution. This has a slight performace impact since
there is a lot of output.
* src/current/include/preproc/symtable.xsl
(lv:function/lv:param, c:let/c:Values/c:value): Set `no-deps'.
* src/symtable/symbols.xsl: Add documentation of `no-deps'.
(preproc:symtable)[lv:meta]: Set `no-deps'.
These provide a more pleasent abstraction than having to reference CMP_OP_*
constants.
* core/test/core/vector/interpolate.xml: {t:when=>t:where-eq}.
* core/test/core/vector/table.xml: Likewise, but using the other variants
where appropriate given the value of `@op'.
* core/vector/interpolate.xml: Likewise.
* core/vector/table.xml (_when_, _where_): Rename former to latter and
provide deprecation warning.
(_when-lt_, _when-lte_, _when-gt_, _when-gte_): Add abstractions.
* src/current/rater.xsd: Permit template variable as tenplate name.
This is fairly primitive support and it completely sidesteps the bisect
algorithm for now. The next commit will abstract this a little bit further
to make it less awkward to use.
* core/test/core/vector/table.xml: New test cases.
* core/vector/filter.xml (CmpOp): New typedef.
(mfilter): Document that bisecting will not happen unless `CMP_OP_EQ'
is used. Implement that restriction.
[op]: New parameter. Provide it to `mrange'.
(_mfilter, _mrange_cmp): Rename from `_mfilter'. Implement new comparison
check based on `op'
[op]: New argument.
* core/vector/table.xml (_when_)[@op@]: New param. Add it to the produced
vector.
(_mquery): Unpack op (from `_when_') in call to `mfilter'.
Just trying to clean up a little as I go to start to make it easier
to understand.
* core/vector/filter.xml: Use _when-*_ templates and c:recurse.
* core/vector/table.xml: Likewise.
It's going to be like TeX before you know it... ._.
* src/current/include/preproc/package.xsl (preproc:tpl-check)
[lv:template|lv:const|lv:typedef|lv:param-copy]: Add lv:param-copy.
* src/current/include/preproc/template.xsl (preproc:apply-template)
[lv:expand-barrier, lv:skip-child-expansion]: New expansion control
structures.
This is a much more useful description if present.
* src/current/include/preproc/macros.xsl (preproc:macros)[c:value-of...]:
Default generated constant description to @label.
The term "set" is all wrong---it is actally intended to be a vector, and can
absolutely have duplicate elements (and often does).
* src/current/calc.xsd (vector): Add, recommending in place of `set'.
* src/current/compiler/js-calc.xsl (compile-calc)[c:set|c:vector]:
Add `c:vector' and provide deprecation notice for `c:set'.
* src/current/include/calc-display.xsl (c:set|c:vector): Likewise.
* core/test/spec.xml (_describe_): Enclose aggregate classification in a
series of nested expand-sequence to work around bug (described in
comment), which was causing test cases to not be compiled.
A better option is to pre-process all inputs, but I need a quick
fix to my stupidity. 0||""==="".
* src/current/compiler/map.xsl (lvmc:compile)[lvm:map//lvm:from[*]]: Correct oval default.
I wanted to get this section started so that I can easily add to it when I
have small bits of time to do so. Our documentation needs to improve.
* doc/Makefile.am (tame_TEXINFOS): Add `concept.texi'.
* doc/concept.texi: New file.
* doc/preproc.texi: Remove accidentally added input line.
* doc/tame.texi (menu): Add `Core Concepts' node.
I need to revert this for now because it breaks YAML test cases. The proper
fix is a more expressive type system with dependent types that would allow
it to know the proper number of indexes to initialize relative to other
inputs. I wanted to implement this anyway to help catch iteration-related
bugs.
I'm tabling this for now, though, since I have other things that I need to
work on.
This reverts commit 4406cbe553.
This includes, notably, the Developer Notes feature. I did not copy any
SRCUI stuff since this project uses literate documentation, but I'll add it
if it seems like it will be useful. Barely any of the project is written
literately right now.
* .gitignore: `{=>/}config.*'.
* configure.ac (SET_DEVNOTES): New variable.
(AC_CONFIG_FILES): Add `doc/config.texi'.
* doc/.gitignore (config.texi): Ignore (generated).
* doc/Makefile.am (tame_TEXINFOS): Add `macros.texi' and `config.texi'.
* doc/config.texi.in: New file.
* doc/macros.texi: New file containing some macros from `doc/tame.texi' and
some from Liza's `doc/macros.texi'.
* doc/tame.texi: Adjust position of header comment. Include `config.texi'
and `macros.texi'. Add devnotice to header. Strip out macros.
(menu): Add `Concept Index' and conditional `Developer Notes Index'.
(Concept Index, Developer Notes Index): New nodes (latter conditional).
I want this manual to be useful both to developers and users of TAME,
so this distinction needs to be made clear.
* doc/tame.texi (Preprocessor): chapter=>appendix.
* src/graph.texi: Top to appendix and raise subsections.
* src/symtable.texi: Top to appendix.
This is an assumption that's existed since the Summary Page was first
devised---that all vectors have at least one value. This is because the
bucket (originating from Liza) always has at least one value in its vectors.
Of course, we still have a problem in that the Summary Page initializes
everything to have a single value by default, and that's still the
case. But this will at least allow for things _outside_ the Summary Page to
provide an empty array. I'll have to address the Summary Page separately,
and that's going to be difficult, since we don't really want to change the
behavior across the board.
* src/current/compiler/js.xsl (set_defaults): Default max index to 0 if
`length' is unavailable, rather than 1.
The previous length check existed as a really bad array check (before
Array.isArray was a thing). This has been broken since Nov 2012.
The problem manifests itself when you want an empty array. We then have:
[] => [[]] => [DEFAULT_VALUE]
* src/current/compiler/map.xsl (lvmc:compile)[lvm:map//lvm:from[*]]: Use
`Array.isArray' in place of length check.
TODOs shouldn't be stored here, and they will get out of sync.
* Makefile.am (tame_TEXINFOS): Remove todo.texi.
* tame.texi: Remove include and menu entry.
* todo.texi: Remove file.
This is a BC break since this generates assertions by default. To maintain
BC, set `@allow-zero@' and `@allow-negative@' to `true' in existing template
applications.
* core/insurance.xml
(assert_ignore_premium_zero, assert_ignore_premium_negative): New params.
(_premium_): Generate assertions.
[@allow-zero@, @allow-negative@]: New params.
lsimports will be able to be used to replace the last remaining Ant script
that generates depfiles.
* build-aux/check-coupling:
* build-aux/lsimports: New files.
This allows customizing from the command-line what suppliers should be
checked. This motivation for this is both to run as part of a distributed
pipeline (where each supplier may be built individually), and for during
development of a single supplier.
BC BREAK: Note that this will now check for `package' in the test path for
UI tests. To keep the old directory around, a symlink of `packages' to `ui'
would suffice.
* build-aux/Makefile (SUPPLIERS, suppliers_strip): New variables.
(check-am): BC-BREAK: Build and check only requested suppliers.
* build-aux/progtest-runner: BC-BREAK: First argument is now test directory
and all remaining arguments specify the supplier XML files to check.
We want to be able to build the UI independently of the
suppliers. Historically, this did not provide much of a benefit, but this
change allows us to build independently as a job in a distributed pipeline,
and allows testing out the UI when rating is unneeded.
* build-aux/Makefile.am (program-ui): Remove `standalones'.
This target has not been used for years.
* build-aux/Makefile.am (program-ui-immediate): Remove target.
(program-ui): Use dependency of old `program-ui-immediate'.
(.PHONY): Remove `program-ui-immediate'.
The difference is described here:
http://www.saxonica.com/html/documentation/using-xsl/embedding/
And s9api here:
http://www.saxonica.com/html/documentation/using-xsl/embedding/s9api-transformation.html
* Makefile.am (DSLC_CLASSPATH): Export for submakes.
* configure.ac (DSLC_CLASSPATH): Prefix with SAXON_CP.
* rater/rater.xsd (classNameType): Increase length 50=>75 (generated
identifiers can now exceed that, it seems).
* src/current/rater.xsd: Likewise. These files need to be combined.
* src/current/src/Makefile (CLASSPATH): Set to DSLC_CLASSPATH.
* src/current/src/com/lovullo/dslc/DslCompiler.java: Update imports.
(DslCompiler)[_DslCompiler]: New members _processor and
_xsltCompiler. Convert to s9api.
Note that such files may not actually exist, which is why `nullglob' is set
and the `for' loop is used.
* build-aux/Makefile.am (SHELL): Set `nullglob'.
(program-data-copy, lvroot): Copy srv/!(rater).js to destination JS paths.
This has been broken for years. I don't object to fixing it, it's just that
I have better things to do right now and we've gotten complaints about it;
no use in keeping around something that's broken if there's no desire to fix
it. Workaround: refresh the page.
This does keep around the reset logic because it is actually used in other
places.
* src/current/include/entry-form.xsl (entry-form)[lv:package]: Remove reset
button.
* src/current/include/entry-form.js (clearTestCases): Remove broken function
call `Prior.setPriorMessage(null)'.
It wasn't until recently that I realized that the default browser font was
being used, since I have mine customized.
* src/current/summary.css (body)[font-family]: Sans-serif font stack.
* src/current/compiler/map.xsl:
(lvmc:gen-input-default): Add argument.
[dim]: New param, defaulting to `$sym/@dim'.
(lvmc:compile)[lvm:map//lvm:from[*]]: Provide appropriate dimension value
to `set_defaults'. Provide compile-time error if nesting of `from'
nodes exceeds what is appropriate for the symbol dimensions.
This fixes a number of obnoxious miscellaneous issues, summarized below.
* src/current/src/com/lovullo/dslc/DslCompiler.java (DslCompiler)[compile]:
Output termination line (DONE) on missing destination path
error. Always output exception message before termination
line (otherwise it won't output to the user). Output termination line
and remove destination file for XSD failure.
This was a bit of a nasty one. Fortunately, this was only used as a
validation, so the code that the compiler produced was still correct.
The problem was that a version of Saxon sometime between 9.5 and 9.8 added
an optimization to eliminate conditionals with no body. Consequently, the
kluge to force the variable to be evaluated was optimized away,
`lvmc:get-symbol' was never called, and no error was ever produced.
This would be best refactored, but that's not something I have time to take
up at the moment priority-wise. This should be future-proof since this
would never be a noop.
* src/current/compiler/map.xsl (lvmc:compile)[lvm:map//lvm:from[*]]: Force
evaluation of `$sym' by ensuring that the condition will not be a noop.
This will ensure that tamed does not stall while e.g. make is still
running. This makes TAMED_STALL_SECONDS almost useless; maybe it'll be
removed in future versions.
* bin/tame (TAMED_SPAWNER_PID): Export variable.
* bin/tamed (TAMED_SPAWNER_PID): New variable, default to PPID.
(spawner-dead): New function.
(stall-monitor): Use it.
(usage): Update documentation.
* build-aux/Makefile.am: Set TAMED_SPAWNER_PID to own id and export.
* src/current/c1map.xsl (lvm:c1map): Copy `@namespace' to generated
`lvmp:root'.
* src/current/c1map/render.xsl (lvmp:render)[lvmp:root]: Output
`@namespace' rather than using hardcoded string and dynamic program.
This maintains BC for existing raters that have not yet been migrated to use
the new c1-import service.
* build-aux/Makefile.am (path_c1root): New variable.
(.PHONY): Add c1root target dependency.
(program-data-copy): Copy to `@C1_IMPORT_MAPDEST@'.
(c1root): New target.
* build-aux/m4/calcdsl.m4 (C1_IMPORT_MAPDEST): Configure depending on the
existence of the `c1-import' directory.
This is a long-standing bug, apparently. The location of this code makes it
difficult to test directly (that is in dire need of correcting), but
fortunately we have a number of tests in systems that use TAME that
indirectly test this.
The problem manifested when a matrix was already in the store, but then a
scalar or vector predicate was considered. Without making the branch that
was modified here, it modified store such that it would always yield a
vector.
* src/current/compiler/js.xsl (anyValue): Consider store dimension when
recursing.
This is the start of a working build for core.
* .gitignore: Ignore generated files from configuration and build.
* build.xml: Copy from rater repo. This is the last remaining ant-based
dependency and can be gotten rid of; see comments.
* configure.ac: New file.
* rater/build-aux, rater/src: New symlinks.
This begins to decouple the rater directory conventions using an incremental
approach, defaulting to the existing structure. Not all things were
modified (for example, cleaning will not work properly with a custom
SRCPATHS if those directories do not exist); WIP.
* build-aux/Makefile.am (path_dsl): Use `CALCROOT'.
(suppliers.mk): Test for existence of program.dep and c1map directory
before acting on them.
* build-aux/m4/calcdsl.m4: Default SRCPATHS. Output it during configure.
Expose CALCROOT and SRCPATHS using AC_SUBST.
Invoke suppmk-gen using SRCPATHS.
* build-aux/suppmk-gen: Use arguments (SRCPATHS) in place of hard-coded paths.
This frees us from requiring a rater/ directory in the working
directory. However, it is important that we continue using it if it
exists, since there are additional things that haven't yet been moved
into the tame repo.
* bin/dslc.in: Provide path to rater/ directory.
* src/current/src/com/lovullo/dslc/DslCompiler.java: Use provided rater/ path.
This was broken by a previous commit, but was not noticed because
the test cases aren't being compiled as part of the build yet!
Now that we have tamed, that is an option.
* test/core/insurance.xml: Add missing @desc@.
* bin/tame (TAME_CMD_WAITTIME): Renamed from `RUNNER_CMD_WAITTIME'.
Inherit from environment, default 3.
(command-runner): Sleep for an additional TAME_CMD_WAITTIME seconds after
requesting runner reload to give more time in case of high load.
(verify-runner-ack): Rename variable.
(usage): Document env var.
* build-aux/Makefile.am: Export TAME_CMD_WAITTIME.
* build-aux/gen-make: Do not add ".xmlo" suffix for deps with a
trailing `$'.
* src/current/pkg-dep.xsl (lvm:program|lvm:return-map): Append ".xml$" to
dep for map/@src (new dep).
This is the one we always want in the UI. Rather than stripping with an
outside build process, just use this.
* build-aux/Makefile.am (program-data-copy, lvroot): Copy ui/program{=>.strip}.js.
This significantly improves speed and reduces memory usage when dealing with
hundreds of test cases.
* build-aux/Makefile.am (dest_standalone_strip): New variable.
(strip, %.strip.js: New targets.
(.PHONY): Add strip target.
(check-am): Depend on strip.
* build-aux/progtest-runner: Use stripped executables.
Try to re-post message, since the previous message will have already been
read (otherwise the previous echo would have hung).
* bin/tame (EX_STALLED): New exit code.
(command-runner): Re-post message after stall. If unrecoverable, provide
a more clear error and exit with EX_STALLED.
This tries to be a bit more resilient in case a runner becomes unresponsive,
rather than waiting for tamed to kill itself.
* bin/tame (RUNNER_CMD_WAITTIME): New variable.
(command-runner): Tell runner to reload if it does not respond in
RUNNER_CMD_WAITTIME seconds.
(verify-runner-ack): New function.
* bin/tamed (mkfifos): Only keep stdin open. stdout isn't necessary, and
may have actually been causing subtle issues.
(spawn-runner): Support restarting dslc on SIGHUP.
This fixes a lot of the problems with the build by using a normal Makefile
as it is intended to be used. To do this, tamed was created. See the
manual and commit messages for more information. bin/tame{,d} also have
more information. More information will follow in the manual in the future.
There is also more cleanup to follow; I just want to get this committed so
that people can take advantage of it and stop some of the suffering.
This does not include a great deal of information, but it is a start.
* README.md: Modernize.
* doc/Makefile.am (tame_TEXINFOS): Add `about.texi'.
* doc/about.texi: New file.
* doc/tame.texi: Include it.
This will keep the intermediate files around but will still delete them on
build failure.
* build-aux/Makefile.am (.SECONDARY): Renamed from `.PRECIOUS'.
Please excuse the mess. This was taken from an existing bootstrap script in
a private repository; it can be cleaned up in the future.
* bootstrap: New file.
* README.md (Getting Started): New section.
This is a major step toward normalcy---removing the kluge of a build process
that was causing so many issues. Rather than echoing all operations to a
queue file before passing it off to dslc, the new build scripts in `bin/'
are used to invoke tame normally, as needed. This solves all of the current
issues with things not rebuilding when they should. And, as a bonus, tab
completion on targets works.
Sorry this took so long. There wasn't much motivation until we hired so
many people that are suffering from this.
This does a few major things, along with some miscellaneous others:
- Invoke bin/tame directly;
- Merge Makefile.2.in into Makefile.am; and
- Fix up some targets.
* build-aux/Makefile.2.in: Delete file. Mostly merged with Makefile.am.
* build-aux/Makefile.am: Add a bunch of new targets and definitions from
Makefile.2.in. Modify all that previously used .cqueue to now invoke
`$(TAME)' directly. Remove miscellaneous targets for trying to proxy
targets to Makefile.2.
(saneout, _go): Remove definitions.
(.NOTPARALLEL): Add to prevent parallel builds.
(ui/program.expanded.xml)[.version.xml]: Remove dependency for now.
(clean): Also clean generated PHP files. Follow symlinks to clean core.
This is still incomplete (does not clean all rate table stuff).
(suppliers.mk)[xmlo_cmd]: Remove. See `gen-make' and `gen-c1make'.
(lvroot)[summary-html]: New dependency.
(kill-tamed, tamed-die): New targets (former alias of latter) to kill
tamed.
* build-aux/gen-c1make: Generate `$(TAME)' invocation.
* build-aux/gen-make: Likewise. Remove `xmlo_cmd' output. Ignore recursive
`tame' symlink (this can be removed once we clean `rater/' up.
* build-aux/m4/calcdsl.m4 (TAME): Update description to reflect that it
should now be the path to `bin/tame'. Adjust `AC_CHECK_FILE' lines
accordingly.
(tame_needed_ver): Remove. We have been in the same repo as TAME itself
for quite some time. Remove associated code.
(AC_CONFIG_FILES): Remove `Makefile.2'.
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
Perform validation prefore `compile' command rather than a separate
`validate' step. Remove `rm'.
[compileSrc]: Stop echoing command. This was only necessary because of
the previous Makefile klugery; now Make echoes on its own correctly.
These scripts allow the TAME compiler stack to be invoked naturally, rather
than requiring the use of a Makefile today. This will not only allow users
to more easily invoke the compiler, but will also allow us to invoke TAME
naturally from Makefile and remove the klugery that has existed for so
long.
This users a server/client architecture in order to mitigate the startup
cost of the JVM. More documentation will follow.
Note that there are a bunch of symlinks in rater/---this is a transition
step to allow the build to continue working as it did before, which relies
on a directory structure that exists outside of this repository. This will
be cleaned up in the future.
* .gitignore (bin/dslc): Add ignore for generated file.
* bin/dslc.in: New script to encapsulate Java invocation.
* bin/tame: New script (client).
* bin/tamed: New script (server).
* configure.ac (JAVA_OPTS, DSLC_CLASSPATH, AUTOGENERATED): New variables for
dslc.in. Output bin/dslc.
* rater/README.md: Note that this symlink mess is temporary.
* rater/c1map: New symlink for dslc assumptions.
* rater/c1map.xsl: Likewise.
* rater/calc.xsd: Likewise.
* rater/compile.xsl: Likewise.
* rater/compiler: Likewise.
* rater/dot.xsl: Likewise.
* rater/include: Likewise.
* rater/link.xsl: Likewise.
* rater/standalone.xsl: Likewise.
* rater/summary.xsl: Likewise.
* rater/tame: Likewise (warning: circular symlink).
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
Output `DONE' lines.
This will now automatically build on recursive target `all'.
* Makefile.am (SUBDIRS): Add `src/current/src'.
* src/current/src/Makefile: (.PHONY): Add `all'.
(all): New target. Alias to `dslc'.
2018-10-08 23:07:41 -04:00
412 changed files with 81758 additions and 9439 deletions