employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	41b41e02c1	tamer: Xirf::Text refinement This teaches XIRF to optionally refine Text into RefinedText, which determines whether the given SymbolId represents entirely whitespace. This is something I've been putting off for some time, but now that I'm parsing source language for NIR, it is necessary, in that we can only permit whitespace Text nodes in certain contexts. The idea is to capture the most common whitespace as preinterned symbols. Note that this heuristic ought to be determined from scanning a codebase, which I haven't done yet; this is just an initial list. The fallback is to look up the string associated with the SymbolId and perform a linear scan, aborting on the first non-whitespace character. This combination of checks should be sufficiently performant for now considering that this is only being run on source files, which really are not all that large. (They become large when template-expanded.) I'll optimize further if I notice it show up during profiling. This also frees XIR itself from being concerned by Whitespace. Initially I had used quick-xml's whitespace trimming, but it messed up my span calculations, and those were a pain in the ass to implement to begin with, since I had to resort to pointer arithmetic. I'd rather avoid tweaking it. tameld will not check for whitespace, since it's not important---xmlo files, if malformed, are the fault of the compiler; we can ignore text nodes except in the context of code fragments, where they are never whitespace (unless that's also a compiler bug). Onward and yonward. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	a16a0d9138	Revert "tamer: xir: Initial re-introduction of AttrEnd" This reverts commit `b973d36862`. Alright, I'm getting sick of fighting with myself on this. But rather than just removing the last commit, I'm going to keep it around, so that my thoughts are clearly documented for my future quarrels with myself. Firstly: this added more overhead than I wanted it to. While it wasn't significant, it did add 100--150ms to one of our largest systems, up from ~2.8s, which seems a bit much for a token that's really just meant to make life easier for the parser. Further, it seems that all I've managed to do is push my original problem to a different layer---this started as a means to resolve having to emit both an object and an error simultaneously in the case where aggregate attribute parsing has completed, but we encounter an error on the next token (e.g. an unexpected element). But XIRF, if it's missing AttrEnd, should throw an error, but should also recover. Recovery is easy---just assume that it was present---_but then we don't emit a XIRF `AttrEnd` token_, which is necessary for downstream systems. So we'd need to either: (a) emit both a token and an error; or (b) panic. But if we're doing (a), then the need for `AttrEnd` goes away, because it solves the original problem (though the other concerns of the previous commit still stand). (b) is not ideal at all, even though the missing token does represent an internal system error; it's not something the user can correct. But, given that it's something that the user cannot correct, doesn't that imply that it's an awkward thing to include in the token stream? So back to `AttrEnd` being an awkward PITA to have. So, given (a), I'll just do that: errors will become more of a "hey, this error just occurred, but I'm trying to recover---here's an object that you should use if you choose to continue parsing, but it may or may not be what you're looking for; proceed with caution". That flips the original script: I imagined having external systems feed recovery tokens, but this encapsulates recovery within the parser, which really is more appropriate, though less flexible than having an omniscient external recovery system; such a monolith was always an awkward concept and would be difficult to implement cleanly. This can also potentially be implemented as a generalization of the Dead state change that allowed an object to be emitted alongside the lookahead/error. Anyway, back to where I was...I'm sure I'll look back on this in the future shaking my head, reflecting on how naive I was. DEV-7145	2022-06-29 11:25:44 -04:00
Mike Gerwitz	b973d36862	tamer: xir: Initial re-introduction of AttrEnd AttrEnd was initially removed in `0cc0bc9d5a` (and the commit prior), because there was not a compelling reason to use it over a lookahead operation (returning a token via the a dead state transition); `AttrEnd` simply introduced inconsistencies between the XIR reader (which produced AttrEnd) and internal XIR stream generators (e.g. the lowering operations into XIR->XML, which do not). But now that parsers are performing aggregation---in particular the attribute parser-generator `xir::parse::attr`---this has become quite a pain, because the dead state is an actionable token. For example: 1. Open 2. Attr 3. Attr 4. Open 5. ... In the happy case, token #4 results in `Parsed::Incomplete`, and so can just be transformed into the object representing the aggregated attributes. But even in this happy path, it's ugly, and it requires non-tail recursion on the parser which requires a duplicate stack allocation for the `ParserState`. That violates a core principle of the system. But if there is an error at #4---e.g. an unexpected element---then we no longer have a `Parsed::Incomplete` to hijack for our own uses, and we'd have to introduce the ability to return both an error and a token, or we'd have to introduce the ability to keep a token of lookahead instead of reading from the underlying token stream, but that's complicated with push parsers, which are used for parser composition. Yikes. And furthermore, the aggregation has caused me to introduce the ability to override the dead state type to introduce both a token of lookahead and aggregation information. This complicates the system and is going to be confusing to others. Given all of this, AttrEnd does now seem appropriate to reintroduce, since it will allow processing of aggregate operations when encountering that token without having to worry about the above scenario; without having to duplicate a `ParseState` stack; without having to hijack dead state transitions for producing our aggregate object; and everything else mentioned above. This commit does not modify those abstractions to use AttrEnd yet; it re-introduces the token to the core system, not the parser-generators, and it doesn't yet replace lookahead operations in the parsers that use them. That'll come next. Unlike the commit that removed it, though, we are now generating proper spans, so make note of that here. This also does not introduce the concept to XIRF yet, which did not exist at the time that it was removed, so XIRF is filtering it out until a following commit. DEV-7145	2022-06-29 11:02:02 -04:00
Mike Gerwitz	c671bf6a9c	tamer: xir: Introduce {Ele,Open,Close}Span This isn't conceptally all that significant of a change, but there was a lot of modify to get it working. I would generally separate this into a commit for the implementation and another commit for the integration, but I decided to keep things together. This serves a role similar to AttrSpan---this allows deriving a span representing the element name from a span representing the entire XIR token. This will provide more useful context for errors---including the tag delimiter(s) means that we care about the fact that an element is in that position (as opposed to some other type of node) within the context of an error. However, if we are expecting an element but take issue with the element name itself, we want to place emphasis on that instead. This also starts to consider the issue of span contexts---a blob of detached data that is `Span` is useful for error context, but it's not useful for manipulation or deriving additional information. For that, we need to encode additional context, and this is an attempt at that. I am interested in the concept of providing Spans that are guaranteed to actually make sense---that are instantiated and manipulated with APIs that ensure consistency. But such a thing buys us very little, practically speaking, over what I have now for TAMER, and so I don't expect to actually implement that for this project; I'll leave that for a personal project. TAMER's already take a lot of my personal interests and it can cause me a lot of grief sometimes (with regards to letting my aspirations cause me more work). DEV-7145	2022-06-24 14:16:29 -04:00
Mike Gerwitz	873e5fc761	tamer: asg::ident: {prolog=>prologue} typo fix Somewhat humorous.	2022-06-23 09:19:12 -04:00
Mike Gerwitz	2fafc331a1	tamer: xir::reader: Opening and closing tag whitespace Non-attribute and non-empty start/end tags will have their whitespace as part of the produced span. This sets us up for a following change that will allow for deriving the name span from this span given a QName, which gives us a span that both represents the entire XIR token and allows deriving the element name. An accurate token span is necessary for parsing errors where an element was not expected, while an element name span is more appropriate for issues of grammar and semantic errors that deal not with the fact that an element was encountered, but _what_ element was encountered. DEV-7145	2022-06-22 15:10:49 -04:00
Mike Gerwitz	e5c8a218c3	tamer: xir::reader: Correct empty element whitespace handling This both adds clarifying tests and corrects the case of `<foo/>`, where the offset was erroneously off by one---it saw that there were no attributes and added a byte thinking it'd include `>`, as in `<foo>`. DEV-7145	2022-06-22 10:28:44 -04:00
Mike Gerwitz	495c1438fd	tamer: Consistent span diagram representation I'll document it more formally eventually, but this settles on a mix of the two: square brackets and dashes for intervals, `+` for intersecting lines, byte offsets below interval endpoints, and names below that. The docblock for `Span` itself iss still off; I'll probably just take one of the test cases and paste it there at some point. DEV-7145	2022-06-06 11:32:35 -04:00
Mike Gerwitz	8d92667388	tamer: Integrate xir::reader as a parser in the lowering pipeline This allows `XmlXirReader` to be used in a `Lower` operation, just as everything else, bringing me one step closer to a pipeline that can be concisely represented; this is finally beginning to unify in a clear way, though it is still a bit of a mess. This causes `XmlXirReader` to _act_ like a `parse::Parser` in that it yields a `ParsedResult`, but it does not use `parse::Parser` itself; that was the _original_ plan: convert it into a `ParseState` where `XmlXirReader` became a context, and force `Parser` to yield by feeding it a stream of tokens with `repeat`, but that ended up performing poorly relative to this change. I did some investigation, which I might write about in the future, but for now, this solution works just fine. DEV-7145	2022-06-02 10:30:44 -04:00
Mike Gerwitz	1ad2fb1dc8	Copyright year update 2022 RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no "Group"), but I'm not sure if the legal name has been changed yet or not, so I'll wait on that.	2022-05-03 14:14:29 -04:00
Mike Gerwitz	eaa8133d21	tamer: diagnose: Introduction of diagnostic system This is a working concept that will continue to evolve. I wanted to start with some basic output before getting too carried away, since there's a lot of potential here. This is heavily influenced by Rust's helpful diagnostic messages, but will take some time to realize a lot of the things that Rust does. The next step will be to resolve line and column numbers, and then possibly include snippets and underline spans, placing the labels alongside them. I need to balance this work with everything else I have going on. This is a large commit, but it converts the existing Error Display impls into Diagnostic. This separation is a bit verbose, so I'll see how this ends up evolving. Diagnostics are tied to Error at the moment, but I imagine in the future that any object would be able to describe itself, error or not, which would be useful in the future both for the Summary Page and for query functionality, to help developers understand the systems they are writing using TAME. Output is integrated into tameld only in this commit; I'll add tamec next. Examples of what this outputs are available in the test cases in this commit. DEV-10935	2022-04-13 15:22:46 -04:00
Mike Gerwitz	a1a4ad3e8e	tamer: Introduce context into XirReader tamec and tameld will now both introduce a `Context` to XIR, which will use it to create spans. Here's an example of an error, now that it's all working well together: $ target/release/tameld --emit xmle -o /dev/null path/to/package.xmlo error: invalid preproc:sym/@dim `9` at [/../path/to/package.xmlo offset 1175451-1175452] A future task will make this human-readable by producing line and column numbers, and perhaps even a snippet (if not now, then eventually). It's exciting to see this coming together finally. DEV-10934	2022-04-08 16:16:23 -04:00
Mike Gerwitz	68223cb7d3	tamer: xir::reader: Additional quick-xml error spans There's a bit to unpack here. Some of the spans originate from quick-xml's error handling, but in coming up with test cases to try to trigger errors, I found that quick-xml is far too permissive in what it accepts, and oughtright dangerous in some situations. I feel like the writing is on the wall for quick-xml, but I'll probably wait until replacing `xmlo` with a more efficient format before deciding whether to use a different library or implement parsing ourselves. There's a lot of factors to consider, and a library would have to not only be correct and performant, but provide useful information for span generation. But for now, I have other more important things to work on, like a functioning compiler. So while quick-xml is around, I'll just have to do the best I can to provide a correct parser with useful errors. DEV-10934	2022-04-08 14:54:49 -04:00
Mike Gerwitz	ab181670b5	tamer: xir::reader: Initial introduction of spans This is a large change, and was a bit of a tedious one, given the comprehensive tests. This introduces proper offsets and lengths for spans, with the exception of some quick-xml errors that still need proper mapping. Further, this still uses `UNKNOWN_CONTEXT`, which will be resolved shortly. This also introduces `SpanlessError`, which `Error` explicitly _does not_ implement `From<SpanlessError>` for---this forces the caller to provide a span before the error is compatable with the return value, ensuring that spans will actually be available rather than forgotten for errors. This is important, given that errors are generally less tested than the happy path, and errors are when users need us the most (so, need span information). Further, I had to use pointer arithmetic in order to calculate many of the spans, because quick-xml does not provide enough information. There's no safety considerations here, and the comprehensive unit test will ensure correct behavior if the implementation changes in the future. I would like to introduce typed spans at some point---I made some opinionated choices when it comes to what the spans ought to represent. Specifically, whether to include the `<` or `>` with the open span (depends), whether to include quotes with attribute values (no), and some other details highlighted in the test cases. If we provide typed spans, then we could, knowing the type of span, calculate other spans on request, e.g. to include or omit quotes for attributes. Different such spans may be useful in different situations when presenting information to the user. This also highlights gaps in the tokens emitted by XIR, such as whitespace between attributes, the `=` between name and value, and so on. These are important when it comes to code formatting, so that we can reliably reconstruct the XML tree, but it's not important right now. I anticipate future changes would allow the XIR reader to be configured (perhaps via generics, like a strategy-type pattern) to optionally omit these tokens if desired. Anyway, more to come. DEV-10934	2022-04-08 13:59:37 -04:00
Mike Gerwitz	99aacaf7ca	tamer: tamec: Replace copy with XIR parsing/writing When wip-frontends is on, this will parse the input file using XIR and then immediately output it again. This makes the necessary changes to be able to read every source file we have in our largest project, such that the output is identical after having been formatted with `xmllint --format -` (there are differences because e.g. whitespace between attributes is not yet maintained). This is performant too, with times remaining essentially identical despite the additional work. DEV-10413	2022-04-07 12:13:49 -04:00
Mike Gerwitz	2e386f1baf	tamer: xir::reader::XmlXirReader::refill_buf: Clear read buffer This was done in the old reader many months ago, but I somehow forgot to do it here (or forgot to). The new reader was using substantially more memory. Here's how this change affects the memory profile for one of our systems (output from `ms_print`): Before: MB 79.75^ # \| # \| # @ \| @@@@ # @ \| @@@ # @@ \| @@@ @@@#@ @@@@@ \| @@@ @@ #@@@@@@@@@@ \| @@@@@@ @@@@ #@@@@@@@@@@ \| @@ @@ @@@ @@ @ @@ #@@@@@@@@@@ \| @@ @@ @@@ @@@@@ @@ #@@@@@@@@@@ \| @@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@ @@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @ @@ @@ @ @@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @ @ @@@ @@ @@@ @@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @ @@@@ @@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@@ @@@@@@ @@@@@@@@@ @@@@@ @@@@@ @@ @@@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@@ @ @@@@ @@@@@@@@@ @@@@@ @@@@@ @@ @@@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@@ @@@ @@@@ @@@@@@@@@ @@@@@ @@@@@ @@ @@@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ \| @@@ @@@ @@@@ @@@@@@@@@ @@@@@ @@@@@ @@ @@@@@@@ @@@ @@@@@@ @@ #@@@@@@@@@@ 0 +----------------------------------------------------------------------->Gi 0 15.20 After: MB 63.25^ # \| # \| @@@@@@@@@#@ \| @@@@@@ @@#@ \| @@@@@@ @@#@ \| @@@@@@ @@#@ \| @@@@@@ @@#@ \| @@@@@@@@@@@@ @@#@ \| @@@@@@@@@ @@ @@@@@@ @@#@ \| @@@@@@@@ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@ @@@@@@@@ @@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@@@ @@@@@@@@ @@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@@@@@ @@@@@@@@ @@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@@@@@@@@ @@@@@@@@ @@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ \| @@@@@@@@@@@@@@@@@@@@ @@@@@@@@ @@@@@@@@@@@@@@@ @ @@@ @@@ @@ @@@@@@ @@#@ 0 +----------------------------------------------------------------------->Gi 0 15.20 The bottom graph is virtually identical to the memory profile of the old reader, just with the exception that it's interning a bit more data than before, because we're reading more comprehensively. That's (potentially) the subject of future changes. DEV-12038	2022-04-06 11:50:07 -04:00
Mike Gerwitz	0cc0bc9d5a	tamer: xir::Token::AttrEnd: Remove More information can be found in the prior commit message, but I'll summarize here. This token was introduced to create a LL(0) parser---no tokens of lookahead. This allowed the underlying TokenStream to be freely passed to the next system that needed it. Since then, Parser and ParseState were introduced, along with ParseStatus::Dead, which introduces the concept of lookahead for a single token---an LL(1) grammar. I had always suspected that this would happen, given the awkwardness of AttrEnd; it was just a matter of time before the right abstraction manifested itself to handle lookahead. DEV-11339	2021-12-17 10:14:31 -05:00
Mike Gerwitz	325c3167ee	tamer: xir::Token::span: New method This permits retrieving a Span from any Token variant. To support this, rather than having this return an Option, Token::AttrEnd was augmented with a Span; this results in a much simpler and friendlier API. DEV-11268	2021-12-06 14:48:55 -05:00
Mike Gerwitz	5233822322	tamer: xir: Remove Text enum Like previous commits, this replaces the explicit escaping context with the convention that all values retrieved from `xir` are unescaped on read and escaped on write. Comments are a notable TODO, since we must escape only `--`. CData is also an issue. I had _expected_ to use it as a means to avoid unescaping fragments, but I had forgotten that quick_xml hard-codes escaping on read, so that it can re-use BytesStart! That is terribly unfortunate, and may result in us having to re-implement our own read method in the future to avoid this nonsense. So I'm just leaving it as a TODO for now. DEV-11081	2021-11-15 23:47:14 -05:00
Mike Gerwitz	d710437ee4	tamer: xir::escape::CachingEscaper: New Escaper As promised, this will cache previously seen escaped/unescaped values by creating a two-way mapping between them. DEV-11081	2021-11-15 16:44:24 -05:00
Mike Gerwitz	27ba03b59b	tamer: xir::escape: Remove XirString in favor of Escaper This rewrites a good portion of the previous commit. Rather than explicitly storing whether a given string has been escaped, we can instead assume that all SymbolIds leaving or entering XIR are unescaped, because there is no reason for any other part of the system to deal with such details of XML documents. Given that, we need only unescape on read and escape on write. This is customary, so why didn't I do that to begin with? The previous commit outlines the reason, mainly being an optimization for the echo writer that is upcoming. However, this solution will end up being better---it's not implemented yet, but we can have a caching layer, such that the Escaper records a mapping between escaped and unescaped SymbolIds to avoid work the next time around. If we share the Escaper between _all_ readers and the writer, the result is that 1. Duplicate strings between source files and object files (many of which are read by both the linker and compiler) avoid re-unescaping; and 2. Writers can use this cache to avoid re-escaping when we've already seen the escaped variant of the string during read. The alternative would be a global cache, like the internment system, but I did not find that to be appropriate here, since this is far less fundamental and is much easier to compose. DEV-11081	2021-11-12 14:03:23 -05:00
Mike Gerwitz	b1c0783c75	tamer: xir::XirString: WIP implementation (likely going away) I'm not fond of this implementation, which is why it's not fully completed. I wanted to commit this for future reference, and take the opportunity to explain why I don't like it. First: this task started as an idea to implement a third variant to AttrValue and friends that indicates that a value is fixed, in the sense of a fixed-point function: escaped or unescaped, its value is the same. This would allow us to skip wasteful escape/unescape operations. In doing so, it became obvious that there's no need to leak this information through the API, and indeed, no part of the system should care. When we read XML, it should be unescaped, and when we write, it should be escaped. The reason that this didn't quite happen to begin with was an optimization: I'll be creating an echo writer in place of the current filesystem-based copy in tamec shortly, and this would allow streaming XIR directly from the reader to the writer without any unescaping or re-escaping. When we unescape, we know the value that it came from, so we could simply store both symbols---they're 32-bit, so it results in a nicely compressed 64-bit value, so it's essentially cost-free, as long as we accept the expense of internment. This is `XirString`. Then, when we want to escape or unescape, we first check to see whether a symbol already exists and, if so, use it. While this works well for echoing streams, it won't work all that well in practice: the unescaped SymbolId will be taken and the XirString discarded, since nothing after XIR should be coupled with it. Then, when we later construct a XIR stream for writting, XirString will no longer be available and our previously known escape is lost, so the writer will have to re-escape. Further, if we look at XirString's generic for the XirStringEscaper---it uses phantom, which hints that maybe it's not in the best place. Indeed, I've already acknowledged that only a reader unescapes and only a writer escapes, and that the rest of the system works with normal (unescaped) values, so only readers and writers should be part of this process. I also already acknowledged that XirString would be lost and only the unescaped SymbolId would be used. So what's the point of XirString, then, if it won't be a useful optimization beyond the temporary echo writer? Instead, we can take the XirStringWriter and implement two caches on that: mapping SymbolId from escaped->unescaped and vice-versa. These can be simple vectors, since SymbolId is a 32-bit value we will not have much wasted space for symbols that never get read or written. We could even optimize for preinterned symbols using markers, though I'll probably not do so, and I'll explain why later. If we do _that_, we get even _better_ optimizations through caching that _will_ apply in the general case (so, not just for echo), and we're able to ditch XirString entirely and simply use a SymbolId. This makes for a much more friendly API that isn't leaking implementation details, though it _does_ put an onus on the caller to pass the encoder to both the reader and the writer, _if_ it wants to take advantage of a cache. But that burden is not significant (and is, again, optional if we don't want it). So, that'll be the next step.	2021-11-10 12:22:10 -05:00
Mike Gerwitz	428d508be4	tamer: {ir::=>}{asg, xir} See the previous commit. There is no sense in some common "IR" namespace, since those IRs should live close to whatever system whose data they represent. In the case of these, they are general IRs that can apply to many different parts of the system. If that proves to be a false statement, they'll be moved. DEV-10863	2021-11-04 16:13:27 -04:00

23 Commits (db3fd3f1775e884aa63f1935f3f80952c8dd1f87)