employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	5233822322	tamer: xir: Remove Text enum Like previous commits, this replaces the explicit escaping context with the convention that all values retrieved from `xir` are unescaped on read and escaped on write. Comments are a notable TODO, since we must escape only `--`. CData is also an issue. I had _expected_ to use it as a means to avoid unescaping fragments, but I had forgotten that quick_xml hard-codes escaping on read, so that it can re-use BytesStart! That is terribly unfortunate, and may result in us having to re-implement our own read method in the future to avoid this nonsense. So I'm just leaving it as a TODO for now. DEV-11081	2021-11-15 23:47:14 -05:00
Mike Gerwitz	27ba03b59b	tamer: xir::escape: Remove XirString in favor of Escaper This rewrites a good portion of the previous commit. Rather than explicitly storing whether a given string has been escaped, we can instead assume that all SymbolIds leaving or entering XIR are unescaped, because there is no reason for any other part of the system to deal with such details of XML documents. Given that, we need only unescape on read and escape on write. This is customary, so why didn't I do that to begin with? The previous commit outlines the reason, mainly being an optimization for the echo writer that is upcoming. However, this solution will end up being better---it's not implemented yet, but we can have a caching layer, such that the Escaper records a mapping between escaped and unescaped SymbolIds to avoid work the next time around. If we share the Escaper between _all_ readers and the writer, the result is that 1. Duplicate strings between source files and object files (many of which are read by both the linker and compiler) avoid re-unescaping; and 2. Writers can use this cache to avoid re-escaping when we've already seen the escaped variant of the string during read. The alternative would be a global cache, like the internment system, but I did not find that to be appropriate here, since this is far less fundamental and is much easier to compose. DEV-11081	2021-11-12 14:03:23 -05:00
Mike Gerwitz	b1c0783c75	tamer: xir::XirString: WIP implementation (likely going away) I'm not fond of this implementation, which is why it's not fully completed. I wanted to commit this for future reference, and take the opportunity to explain why I don't like it. First: this task started as an idea to implement a third variant to AttrValue and friends that indicates that a value is fixed, in the sense of a fixed-point function: escaped or unescaped, its value is the same. This would allow us to skip wasteful escape/unescape operations. In doing so, it became obvious that there's no need to leak this information through the API, and indeed, no part of the system should care. When we read XML, it should be unescaped, and when we write, it should be escaped. The reason that this didn't quite happen to begin with was an optimization: I'll be creating an echo writer in place of the current filesystem-based copy in tamec shortly, and this would allow streaming XIR directly from the reader to the writer without any unescaping or re-escaping. When we unescape, we know the value that it came from, so we could simply store both symbols---they're 32-bit, so it results in a nicely compressed 64-bit value, so it's essentially cost-free, as long as we accept the expense of internment. This is `XirString`. Then, when we want to escape or unescape, we first check to see whether a symbol already exists and, if so, use it. While this works well for echoing streams, it won't work all that well in practice: the unescaped SymbolId will be taken and the XirString discarded, since nothing after XIR should be coupled with it. Then, when we later construct a XIR stream for writting, XirString will no longer be available and our previously known escape is lost, so the writer will have to re-escape. Further, if we look at XirString's generic for the XirStringEscaper---it uses phantom, which hints that maybe it's not in the best place. Indeed, I've already acknowledged that only a reader unescapes and only a writer escapes, and that the rest of the system works with normal (unescaped) values, so only readers and writers should be part of this process. I also already acknowledged that XirString would be lost and only the unescaped SymbolId would be used. So what's the point of XirString, then, if it won't be a useful optimization beyond the temporary echo writer? Instead, we can take the XirStringWriter and implement two caches on that: mapping SymbolId from escaped->unescaped and vice-versa. These can be simple vectors, since SymbolId is a 32-bit value we will not have much wasted space for symbols that never get read or written. We could even optimize for preinterned symbols using markers, though I'll probably not do so, and I'll explain why later. If we do _that_, we get even _better_ optimizations through caching that _will_ apply in the general case (so, not just for echo), and we're able to ditch XirString entirely and simply use a SymbolId. This makes for a much more friendly API that isn't leaking implementation details, though it _does_ put an onus on the caller to pass the encoder to both the reader and the writer, _if_ it wants to take advantage of a cache. But that burden is not significant (and is, again, optional if we don't want it). So, that'll be the next step.	2021-11-10 12:22:10 -05:00
Mike Gerwitz	428d508be4	tamer: {ir::=>}{asg, xir} See the previous commit. There is no sense in some common "IR" namespace, since those IRs should live close to whatever system whose data they represent. In the case of these, they are general IRs that can apply to many different parts of the system. If that proves to be a false statement, they'll be moved. DEV-10863	2021-11-04 16:13:27 -04:00
Mike Gerwitz	e91aeef478	tamer: Remove Ix generalization throughout system This had the writing on the wall all the same as the `'i` interner lifetime that came before it. It was too much of a maintenance burden trying to accommodate both 16-bit and 32-bit symbols generically. There is a situation where we do still want 16-bit symbols---the `Span`. Therefore, I have left generic support for symbol sizes, as well as the different global interners, but `SymbolId` now defaults to 32-bit, as does `Asg`. Further, the size parameter has been removed from the rest of the code, with the exception of `Span`. This cleans things up quite a bit, and is much nicer to work with. If we want 16-bit symbols in the future for packing to increase CPU cache performance, we can handle that situation then in that specific case; it's a premature optimization that's not at all worth the effort here.	2021-09-23 14:52:54 -04:00
Mike Gerwitz	e0a209d417	tamer: bench: xir: Reduce writer benchmark memory usage These were using GiB of memory, which is ...unnecessary. I reduced the iteration count significantly, but it was still wasting a lot of time and memory and needed `with_capacity` to reduce the number of copies after reallocation. It is not typical that a buffer would contain this much information.	2021-09-21 16:21:32 -04:00
Mike Gerwitz	aee781a6fb	tamer: bench: xir: Fix broken benchmark This broke when I removed `SelfClose`. I used to run `make all fmt check bench` before every push, but they take a while to run, in part because it uses nightly and has to recompile too. But it looks like I need to be more diligent again.	2021-09-21 16:09:50 -04:00
Mike Gerwitz	cd1eae95ca	tamer: xir: {NodeStream=>Token} I decided not to do this in a previous commit because I had documented "NodeStream" elsewhere, so I'd like it to be in the Git history to understand its evolution. This never was a "Node" stream beyond the initial concept phase, because it represents tokens that aren't themselves nodes. It is intended to generate XML nodes, but may need to accommodate non-nodes (e.g. XML declarations) in the future. The name originated from `Node`, which was a tree-based IR that was initially conceived, but removed because it's not yet needed. What we need is a streaming IR for xmle writing, and then for reading and echoing back out XML for the new frontend.	2021-08-20 10:30:27 -04:00
Mike Gerwitz	a23bae5e4d	tamer: XIR: Working concept This is a working streaming IR for XML. I want to get this committed before I go further cleaning it up and integrating it into the xmle writer. This is lacking detailed documentation, and the names of things may end up changing. Initial benchmarks do show that it has a ~2x performance improvement over quick-xml when dealing with two attributes on a node, and I suspect that improvement will increase with the number of attributes. We will see how it compares in real-world benchmarks once the linker has been modified to use it. The goal isn't to _avoid_ quick-xml---it'll be used in the future for things like escaping that would be a huge waste to implement ourselves. It just so happened that quick-xml was not beneficial for these changes; indeed, its own writer is fairly simple for the portions that were implemented here, so there's no use in fighting with its API, particularly around attributes and our need to explicitly control whitespace (with the intent of handling code formatters in the future). To put this into perspective: the reason this work is being done isn't to refactor the linker, or to speed it up, but to generalize XML writing and provide a suitable IR for use in the compiler. The first step of the frontend is to essentially echo the XML token stream back out so we can incrementally parse it and do something useful, to incrementally rewrite the compiler in Rust.	2021-08-20 10:16:36 -04:00

9 Commits (a1a4ad3e8ebd57666b812422a16d8ed99289dcfa)