employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	15ff00b3cf	tamer: sym: Only prefill 32-bit global interner This is the interner that is intended to be used with the majority of the system; the 16-bit interner is left around for the moment, but will likely later become specialized.	2021-09-23 16:11:17 -04:00
Mike Gerwitz	e91aeef478	tamer: Remove Ix generalization throughout system This had the writing on the wall all the same as the `'i` interner lifetime that came before it. It was too much of a maintenance burden trying to accommodate both 16-bit and 32-bit symbols generically. There is a situation where we do still want 16-bit symbols---the `Span`. Therefore, I have left generic support for symbol sizes, as well as the different global interners, but `SymbolId` now defaults to 32-bit, as does `Asg`. Further, the size parameter has been removed from the rest of the code, with the exception of `Span`. This cleans things up quite a bit, and is much nicer to work with. If we want 16-bit symbols in the future for packing to increase CPU cache performance, we can handle that situation then in that specific case; it's a premature optimization that's not at all worth the effort here.	2021-09-23 14:52:54 -04:00
Mike Gerwitz	ed245bb099	tamer: sym::prefill: Initial typed static symbol concept We'll see how the syntax evolves over time. It's not ideal to have to specify the type, rather than having the compiler infer it, but I don't much feel like getting into my first procedural macro right now, so we'll stick with this approach for the time being. This will set the stage to be able to safely e.g. create QNames statically at compile-time and would allow us to make any attempts to bypass it unsafe.	2021-09-23 00:37:39 -04:00
Mike Gerwitz	b972b0b202	tamer: sym::StaticSymbolId: Introduce Previously, we were allocating only u32 versions of `SymbolId` for the statically allocated symbols. This introduces a new symbol type with a very small datatype (8 bits) that is able to cast into any `SymbolId`. This is explained in the docs. We'll be taking this typing further in future commits so that static symbols are better-suited for compile-time guarantees for static newtype construction. DEV-10710	2021-09-22 21:37:06 -04:00
Mike Gerwitz	366fef714b	tamer: sym::prefill: Introduce static symbols This is the beginning of static symbols, which is becoming increasing necessary as it's quite a pain to have to deal with interning static strings any place they're used. It's _more_ of a pain to do that in conjunction with newtypes (e.g. `QName`, `AttValue`, etc) that make use of `SymbolId`; this will allow us to construct _those_ statically as well, and additional work to support that will be coming up. DEV-10701	2021-09-22 16:08:40 -04:00
Mike Gerwitz	1cdb3fbbc5	tamer: tameld: Skip fragment unescaping only to re-escape on write Fragments' text were unescaped on reading, producing an owned String and spending time parsing the text to unescape. We were then copying that into an internement pool (so, copying twice, effectively). Further, we were then _re-escaping_ on write. This was all wasteful, since we do not do any manipulation of the fragment before outputting to the xmle file; we know that Saxon produced properly escaped XML to begin with, and can trust to propagate it. This also introduces a new global `clone_uninterned_utf8_unchecked` method. In profiling this change, I tested (a) before this change, (b) after writing without escaping, and (c) after both reading escaped and writing without escaping. (a) (b) (c) sec mem (B) sec B sec B 0:00.95 47896 -> 0:00.91 47988 -> 0:00.87 48288 0:00.40 30176 -> 0:00.37 25656 -> 0:00.36 25788 0:00.39 45672 -> 0:00.37 45756 -> 0:00.35 34952 0:00.39 20716 -> 0:00.38 19604 -> 0:00.36 19956 0:00.33 16836 -> 0:00.32 16988 -> 0:00.31 16892 0:00.23 15268 -> 0:00.23 15236 -> 0:00.22 15312 0:00.44 20780 -> 0:00.44 20048 -> 0:00.41 20148 0:00.54 44516 -> 0:00.50 36964 -> 0:00.49 36728 0:00.62 55976 -> 0:00.57 46204 -> 0:00.54 41468 0:00.31 28016 -> 0:00.30 27308 -> 0:00.28 23844 0:00.23 15388 -> 0:00.22 15316 -> 0:00.21 15304 0:00.05 4888 -> 0:00.05 4760 -> 0:00.05 4948 0:00.41 19756 -> 0:00.41 19852 -> 0:00.40 19992 0:00.47 20828 -> 0:00.46 20844 -> 0:00.44 20968 0:00.27 18152 -> 0:00.26 18184 -> 0:00.25 18312 Interestingly, the peak memory usage increases very slightly between the second and third steps (though decreases from the first), likely because the raw (encoded) is larger than the unencoded text (e.g. `>` takes more space than `>`).	2021-08-18 11:39:06 -04:00
Mike Gerwitz	f97141f5c5	tamer: tameld: Use uninterned symbols for reader Fragments were previously represented by `String` to avoid the cost of interning (hashing and copying). This change modifies it to use uninterned symbols, which does still have a copy overhead but it does not hash. Initial tests shows a small performance decrease of about 15% and a small memory increase of similar proportion. However, once I realized that I was not clearing buffers from quick_xml events and implemented that change in a previous commit, this change ended up being approximately on par with `String`, despite the copying of some pretty large fragments. YMMV, though, and perhaps on less powerful systems time may increase slightly. The upcoming XIR (XML IR) was originally going to support both owned strings and symbols, but now we'll just use uninterned symbols; I can't rationalize complicating the API at this time when it will provide an almost imperceivable performance benefit. If ever that changes in the future, that change will be entertained. The end result is that the fate of a fragment's underlying memory is determined by whatever is processing the data, _not_ by the API itself---the API was previously forcing use of a String, whereas now it's up to the caller to determine whether we want comparable interns. For fragments, that's not likely ever to be the case, especially considering that the representation will change so drastically in the future.	2021-08-16 14:05:32 -04:00
Mike Gerwitz	ce233ac01d	tamer: sym: Uninterned symbols This adds support for uninterned symbols. This came about as I was creating Xir (not yet committed) where I had to decide if I wanted `SymbolId` for all values, even though some values (e.g. large text blocks like compiled code fragments for xmle files) will never be compared, and so would be wastefull hashed. Previous IRs used `String`, but that was clumsy; see documentation in this commit for rationale.	2021-08-13 22:54:04 -04:00
Mike Gerwitz	29ab4b9bfc	tamer: sym: Disallow SymbolId construction outside of module SymboldIds must only be constructed by interners, otherwise we lose confidence in the type. This offers an associated function to construct raw SymbolIds from integers for testing purposes.	2021-08-13 11:54:11 -04:00
Mike Gerwitz	9deb393bfd	tamer: Global interners This is a major change, and I apologize for it all being in one commit. I had wanted to break it up, but doing so would have required a significant amount of temporary work that was not worth doing while I'm the only one working on this project at the moment. This accomplishes a number of important things, now that I'm preparing to write the first compiler frontend for TAMER: 1. `Symbol` has been removed; `SymbolId` is used in its place. 2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer. 3. Using symbols no longer requires dereferencing. 4. Lifetimes no longer pollute the entire system! (`'i`) 5. Two global interners are offered to produce `SymbolStr` with `'static` lifetimes, simplfiying lifetime management and borrowing where strings are still needed. 6. A nice API is provided for interning and lookups (e.g. "foo".intern()) which makes this look like a core feature of Rust. Unfortunately, making this change required modifications to...virtually everything. And that serves to emphasize why this change was needed: _everything_ used symbols, and so there's no use in not providing globals. I implemented this in a way that still provides for loose coupling through Rust's trait system. Indeed, Rustc offers a global interner, and I decided not to go that route initially because it wasn't clear to me that such a thing was desirable. It didn't become apparent to me, in fact, until the recent commit where I introduced `SymbolIndexSize` and saw how many things had to be touched; the linker evolved so rapidly as I was trying to learn Rust that I lost track of how bad it got. Further, this shows how the design of the internment system was a bit naive---I assumed certain requirements that never panned out. In particular, everything using symbols stored `&'i Symbol<'i>`---that is, a reference (usize) to an object containing an index (32-bit) and a string slice (128-bit). So it was a reference to a pretty large value, which was allocated in the arena alongside the interned string itself. But, that was assuming that something would need both the symbol index _and_ a readily available string. That's not the case. In fact, it's pretty clear that interning happens at the beginning of execution, that `SymbolId` is all that's needed during processing (unless an error occurs; more on that below); and it's not until _the very end_ that we need to retrieve interned strings from the pool to write either to a file or to display to the user. It was horribly wasteful! So `SymbolId` solves the lifetime issue in itself for most systems, but it still requires that an interner be available for anything that needs to create or resolve symbols, which, as it turns out, is still a lot of things. Therefore, I decided to implement them as thread-local static variables, which is very similar to what Rustc does itself (Rustc's are scoped). TAMER does not use threads, so the resulting `'static` lifetime should be just fine for now. Eventually I'd like to implement `!Send` and `!Sync`, though, to prevent references from escaping the thread (as noted in the patch); I can't do that yet, since the feature has not yet been stabalized. In the end, this leaves us with a system that's much easier to use and maintain; hopefully easier for newcomers to get into without having to deal with so many complex lifetimes; and a nice API that makes it a pleasure to work with symbols. Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I end up regretting that down the line, but it exists for an important reason: the `Span` and other structures that'll be introduced need to pack a lot of data into 64 bits so they can be freely copied around to keep lifetimes simple without wreaking havoc in other ways, but a 32-bit symbol size needed by the linker is too large for that. (Actually, the linker doesn't yet need 32 bits for our systems, but it's going to in the somewhat near future unless we optimize away a bunch of symbols...but I'd really rather not have the linker hit a limit that requires a lot of code changes to resolve). Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid that for now. Most systems can just use on of the `PkgSymbolId` or `ProgSymbolId` type aliases and not have to worry about it. Systems that are actually shared between the compiler and the linker do, though, but it's not like we don't already have a bunch of trait bounds. Of course, as we implement link-time optimizations (LTO) in the future, it's possible most things will need the size and I'll grow frustrated with that and possibly revisit this. We shall see. Anyway, this was exhausting...and...onward to the first frontend!	2021-08-11 14:24:55 -04:00
Mike Gerwitz	71011f5724	tamer: sym: Split into multiple modules This helps to organize a bit better as I prepare to introduce singleton interners.	2021-08-02 23:54:37 -04:00

11 Commits (15ff00b3cffea2893e552bba7d2c01e67f17a908)