employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	31a39c79d3	tamer: asg::graph: lookup{=>_global} Identifier lookups, as done using the graph methods today, look up from a cache representing the global environment. Templates must not contribute to this environment until expansion. Further, metavariables will not be present in this environment. To avoid confusion and help obviate accidental contributions to this environment, the methods have been renamed. This will also allow for the creation of more general methods down the line. DEV-13708	2023-03-22 11:28:55 -04:00
Mike Gerwitz	954b5a2795	Copyright year and name update Ryan Specialty Group (RSG) rebranded to Ryan Specialty after its IPO.	2023-01-20 23:37:30 -05:00
Mike Gerwitz	378fe3db66	tamer: asg::Asg::lookup: SymbolId=>SPair This seems to have been an oversight from when I recently introduced SPairs to ASG; I noticed it while working on another change and receiving back a `DUMMY_SPAN`. DEV-13597	2023-01-20 23:37:29 -05:00
Mike Gerwitz	e6640c0019	tamer: Integrate clippy This invokes clippy as part of `make check` now, which I had previously avoided doing (I'll elaborate on that below). This commit represents the changes needed to resolve all the warnings presented by clippy. Many changes have been made where I find the lints to be useful and agreeable, but there are a number of lints, rationalized in `src/lib.rs`, where I found the lints to be disagreeable. I have provided rationale, primarily for those wondering why I desire to deviate from the default lints, though it does feel backward to rationalize why certain lints ought to be applied (the reverse should be true). With that said, this did catch some legitimage issues, and it was also helpful in getting some older code up-to-date with new language additions that perhaps I used in new code but hadn't gone back and updated old code for. My goal was to get clippy working without errors so that, in the future, when others get into TAMER and are still getting used to Rust, clippy is able to help guide them in the right direction. One of the reasons I went without clippy for so long (though I admittedly forgot I wasn't using it for a period of time) was because there were a number of suggestions that I found disagreeable, and I didn't take the time to go through them and determine what I wanted to follow. Furthermore, it was hard to make that judgment when I was new to the language and lacked the necessary experience to do so. One thing I would like to comment further on is the use of `format!` with `expect`, which is also what the diagnostic system convenience methods do (which clippy does not cover). Because of all the work I've done trying to understand Rust and looking at disassemblies and seeing what it optimizes, I falsely assumed that Rust would convert such things into conditionals in my otherwise-pure code...but apparently that's not the case, when `format!` is involved. I noticed that, after making the suggested fix with `get_ident`, Rust proceeded to then inline it into each call site and then apply further optimizations. It was also previously invoking the thread lock (for the interner) unconditionally and invoking the `Display` implementation. That is not at all what I intended for, despite knowing the eager semantics of function calls in Rust. Anyway, possibly more to come on that, I'm just tired of typing and need to move on. I'll be returning to investigate further diagnostic messages soon.	2023-01-20 23:37:29 -05:00
Mike Gerwitz	0b2e563cdb	tamer: asg: Associate spans with identifiers and introduce diagnostics This ASG implementation is a refactored form of original code from the proof-of-concept linker, which was well before the span and diagnostic implementations, and well before I knew for certain how I was going to solve that problem. This was quite the pain in the ass, but introduces spans to the AIR tokens and graph so that we always have useful diagnostic information. With that said, there are some important things to note: 1. Linker spans will originate from the `xmlo` files until we persist spans to those object files during `tamec`'s compilation. But it's better than nothing. 2. Some additional refactoring is still needed for consistency, e.g. use of `SPair`. 3. This is just a preliminary introduction. More refactoring will come as tamec is continued. DEV-13041	2022-12-16 14:44:38 -05:00
Mike Gerwitz	41b41e02c1	tamer: Xirf::Text refinement This teaches XIRF to optionally refine Text into RefinedText, which determines whether the given SymbolId represents entirely whitespace. This is something I've been putting off for some time, but now that I'm parsing source language for NIR, it is necessary, in that we can only permit whitespace Text nodes in certain contexts. The idea is to capture the most common whitespace as preinterned symbols. Note that this heuristic ought to be determined from scanning a codebase, which I haven't done yet; this is just an initial list. The fallback is to look up the string associated with the SymbolId and perform a linear scan, aborting on the first non-whitespace character. This combination of checks should be sufficiently performant for now considering that this is only being run on source files, which really are not all that large. (They become large when template-expanded.) I'll optimize further if I notice it show up during profiling. This also frees XIR itself from being concerned by Whitespace. Initially I had used quick-xml's whitespace trimming, but it messed up my span calculations, and those were a pain in the ass to implement to begin with, since I had to resort to pointer arithmetic. I'd rather avoid tweaking it. tameld will not check for whitespace, since it's not important---xmlo files, if malformed, are the fault of the compiler; we can ignore text nodes except in the context of code fragments, where they are never whitespace (unless that's also a compiler bug). Onward and yonward. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	c671bf6a9c	tamer: xir: Introduce {Ele,Open,Close}Span This isn't conceptally all that significant of a change, but there was a lot of modify to get it working. I would generally separate this into a commit for the implementation and another commit for the integration, but I decided to keep things together. This serves a role similar to AttrSpan---this allows deriving a span representing the element name from a span representing the entire XIR token. This will provide more useful context for errors---including the tag delimiter(s) means that we care about the fact that an element is in that position (as opposed to some other type of node) within the context of an error. However, if we are expecting an element but take issue with the element name itself, we want to place emphasis on that instead. This also starts to consider the issue of span contexts---a blob of detached data that is `Span` is useful for error context, but it's not useful for manipulation or deriving additional information. For that, we need to encode additional context, and this is an attempt at that. I am interested in the concept of providing Spans that are guaranteed to actually make sense---that are instantiated and manipulated with APIs that ensure consistency. But such a thing buys us very little, practically speaking, over what I have now for TAMER, and so I don't expect to actually implement that for this project; I'll leave that for a personal project. TAMER's already take a lot of my personal interests and it can cause me a lot of grief sometimes (with regards to letting my aspirations cause me more work). DEV-7145	2022-06-24 14:16:29 -04:00
Mike Gerwitz	f218c452b9	tamer: iter::trip: Flatten Result The `*_iter_while_ok` functions now compose like monads, flattening `Result` at each step and drastically simplifying handling of error types. This also removes the bunch of `?`s at the end of the expression, and allows me to use `?` within the callback itself. I had originally not used `Result` as the return type of the callback because I was not entirely sure how I was going to use them, but it's now clear that I _always_ use `Result` as the return type, and so there's no use in trying to be too accommodating; it can always change in the future. This is desirable not just for cleanup, but because trying to refactor `asg_builder` into a pair of `Parser`s is really messy to chain without flattening, especially given some state that has to leak temporarily to the caller. More on that in a future commit. DEV-11864	2022-05-20 16:08:16 -04:00
Mike Gerwitz	ebf1de5a60	tamer: asg::Ident{Object=>}: Rename I think this may have been renamed _from_ `Ident` some time ago, but I'm too lazy to check. In any case, the name is redundant. DEV-11864	2022-05-19 11:17:04 -04:00
Mike Gerwitz	07d2ec1ffb	tamer: Move Dim and {Sym=>}Dtype into num module A previous commit mentioned that there's not a place for `Dim`, and duplicated it between `asg` and `xmlo`. Well, `Dtype` is also needed in both, and so here's a home for now. `Dtype` has always been an inappropriate detail for the system and will one day be removed entirely in favor of higher-level types; the machine representation is up to the compiler to decide. DEV-11864	2022-05-19 10:39:21 -04:00
Mike Gerwitz	3e277270a7	tamer: asg: Track roots on graph Previously, since the graph contained only identifiers, discovered roots were stored in a separate vector and exposed to the caller. This not only leaked details, but added complexity; this was left over from the refactoring of the proof-of-concept linker some time ago. This moves the root management into the ASG itself, mostly, with one item being left over for now in the asg_builder (eligibility classifications). There are two roots that were added automatically: - __yield - __worksheet The former has been removed and is now expected to be explicitly mapped in the return map, which is now enforced with an extern in `core/base`. This is still special, in the sense that it is explicitly referenced by the generated code, but there's nothing inherently special about it and I'll continue to generalize it into oblivion in the future, such that the final yield is just a convention. `__worksheet` is the only symbol of type `IdentKind::Worksheet`, and so that was generalized just as the meta and map entries were. The goal in the future will be to have this more under the control of the source language, and to consolodate individual roots under packages, so that the _actual_ roots are few. As far as the actual ASG goes: this introduces a single root node that is used as the sole reference for reachability analysis and topological sorting. The edges of that root node replace the vector that was removed. DEV-11864	2022-05-17 10:42:05 -04:00
Mike Gerwitz	34eb994a0d	tamer: asg::Asg::set_fragment: {ObjectRef=>SymbolId} In the actual implementation (outside of tests), this is always looking up before adding the symbol. This will simplify the API, while still retaining errors, since the identifier will fail the state transition if the identifier did not exist before attempting to set a fragment. So while this is slower in microbenchmarks, this has no effect on real-world performance. Further, I'm refactoring toward a streaming ASG aggregation, which is a lot easier if we do not need to perform lookups in a separate step from the ASG's primitives. DEV-11864	2022-05-16 13:14:27 -04:00
Mike Gerwitz	d87006391e	tamer: asg::object: Remove IdentObjectState, IdentObjectData These traits are no longer necessary now that I'm using concrete types; they just add unnecessary noise and confusion as I attempt to further refactor. Don't abstract prematurely. DEV-11864	2022-05-12 16:31:36 -04:00
Mike Gerwitz	3748762d31	tamer: asg::graph::Asg: Remove type parameter O This removes the generic on the Asg (which was formerly BaseAsg), hard-coding `IdentObject`, which will further evolve. This makes the IR an actual concrete IR rather than an abstract data structure. These tests bring me back a bit, since they were written as I was still becoming familiar with Rust. DEV-11864	2022-05-12 15:46:17 -04:00
Mike Gerwitz	f2c5443176	tamer: asg: Remove generic Asg, rename {Base=>}Asg This is the beginning of an incremental refactoring to remove generics, to simplify the ASG. When I initially wrote the linker, I wasn't sure what direction I was going in, but I was also negatively influenced by more traditional approaches to both design and unit testing. If we're going to call the ASG an IR, then it needs to be one---if the core of the IR is generic, then it's more like an abstract data structure than anything. We can abstract around the IR to slice it up into components that are a little easier to reason about and understand how responsibilities are segregated. DEV-11864	2022-05-11 16:47:13 -04:00
Mike Gerwitz	1ad2fb1dc8	Copyright year update 2022 RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no "Group"), but I'm not sure if the legal name has been changed yet or not, so I'll wait on that.	2022-05-03 14:14:29 -04:00
Mike Gerwitz	5233822322	tamer: xir: Remove Text enum Like previous commits, this replaces the explicit escaping context with the convention that all values retrieved from `xir` are unescaped on read and escaped on write. Comments are a notable TODO, since we must escape only `--`. CData is also an issue. I had _expected_ to use it as a means to avoid unescaping fragments, but I had forgotten that quick_xml hard-codes escaping on read, so that it can re-use BytesStart! That is terribly unfortunate, and may result in us having to re-implement our own read method in the future to avoid this nonsense. So I'm just leaving it as a TODO for now. DEV-11081	2021-11-15 23:47:14 -05:00
Mike Gerwitz	27ba03b59b	tamer: xir::escape: Remove XirString in favor of Escaper This rewrites a good portion of the previous commit. Rather than explicitly storing whether a given string has been escaped, we can instead assume that all SymbolIds leaving or entering XIR are unescaped, because there is no reason for any other part of the system to deal with such details of XML documents. Given that, we need only unescape on read and escape on write. This is customary, so why didn't I do that to begin with? The previous commit outlines the reason, mainly being an optimization for the echo writer that is upcoming. However, this solution will end up being better---it's not implemented yet, but we can have a caching layer, such that the Escaper records a mapping between escaped and unescaped SymbolIds to avoid work the next time around. If we share the Escaper between _all_ readers and the writer, the result is that 1. Duplicate strings between source files and object files (many of which are read by both the linker and compiler) avoid re-unescaping; and 2. Writers can use this cache to avoid re-escaping when we've already seen the escaped variant of the string during read. The alternative would be a global cache, like the internment system, but I did not find that to be appropriate here, since this is far less fundamental and is much easier to compose. DEV-11081	2021-11-12 14:03:23 -05:00
Mike Gerwitz	b1c0783c75	tamer: xir::XirString: WIP implementation (likely going away) I'm not fond of this implementation, which is why it's not fully completed. I wanted to commit this for future reference, and take the opportunity to explain why I don't like it. First: this task started as an idea to implement a third variant to AttrValue and friends that indicates that a value is fixed, in the sense of a fixed-point function: escaped or unescaped, its value is the same. This would allow us to skip wasteful escape/unescape operations. In doing so, it became obvious that there's no need to leak this information through the API, and indeed, no part of the system should care. When we read XML, it should be unescaped, and when we write, it should be escaped. The reason that this didn't quite happen to begin with was an optimization: I'll be creating an echo writer in place of the current filesystem-based copy in tamec shortly, and this would allow streaming XIR directly from the reader to the writer without any unescaping or re-escaping. When we unescape, we know the value that it came from, so we could simply store both symbols---they're 32-bit, so it results in a nicely compressed 64-bit value, so it's essentially cost-free, as long as we accept the expense of internment. This is `XirString`. Then, when we want to escape or unescape, we first check to see whether a symbol already exists and, if so, use it. While this works well for echoing streams, it won't work all that well in practice: the unescaped SymbolId will be taken and the XirString discarded, since nothing after XIR should be coupled with it. Then, when we later construct a XIR stream for writting, XirString will no longer be available and our previously known escape is lost, so the writer will have to re-escape. Further, if we look at XirString's generic for the XirStringEscaper---it uses phantom, which hints that maybe it's not in the best place. Indeed, I've already acknowledged that only a reader unescapes and only a writer escapes, and that the rest of the system works with normal (unescaped) values, so only readers and writers should be part of this process. I also already acknowledged that XirString would be lost and only the unescaped SymbolId would be used. So what's the point of XirString, then, if it won't be a useful optimization beyond the temporary echo writer? Instead, we can take the XirStringWriter and implement two caches on that: mapping SymbolId from escaped->unescaped and vice-versa. These can be simple vectors, since SymbolId is a 32-bit value we will not have much wasted space for symbols that never get read or written. We could even optimize for preinterned symbols using markers, though I'll probably not do so, and I'll explain why later. If we do _that_, we get even _better_ optimizations through caching that _will_ apply in the general case (so, not just for echo), and we're able to ditch XirString entirely and simply use a SymbolId. This makes for a much more friendly API that isn't leaking implementation details, though it _does_ put an onus on the caller to pass the encoder to both the reader and the writer, _if_ it wants to take advantage of a cache. But that burden is not significant (and is, again, optional if we don't want it). So, that'll be the next step.	2021-11-10 12:22:10 -05:00
Mike Gerwitz	428d508be4	tamer: {ir::=>}{asg, xir} See the previous commit. There is no sense in some common "IR" namespace, since those IRs should live close to whatever system whose data they represent. In the case of these, they are general IRs that can apply to many different parts of the system. If that proves to be a false statement, they'll be moved. DEV-10863	2021-11-04 16:13:27 -04:00
Mike Gerwitz	e9871541a8	tamer: benches/iter.rs: Basic benchmark This was forgotten in the previous commit and exists simply to ensure that the TripIter doesn't add any significant overhead. The tests are a handful of nanoseconds apart, on my machine.	2021-10-28 21:17:41 -04:00
Mike Gerwitz	fc3953e90e	tamer: benches/sym.rs: Interner::intern_utf8 benchmarks These were forgotten in the previous commit.	2021-10-19 13:42:26 -04:00
Mike Gerwitz	63e5a0d441	tamer: benches/sym.rs: Add additional UTF-8-related tests The intent of this is to demonstrate how significant of an impact checking byte arrays for UTF-8 validity will have, since the existing tests do not make that clear (a static string in Rust is always valid UTF-8). These benchmarks show that the cost when re-interning an already existing value is +50%. This is important, because the new reader will be interning a _lot_ of duplicate strings, whereas the existing reader operates on byte arrays without interning unless necessary. And, when it does, it does so unchecked. But we'd rather not do that, since we cannot guarantee that those XML files are valid (and not modified in some way). Upcoming commits will have what I think is a reasonable compromise to this, based on the fact that we'll be encountering _many_ duplicate strings in parsing XML files. DEV-10920	2021-10-18 21:35:32 -04:00
Mike Gerwitz	739cf7e6eb	tamer: ir::asg::object::IdentObject: Define methods from IdentObjectData In particular, `name` needn't return an `Option`. `fragment` also returns a copy, since it's just a `SymbolId`. (It really ought to be a newtype rather than an alias, but we'll worry about that some other time.) These changes allow us to remove some runtime panics. DEV-10859	2021-10-14 14:38:02 -04:00
Mike Gerwitz	f055cb77c2	tamer: ld::xmle: Narrow Sections types This moves the logic that sorts identifiers into sections into Sections itself, and introduces XmleSections to allow for mocking for testing. This then allows us to narrow the types significantly, eliminating some runtime checks. The types can be narrowed further, but I'll be limiting the work I'll be doing now; this'll be inevitably addressed as we use the ASG for the compiler. This also handles moving Sections tests, which was a TODO from the previous commit. DEV-10859	2021-10-14 12:40:13 -04:00
Mike Gerwitz	31144d0c9a	tamer: benches/asg_lower.rs: Add missing file from previous commit This was missed in the `lower` module move.	2021-10-12 10:30:35 -04:00
Mike Gerwitz	df328da71f	tamer: ir::asg::SortableAsg: Move into ld::xmle::lower This has always been a lowering operation, but it was not phrased in terms of it, which made the process a bit more confusing to understand. The implementation hasn't changed, but this is an incremental refactoring and so exposes BaseAsg and its `graph` field temporarily. DEV-10859	2021-10-12 09:49:33 -04:00
Mike Gerwitz	e91aeef478	tamer: Remove Ix generalization throughout system This had the writing on the wall all the same as the `'i` interner lifetime that came before it. It was too much of a maintenance burden trying to accommodate both 16-bit and 32-bit symbols generically. There is a situation where we do still want 16-bit symbols---the `Span`. Therefore, I have left generic support for symbol sizes, as well as the different global interners, but `SymbolId` now defaults to 32-bit, as does `Asg`. Further, the size parameter has been removed from the rest of the code, with the exception of `Span`. This cleans things up quite a bit, and is much nicer to work with. If we want 16-bit symbols in the future for packing to increase CPU cache performance, we can handle that situation then in that specific case; it's a premature optimization that's not at all worth the effort here.	2021-09-23 14:52:54 -04:00
Mike Gerwitz	e0a209d417	tamer: bench: xir: Reduce writer benchmark memory usage These were using GiB of memory, which is ...unnecessary. I reduced the iteration count significantly, but it was still wasting a lot of time and memory and needed `with_capacity` to reduce the number of copies after reallocation. It is not typical that a buffer would contain this much information.	2021-09-21 16:21:32 -04:00
Mike Gerwitz	aee781a6fb	tamer: bench: xir: Fix broken benchmark This broke when I removed `SelfClose`. I used to run `make all fmt check bench` before every push, but they take a while to run, in part because it uses nightly and has to recompile too. But it looks like I need to be more diligent again.	2021-09-21 16:09:50 -04:00
Mike Gerwitz	cd1eae95ca	tamer: xir: {NodeStream=>Token} I decided not to do this in a previous commit because I had documented "NodeStream" elsewhere, so I'd like it to be in the Git history to understand its evolution. This never was a "Node" stream beyond the initial concept phase, because it represents tokens that aren't themselves nodes. It is intended to generate XML nodes, but may need to accommodate non-nodes (e.g. XML declarations) in the future. The name originated from `Node`, which was a tree-based IR that was initially conceived, but removed because it's not yet needed. What we need is a streaming IR for xmle writing, and then for reading and echoing back out XML for the new frontend.	2021-08-20 10:30:27 -04:00
Mike Gerwitz	a23bae5e4d	tamer: XIR: Working concept This is a working streaming IR for XML. I want to get this committed before I go further cleaning it up and integrating it into the xmle writer. This is lacking detailed documentation, and the names of things may end up changing. Initial benchmarks do show that it has a ~2x performance improvement over quick-xml when dealing with two attributes on a node, and I suspect that improvement will increase with the number of attributes. We will see how it compares in real-world benchmarks once the linker has been modified to use it. The goal isn't to _avoid_ quick-xml---it'll be used in the future for things like escaping that would be a huge waste to implement ourselves. It just so happened that quick-xml was not beneficial for these changes; indeed, its own writer is fairly simple for the portions that were implemented here, so there's no use in fighting with its API, particularly around attributes and our need to explicitly control whitespace (with the intent of handling code formatters in the future). To put this into perspective: the reason this work is being done isn't to refactor the linker, or to speed it up, but to generalize XML writing and provide a suitable IR for use in the compiler. The first step of the frontend is to essentially echo the XML token stream back out so we can incrementally parse it and do something useful, to incrementally rewrite the compiler in Rust.	2021-08-20 10:16:36 -04:00
Mike Gerwitz	c211ada89b	tamer: benches (memchr): Add missing bench attr This benchmark was not being run.	2021-08-19 23:14:33 -04:00
Mike Gerwitz	fc235b7ecc	tamer: memchr benches This adds benchmarking for the memchr crate. It is used primarily by quick-xml at the moment, but the question is whether to rely on it for certain operations for XIR. The benchmarking on an Intel Xeon system shows that memchr and Rust's contains() perform very similarly on small inputs, matching against a single character, and so Rust's built-in should be preferred in that case so that we're using APIs that are familiar to most people. When larger inputs are compared against, there's a greater benefit (a little under ~2x). When comparing against two characters, they are again very close. But look at when we compare two characters against _multiple_ inputs: running 24 tests test large_str:1️⃣:memchr_early_match ... bench: 4,938 ns/iter (+/- 124) test large_str:1️⃣:memchr_late_match ... bench: 81,807 ns/iter (+/- 1,153) test large_str:1️⃣:memchr_non_match ... bench: 82,074 ns/iter (+/- 1,062) test large_str:1️⃣:rust_contains_one_byte_early_match ... bench: 9,425 ns/iter (+/- 167) test large_str:1️⃣:rust_contains_one_byte_late_match ... bench: 123,685 ns/iter (+/- 3,728) test large_str:1️⃣:rust_contains_one_byte_non_match ... bench: 123,117 ns/iter (+/- 2,200) test large_str:1️⃣:rust_contains_one_char_early_match ... bench: 9,561 ns/iter (+/- 507) test large_str:1️⃣:rust_contains_one_char_late_match ... bench: 123,929 ns/iter (+/- 2,377) test large_str:1️⃣:rust_contains_one_char_non_match ... bench: 122,989 ns/iter (+/- 2,788) test large_str:2️⃣:memchr2_early_match ... bench: 5,704 ns/iter (+/- 91) test large_str:2️⃣:memchr2_late_match ... bench: 89,194 ns/iter (+/- 8,546) test large_str:2️⃣:memchr2_non_match ... bench: 85,649 ns/iter (+/- 3,879) test large_str:2️⃣:rust_contains_two_char_early_match ... bench: 66,785 ns/iter (+/- 3,385) test large_str:2️⃣:rust_contains_two_char_late_match ... bench: 2,148,064 ns/iter (+/- 21,812) test large_str:2️⃣:rust_contains_two_char_non_match ... bench: 2,322,082 ns/iter (+/- 22,947) test small_str:1️⃣:memchr_mid_match ... bench: 4,737 ns/iter (+/- 842) test small_str:1️⃣:memchr_non_match ... bench: 5,160 ns/iter (+/- 62) test small_str:1️⃣:rust_contains_one_byte_non_match ... bench: 3,930 ns/iter (+/- 35) test small_str:1️⃣:rust_contains_one_char_mid_match ... bench: 3,677 ns/iter (+/- 618) test small_str:1️⃣:rust_contains_one_char_non_match ... bench: 5,415 ns/iter (+/- 221) test small_str:2️⃣:memchr2_mid_match ... bench: 5,488 ns/iter (+/- 888) test small_str:2️⃣:memchr2_non_match ... bench: 6,788 ns/iter (+/- 134) test small_str:2️⃣:rust_contains_two_char_mid_match ... bench: 6,203 ns/iter (+/- 170) test small_str:2️⃣:rust_contains_two_char_non_match ... bench: 7,853 ns/iter (+/- 713) Yikes. With that said, we won't be comparing against such large inputs short-term. The larger strings (fragments) are copied verbatim, and not compared against---but they _were_ prior to the previous commit that stopped unencoding and re-encoding. So: Rust built-ins for inputs that are expected to be small.	2021-08-18 14:23:03 -04:00
Mike Gerwitz	f97141f5c5	tamer: tameld: Use uninterned symbols for reader Fragments were previously represented by `String` to avoid the cost of interning (hashing and copying). This change modifies it to use uninterned symbols, which does still have a copy overhead but it does not hash. Initial tests shows a small performance decrease of about 15% and a small memory increase of similar proportion. However, once I realized that I was not clearing buffers from quick_xml events and implemented that change in a previous commit, this change ended up being approximately on par with `String`, despite the copying of some pretty large fragments. YMMV, though, and perhaps on less powerful systems time may increase slightly. The upcoming XIR (XML IR) was originally going to support both owned strings and symbols, but now we'll just use uninterned symbols; I can't rationalize complicating the API at this time when it will provide an almost imperceivable performance benefit. If ever that changes in the future, that change will be entertained. The end result is that the fate of a fragment's underlying memory is determined by whatever is processing the data, _not_ by the API itself---the API was previously forcing use of a String, whereas now it's up to the caller to determine whether we want comparable interns. For fragments, that's not likely ever to be the case, especially considering that the representation will change so drastically in the future.	2021-08-16 14:05:32 -04:00
Mike Gerwitz	ce233ac01d	tamer: sym: Uninterned symbols This adds support for uninterned symbols. This came about as I was creating Xir (not yet committed) where I had to decide if I wanted `SymbolId` for all values, even though some values (e.g. large text blocks like compiled code fragments for xmle files) will never be compared, and so would be wastefull hashed. Previous IRs used `String`, but that was clumsy; see documentation in this commit for rationale.	2021-08-13 22:54:04 -04:00
Mike Gerwitz	9deb393bfd	tamer: Global interners This is a major change, and I apologize for it all being in one commit. I had wanted to break it up, but doing so would have required a significant amount of temporary work that was not worth doing while I'm the only one working on this project at the moment. This accomplishes a number of important things, now that I'm preparing to write the first compiler frontend for TAMER: 1. `Symbol` has been removed; `SymbolId` is used in its place. 2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer. 3. Using symbols no longer requires dereferencing. 4. Lifetimes no longer pollute the entire system! (`'i`) 5. Two global interners are offered to produce `SymbolStr` with `'static` lifetimes, simplfiying lifetime management and borrowing where strings are still needed. 6. A nice API is provided for interning and lookups (e.g. "foo".intern()) which makes this look like a core feature of Rust. Unfortunately, making this change required modifications to...virtually everything. And that serves to emphasize why this change was needed: _everything_ used symbols, and so there's no use in not providing globals. I implemented this in a way that still provides for loose coupling through Rust's trait system. Indeed, Rustc offers a global interner, and I decided not to go that route initially because it wasn't clear to me that such a thing was desirable. It didn't become apparent to me, in fact, until the recent commit where I introduced `SymbolIndexSize` and saw how many things had to be touched; the linker evolved so rapidly as I was trying to learn Rust that I lost track of how bad it got. Further, this shows how the design of the internment system was a bit naive---I assumed certain requirements that never panned out. In particular, everything using symbols stored `&'i Symbol<'i>`---that is, a reference (usize) to an object containing an index (32-bit) and a string slice (128-bit). So it was a reference to a pretty large value, which was allocated in the arena alongside the interned string itself. But, that was assuming that something would need both the symbol index _and_ a readily available string. That's not the case. In fact, it's pretty clear that interning happens at the beginning of execution, that `SymbolId` is all that's needed during processing (unless an error occurs; more on that below); and it's not until _the very end_ that we need to retrieve interned strings from the pool to write either to a file or to display to the user. It was horribly wasteful! So `SymbolId` solves the lifetime issue in itself for most systems, but it still requires that an interner be available for anything that needs to create or resolve symbols, which, as it turns out, is still a lot of things. Therefore, I decided to implement them as thread-local static variables, which is very similar to what Rustc does itself (Rustc's are scoped). TAMER does not use threads, so the resulting `'static` lifetime should be just fine for now. Eventually I'd like to implement `!Send` and `!Sync`, though, to prevent references from escaping the thread (as noted in the patch); I can't do that yet, since the feature has not yet been stabalized. In the end, this leaves us with a system that's much easier to use and maintain; hopefully easier for newcomers to get into without having to deal with so many complex lifetimes; and a nice API that makes it a pleasure to work with symbols. Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I end up regretting that down the line, but it exists for an important reason: the `Span` and other structures that'll be introduced need to pack a lot of data into 64 bits so they can be freely copied around to keep lifetimes simple without wreaking havoc in other ways, but a 32-bit symbol size needed by the linker is too large for that. (Actually, the linker doesn't yet need 32 bits for our systems, but it's going to in the somewhat near future unless we optimize away a bunch of symbols...but I'd really rather not have the linker hit a limit that requires a lot of code changes to resolve). Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid that for now. Most systems can just use on of the `PkgSymbolId` or `ProgSymbolId` type aliases and not have to worry about it. Systems that are actually shared between the compiler and the linker do, though, but it's not like we don't already have a bunch of trait bounds. Of course, as we implement link-time optimizations (LTO) in the future, it's possible most things will need the size and I'll grow frustrated with that and possibly revisit this. We shall see. Anyway, this was exhausting...and...onward to the first frontend!	2021-08-11 14:24:55 -04:00
Mike Gerwitz	71011f5724	tamer: sym: Split into multiple modules This helps to organize a bit better as I prepare to introduce singleton interners.	2021-08-02 23:54:37 -04:00
Mike Gerwitz	2e50af1220	Copyright year update 2021	2021-07-22 15:00:15 -04:00
Mike Gerwitz	0127d4b698	TAMER: sym::Interner::index_lookup This was originally omitted because there wasn't a use case for it. Now that we're adding context to errors, however, an owned value is highly desirable. This adds almost no measurable overhead to the internment system in benchmarks (largely within the margin of error).	2020-04-29 11:33:41 -04:00
Mike Gerwitz	0a9a3214b7	[DEV-7084] TAMER: ir::asg::BaseAsg:🆕 New associated function Profiling showed that creating an initial capacity of 0 did not have a notable affect on performance.	2020-04-28 09:06:25 -04:00
Mike Gerwitz	0868453dab	[DEV-7086] Proper handling of identifier overrides This is an awkward system that I'd like to remove at some point. It adds complexity. For the meantime, overrides have been arbitrarily restricted to a single override (no override-override). But it's needed being until we rework maps and can handle the illusion of overrides using the template system.	2020-04-06 09:55:54 -04:00
Mike Gerwitz	f7ed0dbff3	[DEV-7086] ASG benchmarks	2020-03-31 14:18:26 -04:00
Mike Gerwitz	bfea768f89	Copyright year 2020 update	2020-03-06 11:05:18 -05:00
Mike Gerwitz	6aae741162	TAMER (sym::Interner::intern_utf8_unchecked): New function This removes boilerplate for reading xmlo files. See next commit.	2020-02-25 16:10:55 -05:00
Mike Gerwitz	1f4db84f24	TAMER: Arena-based string interner Contrary to what I said previously, this replaces the previous implementation with an arena-backed internment system. The motivation for this change was investigating how Rustc performed its string interning, and why they chose to associate integer identifiers with symbols. The intent was originally to use Rustc's arena allocator directly, but that create pulled in far too many dependencies and depended on nightly Rust. Bumpalo provides a very similar implementation to Rustc's DroplessArena, so I went with that instead. Rustc also relies on a global, singleton interner. I do not do that here. Instead, the returned Symbol carries a lifetime of the underlying arena, as well as a pointer to the interned string. Now that this is put to rest, it's time to move on.	2020-02-24 14:56:28 -05:00
Mike Gerwitz	176d099fb6	tamer::sym: FNV => Fx Hash For strings of any notable length, Fx Hash outperforms FNV. Rustc also moved to this hash function and noticed performance improvements. Fortunately, as was accounted for in the design, this was a trivial switch. Here are some benchmarks to back up that claim: test hash_set::fnv::with_all_new_1000 ... bench: 133,096 ns/iter (+/- 1,430) test hash_set::fnv::with_all_new_1000_with_capacity ... bench: 82,591 ns/iter (+/- 592) test hash_set::fnv::with_all_new_rc_str_1000_baseline ... bench: 162,073 ns/iter (+/- 1,277) test hash_set::fnv::with_one_new_1000 ... bench: 37,334 ns/iter (+/- 256) test hash_set::fnv::with_one_new_rc_str_1000_baseline ... bench: 18,263 ns/iter (+/- 261) test hash_set::fx::with_all_new_1000 ... bench: 85,217 ns/iter (+/- 1,111) test hash_set::fx::with_all_new_1000_with_capacity ... bench: 59,383 ns/iter (+/- 752) test hash_set::fx::with_all_new_rc_str_1000_baseline ... bench: 98,802 ns/iter (+/- 1,117) test hash_set::fx::with_one_new_1000 ... bench: 42,484 ns/iter (+/- 1,239) test hash_set::fx::with_one_new_rc_str_1000_baseline ... bench: 15,000 ns/iter (+/- 233) test hash_set::with_all_new_1000 ... bench: 137,645 ns/iter (+/- 1,186) test hash_set::with_all_new_rc_str_1000_baseline ... bench: 163,129 ns/iter (+/- 1,725) test hash_set::with_one_new_1000 ... bench: 59,051 ns/iter (+/- 1,202) test hash_set::with_one_new_rc_str_1000_baseline ... bench: 37,986 ns/iter (+/- 771)	2020-02-24 14:56:28 -05:00
Mike Gerwitz	f2b24e6505	HashMapInterner: New interner, docs, and benchmarks This interner will be suitable for providing an index to look up nodes in the ASG.	2020-02-24 14:56:28 -05:00
Mike Gerwitz	e4e0089815	TAMER: Initial string interning abstraction This is missing two key things that I'll add shortly: a HashMap-based one for use in the ASG for node mapping, and an entry-based system for manipulations. This has been a nice start for exploring various aspects of Rust development, as well as conventions that I'd like to implement. In particular: - Robust documentation intended to guide people through learning the necessary material about the compiler, as well as related work to rationalize design decisions; - Benchmarks; - TDD; - And just getting used to Rust in general. I've beat this one to death, so I'll commit this and make smaller changes going forward to show how easily it can evolve. (This module was originally named `intern` but this commit and those that follow rewrote it to `sym`.)	2020-02-24 14:56:28 -05:00

49 Commits (31a39c79d3c78c502f841a588af26d16200d9785)