employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	54531e2284	tamer: xir::tree::attr: Display impls	2021-11-23 13:05:10 -05:00
Mike Gerwitz	ba7ebad930	tamer: obj::xmlo::reader::test: {DUMMY_SPAN=>DS} for brevity There's a lot of boilerplate that can be reduced in general, but I _really_ want to focus on getting this thing done; I can clean up later.	2021-11-22 11:16:43 -05:00
Mike Gerwitz	ba4c32383f	tamer: obj::xmlo::reader: Parse root package node attributes Well, parse to the extent that it was being parsed before, anyway. The core of this change demonstrates how well TAMER's abstractions work well together. (As long as you have an e.g. LSP to help you make sense of all of the inference, I suppose.) Token::Open(QN_LV_PACKAGE \| QN_PACKAGE, _) => { return Ok(XmloEvent::Package( attr_parser_from(&mut self.reader) .try_collect_ok()??, )); } This finally makes use of `attr_parser_from` and `try_collect_ok`. All of the types are inferred---from the iterator transformations, to the error conversions, to the destination PackageAttrs type. DEV-10863	2021-11-18 00:59:10 -05:00
Mike Gerwitz	d421112f35	tamer: xir::tree::ParserState::store_or_emit: Properly emit Parsed::Done This was forgotten when the attribute parser was introduced, and led to the parser continuing to the token following AttrEnd, which properly caused a failure given that the parser was in the Done state. There is a future task I have in my backlog to properly address the Done state, but this is sufficient for now.	2021-11-17 00:13:07 -05:00
Mike Gerwitz	e0811589fa	tamer: xir::tree::attr::value_atom: Doc typo fix	2021-11-16 15:48:59 -05:00
Mike Gerwitz	7367e20c01	tamer: obj::xmlo: Extract error types into own module	2021-11-16 15:47:52 -05:00
Mike Gerwitz	f519dab2b6	tamer: xir::tree::attr::Attr::value_atom: Option<SymbolId>=>SymbolId To maintain a proper abstraction, this cannot be the responsibility of the caller; most callers should not know that fragments exist, letalone how to handle them.	2021-11-16 12:41:03 -05:00
Mike Gerwitz	c9be1d613d	tamer: iter::collect::TryCollect::try_collect_ok: Doc fix This was copied from another docblock and I messed it up.	2021-11-16 12:26:05 -05:00
Mike Gerwitz	5233822322	tamer: xir: Remove Text enum Like previous commits, this replaces the explicit escaping context with the convention that all values retrieved from `xir` are unescaped on read and escaped on write. Comments are a notable TODO, since we must escape only `--`. CData is also an issue. I had _expected_ to use it as a means to avoid unescaping fragments, but I had forgotten that quick_xml hard-codes escaping on read, so that it can re-use BytesStart! That is terribly unfortunate, and may result in us having to re-implement our own read method in the future to avoid this nonsense. So I'm just leaving it as a TODO for now. DEV-11081	2021-11-15 23:47:14 -05:00
Mike Gerwitz	8723ca154d	tamer: xir::escape::CachingEscaper: Use new sym::st::ST_COUNT This adds a constant `ST_COUNT` representing the number of statically allocated symbols, and uses that to estimate an initial capacity for the `CachingEscaper`. This is just a guess (and is certainly too low), but we can adjust later on after profiling, if it ever comes up.	2021-11-15 21:46:57 -05:00
Mike Gerwitz	d710437ee4	tamer: xir::escape::CachingEscaper: New Escaper As promised, this will cache previously seen escaped/unescaped values by creating a two-way mapping between them. DEV-11081	2021-11-15 16:44:24 -05:00
Mike Gerwitz	27ba03b59b	tamer: xir::escape: Remove XirString in favor of Escaper This rewrites a good portion of the previous commit. Rather than explicitly storing whether a given string has been escaped, we can instead assume that all SymbolIds leaving or entering XIR are unescaped, because there is no reason for any other part of the system to deal with such details of XML documents. Given that, we need only unescape on read and escape on write. This is customary, so why didn't I do that to begin with? The previous commit outlines the reason, mainly being an optimization for the echo writer that is upcoming. However, this solution will end up being better---it's not implemented yet, but we can have a caching layer, such that the Escaper records a mapping between escaped and unescaped SymbolIds to avoid work the next time around. If we share the Escaper between _all_ readers and the writer, the result is that 1. Duplicate strings between source files and object files (many of which are read by both the linker and compiler) avoid re-unescaping; and 2. Writers can use this cache to avoid re-escaping when we've already seen the escaped variant of the string during read. The alternative would be a global cache, like the internment system, but I did not find that to be appropriate here, since this is far less fundamental and is much easier to compose. DEV-11081	2021-11-12 14:03:23 -05:00
Mike Gerwitz	b1c0783c75	tamer: xir::XirString: WIP implementation (likely going away) I'm not fond of this implementation, which is why it's not fully completed. I wanted to commit this for future reference, and take the opportunity to explain why I don't like it. First: this task started as an idea to implement a third variant to AttrValue and friends that indicates that a value is fixed, in the sense of a fixed-point function: escaped or unescaped, its value is the same. This would allow us to skip wasteful escape/unescape operations. In doing so, it became obvious that there's no need to leak this information through the API, and indeed, no part of the system should care. When we read XML, it should be unescaped, and when we write, it should be escaped. The reason that this didn't quite happen to begin with was an optimization: I'll be creating an echo writer in place of the current filesystem-based copy in tamec shortly, and this would allow streaming XIR directly from the reader to the writer without any unescaping or re-escaping. When we unescape, we know the value that it came from, so we could simply store both symbols---they're 32-bit, so it results in a nicely compressed 64-bit value, so it's essentially cost-free, as long as we accept the expense of internment. This is `XirString`. Then, when we want to escape or unescape, we first check to see whether a symbol already exists and, if so, use it. While this works well for echoing streams, it won't work all that well in practice: the unescaped SymbolId will be taken and the XirString discarded, since nothing after XIR should be coupled with it. Then, when we later construct a XIR stream for writting, XirString will no longer be available and our previously known escape is lost, so the writer will have to re-escape. Further, if we look at XirString's generic for the XirStringEscaper---it uses phantom, which hints that maybe it's not in the best place. Indeed, I've already acknowledged that only a reader unescapes and only a writer escapes, and that the rest of the system works with normal (unescaped) values, so only readers and writers should be part of this process. I also already acknowledged that XirString would be lost and only the unescaped SymbolId would be used. So what's the point of XirString, then, if it won't be a useful optimization beyond the temporary echo writer? Instead, we can take the XirStringWriter and implement two caches on that: mapping SymbolId from escaped->unescaped and vice-versa. These can be simple vectors, since SymbolId is a 32-bit value we will not have much wasted space for symbols that never get read or written. We could even optimize for preinterned symbols using markers, though I'll probably not do so, and I'll explain why later. If we do _that_, we get even _better_ optimizations through caching that _will_ apply in the general case (so, not just for echo), and we're able to ditch XirString entirely and simply use a SymbolId. This makes for a much more friendly API that isn't leaking implementation details, though it _does_ put an onus on the caller to pass the encoder to both the reader and the writer, _if_ it wants to take advantage of a cache. But that burden is not significant (and is, again, optional if we don't want it). So, that'll be the next step.	2021-11-10 12:22:10 -05:00
Mike Gerwitz	c57aa7fb53	tamer: iter::TryCollect::try_collect_ok: New method This is intended to alleviate what will be some common boilerplate because of the Rust compiler error described therein. This will evolve over time, I'm sure. DEV-10863	2021-11-10 09:09:07 -05:00
Mike Gerwitz	3140279f04	tamer: iter::trip::TrippableIterator: New trait This provides convenience methods atop of the already-existing functions. These are a bit more ergonomic since they (a) remove a variable and its generics and (b) are conveniently suggested via LSP (with e.g. rust-analyzer) if the iterator is of the right type, even if the trait is not yet imported. This should help with discoverability as well.	2021-11-05 16:55:46 -04:00
Mike Gerwitz	90e3e94c0a	tamer: iter::{TryCollect, TryFromIter}: New traits These traits augment Rust's built-in traits to handle failure scenarios, which will allow us to encapsulate lowering logic into discrete, self-parsing units that enforce e.g. schemas (the example alludes to my intentions).	2021-11-05 16:33:16 -04:00
Mike Gerwitz	1f01833d30	tamer: xir::tree::attr_parser_from: Do not take ownership over iter The previous implementation took ownership over the provided iterator, which was an oversight, considering that this is intended to be used in contexts where doing so is not possible. A good example where isolated test cases aren't necessarily painting the correct picture. `scan` takes owned values, so this instead uses the same parsing method as `parse_attrs`, but using a `FromFn` iterator to avoid having to create a whole new iterator type. This will work well so long as we don't need to store the type returned by this (while also wanting to avoid boxing). DEV-11062	2021-11-05 10:54:05 -04:00
Mike Gerwitz	428d508be4	tamer: {ir::=>}{asg, xir} See the previous commit. There is no sense in some common "IR" namespace, since those IRs should live close to whatever system whose data they represent. In the case of these, they are general IRs that can apply to many different parts of the system. If that proves to be a false statement, they'll be moved. DEV-10863	2021-11-04 16:13:27 -04:00
Mike Gerwitz	5a91db6d54	tamer: obj::xmlo::{legacy=>}ir Calling it "legacyir" is just confusing. The original hope, when beginning TAMER, was that I'd be able to use a new object format in the near future to help speed up the compilation process. But that's far from our list of priorities now, and so seeing "legacy" all over the place is really confusing considering that it implies that perhaps it shouldn't be used for new code. This helps to clear up that cognitive dissonance by remaining neutral on the topic. And the reality is that it won't be "legacy" for some time. DEV-10863	2021-11-04 13:23:38 -04:00
Mike Gerwitz	cee6402f8b	tamer: Move {ir::legacyir=>obj::xmlo::legacyir} The IRs really ought to live where they are owned, especially given that "IR" is so generic that it makes no sense for there to be a single location for them; they're just data structures coupled with different phases of compilation. This will be renamed next commit; see that for details. This also removes some documentation describing the lowering process, because it's undergone a number of changes and needs to be accurately re-summarized in another location. That will come at a later time after the work is further along so that I don't have to keep spending the time rewriting it. DEV-10863	2021-11-04 13:20:38 -04:00
Mike Gerwitz	d06f31b4d3	tamer: obj::xmlo: Compile quickxml even with flag off This was previous gated behind the negation of the wip-xmlo-xir-reader flag, which meant that it was not being compiled or picked up by LSP. Both of those things are inconvenient and unideal. DEV-10863	2021-11-04 12:35:08 -04:00
Mike Gerwitz	e494f3fdfd	tamer: ir::xir::tree::attr_parser_from: New parser iterator This allows for the lazy parsing of attributes, and makes the necessary changes to the parser to be able to do so safely without getting into a bad context. When XIRT was originally conceived, this concept existed somewhat, but it was done in a way that would allow the parser to accept invalid input. This avoids that problem. This also introduces the concept of "Done", primarily because we had to for the AttrEnd token. This will evolve in following commit(s), which will allow carrying out the important check of ensuring that the parser has ended parsing in a valid accepting state (in terms of a state machine). DEV-11062	2021-11-04 11:04:42 -04:00
Mike Gerwitz	3ba478b09b	tamer: ir::xir::tree::ParseError::AttrNameExpected: Display typo fix We do not want to put backticks around a token display.	2021-11-03 15:07:52 -04:00
Mike Gerwitz	adc939d779	tamer: ir::xir::Token: Implement Display This also modifies xir::tree errors to use Display instead of Debug when rendering error output. DEV-10863	2021-11-03 14:54:37 -04:00
Mike Gerwitz	c7eb50b636	tamer: xir::xir::tree::parse_attrs: Isolated attribute parsing This produces an `AttrList` independent from a containing `Element`. Upcoming changes may further permit the parser to yield smaller components that are not part of an aggregate. DEV-10863	2021-11-03 14:39:03 -04:00
Mike Gerwitz	54e1877d20	tamer: ir::xir::tree: Isolate AttrList parsing This maintains existing functionality but prepares for an isolated context for AttrList parsing. DEV-10863	2021-11-02 14:07:20 -04:00
Mike Gerwitz	6eed728756	tamer: ir::xir::tree: Explicitly list unhandled tokens for exhaustiveness This allows Rust to carry out its exhaustiveness check for when we add new tokens. It further ensure that we understand what we missed, or chose not to handle. DEV-10863	2021-11-02 14:07:05 -04:00
Mike Gerwitz	edf9a75575	tamer: ir::xir::{QName, Prefix, LocalName}: Implement Display These will be shown in error messages and need user-friendly representations. DEV-10863	2021-11-02 13:55:33 -04:00
Mike Gerwitz	d045786cfb	tamer: ir::xir::tree::Element::attrs: Wrap in Option This allows AttrList not only to be lazily initialized (which is less of a problem at the moment with Vec, but may become one in the future), but also leaves a space open for attributes to be added _after_ having been parsed. It further leaves room to _take_ attributes from their `Element`. This is important because the next commit will re-introduce the ability to parse attributes independently, allowing us to put the parser in a state where we can parse AttrList without an Element context. To re-use that parsing under an Element context, we can simply attach an AttrList after it has been parsed. Option adds no additional size cost to Vec, so we get this for free (except for the tiny change that initializes the attribute list when we try to push to it). I also think this reads better ("attrs: None"). Though it makes the API slightly more of a pain to work with. DEV-10863	2021-10-29 16:34:05 -04:00
Mike Gerwitz	a9fd1c7557	tamer: Use TokenStream trait alias where applicable Simple replacement to improve readability.	2021-10-29 14:39:40 -04:00
Mike Gerwitz	7e6cb2c948	tamer: ir::xir::Token::AttrEnd: New token type The purpose of this token is to implement a lazy streaming attribute collection operation without a token of lookup, which would complicate parsing or require that a TokenStream provide a `peek` method. This is only required for readers to produce, since readers will be feeding data to parsers. I have the writer ignoring it. If you're looking back at this commit, the question is whether this was a bad idea: it introduces inconsistencies into the token stream depending on the context, which can be confusing and error-prone. The intent is to have the parser throw an explicit error if the new token is missing in the context in which it is required, which will safely handle the issue, but does defer it to runtime. But only readers need auditing, and there's only one XIR reader at the moment. DEV-10863	2021-10-29 13:06:27 -04:00
Mike Gerwitz	18ab032ba0	tamer: Begin XIR-based xmlo reader impl There isn't a whole lot here, but there is additional work needed in various places to support upcoming changes and so I want to get this commited to ease the cognitive burden of what I have thusfar. And to stop stashing. We have a feature flag for a reason. DEV-10863	2021-10-28 21:21:30 -04:00
Mike Gerwitz	ba3b576c93	tamer: ir::xir::qname_const_inner: Fully qualified QName paths This macro was previously using the path of wherever the template expanded into, which I found to be unexpected considering that I thought the macros were hygenic and the names bound to the environment in which they were defined. In any case, this solves the problem in all cases. DEV-10863	2021-10-28 21:19:11 -04:00
Mike Gerwitz	f0f58a6e16	tamer: obj::xmlo::asg_builder: Remove example for now Just until the new xmlo reader is ready, since it will be changing slightly and fails to compile with the feature flag on now. DEV-10863	2021-10-28 21:17:53 -04:00
Mike Gerwitz	e9871541a8	tamer: benches/iter.rs: Basic benchmark This was forgotten in the previous commit and exists simply to ensure that the TripIter doesn't add any significant overhead. The tests are a handful of nanoseconds apart, on my machine.	2021-10-28 21:17:41 -04:00
Mike Gerwitz	f6c5a224c8	tamer: iter::trip: Introduce initial TripIter concept See the documentation in this commit for more information. This is pretty significant, in that it's been a long-standing question for me how I'd like to join together `Result` iterators without having unnecessarily complex APIs, and also allow for error recovery. This solves both of those problems. It should be noted, however, that this does not yet explicitly implement error recovery, beyond being able to observe the failure as the result of the provided callback function. Proper recovery will be implemented once there's a use-case. DEV-11006	2021-10-28 14:50:41 -04:00
Mike Gerwitz	18cadb9c7d	tamer: obj::xmlo::reader: Better organize flagged code This moves the Iterator impl and From<B> back into `quickxml`. The type of the new reader is different, taking an iterator instead of a BufRead. This will allow us to easily mock for unit tests, without the clustfuckery that has ensued previously with quick-xml mocking. DEV-10863	2021-10-25 13:47:26 -04:00
Mike Gerwitz	c76fe87acd	tamer: obj::xmlo::reader: Move Xmlo{Result,Error,Event} These will need an API change, but are otherwise shared. This means that only the XmloReader is gated.	2021-10-25 12:26:25 -04:00
Mike Gerwitz	f7d8aa1e4f	tamer: wip-xml-xir-reader flag and setup The original plan was to modify the existing reader to use the new XmlXirReader, but that's going to be a lot of ongoing uncommitted work, with both tests and implementation. The better option seems to be to reimplement it, since so many things are changing. This flag will be short-lived and removed as soon as the implementation is complete. DEV-10863	2021-10-25 12:02:46 -04:00
Mike Gerwitz	e6f53c20fd	tamer: ir::xir::reader: Disable quick-xml check_end_names XIR must support tag mismatches; XIRT will validate them. This is currently disabled in the linker's xmlo reader as well. DEV-10863	2021-10-25 10:58:19 -04:00
Mike Gerwitz	d72ab3675c	tamer: ir::xir::reader: Comment parsing Comments re-use Text, but they are _not_ escaped, so we need to take care with the type to ensure that, if the value were ever used with a Token::Text, that we don't end up injecting XML.	2021-10-21 22:04:45 -04:00
Mike Gerwitz	fdb8e5998c	tamer: ir::xir::reader: CData parsing quick_xml provides us the value escaped, so we can just handle this the same way as Text for now. In the future, we may want to distinguish between the two so that we can reconstruct an identical XML document, but at the moment CData isn't used at all in TAME sources or outputs, and so I'm not going to worry about it for now. DEV-10863	2021-10-21 21:55:15 -04:00
Mike Gerwitz	8b212959c8	tamer: ir::xir::reader: Text and mixed content It's nice being able to breeze through changes, since that's been a pretty rare thing so far, given all the foundational work that has been needed. This should get us pretty damn close to being able to parse the `xmlo` files for the reader linker, if we're not there already. DEV-10863	2021-10-21 21:44:04 -04:00
Mike Gerwitz	13a779ec9c	tamer: ir::xir::reader: Remove namespace TODO This isn't XIR's responsibility, and so there's nothing to do here.	2021-10-21 16:52:58 -04:00
Mike Gerwitz	6d25be0ec7	tamer: ir::xir::reader: Refactor common element open parsing As mentioned in the previous commit, this is just minor cleanup.	2021-10-21 16:51:47 -04:00
Mike Gerwitz	e18aeeffac	tamer: ir::xir::reader: Parsing of child nodes This is quick-and-dirty; refactoring can be done later on. This is also intended to demonstrate the ease with which additional events can be added---the hard work is done.	2021-10-21 16:32:19 -04:00
Mike Gerwitz	4c4d89f84f	tamer: ir::xir::reader: Initial concept This is an initial working concept for the reader which handles, so far, just a single attribute. But extending it to completion will not be all that much more work. This does not have namespace support---that will be added later as part of XIRT, which is responsible for semantic analysis. This allows XIR to stay wonderfully simple, and won't have any impact on the writer (which expects that QNames are unresolved and contain the namespace prefix to be written).	2021-10-21 16:23:11 -04:00
Mike Gerwitz	fc3953e90e	tamer: benches/sym.rs: Interner::intern_utf8 benchmarks These were forgotten in the previous commit.	2021-10-19 13:42:26 -04:00
Mike Gerwitz	b8d0da9095	tamer: sym::Interner::intern_utf8 This is the safe version of the existing intern_utf8_unchecked, and exists as a performance optimization. We're about to introduce a XIR reader, which is going to intern a _lot_ of duplicate strings, since it will intern node and attribute names as well. Given that, we do not want to spent a lot of time performing UTF-8 checks that have already been performed. We know that, if an intern is in the pool, it's either already UTF-8 or that check was bypassed when it was initially interned. Therefore, if we find an existing symbol, that can be returned without having to perform any check. Otherwise, we intern as we usually would after attempting to convert the byte slice into a string. This allows us to continue to have good performance for interning without sacrificing safety for strings.	2021-10-19 12:56:57 -04:00
Mike Gerwitz	63e5a0d441	tamer: benches/sym.rs: Add additional UTF-8-related tests The intent of this is to demonstrate how significant of an impact checking byte arrays for UTF-8 validity will have, since the existing tests do not make that clear (a static string in Rust is always valid UTF-8). These benchmarks show that the cost when re-interning an already existing value is +50%. This is important, because the new reader will be interning a _lot_ of duplicate strings, whereas the existing reader operates on byte arrays without interning unless necessary. And, when it does, it does so unchecked. But we'd rather not do that, since we cannot guarantee that those XML files are valid (and not modified in some way). Upcoming commits will have what I think is a reasonable compromise to this, based on the fact that we'll be encountering _many_ duplicate strings in parsing XML files. DEV-10920	2021-10-18 21:35:32 -04:00

... 10 11 12 13 14 ...

1625 Commits (c59b92370c8c78bd19d5a2297da7cb47b09f364a) All Branches Search

1625 Commits (c59b92370c8c78bd19d5a2297da7cb47b09f364a)

All Branches