employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	1ad2fb1dc8	Copyright year update 2022 RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no "Group"), but I'm not sure if the legal name has been changed yet or not, so I'll wait on that.	2022-05-03 14:14:29 -04:00
Mike Gerwitz	7b6d68af85	tamer: xir::parse::Transition: Generalize flat::Transition XIRF introduced the concept of `Transition` to help document code and provide mental synchronization points that make it easier to reason about the system. I decided to hoist this into XIR's parser itself, and have `parse_token` accept an owned state and require a new state to be returned, utilizing `Transition`. Together with the convenience methods introduced on `Transition` itself, this produces much clearer code, as is evidenced by tree::Stack (XIRT's parser). Passing an owned state is something that I had wanted to do originally, but I thought it'd lead to more concise code to use a mutable reference. Unfortunately, that concision lead to code that was much more difficult than necessary to understand, and ended up having a net negative benefit by leading to some more boilerplate for the nested types (granted, that could have been alleviated in other ways). This also opens up the possibility to do something that I wasn't able to before, which was continue to abstract away parser composition by stitching their state machines together. I don't know if this'll be done immediately, but because the actual parsing operations are now able to compose functionally without mutability getting the way, the previous state coupling issues with the parent parser go away. DEV-10863	2022-03-17 16:02:05 -04:00
Mike Gerwitz	5af698d15c	tamer: xir::{tree::=>}parse: Move module It's a bit odd that I've done next to nothing with TAMER for the past week or so, and decided to do this one small thing before I go on break for the holidays, but I felt compelled to do _something_. Besides, this gets me in a better spot for the inevitable mental planning and writing I'll be doing over the holidays. This move was natural, given what this has evolved into---it has nothing to do with the concept of a "tree", and the modules imports emphasized that fact given the level of inappropriate nesting.	2021-12-23 13:17:18 -05:00
Mike Gerwitz	d5a2d43526	tamer: xir::tree::attr::parse::AttrParse{r=>}State Simply correcting a naming inconsistency between the trait and the concrete type. DEV-11339 / DEV-11268	2021-12-17 10:22:29 -05:00
Mike Gerwitz	0cc0bc9d5a	tamer: xir::Token::AttrEnd: Remove More information can be found in the prior commit message, but I'll summarize here. This token was introduced to create a LL(0) parser---no tokens of lookahead. This allowed the underlying TokenStream to be freely passed to the next system that needed it. Since then, Parser and ParseState were introduced, along with ParseStatus::Dead, which introduces the concept of lookahead for a single token---an LL(1) grammar. I had always suspected that this would happen, given the awkwardness of AttrEnd; it was just a matter of time before the right abstraction manifested itself to handle lookahead. DEV-11339	2021-12-17 10:14:31 -05:00
Mike Gerwitz	61f7a12975	tamer: xir::tree: Integrate AttrParserState into Stack Note that AttrParse{r=>}State needs renaming, and Stack will get a better name down the line too. This commit message is accurate, but confusing. This performs the long-awaited task of trying to observe, concretely, how to combine two automata. This has the effect of stitching together the state machines, such that the union of the two is equivalent to the original monolith. The next step will be to abstract this away. There are some important things to note here. First, this introduces a new "dead" state concept, where here a dead state is defined as an _accepting_ state that has no state transitions for the given input token. This is more strict than a dead state as defined in, for example, the Dragon Book, where backtracking may occur. The reason I chose for a Dead state to be accepting is simple: it represents a lookahead situation. It says, "I don't know what this token is, but I've done my job, so it may be useful in a parent context". The "I've done my job" part is only applicable in an accepting state. If the parser is _not_ in an accepting state, then an unknown token is simply an error; we should _not_ try to backtrack or anything of the sort, because we want only a single token of lookahead. The reason this was done is because it's otherwise difficult to compose the two parsers without requiring that AttrEnd exist in every XIR stream; this has always been an awkward delimiter that was introduced to make the parser LL(0), but I tried to compromise by saying that it was optional. Of course, I knew that decision caused awkward inconsistencies, I had just hoped that those inconsistencies wouldn't manifest in practical issues. Well, now it did, and the benefits of AttrEnd that we had in the previous construction do not exist in this one. Consequently, it makes more sense to simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future commit will remove it entirely. All of this information will be documented, but I want to get further in the implementation first to make sure I don't change course again and therefore waste my time on docs. DEV-11268	2021-12-16 09:44:02 -05:00
Mike Gerwitz	0061a13d63	tree: xir::tree::Object: Remove now-unneeded enum This was added only for isolated attribute parsing. Of course, this does mean that a new union type will be needed when combining the two parsers, depending on the desired resolution, but that'll come at a later time and possibly in a more general way. DEV-11268	2021-12-14 12:44:32 -05:00
Mike Gerwitz	69acba3ec0	tamer: xir::tree: Use parse::Parser for parse All tree module parsing functions now make use of parse::Parser. This module will eventually be hoisted from tree. DEV-11268	2021-12-14 12:36:35 -05:00
Mike Gerwitz	6e9d139373	tamer: xir::tree::parse::Parser: Remove lifetime This will allow Parser to operate on both owned and &mut values, and is the same approach that Rust's built-in iterators take. This is at first quite surprising, and I often forget that this is a feature, and, as a bonus, an attractive way to avoid lifetimes in struct definitions when generics are used for the type that may become a reference. DEV-11268	2021-12-13 16:51:15 -05:00
Mike Gerwitz	f09900b80c	tamer: xir::tree: Remove isolated AttrList parsing This isn't currently used by anything, and this is collecting, which does not fit well with the streaming model. AttrList was originally written for Element parsing, and the isolated attr parser was written for test cases, before it was fully decided how this system ought to work. Instead, if AttrList is in fact needed, we can either collect (ideally not) or implement Extend for AttrList. (Or create TryExtend.) DEV-11268	2021-12-13 16:20:50 -05:00
Mike Gerwitz	29fdf5428c	tamer: xir::tree: {Parse=>Stack}Error Prepare to adopt parse::ParseError, which will contain StackError. DEV-11268	2021-12-13 15:27:20 -05:00
Mike Gerwitz	faed32af7e	tamer: xir::tree::ParserState: Remove and expose Stack directly This removes the layer of encapsulation that was hiding Stack, which is the actual parser. The new layer of encapsulation is parse::Parser, which will be introduced here soon. Baby steps, so it's clear how this evolves. DEV-11268	2021-12-13 15:02:08 -05:00
Mike Gerwitz	48517502d9	tamer: xir::tree::Parsed: Mirror xir::tree::parse::Parsed I think it's obvious where the next commit is going---replace xir::tree::Parsed. DEV-11268	2021-12-13 14:19:12 -05:00
Mike Gerwitz	c6d6f44bcb	tamer: xir::tree::parse: ParseStatus and Parsed The old Parsed was renamed to ParseStatus to be used by Parser, and Parser converts it into Parsed, which has the same variants as it did before and has all but the Done variant, since it's not possible for Parser to yield it. DEV-11268	2021-12-10 16:51:53 -05:00
Mike Gerwitz	9facc26b4f	tamer: xir::tree::parse: Use new Parsed::Done variant over None This removes Option from ParseState, as mentioned in previous commits. This is ideal because it not only removes a layer of abstraction, but also makes the intent very clear; the use of None was too tied to the concept of an Iterator, which is the concern of Parser, _not_ ParseState. This is now similar to tree::Parsed, which will help with that refactoring shortly. The Done variant is not accessible outside of Parser, since it always coverts it to None (to halt iteration); given that, we should have another public-facing type, as was also mentioned in a previous commit. DEV-11268	2021-12-10 16:22:02 -05:00
Mike Gerwitz	38363da9ff	tamer: xir::tree: {TokenStream=>ParseState} This also renames related types. See previous commits for more in formation. In essence, this trait represents the reification of all parser state. The omission of "r" in the name ParseState is intentional, since it indicates the state of a current parse. We'll see whether that naming ends up being too confusing; it's easy enough to change. DEV-11268	2021-12-10 15:42:01 -05:00
Mike Gerwitz	8eddf2f5ef	tamer: xir::tree::parse: Remove TokenStreamParser trait This just leaves Parser, which is what I started with, but I wasn't sure how far I was going to take this. I went against my usual judgment in creating a trait that I may not need, in an attempt to try to reason about the API that I wanted, because it wasn't yet clear at the time whether the Parser ought to be generic. Since then (as detailed in the last commit), this has become more of a coordinator/mediator, and the real parser is actually TokenStreamState, which will be renamed shortly. DEV-11268	2021-12-10 14:58:44 -05:00
Mike Gerwitz	bfe46be5bb	tamer: xir::tree::attr_parser_from: Integrate AttrParser This begins to integrate the isolated AttrParser. The next step will be integrating it into the larger XIRT parser. There's been considerable delay in getting this committed, because I went through quite the struggle with myself trying to determine what balance I want to strike between Rust's type system; convenience with parser combinators; iterators; and various other abstractions. I ended up being confounded by trying to maintain the current XmloReader abstraction, which is fundamentally incompatible with the way the new parsing system works (streaming iterators that do not collect or perform heap allocations). There'll be more information on this to come, but there are certain things that will be changing. There are a couple problems highlighted by this commit (not in code, but conceptually): 1. Introducing Option here for the TokenParserState doesn't feel right, in the sense that the abstraction is inappropriate. We should perhaps introduce a new variant Parsed::Done or something to indicate intent, rather than leaving the reader to have to read about what None actually means. 2. This turns Parsed into more of a statement influencing control flow/logic, and so should be encapsulated, with an external equivalent of Parsed that omits variants that ought to remain encapsulated. 3. TokenStreamState is true, but these really are the actual parsers; TokenStreamParser is more of a coordinator, and helps to abstract away some of the common logic so lower-level parsers do not have to worry about it. But calling it TokenStreamState is both a bit confusing and is an understatement---it _does_ hold the state, but it also holds the current parsing stack in its variants. Another thing that is not yet entirely clear is whether this AttrParser ought to care about detection of duplicate attributes, or if that should be done in a separate parser, perhaps even at the XIR level. The same can be said for checking for balanced tags. By pushing it to TokenStream in XIR, we would get a guaranteed check regardless of what parsers are used, which is attractive because it reduces the (almost certain-to-otherwise-occur) risk that individual parsers will not sufficiently check for semantically valid XML. But it does _potentially_ match error recovery more complicated. But at the same time, perhaps more specific parsers ought not care about recovery at that level. Anyway, point being, more to come, but I am disappointed how much time I'm spending considering parsing, given that there are so many things I need to move onto. I just want this done right and in a way that feels like it's working well with Rust while it's all in working memory, otherwise it's going to be a significant effort to get back into. DEV-11268	2021-12-10 14:25:08 -05:00
Mike Gerwitz	0e08cf3efe	tamer: xir::tree::parse: EOF span This stores the last seen Span and uses that when reporting EOF, so that the user will be able to be notified of where exactly the problem occurred. When I get into creating combinators, it'll be the responsibility of those combinators to ensure that any None return value will be supplemented by its own last span. DEV-11268	2021-12-06 15:34:29 -05:00
Mike Gerwitz	325c3167ee	tamer: xir::Token::span: New method This permits retrieving a Span from any Token variant. To support this, rather than having this return an Option, Token::AttrEnd was augmented with a Span; this results in a much simpler and friendlier API. DEV-11268	2021-12-06 14:48:55 -05:00
Mike Gerwitz	77c18d0615	tamer: xir: Remove Attr::Extensible This removes XIRT support for attribute fragments. The reason is that because this is a write-only operation---fragments are used to concatenate SymbolIds without reallocation, which can only happen if we are generating XIR internally. Given that this cannot happen during read, it was a mistake to complicate the parsers. But it makes sense why I did originally, given that the XIRT parser was written for simplifying test cases. But now that we want parsers for real, and are writing production-quality parsers, this extra complexity is very undesirable. As a bonus, we also avoid any potential for heap allocations related to attributes. Granted, they didn't _really_ exist to begin with, but it was part of XIRT, and was ugly. DEV-11268	2021-12-06 14:26:58 -05:00
Mike Gerwitz	42b5007402	tamer: xir:tree: Begin work on composable XIRT parser The XIRT parser was initially written for test cases, so that unit tests should assert more easily on generated token streams (XIR). While it was planned, it wasn't clear what the eventual needs would be, which were expected to differ. Indeed, loading everything into a generic tree representation in memory is not appropriate---we should prefer streaming and avoiding heap allocations when they’re not necessary, and we should parse into an IR rather than a generic format, which ensures that the data follow a proper grammar and are semantically valid. When parsing attributes in an isolated context became necessary for the aforementioned task, the state machine of the XIRT parser was modified to accommodate. The opposite approach should have been taken---instead of adding complexity and special cases to the parser, and from a complex parser extracting a simple one (an attribute parser), we should be composing the larger (full XIRT) parser from smaller ones (e.g. attribute, child elements). A combinator, when used in a functional sense, refers not to combinatory logic but to the composition of more complex systems from smaller ones. The changes made as part of this commit begin to work toward combinators, though it's not necessarily evident yet (to you, the reader) how that'll work, since the code for it hasn't yet been written; this is commit is simply getting my work thusfar introduced so I can do some light refactoring before continuing on it. TAMER does not aim to introduce a parser combinator framework in its usual sense---it favors, instead, striking a proper balance with Rust’s type system that permits the convenience of combinators only in situations where they are needed, to avoid having to write new parser boilerplate. Specifically: 1. Rust’s type system should be used as combinators, so that parsers are automatically constructed from the type definition. 2. Primitive parsers are written as explicit automata, not as primitive combinators. 3. Parsing should directly produce IRs as a lowering operation below XIRT, rather than producing XIRT itself. That is, target IRs should consume XIRT and produce parse themselves immediately, during streaming. In the future, if more combinators are needed, they will be added; maybe this will eventually evolve into a more generic parser combinator framework for TAME, but that is certainly a waste of time right now. And, to be honest, I’m hoping that won’t be necessary.	2021-12-06 11:27:39 -05:00
Mike Gerwitz	54531e2284	tamer: xir::tree::attr: Display impls	2021-11-23 13:05:10 -05:00
Mike Gerwitz	d421112f35	tamer: xir::tree::ParserState::store_or_emit: Properly emit Parsed::Done This was forgotten when the attribute parser was introduced, and led to the parser continuing to the token following AttrEnd, which properly caused a failure given that the parser was in the Done state. There is a future task I have in my backlog to properly address the Done state, but this is sufficient for now.	2021-11-17 00:13:07 -05:00
Mike Gerwitz	e0811589fa	tamer: xir::tree::attr::value_atom: Doc typo fix	2021-11-16 15:48:59 -05:00
Mike Gerwitz	f519dab2b6	tamer: xir::tree::attr::Attr::value_atom: Option<SymbolId>=>SymbolId To maintain a proper abstraction, this cannot be the responsibility of the caller; most callers should not know that fragments exist, letalone how to handle them.	2021-11-16 12:41:03 -05:00
Mike Gerwitz	5233822322	tamer: xir: Remove Text enum Like previous commits, this replaces the explicit escaping context with the convention that all values retrieved from `xir` are unescaped on read and escaped on write. Comments are a notable TODO, since we must escape only `--`. CData is also an issue. I had _expected_ to use it as a means to avoid unescaping fragments, but I had forgotten that quick_xml hard-codes escaping on read, so that it can re-use BytesStart! That is terribly unfortunate, and may result in us having to re-implement our own read method in the future to avoid this nonsense. So I'm just leaving it as a TODO for now. DEV-11081	2021-11-15 23:47:14 -05:00
Mike Gerwitz	27ba03b59b	tamer: xir::escape: Remove XirString in favor of Escaper This rewrites a good portion of the previous commit. Rather than explicitly storing whether a given string has been escaped, we can instead assume that all SymbolIds leaving or entering XIR are unescaped, because there is no reason for any other part of the system to deal with such details of XML documents. Given that, we need only unescape on read and escape on write. This is customary, so why didn't I do that to begin with? The previous commit outlines the reason, mainly being an optimization for the echo writer that is upcoming. However, this solution will end up being better---it's not implemented yet, but we can have a caching layer, such that the Escaper records a mapping between escaped and unescaped SymbolIds to avoid work the next time around. If we share the Escaper between _all_ readers and the writer, the result is that 1. Duplicate strings between source files and object files (many of which are read by both the linker and compiler) avoid re-unescaping; and 2. Writers can use this cache to avoid re-escaping when we've already seen the escaped variant of the string during read. The alternative would be a global cache, like the internment system, but I did not find that to be appropriate here, since this is far less fundamental and is much easier to compose. DEV-11081	2021-11-12 14:03:23 -05:00
Mike Gerwitz	b1c0783c75	tamer: xir::XirString: WIP implementation (likely going away) I'm not fond of this implementation, which is why it's not fully completed. I wanted to commit this for future reference, and take the opportunity to explain why I don't like it. First: this task started as an idea to implement a third variant to AttrValue and friends that indicates that a value is fixed, in the sense of a fixed-point function: escaped or unescaped, its value is the same. This would allow us to skip wasteful escape/unescape operations. In doing so, it became obvious that there's no need to leak this information through the API, and indeed, no part of the system should care. When we read XML, it should be unescaped, and when we write, it should be escaped. The reason that this didn't quite happen to begin with was an optimization: I'll be creating an echo writer in place of the current filesystem-based copy in tamec shortly, and this would allow streaming XIR directly from the reader to the writer without any unescaping or re-escaping. When we unescape, we know the value that it came from, so we could simply store both symbols---they're 32-bit, so it results in a nicely compressed 64-bit value, so it's essentially cost-free, as long as we accept the expense of internment. This is `XirString`. Then, when we want to escape or unescape, we first check to see whether a symbol already exists and, if so, use it. While this works well for echoing streams, it won't work all that well in practice: the unescaped SymbolId will be taken and the XirString discarded, since nothing after XIR should be coupled with it. Then, when we later construct a XIR stream for writting, XirString will no longer be available and our previously known escape is lost, so the writer will have to re-escape. Further, if we look at XirString's generic for the XirStringEscaper---it uses phantom, which hints that maybe it's not in the best place. Indeed, I've already acknowledged that only a reader unescapes and only a writer escapes, and that the rest of the system works with normal (unescaped) values, so only readers and writers should be part of this process. I also already acknowledged that XirString would be lost and only the unescaped SymbolId would be used. So what's the point of XirString, then, if it won't be a useful optimization beyond the temporary echo writer? Instead, we can take the XirStringWriter and implement two caches on that: mapping SymbolId from escaped->unescaped and vice-versa. These can be simple vectors, since SymbolId is a 32-bit value we will not have much wasted space for symbols that never get read or written. We could even optimize for preinterned symbols using markers, though I'll probably not do so, and I'll explain why later. If we do _that_, we get even _better_ optimizations through caching that _will_ apply in the general case (so, not just for echo), and we're able to ditch XirString entirely and simply use a SymbolId. This makes for a much more friendly API that isn't leaking implementation details, though it _does_ put an onus on the caller to pass the encoder to both the reader and the writer, _if_ it wants to take advantage of a cache. But that burden is not significant (and is, again, optional if we don't want it). So, that'll be the next step.	2021-11-10 12:22:10 -05:00
Mike Gerwitz	1f01833d30	tamer: xir::tree::attr_parser_from: Do not take ownership over iter The previous implementation took ownership over the provided iterator, which was an oversight, considering that this is intended to be used in contexts where doing so is not possible. A good example where isolated test cases aren't necessarily painting the correct picture. `scan` takes owned values, so this instead uses the same parsing method as `parse_attrs`, but using a `FromFn` iterator to avoid having to create a whole new iterator type. This will work well so long as we don't need to store the type returned by this (while also wanting to avoid boxing). DEV-11062	2021-11-05 10:54:05 -04:00
Mike Gerwitz	428d508be4	tamer: {ir::=>}{asg, xir} See the previous commit. There is no sense in some common "IR" namespace, since those IRs should live close to whatever system whose data they represent. In the case of these, they are general IRs that can apply to many different parts of the system. If that proves to be a false statement, they'll be moved. DEV-10863	2021-11-04 16:13:27 -04:00

31 Commits (9edc32dd3bbc2c6e0d1fcd0aaf8f15f0548d4f03)