employer/tame - tame - Mike Gerwitz's Forge

employer

tame

Author	SHA1	Message	Date
Mike Gerwitz	8779abe2bb	tamer: xir::flat: Expose depth for all node-related tokens Previously a `Depth` was provided only for `Open` and `Close`. This depth information, for example, will be used by NIR to quickly determine whether a given parser ought to assert ownership of a text/comment token rather than delegating it. This involved modifying a number of test cases, but it's worth repeating in these commits that this is intentional---I've been bit in the past using `..` in contexts where I really do want to know if variant fields change so that I can consider whether and how that change may affect the code utilizing that variant. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	b3c0bdc786	tamer: xir::parse::ele: Ignore whitespace around elements Recent changes regarding whitespace were all to support this change (though it was also needed for XIRF, pre- and post-root). Now I'll have to conted with how I want to handle text nodes in various circumstances, in terms of `ele_parse!`. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	8f3301431c	tamer: span::dummy: New module to hold DUMMY_SPAN and derivatives Various DUMMY_SPAN-derived spans are used by many test cases, so this finally extracts them---something I've been meaning to do for some time. This also places DUMMY_SPAN behind a `cfg(test)` directive to ensure that it is _only_ used in tests; UNKNOWN_SPAN should be used when a span is actually unknown, which may also be the case during development. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	0edb21429d	tamer: parse::error: Describe unexpected token of input When Parser has a unhandled dead state and fails due to an unexpected token of input, we should display what we interpreted that token as. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	18803ea576	tamer: xir: Format tokens without tt quotes Whether or not quoting is appropriate depends on context, and that parent context is already performing the quoting. For example: error: expected `</rater>`, but found `<import>` --> /home/[...]/foo.xml:2:1 \| 2 \| <rater xmlns="http://www.lovullo.com/rater" \| ------ note: element starts here --> /home/[...]/foo.xml:7:3 \| 7 \| <import package="/rater/core/base" /> \| ^^^^^^^ error: expected `</rater>` In these cases (obviously I'm still working on the parser, since this is nonsense), the parser is responsible for quoting the token "<import>". DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	8778976018	tamer: xir::flat: Ignore whitespace both before and after root DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	4f2b27f944	tamer: xir: Attribute error formatting/typo fixes There were two problem errors: one showing "element element" and one showing the value along with the name of the attribute. The change for `<Attr as Display>::fmt` is debatable. I'm going to do this for now (only show `@name`) and adjust later if necessary. I'll need to go use `crate::fmt` consistently in previously-existing format strings at some point, too. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	41b41e02c1	tamer: Xirf::Text refinement This teaches XIRF to optionally refine Text into RefinedText, which determines whether the given SymbolId represents entirely whitespace. This is something I've been putting off for some time, but now that I'm parsing source language for NIR, it is necessary, in that we can only permit whitespace Text nodes in certain contexts. The idea is to capture the most common whitespace as preinterned symbols. Note that this heuristic ought to be determined from scanning a codebase, which I haven't done yet; this is just an initial list. The fallback is to look up the string associated with the SymbolId and perform a linear scan, aborting on the first non-whitespace character. This combination of checks should be sufficiently performant for now considering that this is only being run on source files, which really are not all that large. (They become large when template-expanded.) I'll optimize further if I notice it show up during profiling. This also frees XIR itself from being concerned by Whitespace. Initially I had used quick-xml's whitespace trimming, but it messed up my span calculations, and those were a pain in the ass to implement to begin with, since I had to resort to pointer arithmetic. I'd rather avoid tweaking it. tameld will not check for whitespace, since it's not important---xmlo files, if malformed, are the fault of the compiler; we can ignore text nodes except in the context of code fragments, where they are never whitespace (unless that's also a compiler bug). Onward and yonward. DEV-7145	2022-08-01 15:01:37 -04:00
Mike Gerwitz	b38c16fd08	tamer: parse::trace: Generalize reason for trace output The trace outputs a note in the footer indicating _why_ it's being output, so that the reader understands both where the potentially-unexpected behavior originates from and so they know (in the case of the feature flag) how to inhibit it. That information originally lived in `Parser`, where the `cfg` directive to enable it lives, but it was moved into the abstraction. This corrects that. DEV-7145	2022-08-01 15:01:12 -04:00
Corey Vollmer	864f50c025	[DEV-9619] Support all UTF-8 characters	2022-07-27 12:58:00 -04:00
Corey Vollmer	2901f06318	[DEV-9619] Return sha256 This fixes the implementation of sha256 to be compatible with our system.	2022-07-27 12:55:17 -04:00
Mike Gerwitz	17327f1b64	tamer: parse::trace: Extract tracing into new module This has gotten large and was cluttering `feed_tok`. This also provides the ability to more easily expand into other types of tracing in the future. DEV-7145	2022-07-26 09:29:17 -04:00
Mike Gerwitz	8f25c9ae0a	tamer: parse::parser: Include object in parser trace This information is likely redundant in a lowering pipeline, but is more useful outside of such a pipeline. It's also more clear. `Object` does not implement `Display`, though, because that's too burdensome for how it's currently used. Many `Object`s are also `Token`s though and, if fed to another `Parser` for lowering, it'll get `Display::fmt`'d. DEV-7145	2022-07-26 09:28:39 -04:00
Mike Gerwitz	4b5e51b0f0	tamer: parse::parser::Parser::feed_tok: cfg note precedence Rust was warning that `cfg` was unused if both `test` and `parser-trace-stderr`. This both allows that and adjusts the precedence to make more sense for tests. DEV-7145	2022-07-26 09:28:39 -04:00
Mike Gerwitz	c3dfcc565c	tamer: parse::parser::Parser: Include errors in parse trace Because of recovery, the trace otherwise paints a really confusing-looking picture when given unexpected input. This is large enough now that it really ought to be extracted from `feed_tok`, but I'll wait to see how this evolves further. I considered adding color too, but it's not yet clear to me that the visual noise will be all that helpful. DEV-7145	2022-07-26 09:28:37 -04:00
Corey Vollmer	f667a1a58e	[DEV-9619] Update sha256 script to handle UTF8 This commit replaces the sha256 script with a newer implemention which supports all UTF8 characters. https://github.com/emn178/js-sha256/blob/master/src/sha256.js Note that this commit breaks the system, the following commit fixes this.	2022-07-22 08:46:35 -04:00
Mike Gerwitz	422f3d9c0c	tamer: New parser-trace-stderr feature flag This flag allows toggling the parser trace that was previously only available to tests. Unfortunately, at the time of writing, Cargo cannot enable flags in profiles, so I have to check for either `test` or this flag being set to enable relevant features. This trace is useful as I start to run the parser against existing code written in TAME so that our existing systems can help to guide my development. Unlike the current tests, it also allows seeing real-world data as part of the lowering pipeline, where multiple `Parser`s are in play. Having this feature flag also makes this feature more easily discoverable to those wishing to observe how the lowering pipeline works. DEV-7145	2022-07-21 22:10:08 -04:00
Mike Gerwitz	de35cc37fd	tamer: xir::writer::XmlWriter: Do not take Token ownership impl for `&Token` instead of Token; the writer is just copying data into the destination stream anyway. This will allow us to continue writing the token while also using it for further processing, like `tee`. DEV-7145	2022-07-21 15:29:55 -04:00
Mike Gerwitz	0504788a16	tamer: xir::parse::ele: Visibility specifier We need to be able to export generated identifiers. Trying to figure out a syntax for this was a bit tricky considering how much is generated, so I just settled on something that's reasonably clear and easy to parse with `macro_rules!`. I had intended to just make everything public by default and encapsulate using private modules, but that then required making everything else that it uses public (e.g. error and token objects), which would have been a bizarre thing to do in e.g. test cases. DEV-7145	2022-07-21 14:56:43 -04:00
Mike Gerwitz	acced76788	tamer: xir::parse::ele: Expand types for external expansion for sum NT Like a previous commit, this corrects the types for sum NTs so that they properly resolve in contexts external to xir::parse. DEV-7145	2022-07-21 13:44:30 -04:00
Mike Gerwitz	992c000b68	tamer: xir::parse::ele: AttrValueError for attr_parse!'s ValueError This integrates the previous ValueError for `attr_parse!` into `ele_parse!`. DEV-7145	2022-07-21 09:23:34 -04:00
Mike Gerwitz	3a764d111e	tamer: xir::parse::attr: Fallible value parsing Values can be parsed using `TryFrom<Attr>`. Previously only `From<Attr>` was supported, which could not fail. This is critical for parsing values into types, which will wrap `SymbolId` to provide data assurances. DEV-7145	2022-07-21 09:23:11 -04:00
Mike Gerwitz	184ff6bdcc	tamer: xir::parse: Fixes for {ele,attr}_parse! outside of module The tests had certain things in scope, but now that I'm trying to use it outside of those modules, some fixes are needed. This is admittedly a sloppy commit, with a number of miscellaneous fixes. I didn't bother separating it more because most of them are type fixes, and the `From<Attr>` stuff is going to have to change into, likely, `TryFrom<Attr>` so that parse failures can occur when attributes do not match certain patterns. DEV-7145	2022-07-20 15:40:28 -04:00
Mike Gerwitz	e517e15a29	tamer: parse::Token: Swap trait method order This just places `ir_name` first in the trait definition so that it'll be inserted in that same order when using LSP. DEV-7145	2022-07-20 13:58:44 -04:00
Mike Gerwitz	c856fd72d9	tamer: xir::parse::ele: Diagnostic output The only additional information needed was opening spans so that we can provide useful information regarding closing tags. This uses a generic Span in place of {Open,Close}Span because the latter wasn't necessary, but more descriptive types would be nice; it may be beneficial later on to introduce newtypes for each of the span generated by {Open,Close}Span. DEV-7145	2022-07-20 12:17:15 -04:00
Mike Gerwitz	ce765d3b56	tamer: xir::parse::attr: Error and recovery on duplicate attr This was a TODO for the attribute parser generator. The first attribute will be kept and later ones will be ignored, producing an error. Recovery permits further attribute parsing having ignored the duplicate. DEV-7145	2022-07-20 12:16:13 -04:00
Mike Gerwitz	21dfff0110	tamer: xir::parse::attr::test: Extract into own file It's not going to be getting any smaller. DEV-7145	2022-07-20 10:02:41 -04:00
Mike Gerwitz	1ec9c963fd	tamer: xir::parse::ele: Nonterminal repetition (Kleene star) This allows an element to be repeated by the parent NT. The easiest way I saw to implement this for now was to abuse the Context to provide a runtime configuration that would allow the state machine to reset after it has completed parsing. This also influences error recovery, in that if we're expecting zero or more of something, we cannot provide an error for an unexpected name, and instead must emit a dead state so that the caller can determine what to do. DEV-7145	2022-07-19 16:14:12 -04:00
Mike Gerwitz	e73c223a55	tamer: parser::Parser: cfg(test) tracing This produces useful parse traces that are output as part of a failing test case. The parser generator macros can be a bit confusing to deal with when things go wrong, so this helps to clarify matters. This is _not_ intended to be machine-readable, but it does show that it would be possible to generate machine-readable output to visualize the entire lowering pipeline. Perhaps something for the future. I left these inline in Parser::feed_tok because they help to elucidate what is going on, just by reading what the trace would output---that is, it helps to make the method more self-documenting, albeit a tad bit more verbose. But with that said, it should probably be extracted at some point; I don't want this to set a precedent where composition is feasible. Here's an example from test cases: [Parser::feed_tok] (input IR: XIRF) \| ==> Parser before tok is parsing attributes for `package`. \| \| Attrs_(SutAttrsState_ { ___ctx: (QName(None, LocalPart(NCName(SymbolId(46 "package")))), OpenSpan(Span { len: 0, offset: 0, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10)), ___done: false }) \| \| ==> XIRF tok: `<unexpected>` \| \| Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)) \| \| ==> Parser after tok is expecting opening tag `<classify>`. \| \| ChildA(Expecting_) \| \| Lookahead: Some(Lookahead(Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)))) = note: this trace was output as a debugging aid because `cfg(test)`. [Parser::feed_tok] (input IR: XIRF) \| ==> Parser before tok is expecting opening tag `<classify>`. \| \| ChildA(Expecting_) \| \| ==> XIRF tok: `<unexpected>` \| \| Open(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1)) \| \| ==> Parser after tok is attempting to recover by ignoring element with unexpected name `unexpected` (expected `classify`). \| \| ChildA(RecoverEleIgnore_(QName(None, LocalPart(NCName(SymbolId(82 "unexpected")))), OpenSpan(Span { len: 0, offset: 1, ctx: Context(SymbolId(1 "#!DUMMY")) }, 10), Depth(1))) \| \| Lookahead: None = note: this trace was output as a debugging aid because `cfg(test)`. DEV-7145	2022-07-19 14:44:18 -04:00
Mike Gerwitz	f462c7daec	tamer: xir::parse::attr: Display: element name This resolves a TODO by including the name of the element whose attributes are currently being parsed. This also frees a parent from having to provide additional context, allowing Display to be fully delegated when stitching. DEV-7145	2022-07-18 14:43:29 -04:00
Mike Gerwitz	2f4c20dac8	tamer: xir::parse::ele: Remaining Display::fmt for nonterminals The following commit (test tracing) requires non-panicing `Display` and `Debug` values. DEV-7145	2022-07-18 14:31:42 -04:00
Mike Gerwitz	cf2cd882ca	tamer: xir::parse::ele: Introduce sum nonterminals This introduces `Nt := (A \| ... \| Z);`, where `Nt` is the name of the nonterminal and `A ... Z` are the inner nonterminals---it produces a parser that provides a choice between a set of nonterminals. This is implemented efficiently by understanding the QName that is accepted by each of the inner nonterminals and delegating that token immediately to the appropriate parser. This is a benefit of using a parser generator macro over parser combinators---we do not need to implement backtracking by letting inner parsers fail, because we know ahead of time exactly what parser we need. This _does not_ verify that each of the inner parsers accept a unique QName; maybe at a later time I can figure out something for that. However, because this compiles into a `match`, there is no ambiguity---like a PEG parser, there is precedence in the face of an ambiguous token, and the first one wins. Consequently, tests would surely fail, since the latter wouldn't be able to be parsed. This also demonstrates how we can have good error suggestions for this parsing framework: because the inner nonterminals and their QNames are known at compile time, error messages simply generate a list of QNames that are expected. The error recovery strategy is the same as previously noted, and subject to the same concerns, though it may be more appropriate here: it is desirable for the inner parser to fail rather than retrying, so that the sum parser is able to fail and, once the Kleene operator is introduced, retry on another potential element. But again, that recovery strategy may happen to work in some cases, but'll fail miserably in others (e.g. placing an unknown element at the head of a block that expects a sequence of elements would potentially fail the entire block rather than just the invalid one). But more to come on that later; it's not critical at this point. I need to get parsing completed for TAME's input language. DEV-7145	2022-07-14 15:12:57 -04:00
Mike Gerwitz	1fdfc0aa4d	tamer: xir::parse::ele: Introduce open/close span bindings This adds the ability to bind identifiers to represent `OpenSpan` and `CloseSpan`, available to the `@` and `/` maps. Since identifiers in TAME originate from attributes, this may not get a whole lot of use, but it's important to be available. There is some awkwardness in that the opening span appears to be scoped to the entire nonterminal, but it's actually only available in the `@` mapping. I'll change this if it's actually needed; this keeps things simple for now. DEV-7145	2022-07-13 23:42:51 -04:00
Mike Gerwitz	cceb8c7fb9	tamer: xir::parse::ele: Initial Close mapping support Since the parsers produce streaming IRs, we need to be able to emit tokens representing closing delimiters, where they are important. This notably doesn't use spans; I'll add those next, since they're also needed for the previous work. DEV-7145	2022-07-13 15:02:46 -04:00
Mike Gerwitz	c30c0e268d	tamer: xir::parse::ele::test: TODO regarding recovery strategy The comment explains the issue. I don't think the strategy is going to be a desirable one, but I want to move on and observe in retrospect how it ought to be handled. The important part right now is that recovery is accounted for and possible, which was a long-standing concern. DEV-7145	2022-07-13 14:25:25 -04:00
Mike Gerwitz	73efc59582	tamer: xir::parse::ele: Initial element parser generator concept This begins generating parsers that are capable of parsing elements. I need to move on, so this abstraction isn't going to go as far as it could, but let's see where it takes me. This was the work that required the recent lookahead changes, which has been detailed in previous commits. This initial support is basic, but robust. It supports parsing elements with attributes and children, but it does not yet support the equivalent of the Kleene star (`*`). Such support will likely be added by supporting parsers that are able to recurse on their own definition in tail position, which will also require supporting parsers that do not add to the stack. This generates parsers that, like all the other parsers, use enums to provide a typed stack. Stitched parsers produce a nested stack that is always bounded in size. Fortunately, expressions---which can nest deeply---do not need to maintain ancestor context on the stack, and so this should work fine; we can get away with this because XIRF ensures proper nesting for us. Statements that _do_ need to maintain such context are not nested. This also does not yet support emitting an object on closing tag, which will be necessary for NIR, which will be a streaming IR that is "near" to the source XML in structure. This will then be used to lower into AIR for the ASG, which gives structure needed for further analysis. More information to come; I just want to get this committed to serve as a mental synchronization point and clear my head, since I've been sitting on these changes for so long and have to keep stashing them as I tumble down rabbit holes covered in yak hair. DEV-7145	2022-07-13 14:08:47 -04:00
Mike Gerwitz	c9b3b84f90	tamer: parse::transition::Lookahead: ParseState=>Token type param Having the lookahead token generic over the `ParseState` was a pain in the ass for stitching, since they shared the same token type but not the same parser. I don't expect there to be any need to be able to infer other parser-related types for a token of lookahead, so I'd rather just make my life easier until such a thing is needed. DEV-7145	2022-07-13 10:13:35 -04:00
Mike Gerwitz	bd783ac08b	tamer: Replace ParseStatus::Dead with generic lookahead Oh what a tortured journey. I had originally tried to avoid formalizing lookahead for all parsers by pretending that it was only needed for dead state transitions (that is---states that have no transitions for a given input token), but then I needed to yield information for aggregation. So I added the ability to override the token for `Dead` to yield that, in addition to the token. But then I also needed to yield lookahead for error conditions. It was a mess that didn't make sense. This eliminates `ParseStatus::Dead` entirely and fully integrates the lookahead token in `Parser` that was previously implemented. Notably, the lookahead token is encapsulated in `TransitionResult` and unavailable to `ParseState` implementations, forcing them to rely on `Parser` for recursion. This not only prevents `ParseState` from recursing, but also simplifies delegation by removing the need to manually handle tokens of lookahead. The awkward case here is XIRT, which does not follow the streaming parsing convention, because it was conceived before the parsing framework. It needs to go away, but doing so right now would be a lot of work, so it has to stick around for a little bit longer until the new parser generators can be used instead. It is a persistent thorn in my side, going against the grain. `Parser` will immediately recurse if it sees a token of lookahead with an incomplete parse. This is because stitched parsers will frequently yield a dead state indication when they're done parsing, and there's no use in propagating an `Incomplete` status down the entire lowering pipeline. But, that does mean that the toplevel is not the only thing recursing. _But_, the behavior doesn't really change, in the sense that it would infinitely recurse down the entire lowering stack (though there'd be an opportunity to detect that). This should never happen with a correct parser, but it's not worth the effort right now to try to force such a thing with Rust's type system. Something like TLA+ is better suited here as an aid, but it shouldn't be necessary with clear implementations and proper test cases. Parser generators will also ensure such a thing cannot occur. I had hoped to remove ParseStatus entirely in favor of Parsed, but there's a lot of type inference that happens based on the fact that `ParseStatus` has a `ParseState` type parameter; `Parsed` has only `Object`. It is desirable for a public-facing `Parsed` to not be tied to `ParseState`, since consumers need not be concerned with such a heavy type; however, we _do_ want that heavy type internally, as it carries a lot of useful information that allows for significant and powerful type inference, which in turn creates expressive and convenient APIs. DEV-7145	2022-07-12 00:11:45 -04:00
Mike Gerwitz	61ce7d3fc7	tamer: parse::state::transition: Extract module into own file That's it. Just preparing for changes that will change how lookahaeds and dead state transitions will work. DEV-7145	2022-07-07 12:47:31 -04:00
Mike Gerwitz	e54f93b30f	tamer: parse: Introduce lookahaed token in Parser NB: This is the initial change to introduce the token of lookahead, but this does not fully integrate it. In particular, this is missing from the stitching/delegation layer. This has been a long time coming, I suppose, though I had tried to avoid it with `Parser::delegate_lookahead`. But the problem with doing that is that it forced the ParserState to recurse, which both violates that I want no looping constructs except for the toplevel, and performs additional stack allocation as it is not in tail position. The final straw was having to both return an error _and_ an aggregate object for the attribute parser when an unexpected element is encountered (this code is not yet committed). One option was to add a recovery object to the error object, and formalize that, but then we have other concerns; for example, what if that recovery object triggered an error? We'd have to mask either the old or the new error. But we wouldn't want to mask either, because the object causing the error would be the aggregate attributes, which is _not_ a recovery object, but actual data we want to emit. And so it's a kluge right off of the bat. The use of a token of lookahaed is a more traditional approach and has uses outside of just this one scenario. It'll also allow for the removal of recursion from the existing ParserStates, and possibly the elimination of dead state associated data, though I may end up leaving that; more to come. Rust will also optimize away lookahead storage and processing in Parsers that do not utilize it. DEV-7145	2022-07-07 11:19:55 -04:00
Mike Gerwitz	6385270fe6	tamer: Ensure debug_assert! takes effect in test profile I'd feel rather silly if I used `debug_assert!` for the sake of tests and they weren't actually being run due to optimization settings. This is just to catch potential future regressions; all is well today. DEV-7145	2022-07-05 14:59:35 -04:00
Mike Gerwitz	40c68d3e1e	tamer: parse::state::TransitionResult: Make opaque There was only one test outside of the `parse` module using these fields. The next commit will be introducing lookahead, and I do not want to have to trust callers to ensure invariants are met. DEV-7145	2022-07-05 14:12:06 -04:00
Mike Gerwitz	a16a0d9138	Revert "tamer: xir: Initial re-introduction of AttrEnd" This reverts commit `b973d36862`. Alright, I'm getting sick of fighting with myself on this. But rather than just removing the last commit, I'm going to keep it around, so that my thoughts are clearly documented for my future quarrels with myself. Firstly: this added more overhead than I wanted it to. While it wasn't significant, it did add 100--150ms to one of our largest systems, up from ~2.8s, which seems a bit much for a token that's really just meant to make life easier for the parser. Further, it seems that all I've managed to do is push my original problem to a different layer---this started as a means to resolve having to emit both an object and an error simultaneously in the case where aggregate attribute parsing has completed, but we encounter an error on the next token (e.g. an unexpected element). But XIRF, if it's missing AttrEnd, should throw an error, but should also recover. Recovery is easy---just assume that it was present---_but then we don't emit a XIRF `AttrEnd` token_, which is necessary for downstream systems. So we'd need to either: (a) emit both a token and an error; or (b) panic. But if we're doing (a), then the need for `AttrEnd` goes away, because it solves the original problem (though the other concerns of the previous commit still stand). (b) is not ideal at all, even though the missing token does represent an internal system error; it's not something the user can correct. But, given that it's something that the user cannot correct, doesn't that imply that it's an awkward thing to include in the token stream? So back to `AttrEnd` being an awkward PITA to have. So, given (a), I'll just do that: errors will become more of a "hey, this error just occurred, but I'm trying to recover---here's an object that you should use if you choose to continue parsing, but it may or may not be what you're looking for; proceed with caution". That flips the original script: I imagined having external systems feed recovery tokens, but this encapsulates recovery within the parser, which really is more appropriate, though less flexible than having an omniscient external recovery system; such a monolith was always an awkward concept and would be difficult to implement cleanly. This can also potentially be implemented as a generalization of the Dead state change that allowed an object to be emitted alongside the lookahead/error. Anyway, back to where I was...I'm sure I'll look back on this in the future shaking my head, reflecting on how naive I was. DEV-7145	2022-06-29 11:25:44 -04:00
Mike Gerwitz	b973d36862	tamer: xir: Initial re-introduction of AttrEnd AttrEnd was initially removed in `0cc0bc9d5a` (and the commit prior), because there was not a compelling reason to use it over a lookahead operation (returning a token via the a dead state transition); `AttrEnd` simply introduced inconsistencies between the XIR reader (which produced AttrEnd) and internal XIR stream generators (e.g. the lowering operations into XIR->XML, which do not). But now that parsers are performing aggregation---in particular the attribute parser-generator `xir::parse::attr`---this has become quite a pain, because the dead state is an actionable token. For example: 1. Open 2. Attr 3. Attr 4. Open 5. ... In the happy case, token #4 results in `Parsed::Incomplete`, and so can just be transformed into the object representing the aggregated attributes. But even in this happy path, it's ugly, and it requires non-tail recursion on the parser which requires a duplicate stack allocation for the `ParserState`. That violates a core principle of the system. But if there is an error at #4---e.g. an unexpected element---then we no longer have a `Parsed::Incomplete` to hijack for our own uses, and we'd have to introduce the ability to return both an error and a token, or we'd have to introduce the ability to keep a token of lookahead instead of reading from the underlying token stream, but that's complicated with push parsers, which are used for parser composition. Yikes. And furthermore, the aggregation has caused me to introduce the ability to override the dead state type to introduce both a token of lookahead and aggregation information. This complicates the system and is going to be confusing to others. Given all of this, AttrEnd does now seem appropriate to reintroduce, since it will allow processing of aggregate operations when encountering that token without having to worry about the above scenario; without having to duplicate a `ParseState` stack; without having to hijack dead state transitions for producing our aggregate object; and everything else mentioned above. This commit does not modify those abstractions to use AttrEnd yet; it re-introduces the token to the core system, not the parser-generators, and it doesn't yet replace lookahead operations in the parsers that use them. That'll come next. Unlike the commit that removed it, though, we are now generating proper spans, so make note of that here. This also does not introduce the concept to XIRF yet, which did not exist at the time that it was removed, so XIRF is filtering it out until a following commit. DEV-7145	2022-06-29 11:02:02 -04:00
Mike Gerwitz	9276d00456	tamer: Cargo.toml: Remove lazy_static This is not longer needed after the previous commit, with static spans having been replaced by `const` spans. This used to be required before Rust acquired better const features, and before I had preinterned symbols. DEV-7145	2022-06-24 14:18:04 -04:00
Mike Gerwitz	c671bf6a9c	tamer: xir: Introduce {Ele,Open,Close}Span This isn't conceptally all that significant of a change, but there was a lot of modify to get it working. I would generally separate this into a commit for the implementation and another commit for the integration, but I decided to keep things together. This serves a role similar to AttrSpan---this allows deriving a span representing the element name from a span representing the entire XIR token. This will provide more useful context for errors---including the tag delimiter(s) means that we care about the fact that an element is in that position (as opposed to some other type of node) within the context of an error. However, if we are expecting an element but take issue with the element name itself, we want to place emphasis on that instead. This also starts to consider the issue of span contexts---a blob of detached data that is `Span` is useful for error context, but it's not useful for manipulation or deriving additional information. For that, we need to encode additional context, and this is an attempt at that. I am interested in the concept of providing Spans that are guaranteed to actually make sense---that are instantiated and manipulated with APIs that ensure consistency. But such a thing buys us very little, practically speaking, over what I have now for TAMER, and so I don't expect to actually implement that for this project; I'll leave that for a personal project. TAMER's already take a lot of my personal interests and it can cause me a lot of grief sometimes (with regards to letting my aspirations cause me more work). DEV-7145	2022-06-24 14:16:29 -04:00
Mike Gerwitz	873e5fc761	tamer: asg::ident: {prolog=>prologue} typo fix Somewhat humorous.	2022-06-23 09:19:12 -04:00
Mike Gerwitz	2fafc331a1	tamer: xir::reader: Opening and closing tag whitespace Non-attribute and non-empty start/end tags will have their whitespace as part of the produced span. This sets us up for a following change that will allow for deriving the name span from this span given a QName, which gives us a span that both represents the entire XIR token and allows deriving the element name. An accurate token span is necessary for parsing errors where an element was not expected, while an element name span is more appropriate for issues of grammar and semantic errors that deal not with the fact that an element was encountered, but _what_ element was encountered. DEV-7145	2022-06-22 15:10:49 -04:00
Mike Gerwitz	e5c8a218c3	tamer: xir::reader: Correct empty element whitespace handling This both adds clarifying tests and corrects the case of `<foo/>`, where the offset was erroneously off by one---it saw that there were no attributes and added a byte thinking it'd include `>`, as in `<foo>`. DEV-7145	2022-06-22 10:28:44 -04:00
Mike Gerwitz	adc45d90df	tamer: xir::parse: Attribute parser generator This is the first parser generator for the parsing framework. I've been waiting quite a while to do this because I wanted to be sure that I understood how I intended to write the attribute parsers manually. Now that I'm about to start parsing source XML files, it is necessary to have a parser generator. Typically one thinks of a parser generator as a separate program that generates code for some language, but that is not always the case---that represents a lack of expressiveness in the language itself (e.g. C). Here, I simply use Rust's macro system, which should be a concept familiar to someone coming from a language like Lisp. This also resolves where I stand on parser combinators with respect to this abstraction: they both accomplish the exact same thing (composition of smaller parsers), but this abstraction doesn't do so in the typical functional way. But the end result is the same. The parser generated by this abstraction will be optimized an inlined in the same manner as the hand-written parsers. Since they'll be tightly coupled with an element parser (which too will have a parser generator), I expect that most attribute parsers will simply be inlined; they exist as separate parsers conceptually, for the same reason that you'd use parser combinators. It's worth mentioning that this awkward reliance on dead state for a lookahead token to determine when aggregation is complete rubs me the wrong way, but resolving it would involve reintroducing the XIR AttrEnd that I had previously removed. I'll keep fighting with myself on this, but I want to get a bit further before I determine if it's worth the tradeoff of reintroducing (more complex IR but simplified parsing). DEV-7145	2022-06-21 13:23:02 -04:00

... 2 3 4 5 6 ...

1487 Commits (b8a7a78f43b9ad91044663b4053ab2c7d08e8195) All Branches Search

1487 Commits (b8a7a78f43b9ad91044663b4053ab2c7d08e8195)

All Branches