Commit Graph

136 Commits (fc569f7551c949c822549f92df07279d54fdcea1)

Author SHA1 Message Date
Mike Gerwitz 2e3d94c3d6 tamer: obj::xmlo::reader: Simplify wip-xmlo-xir-reader flagging
This removes the flag from most of the code, which also resolves the
indentation.  Not only was it bothering me, but I don't want (a) every line
modified when the module body is hoisted and (b) `rustfmt` to reformat
everything when that happens.

This means that everything will be built, even though it's not used, when
the flag is off, but I see that as a good thing.

DEV-10863
2022-03-24 09:45:59 -04:00
Mike Gerwitz fbf786086a tamer: parse::Parser (lower_while_ok): New method
This introduces a WIP lowering operation, abstracting away quite a bit of
the manual wiring work, which is really important to providing an API that
provides the proper level of abstraction for actually understanding what the
system is doing.

This does not yet have tests associated with it---I had started, but it's a
lot of work and boilerplate for something that is going to
evolve.  Generally, I wouldn't use that as an excuse, but the robust type
definitions in play, combined with the tiny amount of actual logic, provide
a pretty high level of confidence.  It's very difficult to wire these types
together and produce something incorrect without doing something obviously
bad.

Similarly, I'm holding off on proper docs too, though I did write some
information here.

More to come, after I actually get to work on the XmloReader.

On a side note: I'm happy to have made progress on this, since this wiring
is something I've been dreading and wondering about since before the Parser
abstraction even existed.

Note also that this makes parser::feed_toks private again---I don't intend
to support push parsers yet, since they're only needed internally.  Maybe
for error recovery, but I'll wait to decide until it's actually needed.

DEV-10863
2022-03-23 14:31:16 -04:00
Mike Gerwitz b4a7591357 tamer: obj::xmlo::reader: Begin conversion to ParseState
This begins to transition XmloReader into a ParseState.  Unlike previous
changes where ParseStates were composed into a single ParseState, this is
instead a lowering operation that will take the output of one Parser and
provide it to another.

The mess in ld::poc (...which still needs to be refactored and removed)
shows the concept, which will be abstracted away.  This won't actually get
to the ASG in order to test that that this works with the
wip-xmlo-xir-reader flag on (development hasn't gotten that far yet), but
since it type-checks, it should conceptually work.

Wiring lowering operations together is something that I've been dreading for
months, but my approach of only abstracting after-the-fact has helped to
guide a sane approach for this.  For some definition of "sane".

It's also worth noting that AsgBuilder will too become a ParseState
implemented as another lowering operation, so:

  XIR -> XIRF -> XMLO -> ASG

These steps will all be streaming, with iteration happening only at the
topmost level.  For this reason, it's important that ASG not be responsible
for doing that pull, and further we should propagate Parsed::Incomplete
rather than filtering it out and looping an indeterminate number of times
outside of the toplevel.

One final note: the choice of 64 for the maximum depth is entirely
arbitrary and should be more than generous; it'll be finalized at some point
in the future once I actually evaluate what maximum depth is reasonable
based on how the system is used, with some added growing room.

DEV-10863
2022-03-22 14:06:52 -04:00
Mike Gerwitz 4c5b860195 tamer: Remove Ix generic from ASG
This is simply not worth it; the size is not going to be the bottleneck (at
least any time soon) and the generic not only pollutes all the things that
will use ASG in the near future, but is also incompatible with the SymbolId
default that is used everywhere; if we have to force it to 32 bits anyway,
then we may as well just default it right off the bat.

I thought that this seemed like a good idea at the time, and saving bits is
certainly tempting, but it was premature.
2022-01-14 10:21:49 -05:00
Mike Gerwitz 61f7a12975 tamer: xir::tree: Integrate AttrParserState into Stack
Note that AttrParse{r=>}State needs renaming, and Stack will get a better
name down the line too.  This commit message is accurate, but confusing.

This performs the long-awaited task of trying to observe, concretely, how to
combine two automata.  This has the effect of stitching together the state
machines, such that the union of the two is equivalent to the original
monolith.

The next step will be to abstract this away.

There are some important things to note here.  First, this introduces a new
"dead" state concept, where here a dead state is defined as an _accepting_
state that has no state transitions for the given input token.  This is more
strict than a dead state as defined in, for example, the Dragon Book, where
backtracking may occur.

The reason I chose for a Dead state to be accepting is simple: it represents
a lookahead situation.  It says, "I don't know what this token is, but I've
done my job, so it may be useful in a parent context".  The "I've done my
job" part is only applicable in an accepting state.

If the parser is _not_ in an accepting state, then an unknown token is
simply an error; we should _not_ try to backtrack or anything of the sort,
because we want only a single token of lookahead.

The reason this was done is because it's otherwise difficult to compose the
two parsers without requiring that AttrEnd exist in every XIR stream; this
has always been an awkward delimiter that was introduced to make the parser
LL(0), but I tried to compromise by saying that it was optional.  Of course,
I knew that decision caused awkward inconsistencies, I had just hoped that
those inconsistencies wouldn't manifest in practical issues.

Well, now it did, and the benefits of AttrEnd that we had in the previous
construction do not exist in this one.  Consequently, it makes more sense to
simply go from LL(0) to LL(1), which makes AttrEnd unnecessary, and a future
commit will remove it entirely.

All of this information will be documented, but I want to get further in
the implementation first to make sure I don't change course again and
therefore waste my time on docs.

DEV-11268
2021-12-16 09:44:02 -05:00
Mike Gerwitz 77c18d0615 tamer: xir: Remove Attr::Extensible
This removes XIRT support for attribute fragments.  The reason is that
because this is a write-only operation---fragments are used to concatenate
SymbolIds without reallocation, which can only happen if we are generating
XIR internally.

Given that this cannot happen during read, it was a mistake to complicate
the parsers.  But it makes sense why I did originally, given that the XIRT
parser was written for simplifying test cases.  But now that we want parsers
for real, and are writing production-quality parsers, this extra complexity
is very undesirable.

As a bonus, we also avoid any potential for heap allocations related to
attributes.  Granted, they didn't _really_ exist to begin with, but it was
part of XIRT, and was ugly.

DEV-11268
2021-12-06 14:26:58 -05:00
Mike Gerwitz f519dab2b6 tamer: xir::tree::attr::Attr::value_atom: Option<SymbolId>=>SymbolId
To maintain a proper abstraction, this cannot be the responsibility of the
caller; most callers should not know that fragments exist, letalone how to
handle them.
2021-11-16 12:41:03 -05:00
Mike Gerwitz 5233822322 tamer: xir: Remove Text enum
Like previous commits, this replaces the explicit escaping context with the
convention that all values retrieved from `xir` are unescaped on read and
escaped on write.

Comments are a notable TODO, since we must escape only `--`.

CData is also an issue.  I had _expected_ to use it as a means to avoid
unescaping fragments, but I had forgotten that quick_xml hard-codes escaping
on read, so that it can re-use BytesStart!  That is terribly unfortunate,
and may result in us having to re-implement our own read method in the
future to avoid this nonsense.  So I'm just leaving it as a TODO for now.

DEV-11081
2021-11-15 23:47:14 -05:00
Mike Gerwitz d710437ee4 tamer: xir::escape::CachingEscaper: New Escaper
As promised, this will cache previously seen escaped/unescaped values by
creating a two-way mapping between them.

DEV-11081
2021-11-15 16:44:24 -05:00
Mike Gerwitz 27ba03b59b tamer: xir::escape: Remove XirString in favor of Escaper
This rewrites a good portion of the previous commit.

Rather than explicitly storing whether a given string has been escaped, we
can instead assume that all SymbolIds leaving or entering XIR are unescaped,
because there is no reason for any other part of the system to deal with
such details of XML documents.

Given that, we need only unescape on read and escape on write.  This is
customary, so why didn't I do that to begin with?

The previous commit outlines the reason, mainly being an optimization for
the echo writer that is upcoming.  However, this solution will end up being
better---it's not implemented yet, but we can have a caching layer, such
that the Escaper records a mapping between escaped and unescaped SymbolIds
to avoid work the next time around.  If we share the Escaper between _all_
readers and the writer, the result is that

  1. Duplicate strings between source files and object files (many of which
     are read by both the linker and compiler) avoid re-unescaping; and
  2. Writers can use this cache to avoid re-escaping when we've already seen
     the escaped variant of the string during read.

The alternative would be a global cache, like the internment system, but I
did not find that to be appropriate here, since this is far less
fundamental and is much easier to compose.

DEV-11081
2021-11-12 14:03:23 -05:00
Mike Gerwitz b1c0783c75 tamer: xir::XirString: WIP implementation (likely going away)
I'm not fond of this implementation, which is why it's not fully
completed.  I wanted to commit this for future reference, and take the
opportunity to explain why I don't like it.

First: this task started as an idea to implement a third variant to
AttrValue and friends that indicates that a value is fixed, in the sense of
a fixed-point function: escaped or unescaped, its value is the same.  This
would allow us to skip wasteful escape/unescape operations.

In doing so, it became obvious that there's no need to leak this information
through the API, and indeed, no part of the system should care.  When we
read XML, it should be unescaped, and when we write, it should be
escaped.  The reason that this didn't quite happen to begin with was an
optimization: I'll be creating an echo writer in place of the current
filesystem-based copy in tamec shortly, and this would allow streaming XIR
directly from the reader to the writer without any unescaping or
re-escaping.

When we unescape, we know the value that it came from, so we could simply
store both symbols---they're 32-bit, so it results in a nicely compressed
64-bit value, so it's essentially cost-free, as long as we accept the
expense of internment.  This is `XirString`.  Then, when we want to escape
or unescape, we first check to see whether a symbol already exists and, if
so, use it.

While this works well for echoing streams, it won't work all that well in
practice: the unescaped SymbolId will be taken and the XirString discarded,
since nothing after XIR should be coupled with it.  Then, when we later
construct a XIR stream for writting, XirString will no longer be available
and our previously known escape is lost, so the writer will have to
re-escape.

Further, if we look at XirString's generic for the XirStringEscaper---it
uses phantom, which hints that maybe it's not in the best place.  Indeed,
I've already acknowledged that only a reader unescapes and only a writer
escapes, and that the rest of the system works with normal (unescaped)
values, so only readers and writers should be part of this process.  I also
already acknowledged that XirString would be lost and only the unescaped
SymbolId would be used.

So what's the point of XirString, then, if it won't be a useful optimization
beyond the temporary echo writer?

Instead, we can take the XirStringWriter and implement two caches on that:
mapping SymbolId from escaped->unescaped and vice-versa.  These can be
simple vectors, since SymbolId is a 32-bit value we will not have much
wasted space for symbols that never get read or written.  We could even
optimize for preinterned symbols using markers, though I'll probably not do
so, and I'll explain why later.

If we do _that_, we get even _better_ optimizations through caching that
_will_ apply in the general case (so, not just for echo), and we're able to
ditch XirString entirely and simply use a SymbolId.  This makes for a much
more friendly API that isn't leaking implementation details, though it
_does_ put an onus on the caller to pass the encoder to both the reader and
the writer, _if_ it wants to take advantage of a cache.  But that burden is
not significant (and is, again, optional if we don't want it).

So, that'll be the next step.
2021-11-10 12:22:10 -05:00
Mike Gerwitz 428d508be4 tamer: {ir::=>}{asg, xir}
See the previous commit.  There is no sense in some common "IR" namespace,
since those IRs should live close to whatever system whose data they
represent.

In the case of these, they are general IRs that can apply to many different
parts of the system.  If that proves to be a false statement, they'll be
moved.

DEV-10863
2021-11-04 16:13:27 -04:00
Mike Gerwitz cee6402f8b tamer: Move {ir::legacyir=>obj::xmlo::legacyir}
The IRs really ought to live where they are owned, especially given that
"IR" is so generic that it makes no sense for there to be a single location
for them; they're just data structures coupled with different phases of
compilation.

This will be renamed next commit; see that for details.

This also removes some documentation describing the lowering process,
because it's undergone a number of changes and needs to be accurately
re-summarized in another location.  That will come at a later time after the
work is further along so that I don't have to keep spending the time
rewriting it.

DEV-10863
2021-11-04 13:20:38 -04:00
Mike Gerwitz d045786cfb tamer: ir::xir::tree::Element::attrs: Wrap in Option
This allows AttrList not only to be lazily initialized (which is less of a
problem at the moment with Vec, but may become one in the future), but also
leaves a space open for attributes to be added _after_ having been
parsed.  It further leaves room to _take_ attributes from their `Element`.

This is important because the next commit will re-introduce the ability to
parse attributes independently, allowing us to put the parser in a state
where we can parse AttrList without an Element context.  To re-use that
parsing under an Element context, we can simply attach an AttrList after it
has been parsed.

Option adds no additional size cost to Vec, so we get this for free (except
for the tiny change that initializes the attribute list when we try to push
to it).

I also think this reads better ("attrs: None").  Though it makes the API
slightly more of a pain to work with.

DEV-10863
2021-10-29 16:34:05 -04:00
Mike Gerwitz 18ab032ba0 tamer: Begin XIR-based xmlo reader impl
There isn't a whole lot here, but there is additional work needed in various
places to support upcoming changes and so I want to get this commited to
ease the cognitive burden of what I have thusfar.  And to stop stashing.  We
have a feature flag for a reason.

DEV-10863
2021-10-28 21:21:30 -04:00
Mike Gerwitz 581b9d4e65 tamer: Use `..` for tuple unimportant variant matches
Tbh, I was unaware that this was supported by tuple variants until reading
over the Rustc source code for something.  (Which I had previously read, but
I must have missed it.)

This is more proper, in the sense that in a lot of cases we not only care
about how many values a tuple has, but if we explicitly match on them using
`_`, then any time we modify the number of values, it would _break_ any code
doing so.  Using this method, we improve maintainability by not causing
breakages under those circumstances.

But, consequently, it's important that we use this only when we _really_
don't care and don't want to be notified by the compiler.

I did not use `..` as a prefix, even where supported, because the intent is
to append additional information to tuples.  Consequently, I also used `..`
in places where no additional fields currently exist, since they may in the
future (e.g. introducing `Span` for `IdentObject`).
2021-10-15 12:28:59 -04:00
Mike Gerwitz 739cf7e6eb tamer: ir::asg::object::IdentObject: Define methods from IdentObjectData
In particular, `name` needn't return an `Option`.  `fragment` also returns a
copy, since it's just a `SymbolId`.  (It really ought to be a newtype rather
than an alias, but we'll worry about that some other time.)

These changes allow us to remove some runtime panics.

DEV-10859
2021-10-14 14:38:02 -04:00
Mike Gerwitz f055cb77c2 tamer: ld::xmle: Narrow Sections types
This moves the logic that sorts identifiers into sections into Sections
itself, and introduces XmleSections to allow for mocking for testing.

This then allows us to narrow the types significantly, eliminating some
runtime checks.  The types can be narrowed further, but I'll be limiting the
work I'll be doing now; this'll be inevitably addressed as we use the ASG
for the compiler.

This also handles moving Sections tests, which was a TODO from the previous
commit.

DEV-10859
2021-10-14 12:40:13 -04:00
Mike Gerwitz ea11cf1416 tamer: ld::xmle::lower: Extract sectioning into Sections
This is the appropriate place to be, now that we've begun narrowing the
types.  We'll be able to do so further; this is just the first step.

This does not yet move the tests, but the code is still tested because it's
tightly coupled with `sort`.  Those will move in the next commit(s).

DEV-10859
2021-10-12 12:15:11 -04:00
Mike Gerwitz 08d92ca663 tamer: ld::xmle::sections: Remove generic object type
xmle sections will only ever contain an object of one type, so there is no
use in making this generic.

I think the original plan was to have this represent, generically, sections
of some object file (like ELF), but doing so would require a significant
redesign anyway, so it makes no sense.  This is easier to reason about.

DEV-10859
2021-10-12 10:35:14 -04:00
Mike Gerwitz df328da71f tamer: ir::asg::SortableAsg: Move into ld::xmle::lower
This has always been a lowering operation, but it was not phrased in terms
of it, which made the process a bit more confusing to understand.

The implementation hasn't changed, but this is an incremental refactoring
and so exposes BaseAsg and its `graph` field temporarily.

DEV-10859
2021-10-12 09:49:33 -04:00
Mike Gerwitz 81ec65742a tamer: {ir::asg=>ld::xmle}::section
Sections, as written, are specific to xmle files.

I think the intent originally was to have this be more generic, but that
doesn't really make sense.

By explicitly coupling it with `xmle` files, that will allow us to turn this
into a proper lowering operation with its own validations that will allow
`xmle::xir` to do its job without having to validate anything itself.
2021-10-12 00:05:44 -04:00
Mike Gerwitz 1c181b568d tamer: ld::poc: Update comment reflecting current state
The linker is feature-complete, but this file has lived on because the
project was on pause for quite some time.
2021-10-11 23:54:24 -04:00
Mike Gerwitz f899ac898e tamer: {obj=>ld}::xmle
This is a linker-specific module.
2021-10-11 23:52:59 -04:00
Mike Gerwitz 5ea5cffd09 tamer: relroot String->SymbolId
This was [one of] the last remaining Strings; SymbolId should be used across
the board.
2021-10-11 16:00:19 -04:00
Mike Gerwitz 85909f1590 tamer: sym::SymbolStr: Remove
This removes `SymbolStr` in favor of, simply, `&'static str`.

The abstraction provided no additional safety since the slice was trivially
extracted (and commonly, in practice), and was inconvenient to work with.

This is part of a process of relaxing lookups so that symbols can be
conveniently displayed in errors; rather than trying to prevent the
developer from doing something bad, we'll just rely on conventions, hope
that it doesn't happen, and if it does, address it either at that time or
when it shows up in the profiler.
2021-10-11 12:58:48 -04:00
Mike Gerwitz 3e385d1a1b tamer: obj::xmle::xir: Finalize docs
This could be improved upon, but there will be more work coming up for this
to finalize Sections.

DEV-10561
2021-10-11 11:43:49 -04:00
Mike Gerwitz f70f5653b2 tamer: ir::asg::section: Head and tail can have only one object
This is the beginning of a refactoring to simplify this implementation a
little bit.
2021-10-09 00:27:03 -04:00
Mike Gerwitz 0626629cb3 tamer: Remove old xmle writer and wip-xir-xmle-writer flag
The new writer has reached parity of the old, with the exception of some
edge case explicit error handling that should never occur (which will be
added), and cleanup/docs.

Removing this flag now allows me to perform that cleanup without having to
worry about updating the now-old implementation.

I ran `tameld` with the new writer against our production system with
numerous programs and a significant number of test cases, and diff'd the old
and new xmle files, and everything looks good.
2021-10-08 22:04:42 -04:00
Austin Schaffer d54ef62a0d Fix import ordering 2021-10-04 17:15:02 -04:00
Mike Gerwitz 1a44e04333 tamer: ld: Write is unused outside of flag 2021-10-04 16:34:25 -04:00
Mike Gerwitz 5250571f15 tamer: ir::asg::ident: Use symbols in place of string slice mapping
`IdentKind` needs to be written to `xmle` files and displayed in error
messages.  String slices were used when quick-xml was used for writing,
which will be going away with the new writer.
2021-09-29 23:18:23 -04:00
Mike Gerwitz 6864fbc1cd tamer: Start of XIR-based xmle writer
This has been a long time coming, and has been repeatedly stashed as other
parts of the system have evolved to support it.  The introduction of the XIR
tree was to write tests for this (which are sloppy atm).

This currently writes out the `xmle` header and _most_ of the `l:dep`
section; it's missing the object-type-specific attributes.  There is,
relatively speaking, not much more work to do here.

The feature flag `wip-xir-xmle-writer` was introduced to toggle this system
in place of `XmleWriter`.  Initial benchmarks show that it will be
competitive with the quick-xml-based writer, but remember that is not the
goal: the purpose of this is to test XIR in a production system before we
continue to implement it for a frontend, and to refactor so that we do not
have multiple implementations writing XML files (once we echo the source XML
files).

I'm excited to get this done with so that I can move on.  This has been
rather exhausting.
2021-09-28 14:52:53 -04:00
Mike Gerwitz e91aeef478 tamer: Remove Ix generalization throughout system
This had the writing on the wall all the same as the `'i` interner lifetime
that came before it.  It was too much of a maintenance burden trying to
accommodate both 16-bit and 32-bit symbols generically.

There is a situation where we do still want 16-bit symbols---the
`Span`.  Therefore, I have left generic support for symbol sizes, as well as
the different global interners, but `SymbolId` now defaults to 32-bit, as
does `Asg`.  Further, the size parameter has been removed from the rest of
the code, with the exception of `Span`.

This cleans things up quite a bit, and is much nicer to work with.  If we
want 16-bit symbols in the future for packing to increase CPU cache
performance, we can handle that situation then in that specific case; it's a
premature optimization that's not at all worth the effort here.
2021-09-23 14:52:54 -04:00
Mike Gerwitz 0a8fb71c1b tamer: tameld: Use buffered writes
This was an oversight.  The difference is significant.  I had my suspicions
about this when I noticed the huge difference in time between writing to
/dev/null vs. an actual file during profiling.

On one of our systems, here's the number of syscalls _before_ this change:

  $ strace -c target/release/tameld --emit xmle -o foo foo.xmlo
  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
   85.05    4.966192          16    318473           write
    7.23    0.421977          13     32298           lstat
    6.53    0.381424          15     25113           read
    0.75    0.043691          13      3350           readlink
    0.25    0.014713          61       241           close
    0.12    0.007167          30       241           openat
    0.05    0.003175         151        21           munmap
    0.01    0.000488          14        35           brk
    0.01    0.000292           9        33           mmap
    0.00    0.000266          38         7           mremap
    0.00    0.000004           1         3           sigaltstack
    0.00    0.000000           0         6           fstat
    0.00    0.000000           0         1           poll
    0.00    0.000000           0        11           mprotect
    0.00    0.000000           0         7           rt_sigaction
    0.00    0.000000           0         1           rt_sigprocmask
    0.00    0.000000           0         6         6 access
    0.00    0.000000           0         1           execve
    0.00    0.000000           0         1           arch_prctl
    0.00    0.000000           0         1           sched_getaffinity
    0.00    0.000000           0         1           set_tid_address
    0.00    0.000000           0         1           set_robust_list
    0.00    0.000000           0         2           prlimit64
  ------ ----------- ----------- --------- --------- ----------------
  100.00    5.839389                379854         6 total

And _after_:

  $ strace -c target/release/tameld --emit xmle -o foo foo.xmlo
  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
   45.21    0.435010          13     32298           lstat
   40.09    0.385752          15     25113           read
    6.14    0.059113          21      2809           write
    4.75    0.045687          14      3350           readlink
    2.51    0.024115         100       241           close
    0.84    0.008045          33       241           openat
    0.26    0.002468         118        21           munmap
    0.06    0.000580          17        35           brk
    0.06    0.000566          17        33           mmap
    0.03    0.000279          40         7           mremap
    0.02    0.000181          16        11           mprotect
    0.01    0.000087          15         6         6 access
    0.01    0.000082          12         7           rt_sigaction
    0.01    0.000075          13         6           fstat
    0.00    0.000027           9         3           sigaltstack
    0.00    0.000024          12         2           prlimit64
    0.00    0.000018          18         1           execve
    0.00    0.000016          16         1           poll
    0.00    0.000013          13         1           sched_getaffinity
    0.00    0.000012          12         1           rt_sigprocmask
    0.00    0.000012          12         1           arch_prctl
    0.00    0.000012          12         1           set_robust_list
    0.00    0.000011          11         1           set_tid_address
  ------ ----------- ----------- --------- --------- ----------------
  100.00    0.962185                 64190         6 total

What a difference!

There's still a lot of other red flags in there; those can be addressed
separately.

This was originally written as I was learning Rust, and I suspect that I
didn't realize that File wasn't buffered at the time.

For the above link: times go from 1.23s pre-change to 0.85s after:

  0.77user 0.44system 0:01.23elapsed 99%CPU (0avgtext+0avgdata 48520maxresident)k
  0inputs+43952outputs (0major+12825minor)pagefaults 0swaps

  0.69user 0.15system 0:00.85elapsed 98%CPU (0avgtext+0avgdata 48396maxresident)k
  0inputs+43952outputs (0major+12823minor)pagefaults 0swaps
2021-08-20 12:14:42 -04:00
Mike Gerwitz 9deb393bfd tamer: Global interners
This is a major change, and I apologize for it all being in one commit.  I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.

This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:

  1. `Symbol` has been removed; `SymbolId` is used in its place.
  2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
  3. Using symbols no longer requires dereferencing.
  4. **Lifetimes no longer pollute the entire system! (`'i`)**
  5. Two global interners are offered to produce `SymbolStr` with `'static`
     lifetimes, simplfiying lifetime management and borrowing where strings
     are still needed.
  6. A nice API is provided for interning and lookups (e.g. "foo".intern())
     which makes this look like a core feature of Rust.

Unfortunately, making this change required modifications to...virtually
everything.  And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.

I implemented this in a way that still provides for loose coupling through
Rust's trait system.  Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable.  It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.

Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out.  In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit).  So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.

But, that was assuming that something would need both the symbol index _and_
a readily available string.  That's not the case.  In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user.  It was horribly wasteful!

So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things.  Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped).  TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now.  Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.

In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.

Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that.  (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).

Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now.  Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it.  Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.

Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this.  We shall see.

Anyway, this was exhausting...and...onward to the first frontend!
2021-08-11 14:24:55 -04:00
Mike Gerwitz 0fc8a1a4df tamer: Remove default SymbolIndex (et al) index type
Oh boy.  What a mess of a change.

This demonstrates some significant issues we have with Symbol.  I had
originally modelled the system a bit after Rustc's, but deviated in certain
regards:

  1. This has a confurable base type to enable better packing without bit
     twiddling and potentially unsafe tricks I'd rather avoid unless
     necessary; and
  2. The lifetime is not static, and there is no global, singleton interner;
     and
  3. I pass around references to a Symbol rather than passing around an
     index into an interner.

For #3---this is done because there's no singleton interner and therefore
resolving a symbol requires a direct reference to an available interner.  It
also wasn't clear to me (and still isn't, in fact) whether more than one
interner may be used for different contexts.

But, that doesn't preclude removing lifetimes and just passing around
indexes; in fact, I plan to do this in the frontend where the parser and
such will have direct interner access and can therefore just look up based
on a symbol index.  We could reserve references for situations where
exposing an interner would be undesirable.

Anyway, more to come...
2021-07-29 14:26:40 -04:00
Mike Gerwitz 2e50af1220 Copyright year update 2021 2021-07-22 15:00:15 -04:00
Mike Gerwitz 0d4bbe5e4e [DEV-8000] ir::asg: Introduce SortableAsgError
This will be used for the next commit, but this change has been isolated
both because it distracts from the implementation change in the next commit,
and because it cleans up the code by removing the need for a type parameter
on `AsgError`.

Note that the sort test cases now use `unwrap` instead of having
`{,Sortable}AsgError` support one or the other---this is because that does
not currently happen in practice, and there is not supposed to be a
hierarchy; they are siblings (though perhaps their name may imply otherwise).
2020-07-01 13:42:14 -04:00
Joseph Frazer 43d00a8268 [DEV-7504] Add GraphML generation
We want to be able to build a representation of the dependency graph so
we can easily inspect it.

We do not want to make GraphML by default. It is better to use a tool.
We use "petgraph-graphml".
2020-05-13 08:04:48 -04:00
Mike Gerwitz 0f4b2d75f8 [DEV-7084] TAMER: obj::xmlo: Private inner modules 2020-04-28 11:08:05 -04:00
Mike Gerwitz 549e9ca23b [DEV-7084] TAMER: AsgBuilderState:🆕 New constructor 2020-04-28 09:06:25 -04:00
Mike Gerwitz 21a0bdcce1 [DEV-7084] TAMER: AsgBuilderError: Introduce proper error variants
This is a union (sum type) of three other errors types, plus errors specific
to this builder.

This commit does a good job demonstrating the boilerplate, as well as a need
for additional context (in the case of `IdentKindError`), that we'll want to
work on abstracting away.
2020-04-28 09:06:25 -04:00
Mike Gerwitz ecc2e33ba7 [DEV-7084] TAMER: xmlo::AsgBuilder: Accept XmloResult iterator
This flips the API from using XmloWriter as the context to using Asg and
consuming anything that can produce XmloResults.  This not only makes more
sense, but avoids having to create a trait for XmloReader, and simplifies
the trait bounds we have to concern ourselves with.
2020-04-28 09:06:25 -04:00
Mike Gerwitz 0f423f3b24 [DEV-7084] TAMER: Simplify path canonicalization
This abstracts away the canonicalizer and solves the problem whereby
canonicalization was not being performed prior to recording whether a path
has been visited.  This ensures that multiple relative paths to the same
file will be properly recognized as visited.
2020-04-28 09:06:25 -04:00
Mike Gerwitz 4a7e00c404 [DEV-7084] TAMER: ld::poc: Remove unused fragments arg 2020-04-28 09:06:25 -04:00
Mike Gerwitz c94120335f [DEV-7084] TAMER: ld::poc: Remove unnecessary initial path canonicalization
Less to refactor and test.
2020-04-28 09:06:25 -04:00
Mike Gerwitz da69118592 [DEV-7084] TAMER: AsgBuilderState
This completes the POC extraction for AsgBuilder, but is still POC
code.  The commits that follow will clean it up and provide tests.
2020-04-28 09:06:25 -04:00
Mike Gerwitz 3f46917da9 [DEV-7084] TAMER: AsgBuilder extracted from POC
This extracts the changes nearly verbatim before doing refactoring so that
it's easier to observe what changes have been made.
2020-04-28 09:06:25 -04:00
Mike Gerwitz 7ed0691c45 [DEV-7084] TAMER: fs: impl File for BufReader
This further simplifies the POC linker.
2020-04-28 09:06:25 -04:00
Mike Gerwitz fbfb3c4ba2 [DEV-7084] TAMER: CanonicalFile
This will be entirely replaced in an upcoming commit.  See that for
details.  I don't feel like dealing with the conflicts for rearranging and
squashing these commits.
2020-04-28 09:06:25 -04:00
Mike Gerwitz d97e53a835 [DEV-7084] TAMER: fs: Basic filesystem abstraction
This also includes an implementation to visit paths only once.  Note that it
does not yet canonicalize the path before visiting, so relative paths to the
same file can slip through, and relative paths to _different_ files could be
erroneously considered to have been visited.

This will be fixed in an upcoming commit.
2020-04-28 09:06:19 -04:00
Mike Gerwitz 90ed4e9bd6 [DEV-7084] TAMER: From<B, &I> for XmloReader
This serves as a constructor for the time being, decoupling from POC.  We
may do something better once we have a better idea of how the various
abstractions around this will evolve.
2020-04-20 10:53:51 -04:00
Mike Gerwitz 8385b64e1d [DEV-7086] TAMER: Remove WIP linker warning
While it is true that this is still being finalized, the warnings originally
existed because tameld was not feature complete.  It is now.
2020-04-06 10:04:19 -04:00
Mike Gerwitz 40eaeb3dc8 [DEV-7087] TAMER: Remote optional Source from ASG and Object
This undoes work I did earlier today...but now we'll be able to support a
Source on an extern.

There is duplicate code between `BaseAsg::declare{,_extern}` that will be
resolved in an upcoming commit.  Upcoming commits will also simplify
terminology and clean up methods on ObjectState.
2020-03-26 09:18:08 -04:00
Mike Gerwitz 7dd8717f2f [DEV-7087] TAMER: Asg: Reintroduce declare_extern
There is some duplication here with `declare` that will be cleared up in a
following commit.  Reintroducing this method is necessary so that Source can
be used to represent the source location of the extern itself; it's
currently None to indicate an extern in `declare`.
2020-03-26 09:15:59 -04:00
Mike Gerwitz d6762ab547 [DEV-7087] TAMER: Type compatability check during extern resolution
This properly verifies extern types, and cleans up Asg's API a little so
that externs aren't handled much differently than other declarations.

With that said, after making src optional, I realized that we will indeed
want source information for externs themselves so we can direct the user to
what package is expecting that symbol (as the old linker does).  So this
approach will not work, and I'll have to undo some of those changes.
2020-03-26 09:14:26 -04:00
Joseph Frazer 6386e096b4 [DEV-7133] Clearly show the cycles in the output 2020-03-26 08:48:43 -04:00
Mike Gerwitz f969877324 [DEV-7087] TAMER: {=>Ident}Object{,State,Data}
This is essential to clarify what exactly the different object types
represent with the new generic abstractions.  For example, we will have
expressions as an object type.
2020-03-24 09:56:25 -04:00
Mike Gerwitz 5fb68f9b67 TAMER: Make Asg generic over object
There's a lot here to make the object stored on the `Asg` generic.  This
introduces `ObjectState` for state transitions and `ObjectData` for pure
data retrieval.  This will allow not only for mocking, but will be useful to
enforce compile-time restrictions on the type of objects expected by the
linker vs. the compiler (e.g. the linker will not have expressions).

This commit intentionally leaves the corresponding tests in their original
location to prove that the functionality has not changed; they'll be moved
in a future commit.

This also leaves the names as "Object" to reduce the number the cognative
overhead of this commit.  It will be renamed to something like "IdentObject"
in the near future to clarify the intent of the current object type and to
open the way for expressions and a type that marries both of them in the
future.

Once all of this is done, we'll finally be able to make changes to the
compatibility logic in state transitions to implement extern compatibility
checks during resolution.

DEV-7087
2020-03-24 09:56:20 -04:00
Mike Gerwitz 3fe3fc4b84 TAMER: ld/poc: Simplify {get_interner_value=>get_ident} 2020-03-19 15:42:06 -04:00
Joseph Frazer 7e95394076 [DEV-7085] Create `SortableAsg` trait
Create a trait that sorts a graph into `Sections` that can then be used
as an IR. The `BaseAsg` should implement the trait using what was
originally in the POC.
2020-03-13 11:51:59 -04:00
Joseph Frazer 59a0c382af [DEV-7085] Move sections to IR module
We need to use `Sections` in both the writer and the ASG so it needs to
be in a place that makes sense.
2020-03-13 11:51:59 -04:00
Joseph Frazer 01e7d3e560 [DEV-7134] Propagate errors from the writer
When an error occurs during the XML writing, they should be shown to the
user.
2020-03-09 08:23:13 -04:00
Joseph Frazer f373a00a80 [DEV-7134] Propagate sorting errors
If a node is found while sorting that is not expected, we should show
the error to the user.
2020-03-09 08:23:13 -04:00
Joseph Frazer 2a5551a04a [DEV-7134] Propagate errors setting fragments
If we cannot set a fragment, we need to display the error to the user.

We are currently ignoring "___head", "___tail", and objects that are
both virtual and overridden. Those will be corrected in with future
changes.
2020-03-09 08:23:13 -04:00
Joseph Frazer 06bc89a9ce [DEV-7134] Pass read event errors up the stack 2020-03-06 14:08:55 -05:00
Joseph Frazer 246a40a047 [DEV-7134] Return error for XmloEvent::SymDecl
We want more than warnings when a XmloEvent::SymDecl symbol has an
unknown "kind".
2020-03-06 13:41:32 -05:00
Joseph Frazer 2228a6158a [DEV-7134] Add alias for LoadResult
It looks better and was recommended by Rust's linter.
2020-03-06 12:44:22 -05:00
Joseph Frazer 4810e7a099 [DEV-7134] Remove unwrap so we can bubble up error messages 2020-03-06 12:32:42 -05:00
Joseph Frazer 590245e191 [DEV-7134] Escalate the error from finding the absolute path
We do not want to have a panic here. The error should be displayed
properly.
2020-03-06 12:24:45 -05:00
Mike Gerwitz bfea768f89 Copyright year 2020 update 2020-03-06 11:05:18 -05:00
Joseph Frazer e613bd8a8c [DEV-7081] Add options to tameld
We want to add an option to set the output file to the linker so we do
not need to redirect output to awk any longer.

This also adds integration tests for tameld.
2020-03-06 09:41:55 -05:00
Joseph Frazer 6ac7641087 [DEV-7083] TAMER: xmle writer
This introduces the writer for xmle files.
2020-03-03 11:21:18 -05:00
Mike Gerwitz 19a6d67dc4 TAMER: Separate static xmle section 2020-02-26 10:49:01 -05:00
Mike Gerwitz ab3aec980d TAMER: POC: Use FxHash to remove nondeterminism
The default SipHash is a cryptographic hash and causes ordering to change
between runs.
2020-02-26 10:49:00 -05:00
Mike Gerwitz 645908e258 TAMER: xmle output changes to support Summary Page
Co-Authored-By: Joseph Frazer <joseph.frazer@ryansg.com>
2020-02-26 10:49:00 -05:00
Mike Gerwitz 6939753ca0 TAMER: POC: Output xmle
This is a working proof-of-concept that will be finalized in future commits.
2020-02-26 10:49:00 -05:00
Mike Gerwitz 85a4934db5 TAMER: Symbol source data and metadata 2020-02-26 10:49:00 -05:00
Mike Gerwitz bcc2ab1221 TAMER: Initial abstract semantic graph (ASG)
This begins to introduce the ASG, backed by Petgraph.  The API will continue
to evolve, and Petgraph will likely be encapsulated so that our
implementation can vary independently from it (or even remove it in the
future).
2020-02-26 10:48:59 -05:00
Mike Gerwitz a0893da577 TAMER: xmlo: Add Package event 2020-02-25 16:46:27 -05:00
Mike Gerwitz a8726918f7 TAMER: poc: Use xmlo reader
TODO: More information
2020-02-25 16:46:27 -05:00
Mike Gerwitz ff0c8bb34f Order symtable, sym-dep, fragments
This ordering will simplify streaming processing of xmlo files in
TAMER.  Specifically, we know that symbols will have been declared by the
time dependencies are added to the graph (and so we should only be creating
edges to existing nodes); and we can halt reading as soon as the closing
fragments tag is encountered, avoiding parsing the entirety of these massive
XML files.

On one particularly large program, this cuts time down from ~0.333s to
~0.300 in the POC linker.
2020-02-24 14:56:28 -05:00
Mike Gerwitz e4e0089815 TAMER: Initial string interning abstraction
This is missing two key things that I'll add shortly: a HashMap-based one
for use in the ASG for node mapping, and an entry-based system for
manipulations.

This has been a nice start for exploring various aspects of Rust
development, as well as conventions that I'd like to implement.  In
particular:

  - Robust documentation intended to guide people through learning the
    necessary material about the compiler, as well as related work to
    rationalize design decisions;
  - Benchmarks;
  - TDD;
  - And just getting used to Rust in general.

I've beat this one to death, so I'll commit this and make smaller changes
going forward to show how easily it can evolve.

(This module was originally named `intern` but this commit and those that
follow rewrote it to `sym`.)
2020-02-24 14:56:28 -05:00
Mike Gerwitz 8455a38a1d Graph-based POC
This makes use of Petgraph for representing the dependency graph and uses a
separate data structure for both string interning and indexing by symbol
name.
2019-12-02 10:05:48 -05:00
Mike Gerwitz 8374541965 tamer: Initial baisc POC with no XML output
This is garbage code.  Do not use it.  It is intentionally throwaway.

While I've researched Rust, I haven't actually _used_ it for a project, so
this is a combination of me exploring various ways of accomplishing the
problem and forcing myself to learn certain aspects of the language.

I'll likely be using petgraph, and this also currently lacks symbol
abstractions.  This commit also performs far too much heap allocation
copying strings around.  But it _does_ perform the topological sort.

Since this only stores the symbol name, it lacks enough information about
the symbol to perform a proper linking.
2019-12-02 10:00:53 -05:00