2021-08-02 23:54:37 -04:00
|
|
|
|
// String interner
|
|
|
|
|
//
|
|
|
|
|
// Copyright (C) 2014-2021 Ryan Specialty Group, LLC.
|
|
|
|
|
//
|
|
|
|
|
// This file is part of TAME.
|
|
|
|
|
//
|
|
|
|
|
// This program is free software: you can redistribute it and/or modify
|
|
|
|
|
// it under the terms of the GNU General Public License as published by
|
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
|
// (at your option) any later version.
|
|
|
|
|
//
|
|
|
|
|
// This program is distributed in the hope that it will be useful,
|
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
|
// GNU General Public License for more details.
|
|
|
|
|
//
|
|
|
|
|
// You should have received a copy of the GNU General Public License
|
|
|
|
|
// along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
|
|
//! Interners used to intern values as symbols.
|
|
|
|
|
//!
|
|
|
|
|
//! See the [parent module](super) for more information.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
//!
|
|
|
|
|
//!
|
|
|
|
|
//! Using Interners Directly (Without Global State)
|
|
|
|
|
//! ===============================================
|
|
|
|
|
//! Please do not do this unless you have a compelling use case and know
|
|
|
|
|
//! what you are doing,
|
|
|
|
|
//! including understanding how to mitigate mixing of [`SymbolId`]s,
|
|
|
|
|
//! such as with newtypes or encapsulation.
|
|
|
|
|
//! Otherwise,
|
|
|
|
|
//! use the global interners instead,
|
|
|
|
|
//! as documented in the [parent module](super).
|
|
|
|
|
//!
|
|
|
|
|
//! ```
|
|
|
|
|
//! use tamer::sym::{Interner, DefaultPkgInterner, SymbolId};
|
|
|
|
|
//!
|
|
|
|
|
//! // Inputs to be interned
|
|
|
|
|
//! let a = "foo";
|
|
|
|
|
//! let b = &"foo".to_string();
|
|
|
|
|
//! let c = "foobar";
|
|
|
|
|
//! let d = &c[0..3];
|
|
|
|
|
//!
|
|
|
|
|
//! // Interners employ interior mutability and so do not need to be
|
|
|
|
|
//! // declared `mut`
|
|
|
|
|
//! let interner = DefaultPkgInterner::new();
|
|
|
|
|
//!
|
|
|
|
|
//! let (ia, ib, ic, id) = (
|
|
|
|
|
//! interner.intern(a),
|
|
|
|
|
//! interner.intern(b),
|
|
|
|
|
//! interner.intern(c),
|
|
|
|
|
//! interner.intern(d),
|
|
|
|
|
//! );
|
|
|
|
|
//!
|
|
|
|
|
//! assert_eq!(ia, ib);
|
|
|
|
|
//! assert_eq!(ia, id);
|
|
|
|
|
//! assert_eq!(ib, id);
|
|
|
|
|
//! assert_ne!(ia, ic);
|
|
|
|
|
//!
|
|
|
|
|
//! // Only "foo" and "foobar" are interned
|
|
|
|
|
//! assert_eq!(2, interner.len());
|
|
|
|
|
//! assert!(interner.contains("foo"));
|
|
|
|
|
//! assert!(interner.contains("foobar"));
|
|
|
|
|
//! assert!(!interner.contains("something else"));
|
|
|
|
|
//!
|
|
|
|
|
//! // Symbols can also be looked up by index.
|
|
|
|
|
//! assert_eq!("foo", interner.index_lookup(ia).unwrap());
|
|
|
|
|
//! ```
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
use super::symbol::SymbolStr;
|
|
|
|
|
use super::{SymbolId, SymbolIndexSize};
|
2021-08-02 23:54:37 -04:00
|
|
|
|
use crate::global;
|
|
|
|
|
use bumpalo::Bump;
|
|
|
|
|
use fxhash::FxBuildHasher;
|
|
|
|
|
use std::cell::RefCell;
|
|
|
|
|
use std::collections::HashMap;
|
|
|
|
|
use std::convert::{TryFrom, TryInto};
|
|
|
|
|
use std::fmt::Debug;
|
|
|
|
|
use std::hash::BuildHasher;
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Create, store, compare, and retrieve interned values.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Interners accept string slices and produce values of type [`SymbolId`].
|
|
|
|
|
/// The same [`SymbolId`] will always be returned for a given string,
|
|
|
|
|
/// allowing symbols to be compared for equality cheaply by comparing
|
|
|
|
|
/// integers.
|
|
|
|
|
/// Symbol locations in memory are fixed for the lifetime of the interner,
|
|
|
|
|
/// and can be retrieved as [`SymbolStr`] using
|
|
|
|
|
/// [`index_lookup`](Interner::index_lookup).
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// If you care whether a value has been interned yet or not,
|
|
|
|
|
/// see [`intern_soft`][Interner::intern_soft`] and
|
|
|
|
|
/// [`contains`](Interner::contains).
|
|
|
|
|
///
|
|
|
|
|
/// See the [module-level documentation](self) for an example.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// For interfaces to the global interners that indirectly use these
|
|
|
|
|
/// methods,
|
|
|
|
|
/// see the [parent module](super).
|
2021-08-02 23:54:37 -04:00
|
|
|
|
pub trait Interner<'i, Ix: SymbolIndexSize> {
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Intern a string slice or return an existing [`SymbolId`].
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// If the provided string has already been interned,
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// then an existing [`SymbolId`] will be returned.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Otherwise,
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// the string will be interned and a new [`SymbolId`] allocated.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// To retrieve an existing symbol _without_ interning,
|
|
|
|
|
/// see [`intern_soft`](Interner::intern_soft).
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn intern(&self, value: &str) -> SymbolId<Ix>;
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Retrieve an existing intern for the provided string slice.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// Unlike [`intern`](Interner::intern),
|
|
|
|
|
/// this will _not_ intern the string if it has not already been
|
|
|
|
|
/// interned.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn intern_soft(&self, value: &str) -> Option<SymbolId<Ix>>;
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
/// Determine whether the given value has already been interned.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// This is equivalent to `intern_soft(value).is_some()`.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
fn contains(&self, value: &str) -> bool;
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Number of interned strings in this interner's pool.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// This count will increase each time a unique string is interned.
|
|
|
|
|
/// It does not increase when a string is already interned.
|
|
|
|
|
fn len(&self) -> usize;
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Look up a symbol's string value by its [`SymbolId`].
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// This will always return a [`SymbolStr`] as long as the provided
|
|
|
|
|
/// `index` represents a symbol interned with this interner.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
/// If the index is not found,
|
|
|
|
|
/// the result is [`None`].
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn index_lookup(&'i self, index: SymbolId<Ix>) -> Option<SymbolStr<'i>>;
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
/// Intern an assumed-UTF8 slice of bytes or return an existing
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// [`SymbolId`].
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// Safety
|
|
|
|
|
/// ======
|
|
|
|
|
/// This function is unsafe because it uses
|
|
|
|
|
/// [`std::str::from_utf8_unchecked`].
|
|
|
|
|
/// It is provided for convenience when interning from trusted binary
|
|
|
|
|
/// data
|
|
|
|
|
/// (such as [object files][]).
|
|
|
|
|
///
|
|
|
|
|
/// [object files]: crate::obj
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
unsafe fn intern_utf8_unchecked(&self, value: &[u8]) -> SymbolId<Ix> {
|
2021-08-02 23:54:37 -04:00
|
|
|
|
self.intern(std::str::from_utf8_unchecked(value))
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// An interner backed by an [arena](bumpalo).
|
|
|
|
|
///
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Since all symbols exist until the interner itself is freed,
|
2021-08-02 23:54:37 -04:00
|
|
|
|
/// an arena is a much more efficient and appropriate memory allocation
|
|
|
|
|
/// strategy.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// This also provides a stable location in memory for symbol data.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// For the recommended configuration,
|
|
|
|
|
/// see [`DefaultInterner`].
|
|
|
|
|
///
|
|
|
|
|
/// See the [module-level documentation](self) for examples and more
|
|
|
|
|
/// information on how to use this interner.
|
|
|
|
|
pub struct ArenaInterner<'i, S, Ix>
|
|
|
|
|
where
|
|
|
|
|
S: BuildHasher + Default,
|
|
|
|
|
Ix: SymbolIndexSize,
|
|
|
|
|
{
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Storage for interned strings.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
arena: Bump,
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Interned strings by [`SymbolId`].
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// The first index must always be populated during initialization to
|
|
|
|
|
/// ensure that [`SymbolId`] will never be `0`.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
|
|
|
|
/// These string slices are stored in `arena`.
|
|
|
|
|
strings: RefCell<Vec<&'i str>>,
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// Map of interned strings to their respective [`SymbolId`].
|
2021-08-02 23:54:37 -04:00
|
|
|
|
///
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
/// This allows us to determine whether a string has already been
|
|
|
|
|
/// interned and, if so, to return its corresponding symbol.
|
|
|
|
|
map: RefCell<HashMap<&'i str, SymbolId<Ix>, S>>,
|
2021-08-02 23:54:37 -04:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
impl<'i, S, Ix> ArenaInterner<'i, S, Ix>
|
|
|
|
|
where
|
|
|
|
|
S: BuildHasher + Default,
|
|
|
|
|
Ix: SymbolIndexSize,
|
|
|
|
|
{
|
|
|
|
|
/// Initialize a new interner with no initial capacity.
|
|
|
|
|
///
|
|
|
|
|
/// Prefer [`with_capacity`](ArenaInterner::with_capacity) when possible.
|
|
|
|
|
#[inline]
|
|
|
|
|
pub fn new() -> Self {
|
|
|
|
|
Self::with_capacity(0)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Initialize a new interner with an initial capacity for the
|
|
|
|
|
/// underlying [`HashMap`].
|
|
|
|
|
///
|
|
|
|
|
/// The given `capacity` has no affect on arena allocation.
|
|
|
|
|
/// Specifying initial capacity is important only for the map of strings
|
|
|
|
|
/// to symbols because it will reallocate and re-hash its contents
|
|
|
|
|
/// once capacity is exceeded.
|
|
|
|
|
/// See benchmarks.
|
|
|
|
|
///
|
|
|
|
|
/// If reallocation is a major concern,
|
|
|
|
|
/// a [consistent hashing algorithm][consistent] could be considered,
|
|
|
|
|
/// but the implementation will still incur the cost of copying
|
|
|
|
|
/// the [`HashMap`]'s contents to a new location in memory.
|
|
|
|
|
///
|
|
|
|
|
/// [consistent]: https://en.wikipedia.org/wiki/Consistent_hashing
|
|
|
|
|
#[inline]
|
|
|
|
|
pub fn with_capacity(capacity: usize) -> Self {
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
let mut strings = Vec::<_>::with_capacity(capacity);
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
// The first index is not used since SymbolId cannot be 0.
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
strings.push("");
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
Self {
|
|
|
|
|
arena: Bump::new(),
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
strings: RefCell::new(strings),
|
2021-08-02 23:54:37 -04:00
|
|
|
|
map: RefCell::new(HashMap::with_capacity_and_hasher(
|
|
|
|
|
capacity,
|
|
|
|
|
Default::default(),
|
|
|
|
|
)),
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
impl<'i, S, Ix> Interner<'i, Ix> for ArenaInterner<'i, S, Ix>
|
|
|
|
|
where
|
|
|
|
|
S: BuildHasher + Default,
|
|
|
|
|
Ix: SymbolIndexSize,
|
|
|
|
|
<Ix as TryFrom<usize>>::Error: Debug,
|
|
|
|
|
{
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn intern(&self, value: &str) -> SymbolId<Ix> {
|
2021-08-02 23:54:37 -04:00
|
|
|
|
let mut map = self.map.borrow_mut();
|
|
|
|
|
|
|
|
|
|
if let Some(sym) = map.get(value) {
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
return *sym;
|
2021-08-02 23:54:37 -04:00
|
|
|
|
}
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
let mut syms = self.strings.borrow_mut();
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
let next_index: Ix = syms
|
|
|
|
|
.len()
|
|
|
|
|
.try_into()
|
|
|
|
|
.expect("internal error: SymbolId range exhausted");
|
|
|
|
|
|
|
|
|
|
// This is not actually unsafe because next_index is always >0
|
|
|
|
|
// from initialization.
|
|
|
|
|
debug_assert!(Ix::new(next_index).is_some()); // != 0 check
|
|
|
|
|
let id = unsafe { SymbolId::from_int_unchecked(next_index) };
|
|
|
|
|
|
|
|
|
|
// Copy string slice into the arena.
|
|
|
|
|
let clone: &'i str = unsafe {
|
|
|
|
|
&*(std::str::from_utf8_unchecked(
|
|
|
|
|
self.arena.alloc_slice_clone(value.as_bytes()),
|
|
|
|
|
) as *const str)
|
|
|
|
|
};
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
map.insert(clone, id);
|
|
|
|
|
syms.push(clone);
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
id
|
2021-08-02 23:54:37 -04:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[inline]
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn intern_soft(&self, value: &str) -> Option<SymbolId<Ix>> {
|
2021-08-02 23:54:37 -04:00
|
|
|
|
self.map.borrow().get(value).map(|sym| *sym)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[inline]
|
|
|
|
|
fn contains(&self, value: &str) -> bool {
|
|
|
|
|
self.map.borrow().contains_key(value)
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[inline]
|
|
|
|
|
fn len(&self) -> usize {
|
|
|
|
|
self.map.borrow().len()
|
|
|
|
|
}
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
fn index_lookup(&'i self, index: SymbolId<Ix>) -> Option<SymbolStr<'i>> {
|
|
|
|
|
self.strings
|
|
|
|
|
.borrow()
|
|
|
|
|
.get(index.as_usize())
|
|
|
|
|
.map(|str| SymbolStr::from_interned_slice(*str))
|
2021-08-02 23:54:37 -04:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Interner using the [Fx Hash][fxhash] hashing function.
|
|
|
|
|
///
|
|
|
|
|
/// _This is currently the hash function used by [`DefaultInterner`]._
|
|
|
|
|
///
|
|
|
|
|
/// If denial of service is not a concern,
|
|
|
|
|
/// then this will outperform the default
|
|
|
|
|
/// [`DefaultHasher`](std::collections::hash_map::DefaultHasher)
|
|
|
|
|
/// (which uses SipHash at the time of writing).
|
|
|
|
|
///
|
|
|
|
|
/// See intern benchmarks for a comparison.
|
|
|
|
|
pub type FxArenaInterner<'i, Ix> = ArenaInterner<'i, FxBuildHasher, Ix>;
|
|
|
|
|
|
|
|
|
|
/// Recommended [`Interner`] and configuration.
|
|
|
|
|
///
|
|
|
|
|
/// The choice of this default relies on the assumption that
|
|
|
|
|
/// denial-of-service attacks against the hash function are not a
|
|
|
|
|
/// concern.
|
|
|
|
|
///
|
|
|
|
|
/// For more information on the hashing algorithm,
|
|
|
|
|
/// see [`FxArenaInterner`].
|
|
|
|
|
pub type DefaultInterner<'i, Ix> = FxArenaInterner<'i, Ix>;
|
|
|
|
|
|
|
|
|
|
/// Interner for individual packages and their dependencies.
|
|
|
|
|
///
|
|
|
|
|
/// This type should be preferred to [`DefaultPkgInterner`] when only a
|
|
|
|
|
/// single package's symbols are being processed,
|
|
|
|
|
/// since it can be better packed into structs.
|
|
|
|
|
pub type DefaultPkgInterner<'i> = DefaultInterner<'i, global::PkgSymSize>;
|
|
|
|
|
|
|
|
|
|
/// Interner for entire programs.
|
|
|
|
|
///
|
|
|
|
|
/// This interner holds symbols with a larger underyling datatype than
|
|
|
|
|
/// [`DefaultPkgInterner`].
|
|
|
|
|
/// It is intended for use by linkers or anything else that needs to process
|
|
|
|
|
/// a large number of packages in a program simultaneously.
|
|
|
|
|
pub type DefaultProgInterner<'i> = DefaultInterner<'i, global::ProgSymSize>;
|
|
|
|
|
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
// Note that these tests assert on standalone interners, not on the globals;
|
|
|
|
|
// see the `global` sibling package for those tests.
|
2021-08-02 23:54:37 -04:00
|
|
|
|
#[cfg(test)]
|
|
|
|
|
mod test {
|
|
|
|
|
use super::*;
|
|
|
|
|
|
|
|
|
|
type Sut<'i> = DefaultInterner<'i, global::ProgSymSize>;
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn recognizes_equal_strings() {
|
|
|
|
|
let a = "foo";
|
|
|
|
|
let b = a.to_string();
|
|
|
|
|
let c = "bar";
|
|
|
|
|
let d = c.to_string();
|
|
|
|
|
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
let (ia, ib, ic, id) =
|
|
|
|
|
(sut.intern(a), sut.intern(&b), sut.intern(c), sut.intern(&d));
|
|
|
|
|
|
|
|
|
|
assert_eq!(ia, ib);
|
|
|
|
|
assert_eq!(ic, id);
|
|
|
|
|
assert_ne!(ia, ic);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn symbol_id_increases_with_each_new_intern() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
// Remember that identifiers begin at 1
|
|
|
|
|
assert_eq!(
|
2021-08-13 00:12:09 -04:00
|
|
|
|
SymbolId::test_from_int(1),
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
sut.intern("foo"),
|
2021-08-02 23:54:37 -04:00
|
|
|
|
"First index should be 1"
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
assert_eq!(
|
2021-08-13 00:12:09 -04:00
|
|
|
|
SymbolId::test_from_int(1),
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
sut.intern("foo"),
|
2021-08-02 23:54:37 -04:00
|
|
|
|
"Index should not increment for already-interned symbols"
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
assert_eq!(
|
2021-08-13 00:12:09 -04:00
|
|
|
|
SymbolId::test_from_int(2),
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
sut.intern("bar"),
|
2021-08-02 23:54:37 -04:00
|
|
|
|
"Index should increment for new symbols"
|
|
|
|
|
);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn length_increases_with_each_new_intern() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
assert_eq!(0, sut.len(), "invalid empty len");
|
|
|
|
|
|
|
|
|
|
sut.intern("foo");
|
|
|
|
|
assert_eq!(1, sut.len(), "increment len");
|
|
|
|
|
|
|
|
|
|
// duplicate
|
|
|
|
|
sut.intern("foo");
|
|
|
|
|
assert_eq!(1, sut.len(), "do not increment len on duplicates");
|
|
|
|
|
|
|
|
|
|
sut.intern("bar");
|
|
|
|
|
assert_eq!(2, sut.len(), "increment len (2)");
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn can_check_wither_string_is_interned() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
assert!(!sut.contains("foo"), "recognize missing value");
|
|
|
|
|
sut.intern("foo");
|
|
|
|
|
assert!(sut.contains("foo"), "recognize interned value");
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn intern_soft() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
assert_eq!(None, sut.intern_soft("foo"));
|
|
|
|
|
|
|
|
|
|
let foo = sut.intern("foo");
|
|
|
|
|
assert_eq!(Some(foo), sut.intern_soft("foo"));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn new_with_capacity() {
|
|
|
|
|
let n = 512;
|
|
|
|
|
let sut = Sut::with_capacity(n);
|
|
|
|
|
|
|
|
|
|
// note that this is not publicly available
|
|
|
|
|
assert!(sut.map.borrow().capacity() >= n);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn intern_utf8_unchecked() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
let a = sut.intern("foo");
|
|
|
|
|
let b = unsafe { sut.intern_utf8_unchecked(b"foo") };
|
|
|
|
|
|
|
|
|
|
assert_eq!(a, b);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
|
fn lookup_symbol_by_index() {
|
|
|
|
|
let sut = Sut::new();
|
|
|
|
|
|
|
|
|
|
// Symbol does not yet exist.
|
2021-08-13 00:12:09 -04:00
|
|
|
|
assert!(sut.index_lookup(SymbolId::test_from_int(1)).is_none());
|
2021-08-02 23:54:37 -04:00
|
|
|
|
|
|
|
|
|
let sym = sut.intern("foo");
|
tamer: Global interners
This is a major change, and I apologize for it all being in one commit. I
had wanted to break it up, but doing so would have required a significant
amount of temporary work that was not worth doing while I'm the only one
working on this project at the moment.
This accomplishes a number of important things, now that I'm preparing to
write the first compiler frontend for TAMER:
1. `Symbol` has been removed; `SymbolId` is used in its place.
2. Consequently, symbols use 16 or 32 bits, rather than a 64-bit pointer.
3. Using symbols no longer requires dereferencing.
4. **Lifetimes no longer pollute the entire system! (`'i`)**
5. Two global interners are offered to produce `SymbolStr` with `'static`
lifetimes, simplfiying lifetime management and borrowing where strings
are still needed.
6. A nice API is provided for interning and lookups (e.g. "foo".intern())
which makes this look like a core feature of Rust.
Unfortunately, making this change required modifications to...virtually
everything. And that serves to emphasize why this change was needed:
_everything_ used symbols, and so there's no use in not providing globals.
I implemented this in a way that still provides for loose coupling through
Rust's trait system. Indeed, Rustc offers a global interner, and I decided
not to go that route initially because it wasn't clear to me that such a
thing was desirable. It didn't become apparent to me, in fact, until the
recent commit where I introduced `SymbolIndexSize` and saw how many things
had to be touched; the linker evolved so rapidly as I was trying to learn
Rust that I lost track of how bad it got.
Further, this shows how the design of the internment system was a bit
naive---I assumed certain requirements that never panned out. In
particular, everything using symbols stored `&'i Symbol<'i>`---that is, a
reference (usize) to an object containing an index (32-bit) and a string
slice (128-bit). So it was a reference to a pretty large value, which was
allocated in the arena alongside the interned string itself.
But, that was assuming that something would need both the symbol index _and_
a readily available string. That's not the case. In fact, it's pretty
clear that interning happens at the beginning of execution, that `SymbolId`
is all that's needed during processing (unless an error occurs; more on that
below); and it's not until _the very end_ that we need to retrieve interned
strings from the pool to write either to a file or to display to the
user. It was horribly wasteful!
So `SymbolId` solves the lifetime issue in itself for most systems, but it
still requires that an interner be available for anything that needs to
create or resolve symbols, which, as it turns out, is still a lot of
things. Therefore, I decided to implement them as thread-local static
variables, which is very similar to what Rustc does itself (Rustc's are
scoped). TAMER does not use threads, so the resulting `'static` lifetime
should be just fine for now. Eventually I'd like to implement `!Send` and
`!Sync`, though, to prevent references from escaping the thread (as noted in
the patch); I can't do that yet, since the feature has not yet been
stabalized.
In the end, this leaves us with a system that's much easier to use and
maintain; hopefully easier for newcomers to get into without having to deal
with so many complex lifetimes; and a nice API that makes it a pleasure to
work with symbols.
Admittedly, the `SymbolIndexSize` adds some complexity, and we'll see if I
end up regretting that down the line, but it exists for an important reason:
the `Span` and other structures that'll be introduced need to pack a lot of
data into 64 bits so they can be freely copied around to keep lifetimes
simple without wreaking havoc in other ways, but a 32-bit symbol size needed
by the linker is too large for that. (Actually, the linker doesn't yet need
32 bits for our systems, but it's going to in the somewhat near future
unless we optimize away a bunch of symbols...but I'd really rather not have
the linker hit a limit that requires a lot of code changes to resolve).
Rustc uses interned spans when they exceed 8 bytes, but I'd prefer to avoid
that for now. Most systems can just use on of the `PkgSymbolId` or
`ProgSymbolId` type aliases and not have to worry about it. Systems that
are actually shared between the compiler and the linker do, though, but it's
not like we don't already have a bunch of trait bounds.
Of course, as we implement link-time optimizations (LTO) in the future, it's
possible most things will need the size and I'll grow frustrated with that
and possibly revisit this. We shall see.
Anyway, this was exhausting...and...onward to the first frontend!
2021-08-02 23:54:37 -04:00
|
|
|
|
assert_eq!("foo", sut.index_lookup(sym).unwrap());
|
2021-08-02 23:54:37 -04:00
|
|
|
|
}
|
|
|
|
|
}
|