tame/tamer/src/sym/mod.rs

360 lines
16 KiB
Rust
Raw Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

// String internment system
//
// Copyright (C) 2014-2023 Ryan Specialty, LLC.
//
// This file is part of TAME.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//! String internment system.
//!
//! Interned strings are represented by an integer [`SymbolId`],
//! created by an [`Interner`].
//!
//! - [`ArenaInterner`] - Intern pool backed by an [arena][] for fast
//! and stable allocation.
//! - [`FxArenaInterner`] - Intern pool backed by an [arena][] using the
//! [Fx Hash][fxhash] hashing algorithm.
//! - [`DefaultInterner`] - The currently recommended intern pool
//! configuration for symbol interning (size-agnostic).
//! - [`DefaultProgInterner`] - The currently recommended
//! general-purpose intern pool configuration for compilers and
//! linkers processing symbols from one or more packages.
//!
//! Interners represent symbols as integer values which allows for `O(1)`
//! comparison of any arbitrary interned value,
//! regardless of length.
//!
//! The most common way to intern strings is using the global static
//! interners,
//! which offer several conveniences that are discussed below.
//! However,
//! interners may also be used standalone without requiring global state.
//!
//! [arena]: bumpalo
//!
//! ```
//! use tamer::sym::{GlobalSymbolIntern, GlobalSymbolResolve, SymbolId};
//!
//! // Interns are represented by `SymbolId`.
//! let foo: SymbolId = "foo".intern();
//! assert_eq!(foo, foo);
//!
//! // Interning the same string twice returns the same intern
//! assert_eq!(foo, "foo".intern());
//!
//! // All interns can be freely copied.
//! let foo2 = foo;
//! assert_eq!(foo, foo2);
//!
//! // Different strings intern to different values
//! assert_ne!(foo, "bar".intern());
//!
//! // Interned slices can be looked up by their symbol id.
//! assert_eq!("foo", foo.lookup_str());
//! ```
//!
//! What Is String Interning?
//! =========================
//! _[String interning][]_ is a process by which a single copy of a string
//! is stored immutably in memory as part of a _pool_.
//! Once a string has been interned,
//! attempting to intern it again will always return the same [`SymbolId`].
//! Interned strings are represented by integer values known as "symbols" or
//! "atoms".
//!
//! String comparison using symbols amounts to comparing integer
//! values (`O(1)`) rather than having to scan the string (`O(n)`).
//! However,
//! both internment and symbol lookup
//! (mapping a symbol to its string)
//! incur a minor hashing cost.
//!
//! It is expected that strings are interned as soon as they are encountered,
//! which is likely to be from source inputs or previously compiled object
//! files.
//! Processing stages will then hold the interned [`SymbolId`] and use those
//! for any needed comparsions,
//! without any need to look up the string from the pool.
//! Symbols should only be looked up
//! (using [`GlobalSymbolResolve::lookup_str`] or
//! [`Interner::index_lookup`])
//! when the string representation is necessary,
//! such as to write to a file or display to the user.
//!
//! [`SymbolId`] is monotonically increasing from 1,
//! making it a useful densely-packed index as an alternative [`HashMap`]
//! when most of the symbols will be represented as part of the map.
//! This also means that strings can be interned in bulk and have a
//! predictable relationship to one-another---for
//! example,
//! if strings are interned in lexographic order,
//! their [`SymbolId`]s will reflect that same ordering,
//! so long as those strings have not previously been interned.
//! Bulk insertion should therefore be done before processing user input.
//!
//! With the exception of pre-interned static symbols
//! (see Static Symbols below),
//! [`SymbolId`]s are _not_ stable between runs.
//!
//! [string interning]: https://en.wikipedia.org/wiki/String_interning
//!
//!
//! Internment Mechanism
//! ====================
//! The current [`DefaultInterner`] is [`FxArenaInterner`],
//! which is an [arena][]-allocated intern pool mapped by the
//! [Fx Hash][fxhash] hash function:
//!
//! 1. Strings are compared against the existing intern pool using a
//! [`HashMap`].
//! 2. If a string has not yet been interned:
//! - A new integer [`SymbolId`] index is allocated;
//! - The string is copied into the arena-backed pool at that new index;
//! and
//! - The string is hashed and will resolve to the new [`SymbolId`] for
//! future lookups and internment attempts.
//! 3. Otherwise, the existing [`SymbolId`] associated with the provided
//! string is returned.
//!
//! The string associated with a [`SymbolId`] can be looked up from the pool
//! using [`GlobalSymbolResolve::lookup_str`] for global interners,
//! or [`Interner::index_lookup`] otherwise.
//! Symbols allocated using a global interner will have a `'static`
//! lifetime.
//!
//! Since [`SymbolId`] is an integer value,
//! it implements [`Copy`] and will still compare equal to other symbols
//! referencing the same interned value.
//!
//! This implementation was heavily motivated by [Rustc's own internment
//! system][rustc-intern].
//!
//! [`HashMap`]: std::collections::HashMap
//! [`NonZeroU32`]: std::num::NonZeroU32
//!
//!
//! Symbol Index Sizes
//! ------------------
//! [`SymbolId`] is generic over [`SymbolIndexSize`],
//! defaulting to[`global::ProgSymSize`](crate::global::ProgSymSize).
//! The generic size allows for specialized interners in situations where a
//! a larger index size is undesirable,
//! such as [`Span`](crate::span::Span),
//! which tries to pack a lot of information into 64-bit structures.
//!
//! Note that _it is not permissable to cast between different index sizes_
//! because they use _different_ interners with their own distinct index
//! sets.
//!
//! Global Interners
//! ----------------
//! TAMER offers two thread-local global interners that intern strings with
//! a `'static` lifetime,
//! simplifying the handling of lifetimes;
//! they produce 16-bit and 32-bit symbols.
//! These interners are lazily initialized on first use.
//! Symbols from the two interners are independently allocated cannot be
//! mixed.
//!
//! Global interners were introduced because symbols are used by virtually
//! every part of the system,
//! which polluted everything with interner lifetimes.
//! This suggested that the interner should be treated instead as if it were
//! a part of Rust itself,
//! and treated no differently than other core memory allocation.
//!
//! All [`str`] objects returned from global interners hold a
//! `'static` lifetime to simplify lifetime management and borrowing.
//! However,
//! these should not be used in place of [`SymbolId`] if the string value
//! is not actually needed.
//!
//! Global interners are exposed via friendly APIs using two traits:
//!
//! - [`GlobalSymbolIntern`] provides an `intern` method that can be used
//! on any [`&str`] (e.g. `"foo".intern()`); and
//! - [`GlobalSymbolResolve`] provides a `lookup_str` method on
//! [`SymbolId`] which resolves the symbol using the appropriate
//! global interner,
//! producing a [`str`] holding a reference to the `'static`
//! string slice within the pool.
//!
//! These traits are intentionally separate so that it is clear how a
//! particular package or object makes use of symbols.
//! If this distinction proves too cumbersome,
//! then they may be combined in the future.
//!
//! TAMER does not currently utilize threads,
//! and global interners are never dropped,
//! and so [`str`] will always refer to a valid string.
//!
//! There is no mechanism preventing [`SymbolId`] from one interner from
//! being used with another beyond [`SymbolIndexSize`] bounds;
//! if you utilize interners for any other purpose,
//! it is advised that you create newtypes for their [`SymbolId`]s.
//!
//! Static Symbols
//! --------------
//! Since nearly every string in the system is represented by a symbol,
//! comparing against static string slices would require awkward interning
//! of a static string at each relevant point in the program.
//! Instead,
//! common static strings are pre-interned when the global interner is
//! first initialized.
//!
//! These symbols are allocated statically,
//! so they can be used in `const` expressions and include additional
//! metadata allowing for safe type conversions in circumstances that
//! aren't typically permitted.
//! Since static symbols are constants,
//! symbol newtypes and objects composed of symbols are able to be
//! statically constructed as well.
//!
//! These generated symbol constants can be found in the [`st`] and [`st16`]
//! modules.
//!
//! Uninterned Symbols
//! ------------------
//! Interners are able to allocate a [`SymbolId`] without interning,
//! which will produce a symbol that cannot compare equal to any other
//! symbol and avoids the hashing cost required to perform interning.
//! This is useful for a couple of reasons:
//!
//! 1. To create a symbol that is guaranteed to be unique,
//! even if the same string value was previously interned; and
//! 2. To store a string without a hashing cost,
//! making [`SymbolId`] a suitable substitute for [`String`] when the
//! string will never benefit from internment.
//!
//! The second option allows all data structures to consistently carry
//! [`SymbolId`] and let the owner of those data decide whether it is
//! appropriate to incur a hashing cost;
//! using [`String`] forces that decision upon users of the data
//! structure,
//! and also makes for an awkward and confusing API.
//!
//! Related Work and Further Reading
//! ================================
//! String interning is used in a variety of systems and languages.
//! Symbols can typically be either interned,
//! and therefore compared for equivalency,
//! or _uninterned_,
//! which makes them unique even to symbols of the same name.
//! Interning may also be done automatically by a language as a performance
//! optimization,
//! or by a compiler for storage in an object file such as ELF.
//! Languages listed below that allow for explicit interning may also
//! perform automatic interning as well
//! (for example, `'symbol` in Lisp and `lowercase_vars` as atoms in
//! Erlang).
//!
//! | Language | Interned | Uninterned |
//! | -------- | -------- | ---------- |
//! | [Erlang][] | [`list_to_atom`][edt] | _(None)_ |
//! | [GNU Emacs Lisp][] | [`intern`][es], [`intern-soft`][es] | [`make-symbol`][es], [`gensym`][es] |
//! | [GNU Guile][] | [`string->symbol`][gs], [`gensym`][gs] | [`make-symbol`][gu] |
//! | [JavaScript][] | [`Symbol.for`][js] | [`Symbol`][js] |
//! | [Java][] | [`intern`][jvs] | _(None)_ |
//! | [Lua][] | _(Automatic for string performance)_ | _(None)_ |
//! | [MIT/GNU Scheme][] | [`intern`][ms], [`intern-soft`][ms], [`string->symbol`][ms] | [`string->uninterned-symbol`][ms], [`generate-uninterned-symbol`][ms] |
//! | [PHP][] | _(Automatic for string [performance][pp])_ | _(None)_ |
//! | [Python][] | [`sys.intern`][pys] | _(None)_ |
//! | [R6RS Scheme][] | [`string->symbol`][r6s] | _(None)_ |
//! | [Racket][] | [`string->symbol`][rs], [`string->unreadable-symbol`][rs] | [`string->uninterned-symbol`][rs], [`gensym`][rs] |
//!
//! [gnu guile]: https://www.gnu.org/software/guile/
//! [gs]: https://www.gnu.org/software/guile/manual/html_node/Symbol-Primitives.html#Symbol-Primitives
//! [gu]: https://www.gnu.org/software/guile/manual/html_node/Symbol-Uninterned.html#Symbol-Uninterned
//! [gnu emacs lisp]: https://www.gnu.org/software/emacs/
//! [es]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Creating-Symbols.html
//! [racket]: https://racket-lang.org/
//! [rs]: https://docs.racket-lang.org/reference/symbols.html
//! [r6rs scheme]: http://www.r6rs.org/
//! [r6s]: http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-14.html
//! [mit/gnu scheme]: https://www.gnu.org/software/mit-scheme/
//! [ms]: https://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Symbols.html
//! [javascript]: https://developer.mozilla.org/en-US/docs/Web/JavaScript
//! [js]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Symbol
//! [java]: http://openjdk.java.net/
//! [jvs]: https://cr.openjdk.java.net/~iris/se/12/latestSpec/api/java.base/java/lang/String.html#intern()
//! [php]: https://www.php.net/
//! [pp]: https://wiki.php.net/rfc/performanceimprovements
//! [erlang]: https://erlang.org/
//! [edt]: http://erlang.org/doc/reference_manual/data_types.html
//! [lua]: https://www.lua.org/
//! [python]: https://www.python.org/
//! [pys]: https://docs.python.org/3/library/sys.html
//!
//! More information:
//! - Wikipedia entry on [string interning][].
//! - The [flyweight pattern][] in object-oriented programming is a type
//! of interning.
//! - [RFC 1845][rfc-1845] gives an example of string interning using
//! `Rc<str>`.
//! - Emacs directly exposes the intern pool at runtime as
//! [`obarray`][es].
//! - [`string-cache`][rust-string-cache] is a string interning system
//! for Rust developed by Mozilla for Servo.
//! - [`string-interner`][rust-string-interner] is another string
//! interning library for Rust.
//! - [Rustc interns strings as `Symbol`s][rustc-intern] using a
//! global [arena allocator][rustc-arena] and unsafe rust to cast to
//! a `static` slice.
//! - Rustc's [`newtype_index!` macro][rustc-nt] uses
//! [`NonZeroU32`] so that [`Option`] uses no
//! additional space (see [pull request `53315`][rustc-nt-pr]).
//! - Rustc also [prefills interners][rustc-intern] with common symbols.
//!
//! [flyweight pattern]: https://en.wikipedia.org/wiki/Flyweight_pattern
//! [rust-string-cache]: https://github.com/servo/string-cache
//! [rust-string-interner]: https://github.com/robbepop/string-interner
//! [rfc-1845]: https://rust-lang.github.io/rfcs/1845-shared-from-slice.html
//! [rustc-intern]: https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_span/symbol.rs.html
//! [rustc-arena]: https://doc.rust-lang.org/nightly/nightly-rustc/arena/index.html
//! [rustc-nt]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_index/macro.newtype_index.html
//! [rustc-nt-pr]: https://github.com/rust-lang/rust/pull/53315
//!
//! The hash function chosen for this module is [Fx Hash][fxhash].
//!
//! - Rustc previously used the [Fowler-Noll-Vo (FNV)][fnv] hash
//! function,
//! but [now uses Fx Hash][rustc-fx].
//! This was extracted into the [`fxhash`][fxhash] crate,
//! which is used by TAMER.
//! - TAMER originally used FNV,
//! but benchmarks showed that Fx Hash was more performant.
//! - Benchmarks for other hash functions,
//! including FNV,
//! can be found at the [`hash-rs`][hash-rs] project.
//!
//! [fnv]: https://doc.servo.org/fnv/
//! [rustc-fx]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_data_structures/fx/index.html
//! [hash-rs]: https://github.com/Gankra/hash-rs
mod interner;
mod prefill;
mod symbol;
pub use prefill::*;
pub use interner::{
ArenaInterner, DefaultInterner, DefaultProgInterner, FxArenaInterner,
Interner,
};
pub use symbol::{
GlobalSymbolIntern, GlobalSymbolInternBytes, GlobalSymbolInternUnchecked,
GlobalSymbolResolve, SymbolId, SymbolIndexSize,
};