// String internment system // // Copyright (C) 2014-2023 Ryan Specialty, LLC. // // This file is part of TAME. // // This program is free software: you can redistribute it and/or modify // it under the terms of the GNU General Public License as published by // the Free Software Foundation, either version 3 of the License, or // (at your option) any later version. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program. If not, see . //! String internment system. //! //! Interned strings are represented by an integer [`SymbolId`], //! created by an [`Interner`]. //! //! - [`ArenaInterner`] - Intern pool backed by an [arena][] for fast //! and stable allocation. //! - [`FxArenaInterner`] - Intern pool backed by an [arena][] using the //! [Fx Hash][fxhash] hashing algorithm. //! - [`DefaultInterner`] - The currently recommended intern pool //! configuration for symbol interning (size-agnostic). //! - [`DefaultProgInterner`] - The currently recommended //! general-purpose intern pool configuration for compilers and //! linkers processing symbols from one or more packages. //! //! Interners represent symbols as integer values which allows for `O(1)` //! comparison of any arbitrary interned value, //! regardless of length. //! //! The most common way to intern strings is using the global static //! interners, //! which offer several conveniences that are discussed below. //! However, //! interners may also be used standalone without requiring global state. //! //! [arena]: bumpalo //! //! ``` //! use tamer::sym::{GlobalSymbolIntern, GlobalSymbolResolve, SymbolId}; //! //! // Interns are represented by `SymbolId`. //! let foo: SymbolId = "foo".intern(); //! assert_eq!(foo, foo); //! //! // Interning the same string twice returns the same intern //! assert_eq!(foo, "foo".intern()); //! //! // All interns can be freely copied. //! let foo2 = foo; //! assert_eq!(foo, foo2); //! //! // Different strings intern to different values //! assert_ne!(foo, "bar".intern()); //! //! // Interned slices can be looked up by their symbol id. //! assert_eq!("foo", foo.lookup_str()); //! ``` //! //! What Is String Interning? //! ========================= //! _[String interning][]_ is a process by which a single copy of a string //! is stored immutably in memory as part of a _pool_. //! Once a string has been interned, //! attempting to intern it again will always return the same [`SymbolId`]. //! Interned strings are represented by integer values known as "symbols" or //! "atoms". //! //! String comparison using symbols amounts to comparing integer //! values (`O(1)`) rather than having to scan the string (`O(n)`). //! However, //! both internment and symbol lookup //! (mapping a symbol to its string) //! incur a minor hashing cost. //! //! It is expected that strings are interned as soon as they are encountered, //! which is likely to be from source inputs or previously compiled object //! files. //! Processing stages will then hold the interned [`SymbolId`] and use those //! for any needed comparsions, //! without any need to look up the string from the pool. //! Symbols should only be looked up //! (using [`GlobalSymbolResolve::lookup_str`] or //! [`Interner::index_lookup`]) //! when the string representation is necessary, //! such as to write to a file or display to the user. //! //! [`SymbolId`] is monotonically increasing from 1, //! making it a useful densely-packed index as an alternative [`HashMap`] //! when most of the symbols will be represented as part of the map. //! This also means that strings can be interned in bulk and have a //! predictable relationship to one-another---for //! example, //! if strings are interned in lexographic order, //! their [`SymbolId`]s will reflect that same ordering, //! so long as those strings have not previously been interned. //! Bulk insertion should therefore be done before processing user input. //! //! With the exception of pre-interned static symbols //! (see Static Symbols below), //! [`SymbolId`]s are _not_ stable between runs. //! //! [string interning]: https://en.wikipedia.org/wiki/String_interning //! //! //! Internment Mechanism //! ==================== //! The current [`DefaultInterner`] is [`FxArenaInterner`], //! which is an [arena][]-allocated intern pool mapped by the //! [Fx Hash][fxhash] hash function: //! //! 1. Strings are compared against the existing intern pool using a //! [`HashMap`]. //! 2. If a string has not yet been interned: //! - A new integer [`SymbolId`] index is allocated; //! - The string is copied into the arena-backed pool at that new index; //! and //! - The string is hashed and will resolve to the new [`SymbolId`] for //! future lookups and internment attempts. //! 3. Otherwise, the existing [`SymbolId`] associated with the provided //! string is returned. //! //! The string associated with a [`SymbolId`] can be looked up from the pool //! using [`GlobalSymbolResolve::lookup_str`] for global interners, //! or [`Interner::index_lookup`] otherwise. //! Symbols allocated using a global interner will have a `'static` //! lifetime. //! //! Since [`SymbolId`] is an integer value, //! it implements [`Copy`] and will still compare equal to other symbols //! referencing the same interned value. //! //! This implementation was heavily motivated by [Rustc's own internment //! system][rustc-intern]. //! //! [`HashMap`]: std::collections::HashMap //! [`NonZeroU32`]: std::num::NonZeroU32 //! //! //! Symbol Index Sizes //! ------------------ //! [`SymbolId`] is generic over [`SymbolIndexSize`], //! defaulting to[`global::ProgSymSize`](crate::global::ProgSymSize). //! The generic size allows for specialized interners in situations where a //! a larger index size is undesirable, //! such as [`Span`](crate::span::Span), //! which tries to pack a lot of information into 64-bit structures. //! //! Note that _it is not permissable to cast between different index sizes_ //! because they use _different_ interners with their own distinct index //! sets. //! //! Global Interners //! ---------------- //! TAMER offers two thread-local global interners that intern strings with //! a `'static` lifetime, //! simplifying the handling of lifetimes; //! they produce 16-bit and 32-bit symbols. //! These interners are lazily initialized on first use. //! Symbols from the two interners are independently allocated cannot be //! mixed. //! //! Global interners were introduced because symbols are used by virtually //! every part of the system, //! which polluted everything with interner lifetimes. //! This suggested that the interner should be treated instead as if it were //! a part of Rust itself, //! and treated no differently than other core memory allocation. //! //! All [`str`] objects returned from global interners hold a //! `'static` lifetime to simplify lifetime management and borrowing. //! However, //! these should not be used in place of [`SymbolId`] if the string value //! is not actually needed. //! //! Global interners are exposed via friendly APIs using two traits: //! //! - [`GlobalSymbolIntern`] provides an `intern` method that can be used //! on any [`&str`] (e.g. `"foo".intern()`); and //! - [`GlobalSymbolResolve`] provides a `lookup_str` method on //! [`SymbolId`] which resolves the symbol using the appropriate //! global interner, //! producing a [`str`] holding a reference to the `'static` //! string slice within the pool. //! //! These traits are intentionally separate so that it is clear how a //! particular package or object makes use of symbols. //! If this distinction proves too cumbersome, //! then they may be combined in the future. //! //! TAMER does not currently utilize threads, //! and global interners are never dropped, //! and so [`str`] will always refer to a valid string. //! //! There is no mechanism preventing [`SymbolId`] from one interner from //! being used with another beyond [`SymbolIndexSize`] bounds; //! if you utilize interners for any other purpose, //! it is advised that you create newtypes for their [`SymbolId`]s. //! //! Static Symbols //! -------------- //! Since nearly every string in the system is represented by a symbol, //! comparing against static string slices would require awkward interning //! of a static string at each relevant point in the program. //! Instead, //! common static strings are pre-interned when the global interner is //! first initialized. //! //! These symbols are allocated statically, //! so they can be used in `const` expressions and include additional //! metadata allowing for safe type conversions in circumstances that //! aren't typically permitted. //! Since static symbols are constants, //! symbol newtypes and objects composed of symbols are able to be //! statically constructed as well. //! //! These generated symbol constants can be found in the [`st`] and [`st16`] //! modules. //! //! Uninterned Symbols //! ------------------ //! Interners are able to allocate a [`SymbolId`] without interning, //! which will produce a symbol that cannot compare equal to any other //! symbol and avoids the hashing cost required to perform interning. //! This is useful for a couple of reasons: //! //! 1. To create a symbol that is guaranteed to be unique, //! even if the same string value was previously interned; and //! 2. To store a string without a hashing cost, //! making [`SymbolId`] a suitable substitute for [`String`] when the //! string will never benefit from internment. //! //! The second option allows all data structures to consistently carry //! [`SymbolId`] and let the owner of those data decide whether it is //! appropriate to incur a hashing cost; //! using [`String`] forces that decision upon users of the data //! structure, //! and also makes for an awkward and confusing API. //! //! Related Work and Further Reading //! ================================ //! String interning is used in a variety of systems and languages. //! Symbols can typically be either interned, //! and therefore compared for equivalency, //! or _uninterned_, //! which makes them unique even to symbols of the same name. //! Interning may also be done automatically by a language as a performance //! optimization, //! or by a compiler for storage in an object file such as ELF. //! Languages listed below that allow for explicit interning may also //! perform automatic interning as well //! (for example, `'symbol` in Lisp and `lowercase_vars` as atoms in //! Erlang). //! //! | Language | Interned | Uninterned | //! | -------- | -------- | ---------- | //! | [Erlang][] | [`list_to_atom`][edt] | _(None)_ | //! | [GNU Emacs Lisp][] | [`intern`][es], [`intern-soft`][es] | [`make-symbol`][es], [`gensym`][es] | //! | [GNU Guile][] | [`string->symbol`][gs], [`gensym`][gs] | [`make-symbol`][gu] | //! | [JavaScript][] | [`Symbol.for`][js] | [`Symbol`][js] | //! | [Java][] | [`intern`][jvs] | _(None)_ | //! | [Lua][] | _(Automatic for string performance)_ | _(None)_ | //! | [MIT/GNU Scheme][] | [`intern`][ms], [`intern-soft`][ms], [`string->symbol`][ms] | [`string->uninterned-symbol`][ms], [`generate-uninterned-symbol`][ms] | //! | [PHP][] | _(Automatic for string [performance][pp])_ | _(None)_ | //! | [Python][] | [`sys.intern`][pys] | _(None)_ | //! | [R6RS Scheme][] | [`string->symbol`][r6s] | _(None)_ | //! | [Racket][] | [`string->symbol`][rs], [`string->unreadable-symbol`][rs] | [`string->uninterned-symbol`][rs], [`gensym`][rs] | //! //! [gnu guile]: https://www.gnu.org/software/guile/ //! [gs]: https://www.gnu.org/software/guile/manual/html_node/Symbol-Primitives.html#Symbol-Primitives //! [gu]: https://www.gnu.org/software/guile/manual/html_node/Symbol-Uninterned.html#Symbol-Uninterned //! [gnu emacs lisp]: https://www.gnu.org/software/emacs/ //! [es]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Creating-Symbols.html //! [racket]: https://racket-lang.org/ //! [rs]: https://docs.racket-lang.org/reference/symbols.html //! [r6rs scheme]: http://www.r6rs.org/ //! [r6s]: http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-14.html //! [mit/gnu scheme]: https://www.gnu.org/software/mit-scheme/ //! [ms]: https://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Symbols.html //! [javascript]: https://developer.mozilla.org/en-US/docs/Web/JavaScript //! [js]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Symbol //! [java]: http://openjdk.java.net/ //! [jvs]: https://cr.openjdk.java.net/~iris/se/12/latestSpec/api/java.base/java/lang/String.html#intern() //! [php]: https://www.php.net/ //! [pp]: https://wiki.php.net/rfc/performanceimprovements //! [erlang]: https://erlang.org/ //! [edt]: http://erlang.org/doc/reference_manual/data_types.html //! [lua]: https://www.lua.org/ //! [python]: https://www.python.org/ //! [pys]: https://docs.python.org/3/library/sys.html //! //! More information: //! - Wikipedia entry on [string interning][]. //! - The [flyweight pattern][] in object-oriented programming is a type //! of interning. //! - [RFC 1845][rfc-1845] gives an example of string interning using //! `Rc`. //! - Emacs directly exposes the intern pool at runtime as //! [`obarray`][es]. //! - [`string-cache`][rust-string-cache] is a string interning system //! for Rust developed by Mozilla for Servo. //! - [`string-interner`][rust-string-interner] is another string //! interning library for Rust. //! - [Rustc interns strings as `Symbol`s][rustc-intern] using a //! global [arena allocator][rustc-arena] and unsafe rust to cast to //! a `static` slice. //! - Rustc's [`newtype_index!` macro][rustc-nt] uses //! [`NonZeroU32`] so that [`Option`] uses no //! additional space (see [pull request `53315`][rustc-nt-pr]). //! - Rustc also [prefills interners][rustc-intern] with common symbols. //! //! [flyweight pattern]: https://en.wikipedia.org/wiki/Flyweight_pattern //! [rust-string-cache]: https://github.com/servo/string-cache //! [rust-string-interner]: https://github.com/robbepop/string-interner //! [rfc-1845]: https://rust-lang.github.io/rfcs/1845-shared-from-slice.html //! [rustc-intern]: https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_span/symbol.rs.html //! [rustc-arena]: https://doc.rust-lang.org/nightly/nightly-rustc/arena/index.html //! [rustc-nt]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_index/macro.newtype_index.html //! [rustc-nt-pr]: https://github.com/rust-lang/rust/pull/53315 //! //! The hash function chosen for this module is [Fx Hash][fxhash]. //! //! - Rustc previously used the [Fowler-Noll-Vo (FNV)][fnv] hash //! function, //! but [now uses Fx Hash][rustc-fx]. //! This was extracted into the [`fxhash`][fxhash] crate, //! which is used by TAMER. //! - TAMER originally used FNV, //! but benchmarks showed that Fx Hash was more performant. //! - Benchmarks for other hash functions, //! including FNV, //! can be found at the [`hash-rs`][hash-rs] project. //! //! [fnv]: https://doc.servo.org/fnv/ //! [rustc-fx]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_data_structures/fx/index.html //! [hash-rs]: https://github.com/Gankra/hash-rs mod interner; mod prefill; mod symbol; pub use prefill::*; pub use interner::{ ArenaInterner, DefaultInterner, DefaultProgInterner, FxArenaInterner, Interner, }; pub use symbol::{ GlobalSymbolIntern, GlobalSymbolInternBytes, GlobalSymbolInternUnchecked, GlobalSymbolResolve, SymbolId, SymbolIndexSize, };