night/regex/env-dyn.sed

325 lines
12 KiB
Sed
Raw Permalink Normal View History

# Environment access and mutation using regular expressions
#
# Copyright (C) 2018 Mike Gerwitz
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# This script, when applied recursively, implements a system for
# accessing and manipulating an environment using only a series of regular
# expressions (in the formal sense). This poses some interesting challenges
# and builds off of previous example scripts (which are noted below). In
# particular, since regular expressions are context-free, we cannot use
# for backreferences, and so much of the logic here is spent performing
# comparisons (see `cmp.sed'). (Note that sed does support backreferences;
# we do not use them here.)
#
# Consider this example:
#
# foo=quick
# bar=fox
# baz=lazy
#
# foo <- the *foo
# bar <- brown *bar
# verb <- jumped
# subj <- the *baz dog
# full <- *foo *bar *verb over *subj
#
# Recursively evaluating this until completion will yield the output
#
# @0
# foo=the quick
# bar=brown fox
# baz=lazy
# verb=jumped
# subj=the lazy dog
# full=the quick brown fox jumped over the lazy dog
#
# along with a couple trailing newlines.
#
# To accomplish this task, we define a series of states that drive each
# regular expression. The state is represented at the beginning of the
# string, prefixed by a `@'. Some states take arguments so that they can
# act like functions with continuations, allowing them to be re-used in
# multiple contexts. To transition to a new state, a regular expression
# replaces the state at the beginning of the string. Iteration is
# accomplished by creating cycles in this state machine. We can visualize
# the system roughly as this NFA, where the accepting states are enclosed in
# parenthesis:
#
# ,-----,--------------------------------------------------,
# v v \
# @A -> @E -> @PRECMP -> @CMP -> { @*, @_, @=, @! } -> @CLEAN -`
# \
# `-> { (@0), (@1) }
#
# This use of state prefixes in each regular expression makes everything
# quite verbose. We could have accomplished the same thing with
# considerably less verbosity by using sed's block support, but that
# distracts from the fact that all of this can be done using only
# context-free regular expressions. So we don't.
#
# Another consequence of using state prefixes is that the match prohibits
# multiple replacements within the same state on the same invocation of
# sed. For example, @CLEAN's expressions do not use the `/g' flag because
# doing so would still only match the beginning of the pattern space once;
# so rather than being able to clean each env line in one shot, we have to
# clean one line at a time. This could also be mitigated by using sed's
# block support (but again, we won't). Consequently, each pass does very
# little---this script must be run many times to produce the final
# result. However, it does have the nice benefit that you can observe the
# program processing step-by-step.
#
# This is intentionally esoteric; please do not ever use this methodology in
# a production system---it is simply the wrong thing to do (unless you're
# working under such heavy constraints that you have no choice, in which
# case it may be better to choose a different system). I want to say that I
# had fun making this, but the inability to abstract away some fundamental
# verbosity gets quite frustrating. (Attempts at further abstraction would
# have distracted too much from the implementation in this somewhat small
# example.) I attempted to mitigate some of this frustration for you (the
# reader) by adding comments below the expressions to loosely document the
# groups, but as you may know (unless your implementation supports e.g. /x):
# writing them is often considerably easier than reading them. Especially
# the BRE flavor.
#
# To run:
#
# $ ./animate -c env-dyn.sed input-file
#
# See also `env-fixed.sed', which presents a much simpler implementation if
# the environment has a fixed set of variables.
##
# Read all lines into the pattern space.
:a; N; $!ba
# Exit code 1 means we finished successfully, whereas 2 means we finished in
# error.
/^@0/q1; /^@1/q2
# If no state is found at the beginning of the string, then this is our
# first pass: initialize, beginning in state @A (assignment). Always end in
# a newline to simplify regular expressions for assignments.
s/^[^@]/@A\n&/
s/[^\n]$/&\n/
##
# State @A - Validate and Normalize Assignment
##
# If there are no assignments remaining, transition @A->@0 to quit
# successfully the next iteration.
s/^@A\(.*\n\n\)[ \n]*$/@0\1/
# Check that the next assignment is well-formed. If so, transition @A->@E
# and normalize the form by removing whitespace around the assignment
# operator.
s/^@A\(.*\n\n[a-z]\+\) *<- */@E\1<-/
# ( 1 )
# If we're still in State @A, then the assignment must have been
# syntatically invalid. Transition to state @1 (to exit in error next
# iteration).
s/^@A/@1/
##
# State @E - Expand Assignment
#
# Expands variable references in the next assignment. Note how this uses
# @PRECMP for both the expansions and for transitioning to the actual
# assignment to the environment.
##
# If a reference is found, transition @E->@PRECMP with two arguments
# representing the reference replacement (@*) and unknown reference (@_)
# states.
s/^@E\(.*\n\n[^\n]\+\*\([a-z]\+\)\)/@PRECMP \2 @* @_\1/
# ( 1 ^ ( 2 ) )
# ref name
# If there are no more references (that is, no more `*'s in the assignment
# value), then transition @E->@PRECMP with two arguments representing the
# assignment (@=) and append (@!) states.
s/^@E\(.*\n\n\([a-z]\+\)<-[^\n*]\+\n\)/@PRECMP \2 @= @!\1/
# ( 1 ( 2 ) ^ )
# var name
##
# State @PRECMP(S) - Prepare for Comparison
#
# This state takes a single argument S that is the name of the variable to
# compare against. This allows it to set up comparisons for multiple uses.
##
# Set up env var line for comparison by prefix it with ``~S'' (where S is
# the argument to @PRECMP), followed by a duplicate of the env var
# name. This duplicate is necessary since the name will be modified during
# @CMP as part of the comparison process.
s/^\(@PRECMP \([a-z]\+\).*\n\)\([a-z]\+\)=/\1~\2 \3 \3=/
# ( 1 ( 2 ) ) ( 3 )
# cmp name env var
# Transition @PRECMP->@CMP if all environment variables have a comparison
# string before them.
s/^@PRECMP [a-z]\+\([^\n]\+\n\(~[^\n]*\n\)*\n\)/@CMP\1/
# ( 1 ( 2 ) )
# env vars
##
# State @CMP(C A)
#
# The comparison state is responsible for performing the actual comparison
# between the values set up by @PRECMP. It takes two arguments C and A
# that represent the states to transition to for the consequent (if a match
# is successful) or alternative (if no match is found), respectfully. This
# allows @CMP to be used in multiple contexts, like a function call.
##
# At this point, all values to be compared will be of the form ``~x y''.
# Compare the first character of x and y, replacing with an exclamation
# point at the beginning of the line in the case of a non-match.
#
# For more information on this strategy, see `cmp.sed'. We limit ourselves
# here to [a-z].
s/^\(@CMP.*\n\)~a[^ ]* [^a][^ ]* /\1!/
s/^\(@CMP.*\n\)~b[^ ]* [^b][^ ]* /\1!/
s/^\(@CMP.*\n\)~c[^ ]* [^c][^ ]* /\1!/
s/^\(@CMP.*\n\)~d[^ ]* [^d][^ ]* /\1!/
s/^\(@CMP.*\n\)~e[^ ]* [^e][^ ]* /\1!/
s/^\(@CMP.*\n\)~f[^ ]* [^f][^ ]* /\1!/
s/^\(@CMP.*\n\)~g[^ ]* [^g][^ ]* /\1!/
s/^\(@CMP.*\n\)~h[^ ]* [^h][^ ]* /\1!/
s/^\(@CMP.*\n\)~i[^ ]* [^i][^ ]* /\1!/
s/^\(@CMP.*\n\)~j[^ ]* [^j][^ ]* /\1!/
s/^\(@CMP.*\n\)~k[^ ]* [^k][^ ]* /\1!/
s/^\(@CMP.*\n\)~l[^ ]* [^l][^ ]* /\1!/
s/^\(@CMP.*\n\)~m[^ ]* [^m][^ ]* /\1!/
s/^\(@CMP.*\n\)~n[^ ]* [^n][^ ]* /\1!/
s/^\(@CMP.*\n\)~o[^ ]* [^o][^ ]* /\1!/
s/^\(@CMP.*\n\)~p[^ ]* [^p][^ ]* /\1!/
s/^\(@CMP.*\n\)~q[^ ]* [^q][^ ]* /\1!/
s/^\(@CMP.*\n\)~r[^ ]* [^r][^ ]* /\1!/
s/^\(@CMP.*\n\)~s[^ ]* [^s][^ ]* /\1!/
s/^\(@CMP.*\n\)~t[^ ]* [^t][^ ]* /\1!/
s/^\(@CMP.*\n\)~u[^ ]* [^u][^ ]* /\1!/
s/^\(@CMP.*\n\)~v[^ ]* [^v][^ ]* /\1!/
s/^\(@CMP.*\n\)~w[^ ]* [^w][^ ]* /\1!/
s/^\(@CMP.*\n\)~x[^ ]* [^x][^ ]* /\1!/
s/^\(@CMP.*\n\)~y[^ ]* [^y][^ ]* /\1!/
s/^\(@CMP.*\n\)~z[^ ]* [^z][^ ]* /\1!/
# If we have not made a negative comparison replacement above, then the
# characters match; discard them to prepare for the next iteration.
s/^\(@CMP.*\n~\).\([^ ]*\) ./\1\2 /
# If we have ``~ '', then that means all characters have been compared and
# a match has been found. Transition @CMP->C and remove the two trailing
# spaces. If all are non-matches (all env lines prefixed with `!'),
# transition @CMP->A. Otherwise, we're not done yet, so do nothing.
s/^@CMP \(@[^ ]\+\)[^\n]*\(.*\n~\) /\1\2/
# ( 1 = C ) A ( 2 )
s/^@CMP [^ ]\+ \(@[^ ]\+\)\(\(\n![^\n]\+\)*\n\n\)/\1\2/
# C ( 1 = A ) ( ( 3 ) 2 )
##
# State @CLEAN(S)
#
# This state cleans up after @CMP to restore environments to their original
# state. It takes a single argument S to transition to after cleanup.
##
# Strip any non-letter prefix and extra comparison words off of env
# vars. Note that we do not check for the empty line separator---a prefix
# check is enough since the assignment form must begin with a letter.
s/^\(@CLEAN.*\n\)[!~]\([a-z]\+ \)*\([a-z]\+=\)/\1\3/
# ( 1 ) ( 2 ) ( 3 )
# state + args extra words var name=
# If all such prefixes have been stripped, then we're done with
# cleanup. Transition @CLEAN->S.
s/^@CLEAN \(@[^ \n]\+\)\(\(\n[a-z][^\n]*\)*\n\n\)/\1\2/
# ( 1 ) ( ( 3 ) 2 )
# next state entire environment
##
# State @* - Replace Variable Reference
#
# Replace a variable reference with the value of that variable from the
# environment.
##
# Replace the last reference on the assignment line (which happens to be the
# reference that we matched on previously) with the value of the matched
# variable from the environment (prefixed by `~'). Then transition
# @*->@CLEAN with a single argument to return to @E for further expansion.
s/^@\*\(.*\n~[^=]\+=\([^\n]\+\).*\n\n[^\n]\+\)\*[a-z]\+/@CLEAN @E\1\2/
# ( 1 ( 2 ) ) ^
# var value
##
# State @_ - Unknown Reference
#
# Replace unknown references with a string of the form "[UNKNOWN REF: name]".
##
# If we're in this state, then the last reference on the assignment line is
# unknown. Replace it, then transition @_->@E for further expansion.
s/^@_\(.*\n\n[^\n]\+\)\*\([a-z]\+\)/@CLEAN @E\1[UNKNOWN REF: \2]/
# ( 1 ) ^ ( 2 )
# ref name
##
# State @= (Assignment)
#
# Replace env var value with the text of the assignment and then transition
# @=->@CLEAN, providing a single state argument to transition to state @A
# after cleanup is complete.
##
s/^@=\(.*\n\)~\([^=]*\)=[^\n]*\(.*\n\n\)[^<]*<-\([^\n]*\)\n/@CLEAN @A\1\2=\4\3/
# ( 1 ) ( 2 ) ( 3 ) ( 4 )
# var new val
##
# State @! (Non-Match Append)
#
# This state indicates that a match for an environment variable was not
# found and that it should be appended to the environment.
##
# Reformat the ``var<-val'' form as ``var=val'' and add it to the bottom of
# the environment. The assignment line is removed. Then, transition
# @!->@CLEAN with a single argument to transition back to @A after cleanup
# is complete.
s/^@!\(.*\)\n\n\([a-z]\+\)<-\([^\n]\+\)\n/@CLEAN @A\1\n\2=\3\n\n/
# ( 1 ) ( 2 = var) ( 3 = val)