night/regex/env-dyn.sed

325 lines
12 KiB
Sed
Raw Permalink Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

# Environment access and mutation using regular expressions
#
# Copyright (C) 2018 Mike Gerwitz
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# This script, when applied recursively, implements a system for
# accessing and manipulating an environment using only a series of regular
# expressions (in the formal sense). This poses some interesting challenges
# and builds off of previous example scripts (which are noted below). In
# particular, since regular expressions are context-free, we cannot use
# for backreferences, and so much of the logic here is spent performing
# comparisons (see `cmp.sed'). (Note that sed does support backreferences;
# we do not use them here.)
#
# Consider this example:
#
# foo=quick
# bar=fox
# baz=lazy
#
# foo <- the *foo
# bar <- brown *bar
# verb <- jumped
# subj <- the *baz dog
# full <- *foo *bar *verb over *subj
#
# Recursively evaluating this until completion will yield the output
#
# @0
# foo=the quick
# bar=brown fox
# baz=lazy
# verb=jumped
# subj=the lazy dog
# full=the quick brown fox jumped over the lazy dog
#
# along with a couple trailing newlines.
#
# To accomplish this task, we define a series of states that drive each
# regular expression. The state is represented at the beginning of the
# string, prefixed by a `@'. Some states take arguments so that they can
# act like functions with continuations, allowing them to be re-used in
# multiple contexts. To transition to a new state, a regular expression
# replaces the state at the beginning of the string. Iteration is
# accomplished by creating cycles in this state machine. We can visualize
# the system roughly as this NFA, where the accepting states are enclosed in
# parenthesis:
#
# ,-----,--------------------------------------------------,
# v v \
# @A -> @E -> @PRECMP -> @CMP -> { @*, @_, @=, @! } -> @CLEAN -`
# \
# `-> { (@0), (@1) }
#
# This use of state prefixes in each regular expression makes everything
# quite verbose. We could have accomplished the same thing with
# considerably less verbosity by using sed's block support, but that
# distracts from the fact that all of this can be done using only
# context-free regular expressions. So we don't.
#
# Another consequence of using state prefixes is that the match prohibits
# multiple replacements within the same state on the same invocation of
# sed. For example, @CLEAN's expressions do not use the `/g' flag because
# doing so would still only match the beginning of the pattern space once;
# so rather than being able to clean each env line in one shot, we have to
# clean one line at a time. This could also be mitigated by using sed's
# block support (but again, we won't). Consequently, each pass does very
# little---this script must be run many times to produce the final
# result. However, it does have the nice benefit that you can observe the
# program processing step-by-step.
#
# This is intentionally esoteric; please do not ever use this methodology in
# a production system---it is simply the wrong thing to do (unless you're
# working under such heavy constraints that you have no choice, in which
# case it may be better to choose a different system). I want to say that I
# had fun making this, but the inability to abstract away some fundamental
# verbosity gets quite frustrating. (Attempts at further abstraction would
# have distracted too much from the implementation in this somewhat small
# example.) I attempted to mitigate some of this frustration for you (the
# reader) by adding comments below the expressions to loosely document the
# groups, but as you may know (unless your implementation supports e.g. /x):
# writing them is often considerably easier than reading them. Especially
# the BRE flavor.
#
# To run:
#
# $ ./animate -c env-dyn.sed input-file
#
# See also `env-fixed.sed', which presents a much simpler implementation if
# the environment has a fixed set of variables.
##
# Read all lines into the pattern space.
:a; N; $!ba
# Exit code 1 means we finished successfully, whereas 2 means we finished in
# error.
/^@0/q1; /^@1/q2
# If no state is found at the beginning of the string, then this is our
# first pass: initialize, beginning in state @A (assignment). Always end in
# a newline to simplify regular expressions for assignments.
s/^[^@]/@A\n&/
s/[^\n]$/&\n/
##
# State @A - Validate and Normalize Assignment
##
# If there are no assignments remaining, transition @A->@0 to quit
# successfully the next iteration.
s/^@A\(.*\n\n\)[ \n]*$/@0\1/
# Check that the next assignment is well-formed. If so, transition @A->@E
# and normalize the form by removing whitespace around the assignment
# operator.
s/^@A\(.*\n\n[a-z]\+\) *<- */@E\1<-/
# ( 1 )
# If we're still in State @A, then the assignment must have been
# syntatically invalid. Transition to state @1 (to exit in error next
# iteration).
s/^@A/@1/
##
# State @E - Expand Assignment
#
# Expands variable references in the next assignment. Note how this uses
# @PRECMP for both the expansions and for transitioning to the actual
# assignment to the environment.
##
# If a reference is found, transition @E->@PRECMP with two arguments
# representing the reference replacement (@*) and unknown reference (@_)
# states.
s/^@E\(.*\n\n[^\n]\+\*\([a-z]\+\)\)/@PRECMP \2 @* @_\1/
# ( 1 ^ ( 2 ) )
# ref name
# If there are no more references (that is, no more `*'s in the assignment
# value), then transition @E->@PRECMP with two arguments representing the
# assignment (@=) and append (@!) states.
s/^@E\(.*\n\n\([a-z]\+\)<-[^\n*]\+\n\)/@PRECMP \2 @= @!\1/
# ( 1 ( 2 ) ^ )
# var name
##
# State @PRECMP(S) - Prepare for Comparison
#
# This state takes a single argument S that is the name of the variable to
# compare against. This allows it to set up comparisons for multiple uses.
##
# Set up env var line for comparison by prefix it with ``~S'' (where S is
# the argument to @PRECMP), followed by a duplicate of the env var
# name. This duplicate is necessary since the name will be modified during
# @CMP as part of the comparison process.
s/^\(@PRECMP \([a-z]\+\).*\n\)\([a-z]\+\)=/\1~\2 \3 \3=/
# ( 1 ( 2 ) ) ( 3 )
# cmp name env var
# Transition @PRECMP->@CMP if all environment variables have a comparison
# string before them.
s/^@PRECMP [a-z]\+\([^\n]\+\n\(~[^\n]*\n\)*\n\)/@CMP\1/
# ( 1 ( 2 ) )
# env vars
##
# State @CMP(C A)
#
# The comparison state is responsible for performing the actual comparison
# between the values set up by @PRECMP. It takes two arguments C and A
# that represent the states to transition to for the consequent (if a match
# is successful) or alternative (if no match is found), respectfully. This
# allows @CMP to be used in multiple contexts, like a function call.
##
# At this point, all values to be compared will be of the form ``~x y''.
# Compare the first character of x and y, replacing with an exclamation
# point at the beginning of the line in the case of a non-match.
#
# For more information on this strategy, see `cmp.sed'. We limit ourselves
# here to [a-z].
s/^\(@CMP.*\n\)~a[^ ]* [^a][^ ]* /\1!/
s/^\(@CMP.*\n\)~b[^ ]* [^b][^ ]* /\1!/
s/^\(@CMP.*\n\)~c[^ ]* [^c][^ ]* /\1!/
s/^\(@CMP.*\n\)~d[^ ]* [^d][^ ]* /\1!/
s/^\(@CMP.*\n\)~e[^ ]* [^e][^ ]* /\1!/
s/^\(@CMP.*\n\)~f[^ ]* [^f][^ ]* /\1!/
s/^\(@CMP.*\n\)~g[^ ]* [^g][^ ]* /\1!/
s/^\(@CMP.*\n\)~h[^ ]* [^h][^ ]* /\1!/
s/^\(@CMP.*\n\)~i[^ ]* [^i][^ ]* /\1!/
s/^\(@CMP.*\n\)~j[^ ]* [^j][^ ]* /\1!/
s/^\(@CMP.*\n\)~k[^ ]* [^k][^ ]* /\1!/
s/^\(@CMP.*\n\)~l[^ ]* [^l][^ ]* /\1!/
s/^\(@CMP.*\n\)~m[^ ]* [^m][^ ]* /\1!/
s/^\(@CMP.*\n\)~n[^ ]* [^n][^ ]* /\1!/
s/^\(@CMP.*\n\)~o[^ ]* [^o][^ ]* /\1!/
s/^\(@CMP.*\n\)~p[^ ]* [^p][^ ]* /\1!/
s/^\(@CMP.*\n\)~q[^ ]* [^q][^ ]* /\1!/
s/^\(@CMP.*\n\)~r[^ ]* [^r][^ ]* /\1!/
s/^\(@CMP.*\n\)~s[^ ]* [^s][^ ]* /\1!/
s/^\(@CMP.*\n\)~t[^ ]* [^t][^ ]* /\1!/
s/^\(@CMP.*\n\)~u[^ ]* [^u][^ ]* /\1!/
s/^\(@CMP.*\n\)~v[^ ]* [^v][^ ]* /\1!/
s/^\(@CMP.*\n\)~w[^ ]* [^w][^ ]* /\1!/
s/^\(@CMP.*\n\)~x[^ ]* [^x][^ ]* /\1!/
s/^\(@CMP.*\n\)~y[^ ]* [^y][^ ]* /\1!/
s/^\(@CMP.*\n\)~z[^ ]* [^z][^ ]* /\1!/
# If we have not made a negative comparison replacement above, then the
# characters match; discard them to prepare for the next iteration.
s/^\(@CMP.*\n~\).\([^ ]*\) ./\1\2 /
# If we have ``~ '', then that means all characters have been compared and
# a match has been found. Transition @CMP->C and remove the two trailing
# spaces. If all are non-matches (all env lines prefixed with `!'),
# transition @CMP->A. Otherwise, we're not done yet, so do nothing.
s/^@CMP \(@[^ ]\+\)[^\n]*\(.*\n~\) /\1\2/
# ( 1 = C ) A ( 2 )
s/^@CMP [^ ]\+ \(@[^ ]\+\)\(\(\n![^\n]\+\)*\n\n\)/\1\2/
# C ( 1 = A ) ( ( 3 ) 2 )
##
# State @CLEAN(S)
#
# This state cleans up after @CMP to restore environments to their original
# state. It takes a single argument S to transition to after cleanup.
##
# Strip any non-letter prefix and extra comparison words off of env
# vars. Note that we do not check for the empty line separator---a prefix
# check is enough since the assignment form must begin with a letter.
s/^\(@CLEAN.*\n\)[!~]\([a-z]\+ \)*\([a-z]\+=\)/\1\3/
# ( 1 ) ( 2 ) ( 3 )
# state + args extra words var name=
# If all such prefixes have been stripped, then we're done with
# cleanup. Transition @CLEAN->S.
s/^@CLEAN \(@[^ \n]\+\)\(\(\n[a-z][^\n]*\)*\n\n\)/\1\2/
# ( 1 ) ( ( 3 ) 2 )
# next state entire environment
##
# State @* - Replace Variable Reference
#
# Replace a variable reference with the value of that variable from the
# environment.
##
# Replace the last reference on the assignment line (which happens to be the
# reference that we matched on previously) with the value of the matched
# variable from the environment (prefixed by `~'). Then transition
# @*->@CLEAN with a single argument to return to @E for further expansion.
s/^@\*\(.*\n~[^=]\+=\([^\n]\+\).*\n\n[^\n]\+\)\*[a-z]\+/@CLEAN @E\1\2/
# ( 1 ( 2 ) ) ^
# var value
##
# State @_ - Unknown Reference
#
# Replace unknown references with a string of the form "[UNKNOWN REF: name]".
##
# If we're in this state, then the last reference on the assignment line is
# unknown. Replace it, then transition @_->@E for further expansion.
s/^@_\(.*\n\n[^\n]\+\)\*\([a-z]\+\)/@CLEAN @E\1[UNKNOWN REF: \2]/
# ( 1 ) ^ ( 2 )
# ref name
##
# State @= (Assignment)
#
# Replace env var value with the text of the assignment and then transition
# @=->@CLEAN, providing a single state argument to transition to state @A
# after cleanup is complete.
##
s/^@=\(.*\n\)~\([^=]*\)=[^\n]*\(.*\n\n\)[^<]*<-\([^\n]*\)\n/@CLEAN @A\1\2=\4\3/
# ( 1 ) ( 2 ) ( 3 ) ( 4 )
# var new val
##
# State @! (Non-Match Append)
#
# This state indicates that a match for an environment variable was not
# found and that it should be appended to the environment.
##
# Reformat the ``var<-val'' form as ``var=val'' and add it to the bottom of
# the environment. The assignment line is removed. Then, transition
# @!->@CLEAN with a single argument to transition back to @A after cleanup
# is complete.
s/^@!\(.*\)\n\n\([a-z]\+\)<-\([^\n]\+\)\n/@CLEAN @A\1\n\2=\3\n\n/
# ( 1 ) ( 2 = var) ( 3 = val)