night/regex/env-dyn.sed

# Environment access and mutation using regular expressions
#
# Copyright (C) 2018 Mike Gerwitz
#
#  This program is free software: you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation, either version 3 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
#
# This script, when applied recursively, implements a system for
# accessing and manipulating an environment using only a series of regular
# expressions (in the formal sense).  This poses some interesting challenges
# and builds off of previous example scripts (which are noted below).  In
# particular, since regular expressions are context-free, we cannot use
# for backreferences, and so much of the logic here is spent performing
# comparisons (see `cmp.sed').  (Note that sed does support backreferences;
# we do not use them here.)
#
# Consider this example:
#
#   foo=quick
#   bar=fox
#   baz=lazy
#
#   foo  <- the *foo
#   bar  <- brown *bar
#   verb <- jumped
#   subj <- the *baz dog
#   full <- *foo *bar *verb over *subj
#
# Recursively evaluating this until completion will yield the output
#
#   @0
#   foo=the quick
#   bar=brown fox
#   baz=lazy
#   verb=jumped
#   subj=the lazy dog
#   full=the quick brown fox jumped over the lazy dog
#
# along with a couple trailing newlines.
#
# To accomplish this task, we define a series of states that drive each
# regular expression.  The state is represented at the beginning of the
# string, prefixed by a `@'.  Some states take arguments so that they can
# act like functions with continuations, allowing them to be re-used in
# multiple contexts.  To transition to a new state, a regular expression
# replaces the state at the beginning of the string.  Iteration is
# accomplished by creating cycles in this state machine.  We can visualize
# the system roughly as this NFA, where the accepting states are enclosed in
# parenthesis:
#
#     ,-----,--------------------------------------------------,
#    v     v                                                    \
#   @A -> @E -> @PRECMP -> @CMP -> { @*, @_, @=, @! } -> @CLEAN -`
#    \
#     `-> { (@0), (@1) }
#
# This use of state prefixes in each regular expression makes everything
# quite verbose.  We could have accomplished the same thing with
# considerably less verbosity by using sed's block support, but that
# distracts from the fact that all of this can be done using only
# context-free regular expressions.  So we don't.
#
# Another consequence of using state prefixes is that the match prohibits
# multiple replacements within the same state on the same invocation of
# sed.  For example, @CLEAN's expressions do not use the `/g' flag because
# doing so would still only match the beginning of the pattern space once;
# so rather than being able to clean each env line in one shot, we have to
# clean one line at a time.  This could also be mitigated by using sed's
# block support (but again, we won't).  Consequently, each pass does very
# little---this script must be run many times to produce the final
# result.  However, it does have the nice benefit that you can observe the
# program processing step-by-step.
#
# This is intentionally esoteric; please do not ever use this methodology in
# a production system---it is simply the wrong thing to do (unless you're
# working under such heavy constraints that you have no choice, in which
# case it may be better to choose a different system).  I want to say that I
# had fun making this, but the inability to abstract away some fundamental
# verbosity gets quite frustrating.  (Attempts at further abstraction would
# have distracted too much from the implementation in this somewhat small
# example.)  I attempted to mitigate some of this frustration for you (the
# reader) by adding comments below the expressions to loosely document the
# groups, but as you may know (unless your implementation supports e.g. /x):
# writing them is often considerably easier than reading them.  Especially
# the BRE flavor.
#
# To run:
#
#   $ ./animate -c env-dyn.sed input-file
#
# See also `env-fixed.sed', which presents a much simpler implementation if
# the environment has a fixed set of variables.
##


# Read all lines into the pattern space.
:a; N; $!ba

# Exit code 1 means we finished successfully, whereas 2 means we finished in
# error.
/^@0/q1;  /^@1/q2

# If no state is found at the beginning of the string, then this is our
# first pass: initialize, beginning in state @A (assignment).  Always end in
# a newline to simplify regular expressions for assignments.
s/^[^@]/@A\n&/
s/[^\n]$/&\n/


##
# State @A - Validate and Normalize Assignment
##

# If there are no assignments remaining, transition @A->@0 to quit
# successfully the next iteration.
s/^@A\(.*\n\n\)[ \n]*$/@0\1/

# Check that the next assignment is well-formed.  If so, transition @A->@E
# and normalize the form by removing whitespace around the assignment
# operator.
s/^@A\(.*\n\n[a-z]\+\) *<- */@E\1<-/
#     (      1       )

# If we're still in State @A, then the assignment must have been
# syntatically invalid.  Transition to state @1 (to exit in error next
# iteration).
s/^@A/@1/


##
# State @E - Expand Assignment
#
# Expands variable references in the next assignment.  Note how this uses
# @PRECMP for both the expansions and for transitioning to the actual
# assignment to the environment.
##

# If a reference is found, transition @E->@PRECMP with two arguments
# representing the reference replacement (@*) and unknown reference (@_)
# states.
s/^@E\(.*\n\n[^\n]\+\*\([a-z]\+\)\)/@PRECMP \2 @* @_\1/
#     (        1     ^ (    2   ) )
#                       ref name

# If there are no more references (that is, no more `*'s in the assignment
# value), then transition @E->@PRECMP with two arguments representing the
# assignment (@=) and append (@!) states.
s/^@E\(.*\n\n\([a-z]\+\)<-[^\n*]\+\n\)/@PRECMP \2 @= @!\1/
#     (   1   (    2   )      ^      )
#              var name


##
# State @PRECMP(S) - Prepare for Comparison
#
# This state takes a single argument S that is the name of the variable to
# compare against.  This allows it to set up comparisons for multiple uses.
##

# Set up env var line for comparison by prefix it with ``~S'' (where S is
# the argument to @PRECMP), followed by a duplicate of the env var
# name.  This duplicate is necessary since the name will be modified during
# @CMP as part of the comparison process.
s/^\(@PRECMP \([a-z]\+\).*\n\)\([a-z]\+\)=/\1~\2 \3 \3=/
#   (    1    (   2    )     ) (    3   )
#              cmp name         env var

# Transition @PRECMP->@CMP if all environment variables have a comparison
# string before them.
s/^@PRECMP [a-z]\+\([^\n]\+\n\(~[^\n]*\n\)*\n\)/@CMP\1/
#                  (     1    (     2    )    )
#                               env vars


##
# State @CMP(C A)
#
# The comparison state is responsible for performing the actual comparison
# between the values set up by @PRECMP.  It takes two arguments C and A
# that represent the states to transition to for the consequent (if a match
# is successful) or alternative (if no match is found), respectfully.  This
# allows @CMP to be used in multiple contexts, like a function call.
##

# At this point, all values to be compared will be of the form ``~x y''.
# Compare the first character of x and y, replacing with an exclamation
# point at the beginning of the line in the case of a non-match.
#
# For more information on this strategy, see `cmp.sed'.  We limit ourselves
# here to [a-z].
s/^\(@CMP.*\n\)~a[^ ]* [^a][^ ]* /\1!/
s/^\(@CMP.*\n\)~b[^ ]* [^b][^ ]* /\1!/
s/^\(@CMP.*\n\)~c[^ ]* [^c][^ ]* /\1!/
s/^\(@CMP.*\n\)~d[^ ]* [^d][^ ]* /\1!/
s/^\(@CMP.*\n\)~e[^ ]* [^e][^ ]* /\1!/
s/^\(@CMP.*\n\)~f[^ ]* [^f][^ ]* /\1!/
s/^\(@CMP.*\n\)~g[^ ]* [^g][^ ]* /\1!/
s/^\(@CMP.*\n\)~h[^ ]* [^h][^ ]* /\1!/
s/^\(@CMP.*\n\)~i[^ ]* [^i][^ ]* /\1!/
s/^\(@CMP.*\n\)~j[^ ]* [^j][^ ]* /\1!/
s/^\(@CMP.*\n\)~k[^ ]* [^k][^ ]* /\1!/
s/^\(@CMP.*\n\)~l[^ ]* [^l][^ ]* /\1!/
s/^\(@CMP.*\n\)~m[^ ]* [^m][^ ]* /\1!/
s/^\(@CMP.*\n\)~n[^ ]* [^n][^ ]* /\1!/
s/^\(@CMP.*\n\)~o[^ ]* [^o][^ ]* /\1!/
s/^\(@CMP.*\n\)~p[^ ]* [^p][^ ]* /\1!/
s/^\(@CMP.*\n\)~q[^ ]* [^q][^ ]* /\1!/
s/^\(@CMP.*\n\)~r[^ ]* [^r][^ ]* /\1!/
s/^\(@CMP.*\n\)~s[^ ]* [^s][^ ]* /\1!/
s/^\(@CMP.*\n\)~t[^ ]* [^t][^ ]* /\1!/
s/^\(@CMP.*\n\)~u[^ ]* [^u][^ ]* /\1!/
s/^\(@CMP.*\n\)~v[^ ]* [^v][^ ]* /\1!/
s/^\(@CMP.*\n\)~w[^ ]* [^w][^ ]* /\1!/
s/^\(@CMP.*\n\)~x[^ ]* [^x][^ ]* /\1!/
s/^\(@CMP.*\n\)~y[^ ]* [^y][^ ]* /\1!/
s/^\(@CMP.*\n\)~z[^ ]* [^z][^ ]* /\1!/

# If we have not made a negative comparison replacement above, then the
# characters match; discard them to prepare for the next iteration.
s/^\(@CMP.*\n~\).\([^ ]*\) ./\1\2 /

# If we have ``~  '', then that means all characters have been compared and
# a match has been found.  Transition @CMP->C and remove the two trailing
# spaces.  If all are non-matches (all env lines prefixed with `!'),
# transition @CMP->A.  Otherwise, we're not done yet, so do nothing.
s/^@CMP \(@[^ ]\+\)[^\n]*\(.*\n~\)  /\1\2/
#        ( 1 = C  )  A    (   2  )
s/^@CMP [^ ]\+ \(@[^ ]\+\)\(\(\n![^\n]\+\)*\n\n\)/\1\2/
#         C     ( 1 = A  ) ( (     3     )   2  )


##
# State @CLEAN(S)
#
# This state cleans up after @CMP to restore environments to their original
# state.  It takes a single argument S to transition to after cleanup.
##

# Strip any non-letter prefix and extra comparison words off of env
# vars.  Note that we do not check for the empty line separator---a prefix
# check is enough since the assignment form must begin with a letter.
s/^\(@CLEAN.*\n\)[!~]\([a-z]\+ \)*\([a-z]\+=\)/\1\3/
#   (     1     )     (    2    )  (    3    )
#    state + args     extra words   var name=

# If all such prefixes have been stripped, then we're done with
# cleanup.  Transition @CLEAN->S.
s/^@CLEAN \(@[^ \n]\+\)\(\(\n[a-z][^\n]*\)*\n\n\)/\1\2/
#          (   1    ) ( (      3       )   2  )
#          next state     entire environment


##
# State @* - Replace Variable Reference
#
# Replace a variable reference with the value of that variable from the
# environment.
##

# Replace the last reference on the assignment line (which happens to be the
# reference that we matched on previously) with the value of the matched
# variable from the environment (prefixed by `~').  Then transition
# @*->@CLEAN with a single argument to return to @E for further expansion.
s/^@\*\(.*\n~[^=]\+=\([^\n]\+\).*\n\n[^\n]\+\)\*[a-z]\+/@CLEAN @E\1\2/
#      (      1      (   2    )              ) ^
#                     var value


##
# State @_ - Unknown Reference
#
# Replace unknown references with a string of the form "[UNKNOWN REF: name]".
##

# If we're in this state, then the last reference on the assignment line is
# unknown.  Replace it, then transition @_->@E for further expansion.
s/^@_\(.*\n\n[^\n]\+\)\*\([a-z]\+\)/@CLEAN @E\1[UNKNOWN REF: \2]/
#     (       1      ) ^ (    2   )
#                         ref name


##
# State @= (Assignment)
#
# Replace env var value with the text of the assignment and then transition
# @=->@CLEAN, providing a single state argument to transition to state @A
# after cleanup is complete.
##

s/^@=\(.*\n\)~\([^=]*\)=[^\n]*\(.*\n\n\)[^<]*<-\([^\n]*\)\n/@CLEAN @A\1\2=\4\3/
#     (  1  )  (   2  )        (   3   )        (   4   )
#                 var                            new val


##
# State @! (Non-Match Append)
#
# This state indicates that a match for an environment variable was not
# found and that it should be appended to the environment.
##

# Reformat the ``var<-val'' form as ``var=val'' and add it to the bottom of
# the environment.  The assignment line is removed.  Then, transition
# @!->@CLEAN with a single argument to transition back to @A after cleanup
# is complete.
s/^@!\(.*\)\n\n\([a-z]\+\)<-\([^\n]\+\)\n/@CLEAN @A\1\n\2=\3\n\n/
#     ( 1 )     ( 2 = var)   ( 3 = val)