diff --git a/regex/cmp.sed b/regex/cmp.sed index bd86f5c..41f43f3 100644 --- a/regex/cmp.sed +++ b/regex/cmp.sed @@ -50,6 +50,9 @@ # # If all possible strings are known ahead of time, we could also make such # comparisons directly. This would require O(1) steps rather than O(n). +# +# To observe this comparison used as part of a larger program, see +# `env-dyn.sed'. ## # Quit if we do not have two space-delimited values to compare. Note that diff --git a/regex/env-dyn.sed b/regex/env-dyn.sed new file mode 100644 index 0000000..d56bdad --- /dev/null +++ b/regex/env-dyn.sed @@ -0,0 +1,324 @@ +# Environment access and mutation using regular expressions +# +# Copyright (C) 2018 Mike Gerwitz +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# This script, when applied recursively, implements a system for +# accessing and manipulating an environment using only a series of regular +# expressions (in the formal sense). This poses some interesting challenges +# and builds off of previous example scripts (which are noted below). In +# particular, since regular expressions are context-free, we cannot use +# for backreferences, and so much of the logic here is spent performing +# comparisons (see `cmp.sed'). (Note that sed does support backreferences; +# we do not use them here.) +# +# Consider this example: +# +# foo=quick +# bar=fox +# baz=lazy +# +# foo <- the *foo +# bar <- brown *bar +# verb <- jumped +# subj <- the *baz dog +# full <- *foo *bar *verb over *subj +# +# Recursively evaluating this until completion will yield the output +# +# @0 +# foo=the quick +# bar=brown fox +# baz=lazy +# verb=jumped +# subj=the lazy dog +# full=the quick brown fox jumped over the lazy dog +# +# along with a couple trailing newlines. +# +# To accomplish this task, we define a series of states that drive each +# regular expression. The state is represented at the beginning of the +# string, prefixed by a `@'. Some states take arguments so that they can +# act like functions with continuations, allowing them to be re-used in +# multiple contexts. To transition to a new state, a regular expression +# replaces the state at the beginning of the string. Iteration is +# accomplished by creating cycles in this state machine. We can visualize +# the system roughly as this NFA, where the accepting states are enclosed in +# parenthesis: +# +# ,-----,--------------------------------------------------, +# v v \ +# @A -> @E -> @PRECMP -> @CMP -> { @*, @_, @=, @! } -> @CLEAN -` +# \ +# `-> { (@0), (@1) } +# +# This use of state prefixes in each regular expression makes everything +# quite verbose. We could have accomplished the same thing with +# considerably less verbosity by using sed's block support, but that +# distracts from the fact that all of this can be done using only +# context-free regular expressions. So we don't. +# +# Another consequence of using state prefixes is that the match prohibits +# multiple replacements within the same state on the same invocation of +# sed. For example, @CLEAN's expressions do not use the `/g' flag because +# doing so would still only match the beginning of the pattern space once; +# so rather than being able to clean each env line in one shot, we have to +# clean one line at a time. This could also be mitigated by using sed's +# block support (but again, we won't). Consequently, each pass does very +# little---this script must be run many times to produce the final +# result. However, it does have the nice benefit that you can observe the +# program processing step-by-step. +# +# This is intentionally esoteric; please do not ever use this methodology in +# a production system---it is simply the wrong thing to do (unless you're +# working under such heavy constraints that you have no choice, in which +# case it may be better to choose a different system). I want to say that I +# had fun making this, but the inability to abstract away some fundamental +# verbosity gets quite frustrating. (Attempts at further abstraction would +# have distracted too much from the implementation in this somewhat small +# example.) I attempted to mitigate some of this frustration for you (the +# reader) by adding comments below the expressions to loosely document the +# groups, but as you may know (unless your implementation supports e.g. /x): +# writing them is often considerably easier than reading them. Especially +# the BRE flavor. +# +# To run: +# +# $ ./animate -c env-dyn.sed input-file +# +# See also `env-fixed.sed', which presents a much simpler implementation if +# the environment has a fixed set of variables. +## + + +# Read all lines into the pattern space. +:a; N; $!ba + +# Exit code 1 means we finished successfully, whereas 2 means we finished in +# error. +/^@0/q1; /^@1/q2 + +# If no state is found at the beginning of the string, then this is our +# first pass: initialize, beginning in state @A (assignment). Always end in +# a newline to simplify regular expressions for assignments. +s/^[^@]/@A\n&/ +s/[^\n]$/&\n/ + + + +## +# State @A - Validate and Normalize Assignment +## + +# If there are no assignments remaining, transition @A->@0 to quit +# successfully the next iteration. +s/^@A\(.*\n\n\)[ \n]*$/@0\1/ + +# Check that the next assignment is well-formed. If so, transition @A->@E +# and normalize the form by removing whitespace around the assignment +# operator. +s/^@A\(.*\n\n[a-z]\+\) *<- */@E\1<-/ +# ( 1 ) + +# If we're still in State @A, then the assignment must have been +# syntatically invalid. Transition to state @1 (to exit in error next +# iteration). +s/^@A/@1/ + + + +## +# State @E - Expand Assignment +# +# Expands variable references in the next assignment. Note how this uses +# @PRECMP for both the expansions and for transitioning to the actual +# assignment to the environment. +## + +# If a reference is found, transition @E->@PRECMP with two arguments +# representing the reference replacement (@*) and unknown reference (@_) +# states. +s/^@E\(.*\n\n[^\n]\+\*\([a-z]\+\)\)/@PRECMP \2 @* @_\1/ +# ( 1 ^ ( 2 ) ) +# ref name + +# If there are no more references (that is, no more `*'s in the assignment +# value), then transition @E->@PRECMP with two arguments representing the +# assignment (@=) and append (@!) states. +s/^@E\(.*\n\n\([a-z]\+\)<-[^\n*]\+\n\)/@PRECMP \2 @= @!\1/ +# ( 1 ( 2 ) ^ ) +# var name + + + +## +# State @PRECMP(S) - Prepare for Comparison +# +# This state takes a single argument S that is the name of the variable to +# compare against. This allows it to set up comparisons for multiple uses. +## + +# Set up env var line for comparison by prefix it with ``~S'' (where S is +# the argument to @PRECMP), followed by a duplicate of the env var +# name. This duplicate is necessary since the name will be modified during +# @CMP as part of the comparison process. +s/^\(@PRECMP \([a-z]\+\).*\n\)\([a-z]\+\)=/\1~\2 \3 \3=/ +# ( 1 ( 2 ) ) ( 3 ) +# cmp name env var + +# Transition @PRECMP->@CMP if all environment variables have a comparison +# string before them. +s/^@PRECMP [a-z]\+\([^\n]\+\n\(~[^\n]*\n\)*\n\)/@CMP\1/ +# ( 1 ( 2 ) ) +# env vars + + + +## +# State @CMP(C A) +# +# The comparison state is responsible for performing the actual comparison +# between the values set up by @PRECMP. It takes two arguments C and A +# that represent the states to transition to for the consequent (if a match +# is successful) or alternative (if no match is found), respectfully. This +# allows @CMP to be used in multiple contexts, like a function call. +## + +# At this point, all values to be compared will be of the form ``~x y''. +# Compare the first character of x and y, replacing with an exclamation +# point at the beginning of the line in the case of a non-match. +# +# For more information on this strategy, see `cmp.sed'. We limit ourselves +# here to [a-z]. +s/^\(@CMP.*\n\)~a[^ ]* [^a][^ ]* /\1!/ +s/^\(@CMP.*\n\)~b[^ ]* [^b][^ ]* /\1!/ +s/^\(@CMP.*\n\)~c[^ ]* [^c][^ ]* /\1!/ +s/^\(@CMP.*\n\)~d[^ ]* [^d][^ ]* /\1!/ +s/^\(@CMP.*\n\)~e[^ ]* [^e][^ ]* /\1!/ +s/^\(@CMP.*\n\)~f[^ ]* [^f][^ ]* /\1!/ +s/^\(@CMP.*\n\)~g[^ ]* [^g][^ ]* /\1!/ +s/^\(@CMP.*\n\)~h[^ ]* [^h][^ ]* /\1!/ +s/^\(@CMP.*\n\)~i[^ ]* [^i][^ ]* /\1!/ +s/^\(@CMP.*\n\)~j[^ ]* [^j][^ ]* /\1!/ +s/^\(@CMP.*\n\)~k[^ ]* [^k][^ ]* /\1!/ +s/^\(@CMP.*\n\)~l[^ ]* [^l][^ ]* /\1!/ +s/^\(@CMP.*\n\)~m[^ ]* [^m][^ ]* /\1!/ +s/^\(@CMP.*\n\)~n[^ ]* [^n][^ ]* /\1!/ +s/^\(@CMP.*\n\)~o[^ ]* [^o][^ ]* /\1!/ +s/^\(@CMP.*\n\)~p[^ ]* [^p][^ ]* /\1!/ +s/^\(@CMP.*\n\)~q[^ ]* [^q][^ ]* /\1!/ +s/^\(@CMP.*\n\)~r[^ ]* [^r][^ ]* /\1!/ +s/^\(@CMP.*\n\)~s[^ ]* [^s][^ ]* /\1!/ +s/^\(@CMP.*\n\)~t[^ ]* [^t][^ ]* /\1!/ +s/^\(@CMP.*\n\)~u[^ ]* [^u][^ ]* /\1!/ +s/^\(@CMP.*\n\)~v[^ ]* [^v][^ ]* /\1!/ +s/^\(@CMP.*\n\)~w[^ ]* [^w][^ ]* /\1!/ +s/^\(@CMP.*\n\)~x[^ ]* [^x][^ ]* /\1!/ +s/^\(@CMP.*\n\)~y[^ ]* [^y][^ ]* /\1!/ +s/^\(@CMP.*\n\)~z[^ ]* [^z][^ ]* /\1!/ + +# If we have not made a negative comparison replacement above, then the +# characters match; discard them to prepare for the next iteration. +s/^\(@CMP.*\n~\).\([^ ]*\) ./\1\2 / + +# If we have ``~ '', then that means all characters have been compared and +# a match has been found. Transition @CMP->C and remove the two trailing +# spaces. If all are non-matches (all env lines prefixed with `!'), +# transition @CMP->A. Otherwise, we're not done yet, so do nothing. +s/^@CMP \(@[^ ]\+\)[^\n]*\(.*\n~\) /\1\2/ +# ( 1 = C ) A ( 2 ) +s/^@CMP [^ ]\+ \(@[^ ]\+\)\(\(\n![^\n]\+\)*\n\n\)/\1\2/ +# C ( 1 = A ) ( ( 3 ) 2 ) + + + +## +# State @CLEAN(S) +# +# This state cleans up after @CMP to restore environments to their original +# state. It takes a single argument S to transition to after cleanup. +## + +# Strip any non-letter prefix and extra comparison words off of env +# vars. Note that we do not check for the empty line separator---a prefix +# check is enough since the assignment form must begin with a letter. +s/^\(@CLEAN.*\n\)[!~]\([a-z]\+ \)*\([a-z]\+=\)/\1\3/ +# ( 1 ) ( 2 ) ( 3 ) +# state + args extra words var name= + +# If all such prefixes have been stripped, then we're done with +# cleanup. Transition @CLEAN->S. +s/^@CLEAN \(@[^ \n]\+\)\(\(\n[a-z][^\n]*\)*\n\n\)/\1\2/ +# ( 1 ) ( ( 3 ) 2 ) +# next state entire environment + + + +## +# State @* - Replace Variable Reference +# +# Replace a variable reference with the value of that variable from the +# environment. +## + +# Replace the last reference on the assignment line (which happens to be the +# reference that we matched on previously) with the value of the matched +# variable from the environment (prefixed by `~'). Then transition +# @*->@CLEAN with a single argument to return to @E for further expansion. +s/^@\*\(.*\n~[^=]\+=\([^\n]\+\).*\n\n[^\n]\+\)\*[a-z]\+/@CLEAN @E\1\2/ +# ( 1 ( 2 ) ) ^ +# var value + + + +## +# State @_ - Unknown Reference +# +# Replace unknown references with a string of the form "[UNKNOWN REF: name]". +## + +# If we're in this state, then the last reference on the assignment line is +# unknown. Replace it, then transition @_->@E for further expansion. +s/^@_\(.*\n\n[^\n]\+\)\*\([a-z]\+\)/@CLEAN @E\1[UNKNOWN REF: \2]/ +# ( 1 ) ^ ( 2 ) +# ref name + + +## +# State @= (Assignment) +# +# Replace env var value with the text of the assignment and then transition +# @=->@CLEAN, providing a single state argument to transition to state @A +# after cleanup is complete. +## + +s/^@=\(.*\n\)~\([^=]*\)=[^\n]*\(.*\n\n\)[^<]*<-\([^\n]*\)\n/@CLEAN @A\1\2=\4\3/ +# ( 1 ) ( 2 ) ( 3 ) ( 4 ) +# var new val + + +## +# State @! (Non-Match Append) +# +# This state indicates that a match for an environment variable was not +# found and that it should be appended to the environment. +## + +# Reformat the ``var<-val'' form as ``var=val'' and add it to the bottom of +# the environment. The assignment line is removed. Then, transition +# @!->@CLEAN with a single argument to transition back to @A after cleanup +# is complete. +s/^@!\(.*\)\n\n\([a-z]\+\)<-\([^\n]\+\)\n/@CLEAN @A\1\n\2=\3\n\n/ +# ( 1 ) ( 2 = var) ( 3 = val) diff --git a/regex/env-fixed.sed b/regex/env-fixed.sed new file mode 100644 index 0000000..c64523c --- /dev/null +++ b/regex/env-fixed.sed @@ -0,0 +1,51 @@ +# Fixed environment access and mutation using regular expressions +# +# Copyright (C) 2018 Mike Gerwitz +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# This is a fairly simply example demonstrating how an environment +# consisting of pre-determined variables can be accessed and manipulated +# using only regular expressions (in the formal sense). +# +# See also `env-dyn.sed' for a a much more sophisticated example of a +# dynamic environment where the variables are _not_ pre-determined. +## + + +# Read all lines into the pattern space. +:a; N; $!ba + +# Variable reference in an assignment form. This copies the value from the +# environment line into the assignment line. Note that this will only +# perform a single replacement per variable. Because regexes are greedy, +# this will replace the _last_ occurrence in the assignment form with the +# value of the current environment. +s/\(x=\)\([^ ]*\)\(.*\n.<-.*\)x/\1\2\3\2/ +s/\(y=\)\([^ ]*\)\(.*\n.<-.*\)y/\1\2\3\2/ +s/\(z=\)\([^ ]*\)\(.*\n.<-.*\)z/\1\2\3\2/ + +# Environment mutation. This assigns a new value to the environment by +# copying the value from the assignment line into the environment line. +s/\(x=\)[^ ]*\(.*\)\nx<-\([^ ]*\)/\1\3\2/ +s/\(y=\)[^ ]*\(.*\)\ny<-\([^ ]*\)/\1\3\2/ +s/\(z=\)[^ ]*\(.*\)\nz<-\([^ ]*\)/\1\3\2/ + +# The above expressions should have eliminated the assignment line if there +# was a match; if it still exists, then there is a syntax error. +s/.<-.*$/ERROR: &/ + +# This script only handles a single pass. Let the animate script know that +# we are done. +q1