cptt/cptt.tex

% Discussion of Selected Exercises From Compilers: Principles, Techniques, and Tools
%
%   Copyright (C) 2013, 2018 Mike Gerwitz
%
%   Licensed under a Creative Commons Attribution-ShareAlike 4.0
%   International License.
%
% Discussion of section 4.2.8 (exercises for section 4.2) in CPTT (the
% "dragon book")
%%

\documentclass[draft]{article}
\usepackage{amsmath,amssymb,tikz}
\usetikzlibrary{automata,positioning}

\begin{document}
\title{Discussion of Selected Exercises: \\
Section 4.2.8 of Compilers: Principles, Techniques and Tools \\
\vspace{1em}
\large{Topic: Context-Free Grammars}}

\author{2013-05-15}
\date{\today}

\maketitle

\def\exercise#1 #2\par{
  \goodbreak
  \vspace{0.5em plus 0.5em}
  \noindent
  \llap{\bf Exercise #1 }%
  {\sl#2}\par
  \vspace{0.5em plus 0.5em}
  \goodbreak
}
\def\exend{$\blacksquare$}

\def\set#1{\left\{#1\right\}}

\def\nt#1{{\ifmmode#1\else$#1$\fi}}
\def\nts#1{\;\nt#1\;}
\def\prod{\rightarrow}
\def\punion{\;|\;}
\def\emptystr{\ifmmode\epsilon\else$\emptystr$\fi}

\def\mspace#1{\ifmmode\;#1\;\else$#1$\fi}

\def\derivop{\displaystyle\mathop{\Rightarrow}}
\def\deriv{{\mspace\derivop}} % extra grouping to solve issue in mmode w/ align
\def\lmderiv{\mspace{\deriv\limits_{lm}}}
\def\derivz{\mspace{\derivop^{\kern -0.25em*}}}
\def\derivp{\mspace{\derivop^{\kern -0.25em+}}}
\def\derivlm{\mspace{\derivop_{lm}}}
\def\derivrm{\mspace{\derivop_{rm}}}
\def\derivlmz{\mspace{\derivop^{\kern -0.25em*}_{lm}}}

\let\eqrefold\eqref
\def\eqref#1{\eqrefold{e:#1}}
\def\gref#1{grammar~\eqref{#1}}
\def\Gref#1{Grammar~\eqref{#1}}
\def\fref#1{Figure~\ref{f:#1}}

\def\prooftext#1 #2\par{
  \goodbreak
  \vspace{1ex plus 0.5ex}
  \noindent
  \llap{#1 }%
  #2\par
}
\def\proof{\prooftext {\bf\small\uppercase{Proof}} }
\def\basis{\prooftext {\sc Basis} }
\def\ind{\prooftext {\sc Induction} }
\def\contra{\prooftext {\sc Contradiction} }
\def\foorp{$\square$\vspace{1ex plus 1ex}}


\begin{abstract}
\input{abstract}
\end{abstract}


\section{Context-Free Grammars}
The focus of this discussion (and of Section 4.2 in CPTT) is on context-free
grammars (or simply ``grammars'').

\section{Convention and Notation}
The following notational conventions are used throughout this paper. In most
cases, they have been borrowed from the text.

For grammars, capital symbols are used to represent non-terminals. The $\nt{S}$
symbol is used to denote the starting non-terminal. The symbol $\prod$~is used
to separate the non-terminal from its production body, whereas
$\deriv$~indicates a single step in a derivation. Leftmost and rightmost
derivations are denoted $\derivlm$ and~$\derivrm$ respectively. $\derivz$ means
``derives in zero or more steps'', whereas $\derivp$ means ``derives in one or
more steps''.  The symbol $\punion$~separates multiple productions for a single
non-terminal. Any time punctuation is placed at the end of a grammar or
derivation, it should be read as part of the surrounding paragraph, \emph{not}
as part of the production or derivation. For example, in the grammar
$$
  \nt{S} \prod 0\nts{S}1 \punion \emptystr,
$$
\noindent
the trailing comma is not part of the construction. Furthermore, whitespace is
not significant and may be discarded. \emptystr~is the empty string.

``The text'' refers to CPTT, whereas ``this paper'' refers to the paper you are
currently reading.


\section{Exercise 4.2.3---Grammar Design}
This exercises requests that the reader design grammars for a series of language
descriptions a--f; we will discuss each of them. Although the text does not
request it, proofs will be provided for each, as they are useful to demonstrate
correctness and an excellent practice in discipline.

\exercise 4.2.3a The set of all strings of 0's and 1's such that every 0 is
immediately followed by at least one 1.

The grammar for this exercise is fairly trivial, but will serve as a useful
introduction to the formalities of this paper. First, let us consider a grammar
that demonstrates such a property. Our alphabet is $\Sigma = \set{0,1}$. The
only restriction on the sentences of our grammar is that each 0 must be followed
by a 1---this therefore means that we can have any number of adjacent 1's, but
it is not possible to have adjacent 0's. Considering that our alphabet~$\Sigma$
has only two characters, this grammar is fairly simple:

\begin{equation}\label{e:z1}
  \nt{S} \prod 1\nt{S} \punion 01\nt{S} \punion \emptystr.
\end{equation}

As an example, let us consider some of the sentences that we may wish to be
derived by this grammar. In particular, consider derivation of the string
$01011$:

\begin{equation}
  \nt{S} \deriv 01\;\nt{S}
    \deriv 01\;01\;\nt{S}
    \deriv 01\;01\;1\nt{S}
    \deriv 01\;01\;1\;\emptystr
    \derivz 01\;01\;1.
\end{equation}

Notice also that a string of 1's---such as $1111$---is also derivable given our
grammar:

\begin{equation}\label{e:z1-1s}
  \nt{S} \deriv 1\;\nt{S}
    \deriv 1\;1\;\nt{S}
    \deriv 1\;1\;1\;\nt{S}
    \deriv 1\;1\;1\;1\;\nt{S}
    \deriv 1\;1\;1\;1\;\emptystr
    \derivz 1\;1\;1\;1,
\end{equation}

\noindent
as is the empty string $\emptystr$ in one step:

\begin{equation}
  \nt{S} \deriv \emptystr.
\end{equation}

To prove that grammar \eqref{z1} is correct, we must prove two independent
statements:

\begin{enumerate}
  \item The \emph{only} strings derivable from \gref{z1} are those of 0's and
  1's such that every 0 is immediately followed by at least one 1;

  \item The grammar accepts all such strings.
\end{enumerate}

We will prove these statements in order. For the first statement, we must
show that, at any given step $n$ of \gref{z1}, the only derivable strings
contain a 1 after each and every 0 (or that the string contains no 0's). For the
second statement, we must show that any string containing 0's and 1's such that
every 0 is followed by at least one 1 is derivable from our grammar. Grammar
proofs are discussed in Section 4.2.6 of the text.

\proof The only strings derivable from~$\nt{S}$ are those of 0's and~1's such
that every 0~is immediately followed by at least one~1. We shall perform this
proof inductively on the number of steps~$n$ in a given derivation.

\basis The basis is $n=1$. In one step, our grammar may produce one of three
strings: A string beginning with a~1 (the first production of~$\nt{S}$), a
string beginning with a~0 followed by a~1 (the second production of~$\nt{S}$)
and the empty string~\emptystr\ (the final production of~$\nt{S}$).

The empty string~\emptystr\ has no~0's and so follows the rules of the language.
The same is true for any string beginning with a~1. The third and final string
that can be generated when~$n=1$ is~01. This string does contain a~0 and
therefore also satisfies our requirement.

\ind We shall now assume that all derivations of fewer than $n$~steps result in
either a sentence containing no~0's or a sentence that contains 0's~followed by
one or more~1's. Such a derivation must have the form

\begin{equation}\label{e:z1-ind}
  \nt{S} \deriv xS \derivz xy.
\end{equation}

\noindent
Since $x$~is derived in fewer than $n$~steps then, by our inductive hypothesis,
$x$~must contain~0's only if followed a~1; the same is true of~$y$.

Additionally, according to \gref{z1}, $y$~must be of one of the productions

\begin{align*}
  \nt{S} &\prod 1\nt{S} \\
  \nt{S} &\prod 01\nt{S} \\
  \nt{S} &\prod \emptystr.
\end{align*}
\noindent
Each of these productions have already been discussed in our basis; therefore,
$y$~cannot contain a~0 followed by another~0. Additionally, it is required that
adjacent~1's be permitted after a~0, which is possible by the first production
(as demonstrated in \eqref{z1-1s}). As such, $xy$~must contain only~0's
followed by one or more~1's and our hypothesis has been proved. \foorp

To ensure a thorough understanding of the above proof, it is worth mentioning
why \eqref{z1-ind}~used both the \deriv\ and~\derivz\ derivation symbols. Our
basis applies when $n=1$; the inductive hypothesis applies otherwise (when
$n>1$). As such, we must have \emph{at least} one production in~\eqref{z1-ind}.

Now that we have proved that we may only derive sentences from \gref{z1} that
contain~0's followed by one or more~1's, we must now show that the grammar may
be used to derive all such possible strings.

\proof Any string~$s$ of length~$l$ consisting of~1's and~0's such that any~0 is
followed by at least one~1 is derivable from~$\nt{S}$.

\basis A string of length~$0$ ($l=0$) must be~\emptystr, which is derivable
from~$\nt{S}$ in one step.

\ind Assume that any string $s$ of a length less than $l$ is derivable
from~\nt{S}. Such a string must have the form~$xy,
y\in\set{1,01,\emptystr}$---that is, we can consider $s$ to be the concatenation
of $y$~with a previously derived string. Since the length of $x$~is clearly less
than~$l$, it must by derivable from~\nt{S} by our inductive hypothesis.
Furthermore, $xy$~must have a derivation of the form

\begin{equation}\label{e:z1-deriv-1}
  \nt{S} \derivp x\;\nt{S} \deriv x\;y,
\end{equation}
\noindent
thereby proving that $s$~is derivable from~\nt{S}. \foorp

The derivation~\eqref{z1-deriv-1} may seem to be too abstract to be useful;
since this is our first proof, it is worth clarifying why it does in fact
complete the proof. We first showed that any string of the language of 0's and
1's that we have been studying can be described as the concatenation of a
smaller such string with 0, 01 or~\emptystr\ (which completes the string). This
string, as we stated, has the form~$xy$. Therefore, we must show that
\nt{S}~supports concatenation---\eqref{z1-deriv-1} demonstrates this with~$x$
fairly abstractly, since it does not matter what exactly $x$~is. From the
productions of~\nt{S} in \gref{z1}, it is understood that $x$ can be any string
of terminals (that is---any derivation) leading up to that point in the
derivation~\eqref{z1-deriv-1}.

We must now show that the remaining part of~$xy$---that is, $y$---is derivable.
The only non-terminal remaining after~$x$ is~\nt{S}. We have defined $y$~to be
any string of terminals in the set $\set{0,01,\emptystr}$. Clearly, each of
these strings are derivable from~\nt{S}. Therefore, we can replace~\nt{S}
in~\eqref{z1-deriv-1} with~$y$, indicating that this is a valid derivation given
our definition of~$y$; it is up to the reader of the proof to make this
connection. Note that, while the domain of $y$~happens to be every production
of~\nt{S}, this is not necessary for the proof---that is the subject of the
first proof.

Before we put this exercise to rest (indeed, we completed the exercise
requirement in the first paragraph following the exercise definition), it is
also worth noting that this grammar may also be accepted by a finite automata
(and consequently, a regular expression); this is demonstrated by
\fref{z1-regex}. It should be noted that this is not the case with all of the
exercises that follow.
\exend

\begin{figure}
  \center
  \begin{tikzpicture}
    \node[state,initial] (a) {$a$};
    \node[state] (b) [right=of a] {$b$};
    \node[state,accepting] (c) [right=of b] {$c$};

    \path[->]
      (a) edge [loop below] node {1} ()
          edge [bend right, below] node {\emptystr} (c)
          edge [above] node {$0$} (b)
      (b) edge [above] node {$1$} (c)
      (c) edge [bend right, above] node {\emptystr} (a)
    ;
  \end{tikzpicture}

  \caption{An NFA corresponding to the extended regular expression
  $\left(0^?1^+\right)^*$ describing \gref{z1}.}
  \label{f:z1-regex}
\end{figure}

The above example was fairly simple, yet resulted in a realitively lengthy
discourse far past what was required by the text; the reader can expect such a
discussion to continue for all examples that follow.


\exercise 4.2.3b The set of all strings of 0's and 1's that are
palindromes; that is, the string reads the same backward as forward.

As the exercise stated, a {\sl palindrome} is a string that reads the same in
both directions; let us consider some examples before attempting to construct a
grammar. The following list of strings are all palindromes, one per
line:\footnote{An example of an English palindrome is ``Mr.~Owl ate my metal
worm'' (discarding punctuation and capitalization.)}

\begin{equation}\label{e:palex}
  \begin{tabular}{rcl}
    1 &00 &1 \\
    1100 &11 &0011 \\
    010 &1 &010 \\
    & 0 &
  \end{tabular}
\end{equation}

The above palindromes have been laid out so that their symmetry is apparent. At
first glance, one can imagine constructing a palindrome out of pairs of
characters, like the second row of~\eqref{palex}:

\begin{equation}\label{e:palex-2}
  \begin{tabular}{crcl}
    & 11 & \\
    1 & 11 & 1 \\
    11 & 00 & 11 \\
    110 & 00 & 011 \\
    1100 & 11 & 0011
  \end{tabular}
\end{equation}

\noindent
In this case, each palindrome would always have an even number of characters.
However, it is important to note the bottom two palindromes of \eqref{palex},
which have an \emph{odd} number of characters:

\begin{equation}\label{e:palex-3}
  \begin{tabular}{rcl}
    & 00 & \\
    0 & 11 & 0 \\
    01 & 00 & 10 \\
    010 & 1 & 010
  \end{tabular}
\end{equation}

Given this evaluation and the understanding that $2n$~is always even for some
positive integer~$n$, it would be accurate to recursively construct a palindrome
from the edges inward in pairs. Once we reach the center, we may end
with~\emptystr\ if we wish to have an even ($2n$) number of characters, or
otherwise may add a single character to create a palindrome containing an odd
($2n+1$) number of characters.

\begin{equation}\label{e:palindrome}
  \begin{aligned}
    \nt{S} &\prod 0\nts{S}0 \punion 1\nts{S}1 \punion M \\
    \nt{M} &\prod 0 \punion 1 \punion \emptystr
  \end{aligned}
\end{equation}

In \gref{palindrome} above, we define out start non-terminal~\nt{S} with
productions for the outer pairs. The non-terminal~\nt{M} represents the
acceptable inner (``middle'') characters, which determines if the length of the
palindrome is even (if \emptystr~is used) or odd (0 or~1). We will leave
demonstrations of such derivations to the proof.

To prove that grammar~\nt{S} is the proper grammar for all palindromes, we must
again prove two things: That language $L(\nt{S})$ can produce only palindromes
of~0's and~1's and that all such palindromes can be derived from~\nt{S}. The
difference between these two descriptions may be subtle for such a simple
grammar, but the distinction is important to ensure that $L(\nt{S})$ represents
\emph{nothing more and nothing less} than a language that may be used for such
palindromes.

As before, the proofs will be inductive---the first proof on the number of
steps~$n$ of a derivation of~\nt{S} and the second on the length~$l$ of the
palindrome~$s$. Our alphabet~$\Sigma$ is once again~$\set{0,1}$.

\proof The only strings derivable from grammar~\nt{S} are palindromes consisting
of 0's and~1's.

\basis The basis is $n=2$, which is the fewest number of steps from which a
string may be derived from~\nt{S}.\footnote{$n=1$ steps cannot result in a
string consisting only of nonterminals, as it would result in $0S0$,~$1S1$
or~$M$.} Such a derivation must be of the form
$$
  \nt{S} \deriv M \deriv x,
$$
\noindent
where $x$~is 0,~1, or~\emptystr. In the latter case, the derived string is
clearly a palindrome of length zero. In the case of 0 or~1, the length of the
string is one, which must be a palindrome.

\ind Now assume that every string derived in less than $n$~steps is a
palindrome. Such a derivation must be of the form
$$
  \nt{S} \deriv x\nts{S}x \derivz x\;y\;x.
$$
\noindent
That is, the string~$x$ appears on both the left and right of~$y$. Since the
derivation of~$y$ from~\nt{S} takes fewer than $n$~steps---specifically, $n-1$
steps---$y$~must be a palindrome by our inductive hypothesis. Because $x$~is
added to both the beginning and end of~$y$, then any string derived in $n$~steps
must be a palindrome. \foorp

Let us further demonstrate the above proof by deriving~\eqref{palex-2}
from~\nt{S}:\footnote{The dots were added so as not to confuse the reader as to
what was going on; the symbol~\derivp\ is sufficient and therefore the dots will
be omitted in the future.}

\begin{equation}
  \nt{S}
  \deriv 1\nts{S}1
  \deriv 1\;1\nts{S}1\;1
  \deriv \cdots
  \derivp 1\;1\;0\;0\;1\;\emptystr\;1\;0\;0\;1\;1
\end{equation}

\noindent
and additionally \eqref{palex-3}:

\begin{equation}
  \nt{S}
  \deriv 0\nts{S}0
  \deriv 0\;1\nts{S}1\;0
  \deriv 0\;1\;0\nts{S}0\;1\;0
  \deriv 0\;1\;0\;1\;0\;1\;0.
\end{equation}

\noindent
The induction step works by recognizing the basis as the middle of the string
(nonterminal~\nt{M} in \gref{palindrome})---\emptystr~for palindromes of an
even length and the $\left\lceil n/2 \right\rceil^{th}$ character for those of
an odd length (1 in the case of the latter derivation). Call this string~$b$. We
know that $b$~is a palindrome, as explained in the proof above. For our
inductive step, we recognize that, for each step~$n$, we add two
characters---one to the beginning and one to the end---to the result of
step~$n-1$. As such, since the derivation of~$n-1$ steps must be a palindrome,
the derivation in~$n$ steps must also be---it is not possible to derive anything
but a palindrome from~\nt{M} and \nt{S}~maintains this designation.

For completeness, we must now show that all possible palindromes of the
alphabet~$\Sigma$ can be derived from~\nt{S}.

\proof Every palindrome consisting of~0's and~1's is derivable from~\nt{S}.

\basis If the string~$s$ is of length~$l\leq1$, then it must be \emptystr,~0 or~1,
all of which are palindromes derivable by~\nt{M}.

\ind Observe that any palindrome of length~$l>1$ must contain the same
character at positions~$1$ and~$l$.\footnote{1-indexed for notational
convenience.} Assume that each string with a length less than~$l$ is derivable
from~\nt{S}. Since $s$~is a palindrome, then it must have the form $xyx,
x\in\Sigma$, where $y$~is also a palindrome. Since $y$~has a length $l-2<l$,
then it must be derivable from~\nt{S} by the inductive hypothesis. The
palindrome~$s$ must therefore have a derivation of the form
$$
    \nt{S} \deriv x\nts{S}x \derivz x\;y\;x,
$$
\noindent
which thus proves that~$s$ is derivable from~\nt{S}. \foorp

It is also worth noting that, unlike the first exercise, we cannot represent a
palindrome as a finite automaton (and therefore cannot represent it as a regular
expression). Let us prove this assertion.

\proof \nt{S}~cannot be represented by any finite automata. Specifically, a
finite automaton representing~\nt{S} may accept all strings that are
palindromes of the alphabet~$\Sigma$, but such an automaton must also accept
strings that are not palindromes. We shall prove this statement by
contradiction.

\contra Given the alphabet~$\Sigma$, a palindrome may contain any character
from~$\Sigma$ at any arbitrary position~$n$ and may be of length~$l\geq0$. As
such, we must be able to represent this automaton by the regular expression
$\left(0|1\right)^*$, whose corresponding minimum-state DFA is shown in
\fref{pal-a}. However, it is also necessary that characters $c_n$
and~$c_{l-n+1}$ be the same symbol in~$\Sigma$---a requirement that
minimum-state DFA of \fref{pal-a} cannot guarantee.

\begin{figure}
  \center
  \begin{tikzpicture}
    \node[state,initial,accepting] (a) {$a$};

    \path[->]
      (a) edge [loop below] node {1} ()
          edge [loop above] node {0} ()
    ;
  \end{tikzpicture}

  \caption{The minimum-state DFA for the regular expression
  $\left(0|1\right)^*$.}
  \label{f:pal-a}
\end{figure}

Consider that the only way for a finite automata to maintain a history of states
is to have a state to represent each unique history. However, to accept a string
of any length, we would need an automaton containing a potentially infinite
number of states, which is not finite (and therefore not a finite automaton).
Therefore, it is not possible to represent the history of every possible
palindrome using a finite set of states.

Given this, it must stand that a finite automaton must at some point contain a
state that transitions to a previous or current state, such as the NFA in
\fref{pal-a2}. Since the history of the string is ``stored'' purely in the
possible states leading up to the current state, this transition~$t$ equates to a
loss of ``memory'', without which the right-hand portion of the palindrome cannot
be properly matched. Furthermore, since each position~$n$ may contain any
character in~$\Sigma$, and since the transition~$t$ can only yield a set of
future states with a limited (finite) precision, each of these future states
must be redundant. Since each NFA can be represented by an equivalent DFA and
each DFA for some grammar has a single common minimum-state DFA, any portion of
a finite automaton that can accept a palindrome of any length must be equivalent
to \fref{pal-a} (such as state~$x$ in \fref{pal-a2}). We are therefore left to
conclude that no finite automata can accept a palindrome of arbitrary length
without accepting every string that is a combination of each character in
$\Sigma$. \foorp

\begin{figure}
  \center
  \begin{tikzpicture}
    \node[state,initial] (a) {$1$};
    \node[state] (b) [right=of a] {$2$};
    \node[state] (x) [right=of b] {$x$};
    \node[state] (y) [right=of x] {$n-1$};
    \node[state,accepting] (z) [right=of y] {$n$};

    \path[->]
      (a) edge [above] node {$\alpha$} (b)
          edge [below, bend right=45] node {$\kern-0.7em\emptystr$} (z)
      (b) edge [above] node {$\beta$} (x)
          edge [below, bend right=65] node {$\emptystr$} (y)
      (x) edge [loop above] node {$\beta$} ()
          edge [loop below] node {$\alpha$} ()
          edge [above] node {$\beta$} (y)
      (y) edge [above] node {$\alpha$} (z)
    ;
  \end{tikzpicture}

  \caption{An NFA with a finite set of states must at some point transition to a
  previous or identical state in order to accept input of any length.
  $\Sigma=\set{\alpha,\beta}$.}
  \label{f:pal-a2}
\end{figure}

To provide further clarification---any finite automata that transitions to a
\emph{previous} state, since it looses a portion of its history, can no longer
accurately determine the states leading up to the final state. That is, consider
the string 10101 and consider that the first three characters of this string can
be represented by the states $\set{a,b,a}$. At this point, we can no longer be
certain of what the string may end with, because we have lost any sense of
nesting/recursion. Therefore, the states leading to the final state are forced
to accept any character in $\Sigma$ and therefore must be equivalent to the
minimum-state DFA of \fref{pal-a}. As was mentioned by the text, ``finite
automata cannot count''.

\fref{pal-a2} gets around such an issue by transitioning only to current or
future states, which permits a \emph{finite} amount of nesting (placing the
aforementioned minimum-state DFA~$x$ in the middle). However, note a glaring
issue---this automaton does not accept~$\beta$ in the first character position.
If it did, then we would need a second set of states in order to maintain such a
history and know that we should also \emph{end} with $\beta$~instead
of~$\alpha$. The number of states would therefore grow very quickly with the
level of nesting and the size of~$\Sigma$ (such a consideration is left to the
reader).

We have exhaustively proved that \gref{palindrome} is the correct answer for
this exercise. \exend


\exercise 4.2.3c The set of all strings of 0's and 1's with an equal number of
0's and 1's.

To understand how to approach this problem, we shall consider a number of
strings that are derivable from this language. An obvious case is~\emptystr,
which contains zero~0's and zero~1's. Some additional examples are shown in
\fref{eq-ex} along with their lengths (denoted by~$l$).

\begin{figure}[h]
  \center
  \begin{tabular}{r|cccccc}
    $s$ & \emptystr & 10 & 01 & 1010 & 1001 & 011100 \\
    \hline
    $l$ & 0 & 2 & 2 & 4 & 4 & 6
  \end{tabular}

  \caption{Examples of strings with an equal number of 0's and 1's.}
  \label{f:eq-ex}
\end{figure}

These examples demonstrate a number of important properties. In particular, the
length~$l$ of the string~$s$ is always even, with the number of 0's and~1's
$n=l/2$. Additionally, the characters of the alphabet~$\Sigma$ may appear in any
order in the string. Therefore, we do not have the luxury of a simple, nested,
recursive implementation as we did with the palindrome exercise (at least not
exclusively).

Let us construct the grammar iteratively, beginning with the simplest case
of~\emptystr.

\begin{equation}\label{e:eq-1}
  \nt{S} \prod \emptystr
\end{equation}

\noindent
The second case---10---is also fairly easy to fit into~$\nt{S}$:

\begin{equation}\label{e:eq-2}
  \nt{S} \prod 10 \punion \emptystr
\end{equation}

The third case demonstrates an important case regarding our strings: They may
begin with either a~0 or a~1 and they may also \emph{end} with either character
(more generally, they may begin or end with any character in~$\Sigma$). However,
we cannot simply adjust our grammar to accept either character in both
positions---$\nt{S}$ must assure that, any time we include a~0 in a production,
we also include a~1 (and vice versa). So far, this is guaranteed by~$\nt{S}$ in
\gref{eq-2}; to keep on this path, we must add 01 as yet another special case.

\begin{equation}\label{e:eq-3}
  \nt{S} \prod 01 \punion 10 \punion \emptystr
\end{equation}

\goodbreak
The fourth case---1010---introduces the need to handle strings of an arbitrary
length. To do this, we must determine at what point we should recurse
on~$\nt{S}$. Looking at the example, we could derive 1010 as two nested
applications of~$\nt{S}$ if we recurse between the two terminals.

\begin{equation}\label{e:eq-a}
  \nt{S}
    \deriv 1\nts{S}0
    \deriv 1\;0\nts{S}1\;0
    \deriv 1\;0\;\emptystr\;1\;0
    \derivz 1\;01\;0
\end{equation}

\noindent
Of course, one could also adopt an alternate perspective by considering the
string to be the production of two adjacent non-terminals.

\begin{equation}\label{e:eq-b}
  \nt{S} \deriv \nt{S}\;\nt{S}
    \derivlm 10\;\nt{S}
    \derivlm 10\;10
\end{equation}

\noindent
Unfortunately, with this information alone, we cannot be certain which of these
productions---if such a choice even matters---should be used in our grammar.
Perhaps we can gain further insight from the remaining examples.

The next example---1001---can be derived in a manner similar to \eqref{eq-b},
but not \eqref{eq-a}; in particular, \gref{eq-3} has no production for the
string 00, and so we cannot construct the string from the outside in. Given
that, we can be certain that an adjacent non-terminal production is needed and
so we will add the production used in \eqref{eq-b} to our grammar.

\begin{equation}\label{e:eq-4}
  \nt{S} \prod 01 \punion 10 \punion \nt{S}\;\nt{S} \punion \emptystr
\end{equation}

However, the aforementioned predicament---the absense of a production that can
yield only 00---raises the question of whether or not we can truly derive any
string of equal 1's and 0's from the above grammar. Our final example challenges
this. 011100 cannot possibly be represented by~$\nt{S}$ in \gref{eq-4} because
this grammar constructs the string from left-to-right (or right-to-left) in
pairs of~0's and~1's. Therefore, the only way to have adjacent~1's or adjacent~
0's is to alternate the productions, which makes it impossible to have more than
two adjacent identical characters.

Given this, it seems that both \eqref{eq-b} \emph{and} \eqref{eq-a} are
necessary; the following derivation demonstrates this fact (neither can
individually be used to derive the string 011100).

\begin{equation}
  \nt{S} \deriv \nt{S}\;\nt{S}
    \derivlm 01\;\nt{S}
    \derivlm 01\;1\nts{S}0
    \derivlm 01\;1\;1\nts{S}0\;0
    \derivlm 01\;1\;1\;\emptystr\;0\;0
    \derivlmz 01\;1\;10\;0
\end{equation}

\noindent
We thus arrive at \gref{eq-5} below.

\begin{equation}\label{e:eq-5}
  \nt{S} \prod 0\nts{S}1
    \punion 1\nts{S}0
    \punion \nt{S}\;\nt{S}
    \punion \emptystr
\end{equation}

An astute reader may at this point notice that we have created an ambiguity in
our grammar: Recall~\eqref{eq-a} and~\eqref{eq-b}, which had two possible
derivations for the same string; both of these derivations are now possible in
our grammar. The text defines an ambiguous grammar to be a grammar that contains
more than one leftmost or more than one rightmost derivation for the same
sentence. This is a particularly interesting example of ambiguity, in particular
because we cannot resolve it. Let us consider why.

\proof Grammar~$\nt{S}$ cannot be disambiguated. We will prove this fact by
contradiction.

\contra Firstly, recognize that~$\nt{S}$ is ambiguous because there exists some
sentence~$s$ that has both of the following derivations in $n>1$ steps, where
$a\ne b$:

\begin{align*}
  \nt{S} &\deriv a\nts{S}b \derivp a\;x\;b;
  \\
  \nt{S} &\deriv \nt{S}\;\nt{S}
    \deriv a\nts{S}b\;\nt{S}
    \derivp a\;b\;\nt{S}
    \derivp a\;b\;x.
\end{align*}

Suppose to the contrary that there is some way to disambiguate~$x$. There must
then be some terminal $c\in\Sigma$ in~$x$ that may be used to perform the
disambiguation and such a disambiguation would imply a difference in the
semantics of~$x$ between the two derivations. However, $x=x$ and so both
derivations hold exactly the same meaning---balanced strings. Furthermore, the
productions for producing balanced strings requires each character in~$\Sigma$;
$c$ therefore must not exist. \foorp

Fortunately, this ambiguity is not an issue for our grammar because the multiple
derivations are semantically equivalent---we are not arriving at any different
result within the context of this exercise. The sentence 1010 of \fref{eq-ex}
demonstrates this concept: It does not matter whether we consider the sentence
to be a single balanced string or the concatenation of two balanced strings; we
arrive at the same result regardless with no harm done.\footnote{Of course, one
valid argument is that a more concise and unambiguous grammar will reduce
problems during parsing. However, the parser (like Lex, as described by the
text) can give precedence to the productions that appear earlier in the grammar
to resolve this issue.}

While the discussion thus far is likely to convince the reader that \gref{eq-5}
is correct, we shall conclude with a formal proof of this fact.  A proof that
the grammar cannot be represented by any finite automata shall be omitted, in
particular because the productions of $\nt{S}$ have a structure very similar to
the palindrome \gref{palindrome}.

\proof Only sentences composed of balanced~1's and!0's may be derived
from~$\nt{S}$.

\basis The basis is $n=1$. The only sentence that may be derived in 1 step
is~\emptystr, which is clearly balanced (containing zero~0's and zero~1's).

\ind Assume that any sentence derived in fewer than~$n$ steps is balanced. Now
recognize that any sentence derived in $n>1$ steps must make use of one of the
following productions of $\nt{S}$:

\begin{align*}
  \nt{S} &\prod 0\nts{S}1; \\
  \nt{S} &\prod 1\nts{S}0; \\
  \nt{S} &\prod \nt{S}\;\nt{S}.
\end{align*}

\noindent
Therefore, the smallest sentence that is not~\emptystr\ is either $0\nt{x}1$ or
$1\nt{x}0$, both of which are balanced (each contains one~0 and one~1). Since
$x$~is derivable from~$\nt{S}$ in fewer than~$n$ steps, then by our inductive
hypothesis, all sentences derivable from~$\nt{S}$ must be balanced. The last
remaining production has the form~$xy$, both of which are derivable from~\nt{S}
in fewer than~$n$ steps and thus must be balanced. Furthermore, since the
productions of~$\nt{S}$ produce only 0,~1, or~\emptystr, $\nt{S}$~has the
alphabet $\Sigma=\set{0,1}$ and, consequently, may derive no sentence except for
those containing balanced~0's and~1's. \foorp

Having proved that only sentences of balanced~0's and~1's are derivable
from~$\nt{S}$, we must now prove that $\nt{S}$~can derive \emph{all} such
strings (that is, all such strings are sentences of $\nt{S}$). Such a proof is
interesting because our grammar is more sophisticated than the previous
examples.

\proof All strings of balanced~0's and~1's are sentences of~$\nt{S}$.

\basis The basis is a string of length $l=0$, which contains zero~0's and
zero~1's. This string must be~\emptystr, which is derivable from~$\nt{S}$.

\ind First, recognize that all balanced strings must have a length $l=2k$---that
is, $l$~is always even (as emphasized in \fref{eq-ex}) and contains $k$~0's and
$k$~1's. Assume that all strings less than length~$2k$ are derivable
from~$\nt{S}$.

Consider any balanced string~$s$ of length~$2k$. We can consider $s$~to have the
form~$yz$---that is, the concatenation of two balanced strings~$y$ and~$z$, both
of which in turn have the form $axb, a\neq b$ where $x$ itself must be balanced
(since $a\neq b$); alternatively, either $y$ or~$z$ may be~\emptystr, which
therefore implies that the form~$yz$ accepts any balanced string where the first
and last characters are not the same.

We must now show that all such strings can be represented by the form~$yz$.
First, recognize that $y=axb$ may have either the form $0x1$ or $1x0$; the form
$yz$ then permits up to two adjacent identical characters in $\Sigma$; any
additional adjacent identical characters may be derived by $x$. Consider
$x=\emptystr$; then, clearly $axb$ is balanced and can be concatenated to form a
larger balanced string. If $x\neq\emptystr$ but $x_1=b$,\footnote{$x_n$ denotes
the $n^{\text{th}}$ character of~$x$.} then we can instead consider an
alternative interpretation $y'=ax_1$ and $x'=x_2\cdots x_nb$, and then let
$y=y'x'$ (instead of $axb$).

We are then left with the case where $x_1=a$. Such a case allows for an
arbitrarily deep nesting of adjacent identical characters and therefore $axb$
can be represented by the regular expression $a^+b^+$. It is therefore clear
that the form $yz$ is able to describe any string of balanced characters in the
alphabet $\Sigma=\set{0,1}$. Such a form must have the derivation

$$
  \nt{S} \deriv \nt{S}\;\nt{S} \derivlmz xy.
$$

\noindent
Since this is a leftmost derivation, $y$~is either a balanced string or
\emptystr. In the former case, it is obvious that both $x$ and~$y$ are of a
length less than~$2k$ and are therefore derivable from~\nt{S} by our inductive
hypothesis. Otherwise, $y=\emptystr$ and the length of $x$~is precisely~$2k$ and
we must consider the form $axb$; $x$~is clearly of a length of less than~$2k$
and is therefore balanced by our inductive hypothesis. Furthermore, it must have
a derivation of the form
$$
  \nt{S} \deriv a\nts{S}b \derivz a\;x\;b,
$$
\noindent
thereby proving that $axb$ is derivable from~\nt{S}. \foorp

This proof was considerably more involved than our previous ones and is an
excellent segue into proving more sophisticated grammars. Of course, the reader
can surely see the challenges that might arise from attempting to prove much
more complicated grammars. \exend


\section{License}
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License---you are free to use, share, and modify it to suit
your needs, provided that you give proper attribution and license derivative
works under similar terms.  For more information, see:

\tt{https://creativecommons.org/licenses/by-sa/4.0/}.

\end{document}