commit 363452fe85ffe53921604c400e31fee65ad42054 Author: Mike Gerwitz Date: Wed May 15 21:59:38 2013 -0400 Initial commit of section 4.2.8 discussion of CPTT This is already becoming much longer than I had originally anticipated and is taking much of my little reading time that I have. That said, this is an excellent way to practice these concepts (rather than hacking, as I would for practicing a programming language). Therefore, I'm not sure if I will go into as many examples as I had originally thought (that is, past 4.2.3f); we shall see. I hope that readers will find this information useful; I have certainly enjoyed writing it. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..70915d3 --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +*.pdf +*.log +*.aux diff --git a/s428.tex b/s428.tex new file mode 100644 index 0000000..0cb5f04 --- /dev/null +++ b/s428.tex @@ -0,0 +1,841 @@ +% discussion of section 4.2.8 (exercises for section 4.2) in CPTT (the "dragon +% book") +% + +\documentclass[draft]{article} +\usepackage{amsmath,amssymb,tikz} +\usetikzlibrary{automata,positioning} + +\begin{document} +\title{Discussion of Selected Exercises: \\ +Section 4.2.8 of Compilers: Principles, Techniques and Tools \\ +\vspace{1em} +\large{Topic: Context-Free Grammars}} + +\author{Mike Gerwitz} +\date{\today} + +\maketitle + +\def\exercise#1 #2\par{ + \goodbreak + \vspace{0.5em plus 0.5em} + \noindent + \llap{\bf Exercise #1 }% + {\sl#2}\par + \vspace{0.5em plus 0.5em} + \goodbreak +} +\def\exend{$\blacksquare$} + +\def\set#1{\left\{#1\right\}} + +\def\nt#1{{\ifmmode#1\else$#1$\fi}} +\def\nts#1{\;\nt#1\;} +\def\prod{\rightarrow} +\def\punion{\;|\;} +\def\emptystr{\ifmmode\epsilon\else$\emptystr$\fi} + +\def\mspace#1{\ifmmode\;#1\;\else$#1$\fi} + +\def\derivop{\displaystyle\mathop{\Rightarrow}} +\def\deriv{{\mspace\derivop}} % extra grouping to solve issue in mmode w/ align +\def\lmderiv{\mspace{\deriv\limits_{lm}}} +\def\derivz{\mspace{\derivop^{\kern -0.25em*}}} +\def\derivp{\mspace{\derivop^{\kern -0.25em+}}} +\def\derivlm{\mspace{\derivop_{lm}}} +\def\derivrm{\mspace{\derivop_{rm}}} +\def\derivlmz{\mspace{\derivop^{\kern -0.25em*}_{lm}}} + +\let\eqrefold\eqref +\def\eqref#1{\eqrefold{e:#1}} +\def\gref#1{grammar~\eqref{#1}} +\def\Gref#1{Grammar~\eqref{#1}} +\def\fref#1{Figure~\ref{f:#1}} + +\def\prooftext#1 #2\par{ + \goodbreak + \vspace{1ex plus 0.5ex} + \noindent + \llap{#1 }% + #2\par +} +\def\proof{\prooftext {\bf\small\uppercase{Proof}} } +\def\basis{\prooftext {\sc Basis} } +\def\ind{\prooftext {\sc Induction} } +\def\contra{\prooftext {\sc Contradiction} } +\def\foorp{$\square$\vspace{1ex plus 1ex}} + + +\begin{abstract} +This paper contains the author's answers and proofs for selected exercises from +Section~4.2.8 of the ``dragon book''---Compiler: Principles, Techniques and +Tools (hereinafter ``CPTT''). This book, while an excellent resource, can be +challenging for self-study as it does not provide a means of verifying one's +answers outside of a classroom setting (unless the reader has confidence in +his/her proofs). This paper is intended for two audiences: (a) those reading the +book and looking for clarification and discussion on the exercises and (b) those +who are curious on the topic of context-free grammars that do not possess the +text. The selected exercises are those that the author felt would be most useful +for discussion and, as such, are expected to be challenging to the reader. Less +challenging portions of exercises may be discussed to segue into the more +challenging portions. +\end{abstract} + + +\section{Context-Free Grammars} +The focus of this discussion (and of Section 4.2 in CPTT) is on context-free +grammars (or simply ``grammars''). + +\section{Convention and Notation} +The following notational conventions are used throughout this paper. In most +cases, they have been borrowed from the text. + +For grammars, capital symbols are used to represent non-terminals. The $\nt{S}$ +symbol is used to denote the starting non-terminal. The symbol $\prod$~is used +to separate the non-terminal from its production body, whereas +$\deriv$~indicates a single step in a derivation. Leftmost and rightmost +derivations are denoted $\derivlm$ and~$\derivrm$ respectively. $\derivz$ means +``derives in zero or more steps'', whereas $\derivp$ means ``derives in one or +more steps''. The symbol $\punion$~separates multiple productions for a single +non-terminal. Any time punctuation is placed at the end of a grammar or +derivation, it should be read as part of the surrounding paragraph, \emph{not} +as part of the production or derivation. For example, in the grammar +$$ + \nt{S} \prod 0\nts{S}1 \punion \emptystr, +$$ +\noindent +the trailing comma is not part of the construction. Furthermore, whitespace is +not significant and may be discarded. \emptystr~is the empty string. + +``The text'' refers to CPTT, whereas ``this paper'' refers to the paper you are +currently reading. + + +\section{Exercise 4.2.3---Grammar Design} +This exercises requests that the reader design grammars for a series of language +descriptions a--f; we will discuss each of them. Although the text does not +request it, proofs will be provided for each, as they are useful to demonstrate +correctness and an excellent practice in discipline. + +\exercise 4.2.3a The set of all strings of 0's and 1's such that every 0 is +immediately followed by at least one 1. + +The grammar for this exercise is fairly trivial, but will serve as a useful +introduction to the formalities of this paper. First, let us consider a grammar +that demonstrates such a property. Our alphabet is $\Sigma = \set{0,1}$. The +only restriction on the sentences of our grammar is that each 0 must be followed +by a 1---this therefore means that we can have any number of adjacent 1's, but +it is not possible to have adjacent 0's. Considering that our alphabet~$\Sigma$ +has only two characters, this grammar is fairly simple: + +\begin{equation}\label{e:z1} + \nt{S} \prod 1\nt{S} \punion 01\nt{S} \punion \emptystr. +\end{equation} + +As an example, let us consider some of the sentences that we may wish to be +derived by this grammar. In particular, consider derivation of the string +$01011$: + +\begin{equation} + \nt{S} \deriv 01\;\nt{S} + \deriv 01\;01\;\nt{S} + \deriv 01\;01\;1\nt{S} + \deriv 01\;01\;1\;\emptystr + \derivz 01\;01\;1. +\end{equation} + +Notice also that a string of 1's---such as $1111$---is also derivable given our +grammar: + +\begin{equation}\label{e:z1-1s} + \nt{S} \deriv 1\;\nt{S} + \deriv 1\;1\;\nt{S} + \deriv 1\;1\;1\;\nt{S} + \deriv 1\;1\;1\;1\;\nt{S} + \deriv 1\;1\;1\;1\;\emptystr + \derivz 1\;1\;1\;1, +\end{equation} + +\noindent +as is the empty string $\emptystr$ in one step: + +\begin{equation} + \nt{S} \deriv \emptystr. +\end{equation} + +To prove that grammar \eqref{z1} is correct, we must prove two independent +statements: + +\begin{enumerate} + \item The \emph{only} strings derivable from \gref{z1} are those of 0's and + 1's such that every 0 is immediately followed by at least one 1; + + \item The grammar accepts all such strings. +\end{enumerate} + +We will prove these statements in order. For the first statement, we must +show that, at any given step $n$ of \gref{z1}, the only derivable strings +contain a 1 after each and every 0 (or that the string contains no 0's). For the +second statement, we must show that any string containing 0's and 1's such that +every 0 is followed by at least one 1 is derivable from our grammar. Grammar +proofs are discussed in Section 4.2.6 of the text. + +\proof The only strings derivable from~$\nt{S}$ are those of 0's and~1's such +that every 0~is immediately followed by at least one~1. We shall perform this +proof inductively on the number of steps~$n$ in a given derivation. + +\basis The basis is $n=1$. In one step, our grammar may produce one of three +strings: A string beginning with a~1 (the first production of~$\nt{S}$), a +string beginning with a~0 followed by a~1 (the second production of~$\nt{S}$) +and the empty string~\emptystr\ (the final production of~$\nt{S}$). + +The empty string~\emptystr\ has no~0's and so follows the rules of the language. +The same is true for any string beginning with a~1. The third and final string +that can be generated when~$n=1$ is~01. This string does contain a~0 and +therefore also satisfies our requirement. + +\ind We shall now assume that all derivations of fewer than $n$~steps result in +either a sentence containing no~0's or a sentence that contains 0's~followed by +one or more~1's. Such a derivation must have the form + +\begin{equation}\label{e:z1-ind} + \nt{S} \deriv xS \derivz xy. +\end{equation} + +\noindent +Since $x$~is derived in fewer than $n$~steps then, by our inductive hypothesis, +$x$~must contain~0's only if followed a~1; the same is true of~$y$. + +Additionally, according to \gref{z1}, $y$~must be of one of the productions + +\begin{align*} + \nt{S} &\prod 1\nt{S} \\ + \nt{S} &\prod 01\nt{S} \\ + \nt{S} &\prod \emptystr. +\end{align*} +\noindent +Each of these productions have already been discussed in our basis; therefore, +$y$~cannot contain a~0 followed by another~0. Additionally, it is required that +adjacent~1's be permitted after a~0, which is possible by the first production +(as demonstrated in \eqref{z1-1s}). As such, $xy$~must contain only~0's +followed by one or more~1's and our hypothesis has been proved. \foorp + +To ensure a thorough understanding of the above proof, it is worth mentioning +why \eqref{z1-ind}~used both the \deriv\ and~\derivz\ derivation symbols. Our +basis applies when $n=1$; the inductive hypothesis applies otherwise (when +$n>1$). As such, we must have \emph{at least} one production in~\eqref{z1-ind}. + +Now that we have proved that we may only derive sentences from \gref{z1} that +contain~0's followed by one or more~1's, we must now show that the grammar may +be used to derive all such possible strings. + +\proof Any string~$s$ of length~$l$ consisting of~1's and~0's such that any~0 is +followed by at least one~1 is derivable from~$\nt{S}$. + +\basis A string of length~$0$ ($l=0$) must be~\emptystr, which is derivable +from~$\nt{S}$ in one step. + +\ind Assume that any string $s$ of a length less than $l$ is derivable +from~\nt{S}. Such a string must have the form~$xy, +y\in\set{1,01,\emptystr}$---that is, we can consider $s$ to be the concatenation +of $y$~with a previously derived string. Since the length of $x$~is clearly less +than~$l$, it must by derivable from~\nt{S} by our inductive hypothesis. +Furthermore, $xy$~must have a derivation of the form + +\begin{equation}\label{e:z1-deriv-1} + \nt{S} \derivp x\;\nt{S} \deriv x\;y, +\end{equation} +\noindent +thereby proving that $s$~is derivable from~\nt{S}. \foorp + +The derivation~\eqref{z1-deriv-1} may seem to be too abstract to be useful; +since this is our first proof, it is worth clarifying why it does in fact +complete the proof. We first showed that any string of the language of 0's and +1's that we have been studying can be described as the concatenation of a +smaller such string with 0, 01 or~\emptystr\ (which completes the string). This +string, as we stated, has the form~$xy$. Therefore, we must show that +\nt{S}~supports concatenation---\eqref{z1-deriv-1} demonstrates this with~$x$ +fairly abstractly, since it does not matter what exactly $x$~is. From the +productions of~\nt{S} in \gref{z1}, it is understood that $x$ can be any string +of terminals (that is---any derivation) leading up to that point in the +derivation~\eqref{z1-deriv-1}. + +We must now show that the remaining part of~$xy$---that is, $y$---is derivable. +The only non-terminal remaining after~$x$ is~\nt{S}. We have defined $y$~to be +any string of terminals in the set $\set{0,01,\emptystr}$. Clearly, each of +these strings are derivable from~\nt{S}. Therefore, we can replace~\nt{S} +in~\eqref{z1-deriv-1} with~$y$, indicating that this is a valid derivation given +our definition of~$y$; it is up to the reader of the proof to make this +connection. Note that, while the domain of $y$~happens to be every production +of~\nt{S}, this is not necessary for the proof---that is the subject of the +first proof. + +Before we put this exercise to rest (indeed, we completed the exercise +requirement in the first paragraph following the exercise definition), it is +also worth noting that this grammar may also be accepted by a finite automata +(and consequently, a regular expression); this is demonstrated by +\fref{z1-regex}. It should be noted that this is not the case with all of the +exercises that follow. +\exend + +\begin{figure} + \center + \begin{tikzpicture} + \node[state,initial] (a) {$a$}; + \node[state] (b) [right=of a] {$b$}; + \node[state,accepting] (c) [right=of b] {$c$}; + + \path[->] + (a) edge [loop below] node {1} () + edge [bend right, below] node {\emptystr} (c) + edge [above] node {$0$} (b) + (b) edge [above] node {$1$} (c) + (c) edge [bend right, above] node {\emptystr} (a) + ; + \end{tikzpicture} + + \caption{An NFA corresponding to the extended regular expression + $\left(0^?1^+\right)^*$ describing \gref{z1}.} + \label{f:z1-regex} +\end{figure} + +The above example was fairly simple, yet resulted in a realitively lengthy +discourse far past what was required by the text; the reader can expect such a +discussion to continue for all examples that follow. + + +\exercise 4.2.3b The set of all strings of 0's and 1's that are +palindromes; that is, the string reads the same backward as forward. + +As the exercise stated, a {\sl palindrome} is a string that reads the same in +both directions; let us consider some examples before attempting to construct a +grammar. The following list of strings are all palindromes, one per +line:\footnote{An example of an English palindrome is ``Mr.~Owl ate my metal +worm'' (discarding punctuation and capitalization.)} + +\begin{equation}\label{e:palex} + \begin{tabular}{rcl} + 1 &00 &1 \\ + 1100 &11 &0011 \\ + 010 &1 &010 \\ + & 0 & + \end{tabular} +\end{equation} + +The above palindromes have been laid out so that their symmetry is apparent. At +first glance, one can imagine constructing a palindrome out of pairs of +characters, like the second row of~\eqref{palex}: + +\begin{equation}\label{e:palex-2} + \begin{tabular}{crcl} + & 11 & \\ + 1 & 11 & 1 \\ + 11 & 00 & 11 \\ + 110 & 00 & 011 \\ + 1100 & 11 & 0011 + \end{tabular} +\end{equation} + +\noindent +In this case, each palindrome would always have an even number of characters. +However, it is important to note the bottom two palindromes of \eqref{palex}, +which have an \emph{odd} number of characters: + +\begin{equation}\label{e:palex-3} + \begin{tabular}{rcl} + & 00 & \\ + 0 & 11 & 0 \\ + 01 & 00 & 10 \\ + 010 & 1 & 010 + \end{tabular} +\end{equation} + +Given this evaluation and the understanding that $2n$~is always even for some +positive integer~$n$, it would be accurate to recursively construct a palindrome +from the edges inward in pairs. Once we reach the center, we may end +with~\emptystr\ if we wish to have an even ($2n$) number of characters, or +otherwise may add a single character to create a palindrome containing an odd +($2n+1$) number of characters. + +\begin{equation}\label{e:palindrome} + \begin{aligned} + \nt{S} &\prod 0\nts{S}0 \punion 1\nts{S}1 \punion M \\ + \nt{M} &\prod 0 \punion 1 \punion \emptystr + \end{aligned} +\end{equation} + +In \gref{palindrome} above, we define out start non-terminal~\nt{S} with +productions for the outer pairs. The non-terminal~\nt{M} represents the +acceptable inner (``middle'') characters, which determines if the length of the +palindrome is even (if \emptystr~is used) or odd (0 or~1). We will leave +demonstrations of such derivations to the proof. + +To prove that grammar~\nt{S} is the proper grammar for all palindromes, we must +again prove two things: That language $L(\nt{S})$ can produce only palindromes +of~0's and~1's and that all such palindromes can be derived from~\nt{S}. The +difference between these two descriptions may be subtle for such a simple +grammar, but the distinction is important to ensure that $L(\nt{S})$ represents +\emph{nothing more and nothing less} than a language that may be used for such +palindromes. + +As before, the proofs will be inductive---the first proof on the number of +steps~$n$ of a derivation of~\nt{S} and the second on the length~$l$ of the +palindrome~$s$. Our alphabet~$\Sigma$ is once again~$\set{0,1}$. + +\proof The only strings derivable from grammar~\nt{S} are palindromes consisting +of 0's and~1's. + +\basis The basis is $n=2$, which is the fewest number of steps from which a +string may be derived from~\nt{S}.\footnote{$n=1$ steps cannot result in a +string consisting only of nonterminals, as it would result in $0S0$,~$1S1$ +or~$M$.} Such a derivation must be of the form +$$ + \nt{S} \deriv M \deriv x, +$$ +\noindent +where $x$~is 0,~1, or~\emptystr. In the latter case, the derived string is +clearly a palindrome of length zero. In the case of 0 or~1, the length of the +string is one, which must be a palindrome. + +\ind Now assume that every string derived in less than $n$~steps is a +palindrome. Such a derivation must be of the form +$$ + \nt{S} \deriv x\nts{S}x \derivz x\;y\;x. +$$ +\noindent +That is, the string~$x$ appears on both the left and right of~$y$. Since the +derivation of~$y$ from~\nt{S} takes fewer than $n$~steps---specifically, $n-1$ +steps---$y$~must be a palindrome by our inductive hypothesis. Because $x$~is +added to both the beginning and end of~$y$, then any string derived in $n$~steps +must be a palindrome. \foorp + +Let us further demonstrate the above proof by deriving~\eqref{palex-2} +from~\nt{S}:\footnote{The dots were added so as not to confuse the reader as to +what was going on; the symbol~\derivp\ is sufficient and therefore the dots will +be omitted in the future.} + +\begin{equation} + \nt{S} + \deriv 1\nts{S}1 + \deriv 1\;1\nts{S}1\;1 + \deriv \cdots + \derivp 1\;1\;0\;0\;1\;\emptystr\;1\;0\;0\;1\;1 +\end{equation} + +\noindent +and additionally \eqref{palex-3}: + +\begin{equation} + \nt{S} + \deriv 0\nts{S}0 + \deriv 0\;1\nts{S}1\;0 + \deriv 0\;1\;0\nts{S}0\;1\;0 + \deriv 0\;1\;0\;1\;0\;1\;0. +\end{equation} + +\noindent +The induction step works by recognizing the basis as the middle of the string +(nonterminal~\nt{M} in \gref{palindrome})---\emptystr~for palindromes of an +even length and the $\left\lceil n/2 \right\rceil^{th}$ character for those of +an odd length (1 in the case of the latter derivation). Call this string~$b$. We +know that $b$~is a palindrome, as explained in the proof above. For our +inductive step, we recognize that, for each step~$n$, we add two +characters---one to the beginning and one to the end---to the result of +step~$n-1$. As such, since the derivation of~$n-1$ steps must be a palindrome, +the derivation in~$n$ steps must also be---it is not possible to derive anything +but a palindrome from~\nt{M} and \nt{S}~maintains this designation. + +For completeness, we must now show that all possible palindromes of the +alphabet~$\Sigma$ can be derived from~\nt{S}. + +\proof Every palindrome consisting of~0's and~1's is derivable from~\nt{S}. + +\basis If the string~$s$ is of length~$l\leq1$, then it must be \emptystr,~0 or~1, +all of which are palindromes derivable by~\nt{M}. + +\ind Observe that any palindrome of length~$l>1$ must contain the same +character at positions~$1$ and~$l$.\footnote{1-indexed for notational +convenience.} Assume that each string with a length less than~$l$ is derivable +from~\nt{S}. Since $s$~is a palindrome, then it must have the form $xyx, +x\in\Sigma$, where $y$~is also a palindrome. Since $y$~has a length $l-2] + (a) edge [loop below] node {1} () + edge [loop above] node {0} () + ; + \end{tikzpicture} + + \caption{The minimum-state DFA for the regular expression + $\left(0|1\right)^*$.} + \label{f:pal-a} +\end{figure} + +Consider that the only way for a finite automata to maintain a history of states +is to have a state to represent each unique history. However, to accept a string +of any length, we would need an automaton containing a potentially infinite +number of states, which is not finite (and therefore not a finite automaton). +Therefore, it is not possible to represent the history of every possible +palindrome using a finite set of states. + +Given this, it must stand that a finite automaton must at some point contain a +state that transitions to a previous or current state, such as the NFA in +\fref{pal-a2}. Since the history of the string is ``stored'' purely in the +possible states leading up to the current state, this transition~$t$ equates to a +loss of ``memory'', without which the right-hand portion of the palindrome cannot +be properly matched. Furthermore, since each position~$n$ may contain any +character in~$\Sigma$, and since the transition~$t$ can only yield a set of +future states with a limited (finite) precision, each of these future states +must be redundant. Since each NFA can be represented by an equivalent DFA and +each DFA for some grammar has a single common minimum-state DFA, any portion of +a finite automaton that can accept a palindrome of any length must be equivalent +to \fref{pal-a} (such as state~$x$ in \fref{pal-a2}). We are therefore left to +conclude that no finite automata can accept a palindrome of arbitrary length +without accepting every string that is a combination of each character in +$\Sigma$. \foorp + +\begin{figure} + \center + \begin{tikzpicture} + \node[state,initial] (a) {$1$}; + \node[state] (b) [right=of a] {$2$}; + \node[state] (x) [right=of b] {$x$}; + \node[state] (y) [right=of x] {$n-1$}; + \node[state,accepting] (z) [right=of y] {$n$}; + + \path[->] + (a) edge [above] node {$\alpha$} (b) + edge [below, bend right=45] node {$\kern-0.7em\emptystr$} (z) + (b) edge [above] node {$\beta$} (x) + edge [below, bend right=65] node {$\emptystr$} (y) + (x) edge [loop above] node {$\beta$} () + edge [loop below] node {$\alpha$} () + edge [above] node {$\beta$} (y) + (y) edge [above] node {$\alpha$} (z) + ; + \end{tikzpicture} + + \caption{An NFA with a finite set of states must at some point transition to a + previous or identical state in order to accept input of any length. + $\Sigma=\set{\alpha,\beta}$.} + \label{f:pal-a2} +\end{figure} + +To provide further clarification---any finite automata that transitions to a +\emph{previous} state, since it looses a portion of its history, can no longer +accurately determine the states leading up to the final state. That is, consider +the string 10101 and consider that the first three characters of this string can +be represented by the states $\set{a,b,a}$. At this point, we can no longer be +certain of what the string may end with, because we have lost any sense of +nesting/recursion. Therefore, the states leading to the final state are forced +to accept any character in $\Sigma$ and therefore must be equivalent to the +minimum-state DFA of \fref{pal-a}. As was mentioned by the text, ``finite +automata cannot count''. + +\fref{pal-a2} gets around such an issue by transitioning only to current or +future states, which permits a \emph{finite} amount of nesting (placing the +aforementioned minimum-state DFA~$x$ in the middle). However, note a glaring +issue---this automaton does not accept~$\beta$ in the first character position. +If it did, then we would need a second set of states in order to maintain such a +history and know that we should also \emph{end} with $\beta$~instead +of~$\alpha$. The number of states would therefore grow very quickly with the +level of nesting and the size of~$\Sigma$ (such a consideration is left to the +reader). + +We have exhaustively proved that \gref{palindrome} is the correct answer for +this exercise. \exend + + +\exercise 4.2.3c The set of all strings of 0's and 1's with an equal number of +0's and 1's. + +To understand how to approach this problem, we shall consider a number of +strings that are derivable from this language. An obvious case is~\emptystr, +which contains zero~0's and zero~1's. Some additional examples are shown in +\fref{eq-ex} along with their lengths (denoted by~$l$). + +\begin{figure}[h] + \center + \begin{tabular}{r|cccccc} + $s$ & \emptystr & 10 & 01 & 1010 & 1001 & 011100 \\ + \hline + $l$ & 0 & 2 & 2 & 4 & 4 & 6 + \end{tabular} + + \caption{Examples of strings with an equal number of 0's and 1's.} + \label{f:eq-ex} +\end{figure} + +These examples demonstrate a number of important properties. In particular, the +length~$l$ of the string~$s$ is always even, with the number of 0's and~1's +$n=l/2$. Additionally, the characters of the alphabet~$\Sigma$ may appear in any +order in the string. Therefore, we do not have the luxury of a simple, nested, +recursive implementation as we did with the palindrome exercise (at least not +exclusively). + +Let us construct the grammar iteratively, beginning with the simplest case +of~\emptystr. + +\begin{equation}\label{e:eq-1} + \nt{S} \prod \emptystr +\end{equation} + +\noindent +The second case---10---is also fairly easy to fit into~$\nt{S}$: + +\begin{equation}\label{e:eq-2} + \nt{S} \prod 10 \punion \emptystr +\end{equation} + +The third case demonstrates an important case regarding our strings: They may +begin with either a~0 or a~1 and they may also \emph{end} with either character +(more generally, they may begin or end with any character in~$\Sigma$). However, +we cannot simply adjust our grammar to accept either character in both +positions---$\nt{S}$ must assure that, any time we include a~0 in a production, +we also include a~1 (and vice versa). So far, this is guaranteed by~$\nt{S}$ in +\gref{eq-2}; to keep on this path, we must add 01 as yet another special case. + +\begin{equation}\label{e:eq-3} + \nt{S} \prod 01 \punion 10 \punion \emptystr +\end{equation} + +\goodbreak +The fourth case---1010---introduces the need to handle strings of an arbitrary +length. To do this, we must determine at what point we should recurse +on~$\nt{S}$. Looking at the example, we could derive 1010 as two nested +applications of~$\nt{S}$ if we recurse between the two terminals. + +\begin{equation}\label{e:eq-a} + \nt{S} + \deriv 1\nts{S}0 + \deriv 1\;0\nts{S}1\;0 + \deriv 1\;0\;\emptystr\;1\;0 + \derivz 1\;01\;0 +\end{equation} + +\noindent +Of course, one could also adopt an alternate perspective by considering the +string to be the production of two adjacent non-terminals. + +\begin{equation}\label{e:eq-b} + \nt{S} \deriv \nt{S}\;\nt{S} + \derivlm 10\;\nt{S} + \derivlm 10\;10 +\end{equation} + +\noindent +Unfortunately, with this information alone, we cannot be certain which of these +productions---if such a choice even matters---should be used in our grammar. +Perhaps we can gain further insight from the remaining examples. + +The next example---1001---can be derived in a manner similar to \eqref{eq-b}, +but not \eqref{eq-a}; in particular, \gref{eq-3} has no production for the +string 00, and so we cannot construct the string from the outside in. Given +that, we can be certain that an adjacent non-terminal production is needed and +so we will add the production used in \eqref{eq-b} to our grammar. + +\begin{equation}\label{e:eq-4} + \nt{S} \prod 01 \punion 10 \punion \nt{S}\;\nt{S} \punion \emptystr +\end{equation} + +However, the aforementioned predicament---the absense of a production that can +yield only 00---raises the question of whether or not we can truly derive any +string of equal 1's and 0's from the above grammar. Our final example challenges +this. 011100 cannot possibly be represented by~$\nt{S}$ in \gref{eq-4} because +this grammar constructs the string from left-to-right (or right-to-left) in +pairs of~0's and~1's. Therefore, the only way to have adjacent~1's or adjacent~ +0's is to alternate the productions, which makes it impossible to have more than +two adjacent identical characters. + +Given this, it seems that both \eqref{eq-b} \emph{and} \eqref{eq-a} are +necessary; the following derivation demonstrates this fact (neither can +individually be used to derive the string 011100). + +\begin{equation} + \nt{S} \deriv \nt{S}\;\nt{S} + \derivlm 01\;\nt{S} + \derivlm 01\;1\nts{S}0 + \derivlm 01\;1\;1\nts{S}0\;0 + \derivlm 01\;1\;1\;\emptystr\;0\;0 + \derivlmz 01\;1\;10\;0 +\end{equation} + +\noindent +We thus arrive at \gref{eq-5} below. + +\begin{equation}\label{e:eq-5} + \nt{S} \prod 0\nts{S}1 + \punion 1\nts{S}0 + \punion \nt{S}\;\nt{S} + \punion \emptystr +\end{equation} + +An astute reader may at this point notice that we have created an ambiguity in +our grammar: Recall~\eqref{eq-a} and~\eqref{eq-b}, which had two possible +derivations for the same string; both of these derivations are now possible in +our grammar. The text defines an ambiguous grammar to be a grammar that contains +more than one leftmost or more than one rightmost derivation for the same +sentence. This is a particularly interesting example of ambiguity, in particular +because we cannot resolve it. Let us consider why. + +\proof Grammar~$\nt{S}$ cannot be disambiguated. We will prove this fact by +contradiction. + +\contra Firstly, recognize that~$\nt{S}$ is ambiguous because there exists some +sentence~$s$ that has both of the following derivations in $n>1$ steps, where +$a\ne b$: + +\begin{align*} + \nt{S} &\deriv a\nts{S}b \derivp a\;x\;b; + \\ + \nt{S} &\deriv \nt{S}\;\nt{S} + \deriv a\nts{S}b\;\nt{S} + \derivp a\;b\;\nt{S} + \derivp a\;b\;x. +\end{align*} + +Suppose to the contrary that there is some way to disambiguate~$x$. There must +then be some terminal $c\in\Sigma$ in~$x$ that may be used to perform the +disambiguation and such a disambiguation would imply a difference in the +semantics of~$x$ between the two derivations. However, $x=x$ and so both +derivations hold exactly the same meaning---balanced strings. Furthermore, the +productions for producing balanced strings requires each character in~$\Sigma$; +$c$ therefore must not exist. \foorp + +Fortunately, this ambiguity is not an issue for our grammar because the multiple +derivations are semantically equivalent---we are not arriving at any different +result within the context of this exercise. The sentence 1010 of \fref{eq-ex} +demonstrates this concept: It does not matter whether we consider the sentence +to be a single balanced string or the concatenation of two balanced strings; we +arrive at the same result regardless with no harm done.\footnote{Of course, one +valid argument is that a more concise and unambiguous grammar will reduce +problems during parsing. However, the parser (like Lex, as described by the +text) can give precedence to the productions that appear earlier in the grammar +to resolve this issue.} + +While the discussion thus far is likely to convince the reader that \gref{eq-5} +is correct, we shall conclude with a formal proof of this fact. A proof that +the grammar cannot be represented by any finite automata shall be omitted, in +particular because the productions of $\nt{S}$ have a structure very similar to +the palindrome \gref{palindrome}. + +\proof Only sentences composed of balanced~1's and!0's may be derived +from~$\nt{S}$. + +\basis The basis is $n=1$. The only sentence that may be derived in 1 step +is~\emptystr, which is clearly balanced (containing zero~0's and zero~1's). + +\ind Assume that any sentence derived in fewer than~$n$ steps is balanced. Now +recognize that any sentence derived in $n>1$ steps must make use of one of the +following productions of $\nt{S}$: + +\begin{align*} + \nt{S} &\prod 0\nts{S}1; \\ + \nt{S} &\prod 1\nts{S}0; \\ + \nt{S} &\prod \nt{S}\;\nt{S}. +\end{align*} + +\noindent +Therefore, the smallest sentence that is not~\emptystr\ is either $0\nt{x}1$ or +$1\nt{x}0$, both of which are balanced (each contains one~0 and one~1). Since +$x$~is derivable from~$\nt{S}$ in fewer than~$n$ steps, then by our inductive +hypothesis, all sentences derivable from~$\nt{S}$ must be balanced. The last +remaining production has the form~$xy$, both of which are derivable from~\nt{S} +in fewer than~$n$ steps and thus must be balanced. Furthermore, since the +productions of~$\nt{S}$ produce only 0,~1, or~\emptystr, $\nt{S}$~has the +alphabet $\Sigma=\set{0,1}$ and, consequently, may derive no sentence except for +those containing balanced~0's and~1's. \foorp + +Having proved that only sentences of balanced~0's and~1's are derivable +from~$\nt{S}$, we must now prove that $\nt{S}$~can derive \emph{all} such +strings (that is, all such strings are sentences of $\nt{S}$). Such a proof is +interesting because our grammar is more sophisticated than the previous +examples. + +\proof All strings of balanced~0's and~1's are sentences of~$\nt{S}$. + +\basis The basis is a string of length $l=0$, which contains zero~0's and +zero~1's. This string must be~\emptystr, which is derivable from~$\nt{S}$. + +\ind First, recognize that all balanced strings must have a length $l=2k$---that +is, $l$~is always even (as emphasized in \fref{eq-ex}) and contains $k$~0's and +$k$~1's. Assume that all strings less than length~$2k$ are derivable +from~$\nt{S}$. + +Consider any balanced string~$s$ of length~$2k$. We can consider $s$~to have the +form~$yz$---that is, the concatenation of two balanced strings~$y$ and~$z$, both +of which in turn have the form $axb, a\neq b$ where $x$ itself must be balanced +(since $a\neq b$); alternatively, either $y$ or~$z$ may be~\emptystr, which +therefore implies that the form~$yz$ accepts any balanced string where the first +and last characters are not the same. + +We must now show that all such strings can be represented by the form~$yz$. +First, recognize that $y=axb$ may have either the form $0x1$ or $1x0$; the form +$yz$ then permits up to two adjacent identical characters in $\Sigma$; any +additional adjacent identical characters may be derived by $x$. Consider +$x=\emptystr$; then, clearly $axb$ is balanced and can be concatenated to form a +larger balanced string. If $x\neq\emptystr$ but $x_1=b$,\footnote{$x_n$ denotes +the $n^{\text{th}}$ character of~$x$.} then we can instead consider an +alternative interpretation $y'=ax_1$ and $x'=x_2\cdots x_nb$, and then let +$y=y'x'$ (instead of $axb$). + +We are then left with the case where $x_1=a$. Such a case allows for an +arbitrarily deep nesting of adjacent identical characters and therefore $axb$ +can be represented by the regular expression $a^+b^+$. It is therefore clear +that the form $yz$ is able to describe any string of balanced characters in the +alphabet $\Sigma=\set{0,1}$. Such a form must have the derivation + +$$ + \nt{S} \deriv \nt{S}\;\nt{S} \derivlmz xy. +$$ + +\noindent +Since this is a leftmost derivation, $y$~is either a balanced string or +\emptystr. In the former case, it is obvious that both $x$ and~$y$ are of a +length less than~$2k$ and are therefore derivable from~\nt{S} by our inductive +hypothesis. Otherwise, $y=\emptystr$ and the length of $x$~is precisely~$2k$ and +we must consider the form $axb$; $x$~is clearly of a length of less than~$2k$ +and is therefore balanced by our inductive hypothesis. Furthermore, it must have +a derivation of the form +$$ + \nt{S} \deriv a\nts{S}b \derivz a\;x\;b, +$$ +\noindent +thereby proving that $axb$ is derivable from~\nt{S}. \foorp + +This proof was considerably more involved than our previous ones and is an +excellent segue into proving more sophisticated grammars. Of course, the reader +can surely see the challenges that might arise from attempting to prove much +more complicated grammars. \exend + +\end{document}