Program Composition: Talk notes

master
Mike Gerwitz 2019-03-15 00:46:44 -04:00
parent b40e289f35
commit f372be6d29
Signed by: mikegerwitz
GPG Key ID: 8C917B7F5DC51BA2
1 changed files with 358 additions and 29 deletions

View File

@ -64,7 +64,9 @@
- [ ] Users need to be able to convey their thoughts to the computer
without being programmers
- [ ] Concise primitives / building blocks
- [ ] Readline (history and editing)
- [ ] Readline (history and editing) [0/2]
- [ ] Develop as you go, perhaps just referencing history early on
- [ ] Transfer commands from history into scripts and aliases for re-use
- [X] Regular expressions ([[*Perspective Topics][Perspective Topics]])
- [ ] Remote commands via SSH
- [X] Text as a universal interface
@ -167,14 +169,14 @@
| \_ Program Composition | | LACKING | |
| \_ Composition Topics | | | |
| \_ Clarifying Pipelines | | RAW | fullframe |
| \_ Tor | | LACKING | fullframe |
| \_ LP Sessions | | LACKING | fullframe |
| \_ Interactive, Incremental, Iterative Development | | LACKING | fullframe |
| \_ Discovering URLs | | LACKING | fullframe |
| \_ Go Grab a Coffee | | LACKING | fullframe |
| \_ Async Processes | | | fullframe |
| \_ Executable Shell Script and Concurrency | | LACKING | fullframe |
| \_ Again: A Research Task | | LACKING | againframe |
| \_ Tor | | RAW | fullframe |
| \_ LP Sessions | | RAW | fullframe |
| \_ Interactive, Incremental, Iterative Development | | RAW | fullframe |
| \_ Discovering URLs | | RAW | fullframe |
| \_ Go Grab a Coffee | | RAW | fullframe |
| \_ Async Processes | | RAW | fullframe |
| \_ Executable Shell Script and Concurrency | | RAW | fullframe |
| \_ Again: A Research Task | | RAW | againframe |
| \_ A Quick-n-Dirty Solution | | LACKING | frame |
|-------------------------------------------------------+----------+---------+-------------|
| \_ Thank You | 00:00:01 | | fullframe |
@ -1696,16 +1698,14 @@ We start to think of how to decompose problems into small operations that
We think of how to chain small, specialized programs together,
transforming text at each step to make it more suitable for the next.
** LACKING Program Composition [0/9]
*** Composition Topics [5/6] :noexport:
** RAW Program Composition [0/10]
*** Composition Topics [6/6] :noexport:
- [X] Clarify how pipelines work with existing =wget | grep=.
- [X] More involved pipeline with more than two programs.
- [ ] Emphasize iterative development and how the shell is a REPL. [0/3]
- [X] Emphasize iterative development and how the shell is a REPL.
- Useful for programmers for prototyping and debugging, but also essential
to average users for discovery.
- [ ] Develop as you go, perhaps just referencing history early on
- [ ] Evolve by making portions of command dynamic (variables, subshells)
- [ ] Transfer commands from history into scripts and aliases for re-use
- Evolve by making portions of command dynamic (variables, subshells)
- [X] Now script discovering what pages contain a certain word [3/3]
- [X] Mention previous example of being emailed a list of URLs. Rather
than pasting them into a file, let's discover them using the same
@ -1782,8 +1782,9 @@ We can pipe it to =wc= instead,
What about the number of lines that contain the string ``free software''?
Or how about the last such line?
It's all a simple matter of composing existing programs.
*** LACKING Tor :B_fullframe:
*** RAW Tor :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -1792,8 +1793,21 @@ Or how about the last such line?
$ alias fetch-url='torify wget -qO-'
#+END_SRC
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
*** LACKING LP Sessions :B_fullframe:
By the way,
retrieving a bunch of URLs in an automated manner may be a privacy concern
for you.
You can easily send all these requests through Tor,
assuming it is installed and the daemon running,
by prefixing =wget= with =torify=.
Since we abstracted our fetching away into the =fetch-url= alias,
our previous examples continue to work as-is.
*** RAW LP Sessions :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -1926,7 +1940,71 @@ $ fetch-url https://libreplanet.org/2019/speakers/ \
#+END_SRC
#+BEAMER: \end{onlyenv}
*** LACKING Interactive, Incremental, Iterative Development :B_fullframe:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
How about something more involved.
I noticed that some talks had multiple speakers,
and I wanted to know which ones had the /most/ speakers.
The HTML of the speakers page includes a header for each speaker.
Here are the first two.
These are keynote speakers,
but there's also non-keynote ones that are just =speaker-header=.
Let's get just the talk titles that those speakers are associated with.
Looking at this output,
we see that the talks titles have an =em= tag,
so let's just go with that.
Uh oh.
It looks like at least one of those results has /multiple/ talks.
But note that each is enclosed in its own set of =em= tags.
If we add =-o= to =grep=,
which stands for /only/,
then it'll only return the portion of the line that matches,
rather than the entire line.
Further,
if there are multiple matches on a line,
it'll output each match independently on its own line.
That's exactly what we want!
But we have to modify our regex a little bit to prevent it from grabbing
everything between the first and /last/ =em= tag,
by prohibiting it from matching on a less than character in the title.
Now assuming that the talk titles are consistent,
we can get a count.
=uniq= has the ability to count consecutive lines that are identical,
as well as output a count.
We also use =-d= to tell it to only output duplicates.
But =uniq= requires sorted input,
so we first pipe it to =sort=.
That gives us a count of each talk!
But I want to know the talks with the most speakers,
so let's sort it /again/,
this time numerically and in reverse order.
And we have our answer!
But just for the hell of it,
let's go a step further.
Using =sed=,
which stands for /stream editor/,
we can match on portions of the input and reference those matches in a
replacement.
So we can reformat the =uniq= output into an English sentence,
like so.
And then we're going to pipe it to the program =espeak=,
which is a text-to-speech synthesizer.
Your computer will speak the top five talks by presenter count to you.
Listening to computers speak is all the rage right now,
right?
*** RAW Interactive, Incremental, Iterative Development :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -1937,7 +2015,42 @@ Incremental Development
#+BEAMER: \fullsubtext
Interactive REPL, Iterative Decomposition
*** LACKING Discovering URLs :B_fullframe:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
Notice how we approached that problem.
I presented it here just as I developed it.
I didn't open my web browser and inspect the HTML;
I just looked at the =wget= output and then started to manipulate it in
useful ways working toward my final goal.
And this is part of what makes working in a shell so powerful.
In software development,
we call environments like this REPLs,
which stands for ``read-eval-print loop''.
The shell reads a command line,
evaluates it,
prints a result,
and then does that all over again.
As a hacker,
this allows me to easily inspect and iterate on my script in real time,
which can be a very efficient process.
I can quickly prototype something and then clean it up later.
Or maybe create a proof-of-concept in shell before writing the actual
implementation in another language.
But most users aren't programmers.
They aren't experts in these commands;
they have to play around and discover as they go.
And the shell is perfect for this discovery.
If something doesn't work,
just keep trying different things and get immediate feedback!
This is also really helpful when you're trying to craft a suitable regular
expression.
*** RAW Discovering URLs :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -2077,7 +2190,78 @@ $ xclip -i -selection clipboard < results.txt
#+END_SRC
#+BEAMER: \end{onlyenv}
*** LACKING Go Grab a Coffee :B_fullframe:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
Okay, back to searching webpages at URLs.
Now that we have a means of creating the list,
how do we feed the URLs to our pipeline?
Why not pull them right out of the email with =grep=?
Let's say you saved the email in =email-of-links.msg=.
This simple regex should grab most URLs for both HTTP and HTTPS protocols.
Here's some example output with a few URLs.
For each of these,
we need to run our pipeline.
It's time to introduce =while= and =read=.
=while= will continue to execute its body in a loop until its command fails.
=read= will read line-by-line into one or more variables,
and will fail when there are no more lines to read.
So if insert our =fetch-url= pipeline into the body,
we get this.
But if we just redirect output into =results.txt=,
we can't see the output unless we inspect the file.
For convenience,
let's use =tee=,
which is named for a pipe tee;
it'll send output through the pipeline while also writing the same
output to a given file.
The =-a= flag tells it to /append/ rather than overwrite.
So now we can both observe the results and have them written to a file!
But we were just going to reply to an email with those results.
Let's assume we're still using a GUI email client.
Wouldn't it be convenient if those results were on the clipboard for us
already so we can just paste it into the message?
We can accomplish that by piping to =xclip= as shown here.
There's also the program =xsel=,
which I typically use because its arguments are far more concise,
but I don't show it here.
Ah, crap, but now we can't see the output again.
So let's use =tee= again.
But rather than outputting to a file on disk,
we're going to use a special notation that tells bash to invoke a command
in a subshell and replace that portion of the command line with a path
to a virtual file representing the standard input of that subshell.
Now we can see the output again!
Well,
if we're /writing/ to the clipboard,
why don't we just /read/ from it too?
Instead of saving our mail to a file,
we can just copy the relevant portion and have that piped directly to
=grep=!
Because we're writing to =results.txt=,
another option is to just let it run and copy to the clipboard at a later
time.
We can do that by reading =results.txt= in place of standard input to
=xclip=,
as shown here.
And while we're at it,
here's a special notation to get rid of =echo= for the =tee= in the body
of =while=:
three less-than symbols provides the given string on standard in.
Phew!
*** RAW Go Grab a Coffee :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -2090,14 +2274,16 @@ Go Grab a Coffee
:BEAMER_env: noteNH
:END:
...
Remember when I said I could go grab a coffee and play with the kids while
the script did its thing?
Well now's that time.
But grabbing a coffee means that this system is a bottleneck.
Ideally,
we wouldn't have to wait long.
The Internet is fast nowadays;
ideally, we wouldn't have to wait long.
Can we do better?
*** Async Processes :B_fullframe:
*** RAW Async Processes :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -2119,7 +2305,41 @@ $ while read URL; do
#+END_SRC
#+BEAMER: \end{uncoverenv}
*** LACKING Executable Shell Script and Concurrency :B_fullframe:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
Indeed we can.
This process is executing serially---one
URL at a time,
waiting for one to complete before checking another.
What if we could query multiple URLs in parallel?
Shells have built-in support for backgrounding tasks so that they can run
while you do other things;
all you have to do is place a single ampersand at the end of a command.
So in this example,
we sleep for one second and then echo ``done''.
But that sleep and subsequent echo is put into the background,
and the shell proceeds to first execute =echo start=.
One second later,
it outputs ``done''.
So here's the loop we were just writing.
If we add an ampersand at the end of that pipeline,
it'll run in the background and immediately proceed to the next URL,
executing the loop again.
But there's a problem with this approach.
Sure,
it's fine if we only have a few URLs.
But what if we have 1000?
Do we really want to spawn 1000s of processes and make 1000 network requests
at once?
That isn't efficient.
*** RAW Executable Shell Script and Concurrency :B_fullframe:
:PROPERTIES:
:BEAMER_env: fullframe
:END:
@ -2185,25 +2405,134 @@ $ xargs -n1 -P5 ./url-grep 'free software' > results.txt
#+END_SRC
#+BEAMER: \end{onlyenv}
*** LACKING Again: A Research Task :B_againframe:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
Before we continue,
we're going to have to write our pipeline in a way that other programs can
run it.
Up to this point,
the program has just been embedded within the shell.
But one of the nice things about shell is that you can take what you entered
into the command line and paste it directly in a file and,
with some minor exceptions,
it'll work all the same.
Let's take our pipeline and name it =url-grep=.
Aliases only work in interactive sessions,
so we're going to just type =wget= directly here.
Alternatively,
you can define a function.
We use the positional parameters =1= and =2= here to represent the
respective arguments to the =url-grep= command.
The comment at the top of the file is called a ``she-bang'' and contains the
path to the executable that will be used to interpret this script.
This is used by the kernel so that it knows how to run our program.
To make it executable,
we use =chmod= to set the executable bit on the file.
We can then invoke it as if it were an executable.
If it were in our =PATH=,
which isn't something I'm going to get into here,
you'd be able to run it like any other command without having to prefix it
with =./=.
We can also do a primitive form of error handling by modifying our
positional parameters like so,
which will show an error message if we don't specify one of them.
Now we replace the =while= loop with =xargs=.
It takes values from standard in and appends them as arguments to the
provided command.
We specify =-n1= to say that only one argument should be read from stdin
for any invocation of the command;
that makes it run a new command for every line of input.
Otherwise it'd just append N URLs as N arguments.
And now we can simply use =-P= to tell it how many processes to use at once.
Here we specify =5=,
meaning =xargs= will run five processes at a time.
You can change that to whatever number makes sense for you.
*** RAW Again: A Research Task :B_againframe:
:PROPERTIES:
:BEAMER_env: againframe
:BEAMER_ref: *A Research Task
:BEAMER_act:
:END:
*** LACKING A Quick-n-Dirty Solution :B_frame:
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
So this was the research task that we started with.
If I were to approach this problem myself,
I'd write a quick-and-dirty script that is just throwaway,
because it's such a simple problem.
So,
let's combine everything we've seen so far:
*** RAW A Quick-n-Dirty Solution :B_frame:
:PROPERTIES:
:BEAMER_env: frame
:END:
#+BEGIN_SRC sh
echo 'wget -qO- "$2" | grep -q "$1" || echo "$2"' > url-grep
$ grep -o 'https\?://[^ ]\+' mail \
| xargs -n1 -P10 bash url-grep 'free software' \
$ xargs -n1 -P10 bash url-grep 'free software' < url-list
| mail -s 'URLs not mentioning free software' mtg@gnu.org
#+END_SRC
#+BEAMER: \begin{onlyenv}<2>\subskip
#+BEGIN_SRC sh
$ wc -l url-list
1000
$ time xargs -n1 -P10 bash url-grep 'free software' < url-list
real 0m17.548s
user 0m8.283s
sys 0m4.877s
#+END_SRC
#+BEAMER: \end{onlyenv}
**** Notes :B_noteNH:
:PROPERTIES:
:BEAMER_env: noteNH
:END:
I'd first echo the pipeline into =url-grep=.
Instead of making it executable,
I'll just pass it as an argument to =bash= instead,
which saves me a step;
it's a temporary file anyway.
I used 10 processes instead of 5.
And then to top it all off,
if you have a MTA configured on your system,
we can just pipe the output to the =mail= command to send that URL list
directly to me.
It only takes a minute or so to come up with this script.
But how long does it take to run?
I took a few URLs for the FSF, Wikipedia, and Google and just repeated them
in a file so that I had 1000 of them.
Running the =xargs= command,
it finishes in under 18 seconds on my system at home.
So in well under two minutes,
the task has been automated away and completed,
all by gluing together existing programs.
You don't need to be a programmer to know how to do this;
you just need to be familiar with the tools and know what's possible,
which comes with a little bit of practice.
This is certainly an efficient means of communicating with the machine.
** Thank You :B_fullframe:
:PROPERTIES: