Program Composition: Talk notes
parent
b40e289f35
commit
f372be6d29
387
slides.org
387
slides.org
|
@ -64,7 +64,9 @@
|
|||
- [ ] Users need to be able to convey their thoughts to the computer
|
||||
without being programmers
|
||||
- [ ] Concise primitives / building blocks
|
||||
- [ ] Readline (history and editing)
|
||||
- [ ] Readline (history and editing) [0/2]
|
||||
- [ ] Develop as you go, perhaps just referencing history early on
|
||||
- [ ] Transfer commands from history into scripts and aliases for re-use
|
||||
- [X] Regular expressions ([[*Perspective Topics][Perspective Topics]])
|
||||
- [ ] Remote commands via SSH
|
||||
- [X] Text as a universal interface
|
||||
|
@ -167,14 +169,14 @@
|
|||
| \_ Program Composition | | LACKING | |
|
||||
| \_ Composition Topics | | | |
|
||||
| \_ Clarifying Pipelines | | RAW | fullframe |
|
||||
| \_ Tor | | LACKING | fullframe |
|
||||
| \_ LP Sessions | | LACKING | fullframe |
|
||||
| \_ Interactive, Incremental, Iterative Development | | LACKING | fullframe |
|
||||
| \_ Discovering URLs | | LACKING | fullframe |
|
||||
| \_ Go Grab a Coffee | | LACKING | fullframe |
|
||||
| \_ Async Processes | | | fullframe |
|
||||
| \_ Executable Shell Script and Concurrency | | LACKING | fullframe |
|
||||
| \_ Again: A Research Task | | LACKING | againframe |
|
||||
| \_ Tor | | RAW | fullframe |
|
||||
| \_ LP Sessions | | RAW | fullframe |
|
||||
| \_ Interactive, Incremental, Iterative Development | | RAW | fullframe |
|
||||
| \_ Discovering URLs | | RAW | fullframe |
|
||||
| \_ Go Grab a Coffee | | RAW | fullframe |
|
||||
| \_ Async Processes | | RAW | fullframe |
|
||||
| \_ Executable Shell Script and Concurrency | | RAW | fullframe |
|
||||
| \_ Again: A Research Task | | RAW | againframe |
|
||||
| \_ A Quick-n-Dirty Solution | | LACKING | frame |
|
||||
|-------------------------------------------------------+----------+---------+-------------|
|
||||
| \_ Thank You | 00:00:01 | | fullframe |
|
||||
|
@ -1696,16 +1698,14 @@ We start to think of how to decompose problems into small operations that
|
|||
We think of how to chain small, specialized programs together,
|
||||
transforming text at each step to make it more suitable for the next.
|
||||
|
||||
** LACKING Program Composition [0/9]
|
||||
*** Composition Topics [5/6] :noexport:
|
||||
** RAW Program Composition [0/10]
|
||||
*** Composition Topics [6/6] :noexport:
|
||||
- [X] Clarify how pipelines work with existing =wget | grep=.
|
||||
- [X] More involved pipeline with more than two programs.
|
||||
- [ ] Emphasize iterative development and how the shell is a REPL. [0/3]
|
||||
- [X] Emphasize iterative development and how the shell is a REPL.
|
||||
- Useful for programmers for prototyping and debugging, but also essential
|
||||
to average users for discovery.
|
||||
- [ ] Develop as you go, perhaps just referencing history early on
|
||||
- [ ] Evolve by making portions of command dynamic (variables, subshells)
|
||||
- [ ] Transfer commands from history into scripts and aliases for re-use
|
||||
- Evolve by making portions of command dynamic (variables, subshells)
|
||||
- [X] Now script discovering what pages contain a certain word [3/3]
|
||||
- [X] Mention previous example of being emailed a list of URLs. Rather
|
||||
than pasting them into a file, let's discover them using the same
|
||||
|
@ -1782,8 +1782,9 @@ We can pipe it to =wc= instead,
|
|||
What about the number of lines that contain the string ``free software''?
|
||||
|
||||
Or how about the last such line?
|
||||
It's all a simple matter of composing existing programs.
|
||||
|
||||
*** LACKING Tor :B_fullframe:
|
||||
*** RAW Tor :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -1792,8 +1793,21 @@ Or how about the last such line?
|
|||
$ alias fetch-url='torify wget -qO-'
|
||||
#+END_SRC
|
||||
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
*** LACKING LP Sessions :B_fullframe:
|
||||
By the way,
|
||||
retrieving a bunch of URLs in an automated manner may be a privacy concern
|
||||
for you.
|
||||
You can easily send all these requests through Tor,
|
||||
assuming it is installed and the daemon running,
|
||||
by prefixing =wget= with =torify=.
|
||||
Since we abstracted our fetching away into the =fetch-url= alias,
|
||||
our previous examples continue to work as-is.
|
||||
|
||||
*** RAW LP Sessions :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -1926,7 +1940,71 @@ $ fetch-url https://libreplanet.org/2019/speakers/ \
|
|||
#+END_SRC
|
||||
#+BEAMER: \end{onlyenv}
|
||||
|
||||
*** LACKING Interactive, Incremental, Iterative Development :B_fullframe:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
How about something more involved.
|
||||
I noticed that some talks had multiple speakers,
|
||||
and I wanted to know which ones had the /most/ speakers.
|
||||
|
||||
The HTML of the speakers page includes a header for each speaker.
|
||||
Here are the first two.
|
||||
These are keynote speakers,
|
||||
but there's also non-keynote ones that are just =speaker-header=.
|
||||
|
||||
Let's get just the talk titles that those speakers are associated with.
|
||||
Looking at this output,
|
||||
we see that the talks titles have an =em= tag,
|
||||
so let's just go with that.
|
||||
|
||||
Uh oh.
|
||||
It looks like at least one of those results has /multiple/ talks.
|
||||
But note that each is enclosed in its own set of =em= tags.
|
||||
If we add =-o= to =grep=,
|
||||
which stands for /only/,
|
||||
then it'll only return the portion of the line that matches,
|
||||
rather than the entire line.
|
||||
Further,
|
||||
if there are multiple matches on a line,
|
||||
it'll output each match independently on its own line.
|
||||
That's exactly what we want!
|
||||
But we have to modify our regex a little bit to prevent it from grabbing
|
||||
everything between the first and /last/ =em= tag,
|
||||
by prohibiting it from matching on a less than character in the title.
|
||||
|
||||
Now assuming that the talk titles are consistent,
|
||||
we can get a count.
|
||||
=uniq= has the ability to count consecutive lines that are identical,
|
||||
as well as output a count.
|
||||
We also use =-d= to tell it to only output duplicates.
|
||||
But =uniq= requires sorted input,
|
||||
so we first pipe it to =sort=.
|
||||
That gives us a count of each talk!
|
||||
|
||||
But I want to know the talks with the most speakers,
|
||||
so let's sort it /again/,
|
||||
this time numerically and in reverse order.
|
||||
|
||||
And we have our answer!
|
||||
|
||||
But just for the hell of it,
|
||||
let's go a step further.
|
||||
Using =sed=,
|
||||
which stands for /stream editor/,
|
||||
we can match on portions of the input and reference those matches in a
|
||||
replacement.
|
||||
So we can reformat the =uniq= output into an English sentence,
|
||||
like so.
|
||||
|
||||
And then we're going to pipe it to the program =espeak=,
|
||||
which is a text-to-speech synthesizer.
|
||||
Your computer will speak the top five talks by presenter count to you.
|
||||
Listening to computers speak is all the rage right now,
|
||||
right?
|
||||
|
||||
*** RAW Interactive, Incremental, Iterative Development :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -1937,7 +2015,42 @@ Incremental Development
|
|||
#+BEAMER: \fullsubtext
|
||||
Interactive REPL, Iterative Decomposition
|
||||
|
||||
*** LACKING Discovering URLs :B_fullframe:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
Notice how we approached that problem.
|
||||
I presented it here just as I developed it.
|
||||
I didn't open my web browser and inspect the HTML;
|
||||
I just looked at the =wget= output and then started to manipulate it in
|
||||
useful ways working toward my final goal.
|
||||
And this is part of what makes working in a shell so powerful.
|
||||
|
||||
In software development,
|
||||
we call environments like this REPLs,
|
||||
which stands for ``read-eval-print loop''.
|
||||
The shell reads a command line,
|
||||
evaluates it,
|
||||
prints a result,
|
||||
and then does that all over again.
|
||||
As a hacker,
|
||||
this allows me to easily inspect and iterate on my script in real time,
|
||||
which can be a very efficient process.
|
||||
I can quickly prototype something and then clean it up later.
|
||||
Or maybe create a proof-of-concept in shell before writing the actual
|
||||
implementation in another language.
|
||||
|
||||
But most users aren't programmers.
|
||||
They aren't experts in these commands;
|
||||
they have to play around and discover as they go.
|
||||
And the shell is perfect for this discovery.
|
||||
If something doesn't work,
|
||||
just keep trying different things and get immediate feedback!
|
||||
This is also really helpful when you're trying to craft a suitable regular
|
||||
expression.
|
||||
|
||||
*** RAW Discovering URLs :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -2077,7 +2190,78 @@ $ xclip -i -selection clipboard < results.txt
|
|||
#+END_SRC
|
||||
#+BEAMER: \end{onlyenv}
|
||||
|
||||
*** LACKING Go Grab a Coffee :B_fullframe:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
Okay, back to searching webpages at URLs.
|
||||
Now that we have a means of creating the list,
|
||||
how do we feed the URLs to our pipeline?
|
||||
Why not pull them right out of the email with =grep=?
|
||||
|
||||
Let's say you saved the email in =email-of-links.msg=.
|
||||
This simple regex should grab most URLs for both HTTP and HTTPS protocols.
|
||||
Here's some example output with a few URLs.
|
||||
|
||||
For each of these,
|
||||
we need to run our pipeline.
|
||||
It's time to introduce =while= and =read=.
|
||||
=while= will continue to execute its body in a loop until its command fails.
|
||||
=read= will read line-by-line into one or more variables,
|
||||
and will fail when there are no more lines to read.
|
||||
|
||||
So if insert our =fetch-url= pipeline into the body,
|
||||
we get this.
|
||||
But if we just redirect output into =results.txt=,
|
||||
we can't see the output unless we inspect the file.
|
||||
For convenience,
|
||||
let's use =tee=,
|
||||
which is named for a pipe tee;
|
||||
it'll send output through the pipeline while also writing the same
|
||||
output to a given file.
|
||||
The =-a= flag tells it to /append/ rather than overwrite.
|
||||
So now we can both observe the results and have them written to a file!
|
||||
|
||||
But we were just going to reply to an email with those results.
|
||||
Let's assume we're still using a GUI email client.
|
||||
Wouldn't it be convenient if those results were on the clipboard for us
|
||||
already so we can just paste it into the message?
|
||||
We can accomplish that by piping to =xclip= as shown here.
|
||||
There's also the program =xsel=,
|
||||
which I typically use because its arguments are far more concise,
|
||||
but I don't show it here.
|
||||
|
||||
Ah, crap, but now we can't see the output again.
|
||||
So let's use =tee= again.
|
||||
But rather than outputting to a file on disk,
|
||||
we're going to use a special notation that tells bash to invoke a command
|
||||
in a subshell and replace that portion of the command line with a path
|
||||
to a virtual file representing the standard input of that subshell.
|
||||
Now we can see the output again!
|
||||
|
||||
Well,
|
||||
if we're /writing/ to the clipboard,
|
||||
why don't we just /read/ from it too?
|
||||
Instead of saving our mail to a file,
|
||||
we can just copy the relevant portion and have that piped directly to
|
||||
=grep=!
|
||||
|
||||
Because we're writing to =results.txt=,
|
||||
another option is to just let it run and copy to the clipboard at a later
|
||||
time.
|
||||
We can do that by reading =results.txt= in place of standard input to
|
||||
=xclip=,
|
||||
as shown here.
|
||||
|
||||
And while we're at it,
|
||||
here's a special notation to get rid of =echo= for the =tee= in the body
|
||||
of =while=:
|
||||
three less-than symbols provides the given string on standard in.
|
||||
|
||||
Phew!
|
||||
|
||||
*** RAW Go Grab a Coffee :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -2090,14 +2274,16 @@ Go Grab a Coffee
|
|||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
...
|
||||
Remember when I said I could go grab a coffee and play with the kids while
|
||||
the script did its thing?
|
||||
Well now's that time.
|
||||
|
||||
But grabbing a coffee means that this system is a bottleneck.
|
||||
Ideally,
|
||||
we wouldn't have to wait long.
|
||||
The Internet is fast nowadays;
|
||||
ideally, we wouldn't have to wait long.
|
||||
Can we do better?
|
||||
|
||||
*** Async Processes :B_fullframe:
|
||||
*** RAW Async Processes :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -2119,7 +2305,41 @@ $ while read URL; do
|
|||
#+END_SRC
|
||||
#+BEAMER: \end{uncoverenv}
|
||||
|
||||
*** LACKING Executable Shell Script and Concurrency :B_fullframe:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
Indeed we can.
|
||||
This process is executing serially---one
|
||||
URL at a time,
|
||||
waiting for one to complete before checking another.
|
||||
What if we could query multiple URLs in parallel?
|
||||
|
||||
Shells have built-in support for backgrounding tasks so that they can run
|
||||
while you do other things;
|
||||
all you have to do is place a single ampersand at the end of a command.
|
||||
So in this example,
|
||||
we sleep for one second and then echo ``done''.
|
||||
But that sleep and subsequent echo is put into the background,
|
||||
and the shell proceeds to first execute =echo start=.
|
||||
One second later,
|
||||
it outputs ``done''.
|
||||
|
||||
So here's the loop we were just writing.
|
||||
If we add an ampersand at the end of that pipeline,
|
||||
it'll run in the background and immediately proceed to the next URL,
|
||||
executing the loop again.
|
||||
|
||||
But there's a problem with this approach.
|
||||
Sure,
|
||||
it's fine if we only have a few URLs.
|
||||
But what if we have 1000?
|
||||
Do we really want to spawn 1000s of processes and make 1000 network requests
|
||||
at once?
|
||||
That isn't efficient.
|
||||
|
||||
*** RAW Executable Shell Script and Concurrency :B_fullframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: fullframe
|
||||
:END:
|
||||
|
@ -2185,25 +2405,134 @@ $ xargs -n1 -P5 ./url-grep 'free software' > results.txt
|
|||
#+END_SRC
|
||||
#+BEAMER: \end{onlyenv}
|
||||
|
||||
*** LACKING Again: A Research Task :B_againframe:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
Before we continue,
|
||||
we're going to have to write our pipeline in a way that other programs can
|
||||
run it.
|
||||
Up to this point,
|
||||
the program has just been embedded within the shell.
|
||||
But one of the nice things about shell is that you can take what you entered
|
||||
into the command line and paste it directly in a file and,
|
||||
with some minor exceptions,
|
||||
it'll work all the same.
|
||||
|
||||
Let's take our pipeline and name it =url-grep=.
|
||||
Aliases only work in interactive sessions,
|
||||
so we're going to just type =wget= directly here.
|
||||
Alternatively,
|
||||
you can define a function.
|
||||
We use the positional parameters =1= and =2= here to represent the
|
||||
respective arguments to the =url-grep= command.
|
||||
|
||||
The comment at the top of the file is called a ``she-bang'' and contains the
|
||||
path to the executable that will be used to interpret this script.
|
||||
This is used by the kernel so that it knows how to run our program.
|
||||
|
||||
To make it executable,
|
||||
we use =chmod= to set the executable bit on the file.
|
||||
We can then invoke it as if it were an executable.
|
||||
If it were in our =PATH=,
|
||||
which isn't something I'm going to get into here,
|
||||
you'd be able to run it like any other command without having to prefix it
|
||||
with =./=.
|
||||
|
||||
We can also do a primitive form of error handling by modifying our
|
||||
positional parameters like so,
|
||||
which will show an error message if we don't specify one of them.
|
||||
|
||||
Now we replace the =while= loop with =xargs=.
|
||||
It takes values from standard in and appends them as arguments to the
|
||||
provided command.
|
||||
We specify =-n1= to say that only one argument should be read from stdin
|
||||
for any invocation of the command;
|
||||
that makes it run a new command for every line of input.
|
||||
Otherwise it'd just append N URLs as N arguments.
|
||||
|
||||
And now we can simply use =-P= to tell it how many processes to use at once.
|
||||
Here we specify =5=,
|
||||
meaning =xargs= will run five processes at a time.
|
||||
You can change that to whatever number makes sense for you.
|
||||
|
||||
*** RAW Again: A Research Task :B_againframe:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: againframe
|
||||
:BEAMER_ref: *A Research Task
|
||||
:BEAMER_act:
|
||||
:END:
|
||||
|
||||
*** LACKING A Quick-n-Dirty Solution :B_frame:
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
So this was the research task that we started with.
|
||||
|
||||
If I were to approach this problem myself,
|
||||
I'd write a quick-and-dirty script that is just throwaway,
|
||||
because it's such a simple problem.
|
||||
So,
|
||||
let's combine everything we've seen so far:
|
||||
|
||||
*** RAW A Quick-n-Dirty Solution :B_frame:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: frame
|
||||
:END:
|
||||
|
||||
#+BEGIN_SRC sh
|
||||
echo 'wget -qO- "$2" | grep -q "$1" || echo "$2"' > url-grep
|
||||
$ grep -o 'https\?://[^ ]\+' mail \
|
||||
| xargs -n1 -P10 bash url-grep 'free software' \
|
||||
$ xargs -n1 -P10 bash url-grep 'free software' < url-list
|
||||
| mail -s 'URLs not mentioning free software' mtg@gnu.org
|
||||
#+END_SRC
|
||||
|
||||
#+BEAMER: \begin{onlyenv}<2>\subskip
|
||||
#+BEGIN_SRC sh
|
||||
$ wc -l url-list
|
||||
1000
|
||||
|
||||
$ time xargs -n1 -P10 bash url-grep 'free software' < url-list
|
||||
real 0m17.548s
|
||||
user 0m8.283s
|
||||
sys 0m4.877s
|
||||
#+END_SRC
|
||||
#+BEAMER: \end{onlyenv}
|
||||
|
||||
**** Notes :B_noteNH:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: noteNH
|
||||
:END:
|
||||
|
||||
I'd first echo the pipeline into =url-grep=.
|
||||
Instead of making it executable,
|
||||
I'll just pass it as an argument to =bash= instead,
|
||||
which saves me a step;
|
||||
it's a temporary file anyway.
|
||||
I used 10 processes instead of 5.
|
||||
And then to top it all off,
|
||||
if you have a MTA configured on your system,
|
||||
we can just pipe the output to the =mail= command to send that URL list
|
||||
directly to me.
|
||||
|
||||
It only takes a minute or so to come up with this script.
|
||||
But how long does it take to run?
|
||||
|
||||
I took a few URLs for the FSF, Wikipedia, and Google and just repeated them
|
||||
in a file so that I had 1000 of them.
|
||||
Running the =xargs= command,
|
||||
it finishes in under 18 seconds on my system at home.
|
||||
|
||||
So in well under two minutes,
|
||||
the task has been automated away and completed,
|
||||
all by gluing together existing programs.
|
||||
You don't need to be a programmer to know how to do this;
|
||||
you just need to be familiar with the tools and know what's possible,
|
||||
which comes with a little bit of practice.
|
||||
|
||||
This is certainly an efficient means of communicating with the machine.
|
||||
|
||||
|
||||
** Thank You :B_fullframe:
|
||||
:PROPERTIES:
|
||||
|
|
Loading…
Reference in New Issue