When the client requests a reload (e.g. on ACK wait failure), we should not
terminate, since we are expecting to attempt the request again.
This was broken by the previous commit.
DEV-10806
It's a tad embarrassing that this has been eluding me for quite some
time. I happened to run into it while testing the previous commit, which in
turn only existed because I was trying to optimize runner performance.
We'd have situations where, following a runner reload (exit code 129 =
SIGHUP), the build would simply hang indefinitely. Apparently, `tame`, in
`command-runner`, blocks on a `read` without a timeout, expecting that the
FIFO attached to stdin will close if the runner crashes. Maybe that used to
be the case, but that is no longer true today.
Because of that, the FIFO stays open, and read continues to block, waiting
for `DONE`.
Now, `tamed`, when seeing that a runner has crashed (which could have been
due to a reload), will check to see if that runner is marked busy. If so,
that means that the client `tame` did not see `DONE`, because it did not
clear the flag via `command-runner`'s `mark-available.` To notify the
client that someone went wrong, `tamed` will inject a `DONE` into the output
FIFO, which will allow the client to fail properly.
`dslc` catches exceptions and should output `DONE` under normal operating
conditions. However, since some of our systems require so much memory to
build, we may encounter the OOM killer. In that case, the process has no
time to recover (it is killed with SIGKILL), and therefore cannot output
`DONE`. I suspect this is what has been happening to cause occasional build
hangs.
One final thing to clean this up: since we're properly handling reloads now,
based on this commit and the immediately preceding one, we can suppress the
warning when the code is 129 (see comments).
DEV-10806
The `tame` client has the ability to request a runner reload by issuing
SIGHUP to the runner PID. This will cause `tamed` to kill the runner and
respawn it.
There were situations where this process hung or did not operate as
expected; the process was not reliable. This does a number of things to
make sure the process is in a good state before proceeding:
1. The HUP trap is set a single time, rather than being reset each time
the signal is received for the runner.
2. A `reloading` flag is set by the `tame` client when reloading is
requested, and the flag is cleared by `tamed` when spawning a new
runner. This allows the client to wait until the reload is complete,
and bail out otherwise. Without this, there is a race, where the
client may proceed to issue additional requests before the original
runner terminates.
3. Kill the runner with SIGKILL (signal 9). This gives no opportunity for
the runner to ignore the signal, and gives us confidence that the
runner is actually going to be killed.
This may have caused errors that look like this (where 129 is the HUP
reload, which is expected):
warning: failed runner 0 ack; requesting reload
warning: runner 0 exited with code 129; restarting
error: runner 0 still unresponsive; giving up
The last line may also be omitted, with instead an empty `xmlo` being
generated.
DEV-10806
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
This provides logging that can be used to analyze jobs. See `tamed --help`
for some examples. More to come.
You'll notice that one of the examples reprents package build time in
_minutes_. This is why TAMER is necessary; as of the time of writing, the
longest-building package is nearly five and a half minutes, and there are a
number of packages that take a minute or more. But, there are potentially
other optimizations that can be done. And this is _after_ many rounds of
optimizations over the years. (TAME was not originally built for what it is
currently being used for.)
This is something that I've wanted to do for quite some time, but for good
reason, have been avoiding.
`tamed --report` is fairly basic right now, but allows you to see what each
of the runners are doing. This will be expanded further to gather data for
further analysis.
The thing that I was avoiding was a status line during the build to
summarize what the runners are doing, since it's nearly impossible to do so
from the build output with multiple runners. This will not only allow me to
debug more easily, but will keep the output plainly visible to developers at
all times in the hope that it can help them improve the build times
themselves in certain cases.
It is currently gated behind TAMED_TUI, since, while it works well overall,
it is imperfect, and will cause artifacts from build output partly
overwriting the status line, and may even occasionally clobber the PS1 by
erasing the line. This will be improved upon in the future; something is
better than nothing.
tamed was originally designed with support for parallel builds in mind, but
I hadn't completed that work because we didn't have enough hardware that
we'd benefit strongly from it. That has since changed.
tamed will now spawn additional runners as needed to fulfill requests, which
works around the issue of not knowing how many jobs GNU Make is going to try
to do at once.
There were a couple minor dependency fixes/workarounds for now in the
Makefile, but otherwise everything appears to be working great.
This now uses year ranges, which I'll update annually.
This also renames "R-T Specialty" to "Ryan Specialty Group". The latter is
the parent company of the former. I was originally employed under the
former when LoVullo Associates was purchased, by I now work for the parent
company.
This will ensure that tamed does not stall while e.g. make is still
running. This makes TAMED_STALL_SECONDS almost useless; maybe it'll be
removed in future versions.
* bin/tame (TAMED_SPAWNER_PID): Export variable.
* bin/tamed (TAMED_SPAWNER_PID): New variable, default to PPID.
(spawner-dead): New function.
(stall-monitor): Use it.
(usage): Update documentation.
* build-aux/Makefile.am: Set TAMED_SPAWNER_PID to own id and export.
This tries to be a bit more resilient in case a runner becomes unresponsive,
rather than waiting for tamed to kill itself.
* bin/tame (RUNNER_CMD_WAITTIME): New variable.
(command-runner): Tell runner to reload if it does not respond in
RUNNER_CMD_WAITTIME seconds.
(verify-runner-ack): New function.
* bin/tamed (mkfifos): Only keep stdin open. stdout isn't necessary, and
may have actually been causing subtle issues.
(spawn-runner): Support restarting dslc on SIGHUP.
This is a major step toward normalcy---removing the kluge of a build process
that was causing so many issues. Rather than echoing all operations to a
queue file before passing it off to dslc, the new build scripts in `bin/'
are used to invoke tame normally, as needed. This solves all of the current
issues with things not rebuilding when they should. And, as a bonus, tab
completion on targets works.
Sorry this took so long. There wasn't much motivation until we hired so
many people that are suffering from this.
This does a few major things, along with some miscellaneous others:
- Invoke bin/tame directly;
- Merge Makefile.2.in into Makefile.am; and
- Fix up some targets.
* build-aux/Makefile.2.in: Delete file. Mostly merged with Makefile.am.
* build-aux/Makefile.am: Add a bunch of new targets and definitions from
Makefile.2.in. Modify all that previously used .cqueue to now invoke
`$(TAME)' directly. Remove miscellaneous targets for trying to proxy
targets to Makefile.2.
(saneout, _go): Remove definitions.
(.NOTPARALLEL): Add to prevent parallel builds.
(ui/program.expanded.xml)[.version.xml]: Remove dependency for now.
(clean): Also clean generated PHP files. Follow symlinks to clean core.
This is still incomplete (does not clean all rate table stuff).
(suppliers.mk)[xmlo_cmd]: Remove. See `gen-make' and `gen-c1make'.
(lvroot)[summary-html]: New dependency.
(kill-tamed, tamed-die): New targets (former alias of latter) to kill
tamed.
* build-aux/gen-c1make: Generate `$(TAME)' invocation.
* build-aux/gen-make: Likewise. Remove `xmlo_cmd' output. Ignore recursive
`tame' symlink (this can be removed once we clean `rater/' up.
* build-aux/m4/calcdsl.m4 (TAME): Update description to reflect that it
should now be the path to `bin/tame'. Adjust `AC_CHECK_FILE' lines
accordingly.
(tame_needed_ver): Remove. We have been in the same repo as TAME itself
for quite some time. Remove associated code.
(AC_CONFIG_FILES): Remove `Makefile.2'.
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
Perform validation prefore `compile' command rather than a separate
`validate' step. Remove `rm'.
[compileSrc]: Stop echoing command. This was only necessary because of
the previous Makefile klugery; now Make echoes on its own correctly.
These scripts allow the TAME compiler stack to be invoked naturally, rather
than requiring the use of a Makefile today. This will not only allow users
to more easily invoke the compiler, but will also allow us to invoke TAME
naturally from Makefile and remove the klugery that has existed for so
long.
This users a server/client architecture in order to mitigate the startup
cost of the JVM. More documentation will follow.
Note that there are a bunch of symlinks in rater/---this is a transition
step to allow the build to continue working as it did before, which relies
on a directory structure that exists outside of this repository. This will
be cleaned up in the future.
* .gitignore (bin/dslc): Add ignore for generated file.
* bin/dslc.in: New script to encapsulate Java invocation.
* bin/tame: New script (client).
* bin/tamed: New script (server).
* configure.ac (JAVA_OPTS, DSLC_CLASSPATH, AUTOGENERATED): New variables for
dslc.in. Output bin/dslc.
* rater/README.md: Note that this symlink mess is temporary.
* rater/c1map: New symlink for dslc assumptions.
* rater/c1map.xsl: Likewise.
* rater/calc.xsd: Likewise.
* rater/compile.xsl: Likewise.
* rater/compiler: Likewise.
* rater/dot.xsl: Likewise.
* rater/include: Likewise.
* rater/link.xsl: Likewise.
* rater/standalone.xsl: Likewise.
* rater/summary.xsl: Likewise.
* rater/tame: Likewise (warning: circular symlink).
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
Output `DONE' lines.