Commit Graph

22 Commits (7692d0d848888f93307f05911da5be7d63f820a5)

Author SHA1 Message Date
Mike Gerwitz 3fad6c6375 tamed: Do not inject unexpected exit on explicit reload request
When the client requests a reload (e.g. on ACK wait failure), we should not
terminate, since we are expecting to attempt the request again.

This was broken by the previous commit.

DEV-10806
2023-10-03 15:39:08 -04:00
Mike Gerwitz fb5947d59e Prevent hanging tame client when tamed runner is killed mid-process
It's a tad embarrassing that this has been eluding me for quite some
time.  I happened to run into it while testing the previous commit, which in
turn only existed because I was trying to optimize runner performance.

We'd have situations where, following a runner reload (exit code 129 =
SIGHUP), the build would simply hang indefinitely.  Apparently, `tame`, in
`command-runner`, blocks on a `read` without a timeout, expecting that the
FIFO attached to stdin will close if the runner crashes.  Maybe that used to
be the case, but that is no longer true today.

Because of that, the FIFO stays open, and read continues to block, waiting
for `DONE`.

Now, `tamed`, when seeing that a runner has crashed (which could have been
due to a reload), will check to see if that runner is marked busy.  If so,
that means that the client `tame` did not see `DONE`, because it did not
clear the flag via `command-runner`'s `mark-available.`  To notify the
client that someone went wrong, `tamed` will inject a `DONE` into the output
FIFO, which will allow the client to fail properly.

`dslc` catches exceptions and should output `DONE` under normal operating
conditions.  However, since some of our systems require so much memory to
build, we may encounter the OOM killer.  In that case, the process has no
time to recover (it is killed with SIGKILL), and therefore cannot output
`DONE`.  I suspect this is what has been happening to cause occasional build
hangs.

One final thing to clean this up: since we're properly handling reloads now,
based on this commit and the immediately preceding one, we can suppress the
warning when the code is 129 (see comments).

DEV-10806
2023-10-03 14:14:47 -04:00
Mike Gerwitz 7c9d6837fe Improve runner reload stability
The `tame` client has the ability to request a runner reload by issuing
SIGHUP to the runner PID.  This will cause `tamed` to kill the runner and
respawn it.

There were situations where this process hung or did not operate as
expected; the process was not reliable.  This does a number of things to
make sure the process is in a good state before proceeding:

  1. The HUP trap is set a single time, rather than being reset each time
     the signal is received for the runner.
  2. A `reloading` flag is set by the `tame` client when reloading is
     requested, and the flag is cleared by `tamed` when spawning a new
     runner.  This allows the client to wait until the reload is complete,
     and bail out otherwise.  Without this, there is a race, where the
     client may proceed to issue additional requests before the original
     runner terminates.
  3. Kill the runner with SIGKILL (signal 9).  This gives no opportunity for
     the runner to ignore the signal, and gives us confidence that the
     runner is actually going to be killed.

This may have caused errors that look like this (where 129 is the HUP
reload, which is expected):

  warning: failed runner 0 ack; requesting reload
  warning: runner 0 exited with code 129; restarting
  error: runner 0 still unresponsive; giving up

The last line may also be omitted, with instead an empty `xmlo` being
generated.

DEV-10806
2023-10-03 14:14:47 -04:00
Mike Gerwitz 954b5a2795 Copyright year and name update
Ryan Specialty Group (RSG) rebranded to Ryan Specialty after its IPO.
2023-01-20 23:37:30 -05:00
Mike Gerwitz 1ad2fb1dc8 Copyright year update 2022
RSG (Ryan Specialty Group) recently announced a rename to Ryan Specialty (no
"Group"), but I'm not sure if the legal name has been changed yet or not, so
I'll wait on that.
2022-05-03 14:14:29 -04:00
Mike Gerwitz 8b255c2251 tame: tamed --help: Add missing closing quote to awk example 2022-01-26 13:51:34 -05:00
Mike Gerwitz 8fbddfb3b3 tamed: Fix --help and add another reporting example
$2 was not escaped and would fail expansion.  I apparently did not run
--help before committing.  Shame on me.
2022-01-20 23:32:28 -05:00
Mike Gerwitz 6fd570477a tamed: Add runtab and TAMED_RUNTAB_OUT
This provides logging that can be used to analyze jobs.  See `tamed --help`
for some examples.  More to come.

You'll notice that one of the examples reprents package build time in
_minutes_.  This is why TAMER is necessary; as of the time of writing, the
longest-building package is nearly five and a half minutes, and there are a
number of packages that take a minute or more.  But, there are potentially
other optimizations that can be done.  And this is _after_ many rounds of
optimizations over the years.  (TAME was not originally built for what it is
currently being used for.)
2022-01-19 16:47:12 -05:00
Mike Gerwitz 4a3b86f480 tamed: Ignore SIGUSR2
This was originally going to tell tamed to redraw the runner status line,
but a different approach was taken.
2022-01-19 15:41:28 -05:00
Mike Gerwitz c72d908a3f tamed: Add missing --report to help
Missing from previous commit.
2022-01-19 13:29:23 -05:00
Mike Gerwitz 756dcd7894 tamed --report and runner status line (TAMED_TUI)
This is something that I've wanted to do for quite some time, but for good
reason, have been avoiding.

`tamed --report` is fairly basic right now, but allows you to see what each
of the runners are doing.  This will be expanded further to gather data for
further analysis.

The thing that I was avoiding was a status line during the build to
summarize what the runners are doing, since it's nearly impossible to do so
from the build output with multiple runners.  This will not only allow me to
debug more easily, but will keep the output plainly visible to developers at
all times in the hope that it can help them improve the build times
themselves in certain cases.

It is currently gated behind TAMED_TUI, since, while it works well overall,
it is imperfect, and will cause artifacts from build output partly
overwriting the status line, and may even occasionally clobber the PS1 by
erasing the line.  This will be improved upon in the future; something is
better than nothing.
2022-01-19 11:51:48 -05:00
Mike Gerwitz 2e50af1220 Copyright year update 2021 2021-07-22 15:00:15 -04:00
Mike Gerwitz 7325578624 bin/tame{,d}: Fix assignments that lose exit code
Ensure that we fail if the command in the assignment fails.
2021-03-12 14:14:40 -05:00
Mike Gerwitz da7a2c71c7 tamed: TAMED_JAVA_OPTS: New environment variable
This will be passed to dslc and then to the JVM.  The intent is to permit
fine-grained heap ratio tuning.
2020-08-19 10:19:04 -04:00
Mike Gerwitz bfea768f89 Copyright year 2020 update 2020-03-06 11:05:18 -05:00
Mike Gerwitz 1a35232bd8 Parallel build support
tamed was originally designed with support for parallel builds in mind, but
I hadn't completed that work because we didn't have enough hardware that
we'd benefit strongly from it.  That has since changed.

tamed will now spawn additional runners as needed to fulfill requests, which
works around the issue of not knowing how many jobs GNU Make is going to try
to do at once.

There were a couple minor dependency fixes/workarounds for now in the
Makefile, but otherwise everything appears to be working great.
2019-04-04 14:41:07 -04:00
Mike Gerwitz e022a3133d Copyright year simplification and update to Ryan Specialty Group
This now uses year ranges, which I'll update annually.

This also renames "R-T Specialty" to "Ryan Specialty Group".  The latter is
the parent company of the former.  I was originally employed under the
former when LoVullo Associates was purchased, by I now work for the parent
company.
2019-02-07 13:23:09 -05:00
Mike Gerwitz cd5440b8da tamed: Clarify usage output shell example
* bin/tamed (usage): Clarify killing when run on a shell.
2018-12-03 16:46:06 -05:00
Mike Gerwitz 079d1dcfaf tamed: Do not stall if TAMED_SPAWER_PID is running
This will ensure that tamed does not stall while e.g. make is still
running.  This makes TAMED_STALL_SECONDS almost useless; maybe it'll be
removed in future versions.

* bin/tame (TAMED_SPAWNER_PID): Export variable.
* bin/tamed (TAMED_SPAWNER_PID): New variable, default to PPID.
  (spawner-dead): New function.
  (stall-monitor): Use it.
  (usage): Update documentation.
* build-aux/Makefile.am: Set TAMED_SPAWNER_PID to own id and export.
2018-12-03 16:25:25 -05:00
Mike Gerwitz db1c03dfd9 tame{,d}: Reload runner when unresponsive
This tries to be a bit more resilient in case a runner becomes unresponsive,
rather than waiting for tamed to kill itself.

* bin/tame (RUNNER_CMD_WAITTIME): New variable.
  (command-runner): Tell runner to reload if it does not respond in
    RUNNER_CMD_WAITTIME seconds.
  (verify-runner-ack): New function.
* bin/tamed (mkfifos): Only keep stdin open.  stdout isn't necessary, and
    may have actually been causing subtle issues.
  (spawn-runner): Support restarting dslc on SIGHUP.
2018-10-16 08:53:04 -04:00
Mike Gerwitz 6027769633 Integrate new compilation scripts, remove cqueue and Makefile.2
This is a major step toward normalcy---removing the kluge of a build process
that was causing so many issues.  Rather than echoing all operations to a
queue file before passing it off to dslc, the new build scripts in `bin/'
are used to invoke tame normally, as needed.  This solves all of the current
issues with things not rebuilding when they should.  And, as a bonus, tab
completion on targets works.

Sorry this took so long.  There wasn't much motivation until we hired so
many people that are suffering from this.

This does a few major things, along with some miscellaneous others:
  - Invoke bin/tame directly;
  - Merge Makefile.2.in into Makefile.am; and
  - Fix up some targets.

* build-aux/Makefile.2.in: Delete file.  Mostly merged with Makefile.am.
* build-aux/Makefile.am: Add a bunch of new targets and definitions from
    Makefile.2.in.  Modify all that previously used .cqueue to now invoke
    `$(TAME)' directly.  Remove miscellaneous targets for trying to proxy
    targets to Makefile.2.
  (saneout, _go): Remove definitions.
  (.NOTPARALLEL): Add to prevent parallel builds.
  (ui/program.expanded.xml)[.version.xml]: Remove dependency for now.
  (clean): Also clean generated PHP files.  Follow symlinks to clean core.
    This is still incomplete (does not clean all rate table stuff).
  (suppliers.mk)[xmlo_cmd]: Remove.  See `gen-make' and `gen-c1make'.
  (lvroot)[summary-html]: New dependency.
  (kill-tamed, tamed-die): New targets (former alias of latter) to kill
    tamed.
* build-aux/gen-c1make: Generate `$(TAME)' invocation.
* build-aux/gen-make: Likewise.  Remove `xmlo_cmd' output.  Ignore recursive
    `tame' symlink (this can be removed once we clean `rater/' up.
* build-aux/m4/calcdsl.m4 (TAME): Update description to reflect that it
    should now be the path to `bin/tame'.  Adjust `AC_CHECK_FILE' lines
    accordingly.
  (tame_needed_ver): Remove.  We have been in the same repo as TAME itself
    for quite some time.  Remove associated code.
  (AC_CONFIG_FILES): Remove `Makefile.2'.
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
    Perform validation prefore `compile' command rather than a separate
    `validate' step.  Remove `rm'.
  [compileSrc]: Stop echoing command.  This was only necessary because of
    the previous Makefile klugery; now Make echoes on its own correctly.
2018-10-11 22:25:18 -04:00
Mike Gerwitz cf57857ce5 bin/: Server/client build scripts
These scripts allow the TAME compiler stack to be invoked naturally, rather
than requiring the use of a Makefile today.  This will not only allow users
to more easily invoke the compiler, but will also allow us to invoke TAME
naturally from Makefile and remove the klugery that has existed for so
long.

This users a server/client architecture in order to mitigate the startup
cost of the JVM.  More documentation will follow.

Note that there are a bunch of symlinks in rater/---this is a transition
step to allow the build to continue working as it did before, which relies
on a directory structure that exists outside of this repository.  This will
be cleaned up in the future.

* .gitignore (bin/dslc): Add ignore for generated file.
* bin/dslc.in: New script to encapsulate Java invocation.
* bin/tame: New script (client).
* bin/tamed: New script (server).
* configure.ac (JAVA_OPTS, DSLC_CLASSPATH, AUTOGENERATED): New variables for
  dslc.in.  Output bin/dslc.
* rater/README.md: Note that this symlink mess is temporary.
* rater/c1map: New symlink for dslc assumptions.
* rater/c1map.xsl: Likewise.
* rater/calc.xsd: Likewise.
* rater/compile.xsl: Likewise.
* rater/compiler: Likewise.
* rater/dot.xsl: Likewise.
* rater/include: Likewise.
* rater/link.xsl: Likewise.
* rater/standalone.xsl: Likewise.
* rater/summary.xsl: Likewise.
* rater/tame: Likewise (warning: circular symlink).
* src/current/src/com/lovullo/dslc/DslCompiler.java (_DslCompiler)[compile]:
  Output `DONE' lines.
2018-10-08 23:25:02 -04:00