_table-row_: Performance fix: place table in const/text() instead of const/@values

This is an interesting one.

For some context: TAME uses `csvm` files to provide syntactic sugar for
large tables of values ("rate tables", as they're often called, since they
contain insurance rates and other data).  This gets desugared into a `csv`
which in turn is compiled via `csv2xml` into a package.  That package uses
the `_table-*_` templates to define a table, which is represented as a
matrix using `const/@values`.

Here's an example of a generated table in a package:

```
  <t:create-table name="foo">
        <t:table-rows data="
          1,2,3;
          4,5,6;" />
  </t:create-table>
```

Some of the tables are quite large, generating tens of MiB of data in
`@data`.  This in itself isn't a problem.  But when Saxon parses the `@data`
attribute, it normalizes the whitespace, as mandated by the XML spec, and
removes the newlines.  Therefore, when the template is expanded and the
`xmlo` file is produced, the template produced a `const/@values` with a huge
amount of data on one line.

Then, when another package imports that `xmlo` file via `<import
package="..." />`, which is done via `document()` in XSLT, Saxon takes a
long time to parse it.  60s on my machine for a ~20MiB line.

This problem does not exist for JS fragments; Saxon doesn't mind large text
nodes.  So that is the approach that is taken here.

The template system doesn't have a way to output text yet, so this takes an
approach that minimizes changes as much as possible:

  - `param-copy` will expand `with-param/@value` as a text node.
  - `const/@values="-"` will cause TAME to use the child text node as the
    value of `@values`.
  - `_table-rows_` is modified to use the above two features.

The reason for using `@values="-"` is so that other parts of the compiler do
not have to be modified to recognize the new text convention, which is
otherwise awkward because newlines are text nodes.  The `-` convention comes
from command line programs, which means "read from stdin", generally; this
is okay since `-` is never a valid matrix specification.

This must have been a problem for a very long time, but wasn't all that
noticeable until recent performance optimizations, since so many other things
around it were also slow.

DEV-15131
main
Mike Gerwitz 2023-10-18 10:20:49 -04:00
parent b82294b1bd
commit e20076235e
3 changed files with 35 additions and 7 deletions

View File

@ -160,7 +160,11 @@
<const name="{@__tid@}_RATE_TABLE"
type="float"
desc="{@__tname@} table; {@__desc@}"
values="@data@" />
values="-">
<!-- `@values="-"` above tells TAME to read the value from the
child text node -->
<param-copy name="@data@" />
</const>
</if>
<unless name="@data@">
<const name="{@__tid@}_RATE_TABLE"

View File

@ -472,7 +472,7 @@
<param name="const" as="element( lv:const )" />
<variable name="values-def" as="xs:string?"
select="$const/@values" />
select="compiler:const-values( $const )" />
<choose>
<when test="$values-def and contains( $values-def, ';' )">
@ -487,6 +487,21 @@
</function>
<function name="compiler:const-values" as="xs:string?">
<param name="const" as="element( lv:const )" />
<!-- @values="-", a convention from command-line programs where '-' means
"read from stdin", will take the value from the child text of the
constant; this is done because Saxon performs very, very poorly on
huge single-line attributes (e.g. 60s for ~20MiB single-line
attribute) -->
<sequence select="if ( $const/@values = '-' ) then
$const/text()
else
$const/@values" />
</function>
<!--
Produce a sequence of items
@ -505,7 +520,9 @@
<when test="$set/@values and $allow-values">
<sequence select="tokenize(
normalize-space( $set/@values ), ',' )" />
normalize-space(
compiler:const-values( $set ) ),
',' )" />
</when>
<otherwise>

View File

@ -653,11 +653,13 @@
<variable name="varname" select="@name" />
<variable name="param" select="$params[ @name=$varname ]" />
<variable name="copy" as="node()*">
<variable name="copy">
<choose>
<!-- TAMER desugared @values@ application convention (see tplshort.rs) -->
<when test="$param/@value">
<!-- the value is the name of a template to copy the body from -->
<!-- TAMER desugared @values@ application convention (see
tplshort.rs); this will go away once the template system is fully
implemented in TAMER -->
<when test="$varname = '@values@' and $param/@value">
<!-- the value may be the name of a template to copy the body from -->
<variable name="dsgr" select="$param/@value" />
<!-- the template is always positioned as the immeditely-following
@ -667,6 +669,11 @@
/*" />
</when>
<!-- non-node value is copied as text -->
<when test="$param/@value">
<sequence select="string( $param/@value )" />
</when>
<!-- old applicatication convention has child nodes within
`with-param` -->
<otherwise>