Template::Nest::Fast

Developing a High Performance templating engine in Raku

Written on

Template::Nest::Fast is a high performance templating engine written in pure Raku. It all began with a simple request from Tom, the author of Template::Nest (Perl), he approached me with a benchmarking task. The task was to benchmark several Perl v/s Raku operations, the first job was to benchmark Template::Nest (Perl) & Template::Nest::XS (Raku).

The direct port of Template::Nest from Perl to Raku was excluded from the benchmarks due to its poor performance. It proved unsuitable for production use, leading to the development of Template::Nest::XS.

At the time I couldn't help but assume that ::Nest::XS was going to be the obvious winner, given its C++ optimizations, and that it would be unfair to Perl. However, the results proved my assumptions wrong; surprisingly, the Perl version outperformed Raku even when Raku was leveraging C++. I had created two templates for testing - "Simple" & "Complex". Here are the results:

Module Simple Page Complex Page
Template::Nest::XS 0.34ms 1.74ms
Template::Nest:from<Perl5> 0.76ms 4.40ms
Template::Nest (Perl) 0.21ms 1.30ms

The second row provides timings when we ran Perl's Template::Nest in Raku using Inline::Perl5. As expected, it shows slower performance, but the extent of the difference, being 4x slower, was somewhat surprising. This motivated me to develop a Raku version of Template::Nest, determined to enhance Raku's performance and make a more efficient template engine. Thus, Template::Nest::Fast was born.

Back when the early Raku version was slow, I delved into the ::Nest source code. It appeared complex, I had no idea how one could make a templating engine. The benchmarks job re-ignited that spark. I figured it was just string substitution, I hacked something up in a day. I recall going up to my friends and showing them the result, the initial version beat the fastest version (in Perl) by ~1.5x & the XS version by ~2x. Later that day I shared it with Tom and he asked me if I could feature match this with the Raku version.

I wrote to Tom:

I was wondering about this problem and decided to write a ::Nest myself in Raku, I haven't clocked in for this. Here is a POC of the idea:

The idea is that we "compile" the templates when ::Nest object is created, what compilation does is that it simply makes note of all variables present in the template and the location we have to subtitute -- if the template is modified then we can "compile" it again on the fly.

I made this for a simple dumb string replacement -- haven't added the ability to inject another TEMPLATE hash, just strings, here are the results.

The idea was to cache the computations for each template file as long as we can. To understand how it all works, let look at an example: Given a template file (templates/00-simple-page.html):

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Simple Page</title>
  </head>
  <body>
    <p><!--% variable %--></p>
  </body>
</html>

One could generate a web page with this template hash:

# Declare template structure.
my %simple-page = %(
    TEMPLATE => '00-simple-page',
    variable => 'Simple Variable',
);

That's it! The variable does not have to be a string, it can be another template hash (referring to another template file), it can also be a list of template hashes. The Perl version parsed each template file every time it was referenced in a template Hash, the pure Raku port was a line by line rewrite of the Perl 5 module & I believe Template::Nest::XS was the same, written in C++.

The idea was to cache the variable positions and perform a simple string substitution at run time. We calculate the position of all "keys" (<!--% a_key %-->) & store them. As we go through the template hash, those positions are filled by the given values. This made the newer Raku version (Template::Nest::Fast) about 120x faster compared to the line by line port. It could now be compared in benchmarks.

Module Simple Complex
Template::Nest::XS 0.40ms 1.96ms
Template::Nest::Fast 0.14ms 1.10ms
Template::Nest (Perl) 0.24ms 1.40ms

I feature matched this with Perl 5 version and it now supports all the options available in Perl 5 module. However, as I added new features to Template::Nest::Fast, there came a trade-off between functionality and performance. These enhancements added overhead that affected the module's speed. For instance, the die-on-bad-params option, when enabled, led to about 2x slower performance. However, all of the options were migrated with the exception of escape_char, Template::Nest::Fast also expanded the original test suite to cover additional edge cases. Moreover, working on it greatly improved my understanding of the module.

Later Template::Nest::XS was re-written to use the same algorithm and it is currently the fastest version available. This competition between the modules led to XS performance improvements. In Tom's words:

The bottleneck slowing down XS version was not related to indexing, at least I don't think so. Indexing made little difference, C++ version was already using a fast algorithm for replacing tokens but the fact pure Raku version was faster than C++ version made us go back to the drawing board to try to work out the reason for this, and ultimately led to a much faster C++ version. The main issue turned out to be getting the data into memory in a format that could easily be read in C++ code.

Ideally, if we fully understood how Raku stores data (i.e. at memory address level) then we could perhaps access the memory directly. However it's not trivial to do that. I believe the XS version was using some native Raku serialisation method to stringify the data, we changed this to JSON::Fast & I think this resulted in a significant performance improvement.