Developing a High Performance templating engine in RakuWritten on
Template::Nest::Fast is a high performance templating engine written in pure Raku. It all began with a simple request from Tom, the author of Template::Nest (Perl), he approached me with a benchmarking task. The task was to benchmark several Perl v/s Raku operations, the first job was to benchmark Template::Nest (Perl) & Template::Nest::XS (Raku).
The direct port of Template::Nest from Perl to Raku was excluded from the benchmarks due to its poor performance. It proved unsuitable for production use, leading to the development of Template::Nest::XS.
At the time I couldn't help but assume that ::Nest::XS was going to be the obvious winner, given its C++ optimizations, and that it would be unfair to Perl. However, the results proved my assumptions wrong; surprisingly, the Perl version outperformed Raku even when Raku was leveraging C++. I had created two templates for testing - "Simple" & "Complex". Here are the results:
The second row provides timings when we ran Perl's Template::Nest in Raku using Inline::Perl5. As expected, it shows slower performance, but the extent of the difference, being 4x slower, was somewhat surprising. This motivated me to develop a Raku version of Template::Nest, determined to enhance Raku's performance and make a more efficient template engine. Thus, Template::Nest::Fast was born.
Back when the early Raku version was slow, I delved into the ::Nest source code. It appeared complex, I had no idea how one could make a templating engine. The benchmarks job re-ignited that spark. I figured it was just string substitution, I hacked something up in a day. I recall going up to my friends and showing them the result, the initial version beat the fastest version (in Perl) by ~1.5x & the XS version by ~2x. Later that day I shared it with Tom and he asked me if I could feature match this with the Raku version.
I wrote to Tom:
I was wondering about this problem and decided to write a ::Nest myself in Raku, I haven't clocked in for this. Here is a POC of the idea:
The idea is that we "compile" the templates when ::Nest object is created, what compilation does is that it simply makes note of all variables present in the template and the location we have to subtitute -- if the template is modified then we can "compile" it again on the fly.
I made this for a simple dumb string replacement -- haven't added the ability to inject another TEMPLATE hash, just strings, here are the results.
The idea was to cache the computations for each template file as
long as we can. To understand how it all works, let look at an
Given a template file (
<!DOCTYPE html> <html lang="en"> <head> <title>Simple Page</title> </head> <body> <p><!--% variable %--></p> </body> </html>
One could generate a web page with this template hash:
# Declare template structure. my %simple-page = %( TEMPLATE => '00-simple-page', variable => 'Simple Variable', );
That's it! The variable does not have to be a string, it can be another template hash (referring to another template file), it can also be a list of template hashes. The Perl version parsed each template file every time it was referenced in a template Hash, the pure Raku port was a line by line rewrite of the Perl 5 module & I believe Template::Nest::XS was the same, written in C++.
The idea was to cache the variable positions and perform a
simple string substitution at run time. We calculate the
position of all "keys" (
<!--% a_key %-->) &
store them. As we go through the template hash, those positions
are filled by the given values. This made the newer Raku version
(Template::Nest::Fast) about 120x faster compared to
the line by line port. It could now be compared in benchmarks.
I feature matched this with Perl 5 version and it now supports
all the options available in Perl 5 module. However, as I added
new features to Template::Nest::Fast, there came a
trade-off between functionality and performance. These
enhancements added overhead that affected the module's speed.
For instance, the
die-on-bad-params option, when
enabled, led to about 2x slower performance. However, all of the
options were migrated with the exception of
escape_char, Template::Nest::Fast also
expanded the original test suite to cover additional edge cases.
Moreover, working on it greatly improved my understanding of the
Later Template::Nest::XS was re-written to use the same algorithm and it is currently the fastest version available. This competition between the modules led to XS performance improvements. In Tom's words:
The bottleneck slowing down XS version was not related to indexing, at least I don't think so. Indexing made little difference, C++ version was already using a fast algorithm for replacing tokens but the fact pure Raku version was faster than C++ version made us go back to the drawing board to try to work out the reason for this, and ultimately led to a much faster C++ version. The main issue turned out to be getting the data into memory in a format that could easily be read in C++ code.
Ideally, if we fully understood how Raku stores data (i.e. at memory address level) then we could perhaps access the memory directly. However it's not trivial to do that. I believe the XS version was using some native Raku serialisation method to stringify the data, we changed this to
JSON::Fast& I think this resulted in a significant performance improvement.