Performance of Parsing Short Strings
G. Wade Johnson has a 3D Printing hobby. For the beginning of the talk he introduced the main stages of going from a design to a 3D printed object. The result of one stage (slicing) is a Gcode file, which is basically a set of commands for controlling the movement of the print head, the temperatures, and the extrusion of the plastic.
He explained that the format is pretty simple. People have written code to modify the Gcode file before printing to generate interesting effects or fix problems with the code. He has written special purpose tools like this a few times. But, he wanted to create a more general tool. Modifying a Gcode file pretty much boils down to recognizing lines in the file, and then changing them, adding new lines, or removing lines.
After building some initial code using regular expressions to do the matching, he began to wonder about other approaches. You will sometimes hear people argue that a purpose-built piece of code can beat a regular expression in some cases. There are also different ways that you can optimize a regular expression to get potentially quicker results. He decided to test some of those ideas.
When doing benchmarking on code, it is important to understand what you are actually measuring and to eliminate variables that might confound your findings. So, he started with the constants in the tests.
- The full Gcode file was loaded into memory and accessed as an array of lines. This removes variability in time spent reading the file from disk from the actual tests.
- These are real Gcode files, so some of the lines will match and some will not. This tests both modes of the matching.
- Different test scripts matched a different number of times, and some of them extracted data, while some did not.
- Different test files executed different types of matching code. The regexes work pretty much the same no matter what, but different optimizations apply if you are looking for an exact match vs. a partial match vs. attempting to extract values from the string.
The source code for the test code is available on github.
The four tests are described in the README under the heading SCRIPTS. The short description of the tests is:
- bm-zero_extruder: Simple match, large number of positives
- bm-zmove: More complicated match, large number of positives
- bm-double_zmove: match in two places in data, two stage match
- bm-G1: match 3-5 times with somewhat simple match
The results are based on a particular run from his laptop. The absolute numbers are less important than the insights gained from them.
Not surprisingly, regular expressions beat out every other solution. More interestingly, carefully hand-crafted solutions tended to not only perform worse, but were also harder to read. The result should not come as much of a surprise. The regular expressions were able to make one transition from Perl down to highly-optimized internal code to make the match. Most of the other solutions required several Perl-level operations to complete and the overhead of those ops shows.
Interestingly, on this particular mix of strings, the normal advice to use non-greedy matches for speed did not really have much of an effect.
We had 8 people attending this month. As always, we'd like to thank cPanel, Inc. for providing the meeting space and food for the group.