I wrote Fast XML Parsing in Ruby over last summer. It has a number of optimizations in it, including combining a bunch of string compares into one regular expression (regex) compare. It has bothered me it still does a series of bunch of string and regex compares, one after another until a match. They could be combined into one (unreadable) regex, if there was a simple (and fast) way to determine which matched. Right now each regex or set of strings has a different action. A single regex could cover all the valid matches, but how to determine which action?
I mentioned creating a Domain Specific Language (DSL) for this situation. But there is already something like this, YACC (Yet Another Compiler Compiler). It has been around for decades in the Unix world. YACC handles LALR(1) languages. Regular expressions are a subset of LALR(1) languages.
Bison is an open source version of YACC with some additional features. Rbison claims to merge Ruby and Bison to produce a Ruby callable YACC parser (in C) with actions written in Ruby. A great solution, except it has been abandoned by it creator and his repository taken down. Rbison 0.0.7 is in a number of FreeBSD repositories. I looked at it and it shows some work was done, but it is a long way from being usable. The goal may be too ambitious or even impossible.
Racc is YACC written entirely in Ruby. It is usable and I am working on using it speed up the RSS and OPML parsers described in the article. The OPML parser is working – it passes the test suite and doesn’t blow up running the examples. I haven’t pushed it to github yet. I haven’t run any benchmarks yet, but my current thinking is that it will not be faster. It is pure Ruby, while the Ruby regex library routines is in C. I expect it can match a bunch of regex faster than a single Racc parser.
I expect to have the RSS parser converted to use Racc in the next few days. I’ll post the benchmark results when complete, which ever way they turn out.