Home » Blog
date 26.May.2013

■ Which is the best minimum dependency C++ regexp class?


Regular expressions or deterministic finite state automata are basically super-complicated wildcards that can match or search for a string using special syntax. For example this one matches only email addresses of the type someone@website.com:

  \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

If you are still with us, there are cases you may want to use such regular expressions in your programs, e.g. for mass renaming or grep (search in files) tasks. Visual Studio 6 (!) doesn't have a built-in regexp class. If you search for a C++ regexp class you may come across recommendations for Boost or but these are heavyweight libraries with many cross dependencies you'd rather do without. We don't want any of these:

In xplorer² I am using CAtlRegExp class which was introduced in visual studio 2005 and immediately discontinued (still available through ATL server), but it isn't fully PERL-compatible (the most popular syntax used). Later versions of developer studio have introduced basic_regex which looks comprehensive (TR1 regular expressions) but a bit over the top — and needless to say VS6 needn't apply <g>

I was very happy to discover DEELX regular expression engine, which is PERL compatible with in a single C++ header file and no funny dependencies. But how good is it in terms of string matching? There is only one way to find out, the experimental!

Regexp performance tests


The objective is to compare the three above regular expression classes, CRegexpT(DEELX), CAtlRegExp(ATL) and basic_regex(STL). These can be mixed in a single VS2010 project. The input was a 43 KB unicode buffer with generic text where the match was at the very end (so that the classes would be stress tested). I used 3 different regular expressions as such:
  1. Number detection: \d+\.?\d*|\.\d+
  2. Email address: ([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})
  3. Plain text matching: 123
regexpDEELX  ATL    STL  Notes
Number3936310259261000 repetitions
Email530674330 repetitions
Plain text  19579447091000 repetitions

Table 1. String matching speed comparison (ms) for regexp classes

The various times in this table are in milliseconds. To get significant results a number of repetitions was used. As you can see there is no clear cut winner. Each class is optimized on different objectives. The ATL class is fastest for matching the number expression, and the STL class for the emails. The DEELX class is decent for numbers but it trails a lot in the complex email matching expression. The ATL class cannot match emails at all with the given regexp — the PERL issue we mentioned.

The verdict on DEELX class is: sort of allright. It cannot match the performance of basic_regexp but it offers what it does with minumum fuss and no dependencies!

Post a comment on this topic »

Share |

©2002-2013 ZABKAT, all rights reserved | Privacy policy | Sitemap