Regular expressions in Haskell

I read an article by Brian Kernighan about an code example that Rob Pike wrote as a programming example in their mutual book, The practice of programming.

A Regular Expression Matcher by Brian Kernighan

As Brian puts it "Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful" and I can only agree.

To really understand the code I thought I would try to re-implement it in Haskell as an exercise.

The following operators is supported by this simple regular expression matcher.

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

The original code is written as follows.

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {    /* must look even if string is empty */
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '\0');
    return 0;
}
/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}
/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
    do {    /* a * matches zero or more instances */
        if (matchhere(regexp, text))
            return 1;
    } while (*text != '\0' && (*text++ == c || c == '.'));
    return 0;
}

The code uses pointer arithmetic in the regular expression string and the input string quite heavily as well as recursion to iterate through the expression. The top level match is separated in two sub functions. The first, match, iterate through the input string to see if there is a local match in the string which is evaluated by the sub-function, matchhere. Star expression is also handled in a separate function, matchstar. See original article for a more thorough explanation.

I have tried to port the code as close as possible to Haskell. I am not totally certain if it is a perfect match, but the code seems to work for a set of common patterns.

module Main (main) where

type RegExp = String

match :: RegExp -> String -> Bool
match []       _      = True
match ('^':re) cs     = matchhere re cs
match re       []     = matchhere re []
match re       (c:cs) = matchhere re (c:cs) || match re cs

matchhere :: RegExp -> String -> Bool
matchhere []         _      = True
matchhere ('$':[])   []     = True
matchhere re         []     = False
matchhere (c:'*':re) cs     = matchstar c re cs
matchhere ('.':re)   (r:cs) = matchhere re cs
matchhere (r:re)     (c:cs) | r == c = matchhere re cs
matchhere re         cs     = False

matchstar :: Char -> RegExp -> String -> Bool
matchstar _   _        []     = False
matchstar '.' re (c:cs) = matchhere re cs || matchstar '.' re cs
matchstar d   re (c:cs) | d == c = matchhere re cs || matchstar d re cs
matchstar d   re cs     = matchhere re cs

I really think that the haskell code is at least as easy to understand as the original C code. But considering that C is a much lower level language it is impressive that the C code is as simple as it is.

Haskell is really built to handle recursion well and that is useful in this implementation. Also Haskell's pattern matching capability is useful to simplify the structure of this implementation.

But as I am no Haskell guru there are probable even better implementations, even if I am quite happy with this one.

The whole solution includes some test cases and a main program as well.

module Main (main) where

{- Port a beautiful regular expression matcher code by Rob Pike in Haskell
   See: http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
   Compile: ghc -O2 RegExp.hs
   Optimize: strip RegExp -o RegExp
   Run: ./RegExp 

    The following constructs is supported

    c    matches any literal character c
    .    matches any single character
    ^    matches the beginning of the input string
    $    matches the end of the input string
    *    matches zero or more occurrences of the previous character -}

type RegExp = String

match :: RegExp -> String -> Bool
match []       _      = True
match ('^':re) cs     = matchhere re cs
match re       []     = matchhere re []
match re       (c:cs) = matchhere re (c:cs) || match re cs

matchhere :: RegExp -> String -> Bool
matchhere []         _      = True
matchhere ('$':[])   []     = True
matchhere re         []     = False
matchhere (c:'*':re) cs     = matchstar c re cs
matchhere ('.':re)   (r:cs) = matchhere re cs
matchhere (r:re)     (c:cs) | r == c = matchhere re cs
matchhere re         cs     = False

-- Greedy match star
matchstar :: Char -> RegExp -> String -> Bool
matchstar _   _        []     = False
matchstar '.' re (c:cs) = matchhere re cs || matchstar '.' re cs
matchstar d   re (c:cs) | d == c = matchhere re cs || matchstar d re cs
matchstar d   re cs     = matchhere re cs

tests :: [(String,String, Bool)]
tests = [ ("Hello","Hello",True),
          ("abc*d","abd",True),
          ("abc*d","abcccd",True),
          ("^$","",True),
          ("^abc$","abc",True),
          (".lo$","Hello",True),
          ("Hello","Say Hello",True),
          (".ello","Say Gello",True),
          ("^.*$","What ever",True),
          ("a","",False),
          ("ab","ba",False),
          ("in$","going",False),
          ("^in","going",False) ]

doTests :: [(String, String, Bool)] -> [Bool]
doTests ts = map test ts where
             test (re, s, ans) = (match re s) == ans

main = print (doTests tests)

The supplied tests works fine.

$ ghc -O2 RegExp.hs; ./RegExp
[1 of 1] Compiling Main             ( RegExp.hs, RegExp.o )
Linking RegExp ...
[True,True,True,True,True,True,True,True,True,True,True,True,True]