Monday, September 13, 2010

Regex Substitution in Haskell

I'm shocked and appalled at the fact that there is no generic regex substitution function in the GHC libraries. All I'm looking for is a simple function equivalent to perl's s/.../.../ expression.

After digging around a bit, I found subRegex in regex-compat. While this works well, it does not use PCRE, and as far as I can tell, there's no support for ByteStrings.

Grrr.

Anyhow, I took the subRegex implementation from regex-compat and mangled it slightly to work with Text.Regex.PCRE. I also added the (=~$) function which feels a bit more familiar to perl users. For example:
Prelude PCRESub> "me boo" =~$ ("(me) boo", "he \\1")
"he me"
The above is equivalent to perl's:
$text = "me boo";
$text =~ s/(me) boo/he $1/;
(~=$) is implemented with reSub (which is also exported by PCRESub). reSub allows you to provide your own CompOption and ExecOption options.

Here's the PCRESub module:
-- PCRE-based Regex Substitution
-- Mohit Muthanna Cheppudira
--
-- Based off code by Chris Kuklewicz from regex-compat library.
--
-- Requires Text.Regex.PCRE from regex-pcre.

module PCRESub(
  (=~$),
  reSub
) where

import Data.Array((!))
import Text.Regex.PCRE

subRegex :: Regex                          -- ^ Search pattern
         -> String                         -- ^ Input string
         -> String                         -- ^ Replacement text
         -> String                         -- ^ Output string
subRegex _ "" _ = ""
subRegex regexp inp repl =
  let compile _i str [] = \ _m ->  (str++)
      compile i str (("\\",(off,len)):rest) =
        let i' = off+len
            pre = take (off-i) str
            str' = drop (i'-i) str
        in if null str' then \ _m -> (pre ++) . ('\\':)
             else \  m -> (pre ++) . ('\\' :) . compile i' str' rest m
      compile i str ((xstr,(off,len)):rest) =
        let i' = off+len
            pre = take (off-i) str
            str' = drop (i'-i) str
            x = read xstr
        in if null str' then \ m -> (pre++) . ((fst (m!x))++)
             else \ m -> (pre++) . ((fst (m!x))++) . compile i' str' rest m
      compiled :: MatchText String -> String -> String
      compiled = compile 0 repl findrefs where
        bre = makeRegexOpts defaultCompOpt execBlank "\\\\(\\\\|[0-9]+)"
        findrefs = map (\m -> (fst (m!1),snd (m!0))) (matchAllText bre repl)
      go _i str [] = str
      go i str (m:ms) =
        let (_,(off,len)) = m!0
            i' = off+len
            pre = take (off-i) str
            str' = drop (i'-i) str
        in if null str' then pre ++ (compiled m "")
             else pre ++ (compiled m (go i' str' ms))
  in go 0 inp (matchAllText regexp inp)

-- Substitue re with sub in str using options copts and eopts.
reSub :: String -> String -> String -> CompOption -> ExecOption -> String
reSub str re sub copts eopts = subRegex (makeRegexOpts copts eopts re) str sub

-- Substitute re with sub in str, e.g.,
--
-- The perl expression:
--
--   $text = "me boo";
--   $text =~ s/(me) boo/he $1/;
--
-- can be written as:
--
--   text = "me boo" =~$ ("(me) boo", "he \\1")
--
(=~$) :: String -> (String, String) -> String
(=~$) str (re, sub) = reSub str re sub defaultCompOpt defaultExecOpt
Example usage:

import PCRESub

main = do
  let text = "me boo" =~$ ("(me) boo", "he \\1")
  print text
Paste this code in, or browse the source at my GitHub repo: PCRESub.hs

Someone please make this work across all the regex backends (and add support for ByteStrings)!

3 comments:

  1. This is awesome. I'll be using VexFlow for displaying music Notes on my blog http://readmusic.org

    Any manual/guide/tutorial available to make a posting in blog?

    Thanks so much.

    ReplyDelete
  2. In haskell, you write your own substitution routine to fit your particular needs.
    See http://stackoverflow.com/questions/3847475/haskell-regex-substitution/3928438#3928438

    ReplyDelete