ode to mb_ereg functions

PHP has some sets of functions, which are not known to the wide audience. One of those is mb_ereg_* family of functions.

There is a common misunderstanding, that mb_ereg_* functions are just unicode counterparts of ereg_* functions: slow and non-powerful. That’s as far from truth as it can be.

mb_ereg_* functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there. Couple of years ago I made a little speed-test.

Anyway, this time, I was going to tell about it’s usage. PHP-documentation isn’t telling much.

Let’s start with the basic fact: you don’t need to put additional delimeters around your regular exprsssions, when you use mb_ereg_* funcitons. For example:

// find first substring consisting of letters from 'a' to 'c' in 'abcdabc' string.
mb_ereg('[a-c]+', 'abcdabc', $res);

To execute same search, but in case-insensitive fashion, you should use mb_eregi()

mb_ereg(), mb_eregi() and mb_split() functions use pre-set options in their work. You can check current options and set the new ones using mb_regex_set_options() function. This function is parametrized by string, each letter of which means something.

There are parameters (you can specify several of these at the same time):

  • ‘i’: ONIG_OPTION_IGNORECASE;
  • ‘x’: ONIG_OPTION_EXTEND;
  • ‘m’: ONIG_OPTION_MULTILINE;
  • ‘s’: ONIG_OPTION_SINGLELINE;
  • ‘p’: ONIG_OPTION_MULTILINE | ONIG_OPTION_SINGLELINE;
  • ‘l’: ONIG_OPTION_FIND_LONGEST;
  • ‘n’: ONIG_OPTION_FIND_NOT_EMPTY;
  • ‘e’: eval() resulting code

And there are “modes” (if you specify several of these, the LAST one will be used):

  • ‘j’: ONIG_SYNTAX_JAVA;
  • ‘u’: ONIG_SYNTAX_GNU_REGEX;
  • ‘g’: ONIG_SYNTAX_GREP;
  • ‘c’: ONIG_SYNTAX_EMACS;
  • ‘r’: ONIG_SYNTAX_RUBY;
  • ‘z’: ONIG_SYNTAX_PERL;
  • ‘b’: ONIG_SYNTAX_POSIX_BASIC;
  • ‘d’: ONIG_SYNTAX_POSIX_EXTENDED;

Descriptions of these constants are available in this document: API.txt

So, for example, mb_regex_set_options(‘pr’) is equivalent to mb_regex_set_options(‘msr’) and means:

  • . should include \n (aka “multiline-match”)
  • ^ is equivalent to \A, $ is equivalent to \Z (aka “strings are single-lined”)
  • using RUBY-mode

By the way, that is the default setting for mb_ereg_* functions. And, mb_ereg_match and mb_ereg_search families of functions take options-parameter explicitly.

So, back to functions:

// make sure, that the whole string matches the regexp:
mb_ereg_match('[a-c]+', $user_string, 'pz'); // 'pz' specifies options for this operation
                                             // (multiline perl-mode in this case)

// replace any of letters from 'a' to 'c' range with 'Z'
$output = mb_ereg_replace('[a-c]', 'Z', $user_string, 'b'); // use basic POSIX mode

Ok, these were easy and similar to what you’ve seen in preg_* functions. Now, to something more powerful. The real strength lies in mb_ereg_search_* functions. The idea is, that you can let oniguruma preparse and cache text and/or regexp in its internal buffers. If you do, matching will work a lot faster.

mb_ereg_search_init($some_long_text); // preparse text
mb_ereg_search('[a-c]'); // execute search
while ($r = mb_ereg_search_getregs()) { // get next result
    // work with matched result
}

mb_ereg_search('[d-e]'); // execute different search on the same text

mb_ereg_search_init($some_other_text); // preparse another text
mb_ereg_search(); // execute search using previous (already preparsed) regexp

This is the fastest way of parsing large documents in php, as far as I know.

Notes on charsets. Though, it is often mentioned, that mb_ereg_* functions are “unicode”, it would be more practical to say, that they are encoding-aware. It is a good idea to specify, which encoding you use beore calling oniguruma.

Some options:

mb_regex_encoding('UTF-8');
mb_regex_encoding('CP1251'); // windows cyrillic encoding
mb_regex_encoding('Shift_JIS'); // japanese

Check the full list of supported encodings.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • DZone
  • FriendFeed
  • Reddit
  • Tumblr
  • Twitter

View Commentsode to mb_ereg functions

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus