ode to mb_ereg functions

PHP has some sets of functions,which are not known to the wide audience. One of those is mb_ereg_* family of functions.

There is a common misunderstanding,that mb_ereg_* functions are just unicode counterparts of ereg_* functions:slow and non-powerful. That’s as far from truth as it can be.

mb_ereg_* functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there. Couple of years ago I made a little speed-test.

Anyway,this time,I was going to tell about it’s usage. PHP-documentation isn’t telling much.

Let’s start with the basic fact:you don’t need to put additional delimeters around your regular exprsssions,when you use mb_ereg_* funcitons. For example:

// find first substring consisting of letters from 'a' to 'c' in 'abcdabc' string.mb_ereg('[a-c]+','abcdabc',$res);

To execute same search,but in case-insensitive fashion,you should use mb_eregi()

mb_ereg(),mb_eregi() and mb_split() functions use pre-set options in their work. You can check current options and set the new ones using mb_regex_set_options() function. This function is parametrized by string,each letter of which means something.

There are parameters (you can specify several of these at the same time):

  • ‘i’:ONIG_OPTION_IGNORECASE;
  • ‘x’:ONIG_OPTION_EXTEND;
  • ‘m’:ONIG_OPTION_MULTILINE;
  • ‘s’:ONIG_OPTION_SINGLELINE;
  • ‘p’:ONIG_OPTION_MULTILINE | ONIG_OPTION_SINGLELINE;
  • ‘l’:ONIG_OPTION_FIND_LONGEST;
  • ‘n’:ONIG_OPTION_FIND_NOT_EMPTY;
  • ‘e’:eval() resulting code

And there are “modes”(if you specify several of these,the LAST one will be used):

  • ‘j’:ONIG_SYNTAX_JAVA;
  • ‘u’:ONIG_SYNTAX_GNU_REGEX;
  • ‘g’:ONIG_SYNTAX_GREP;
  • ‘c’:ONIG_SYNTAX_EMACS;
  • ‘r’:ONIG_SYNTAX_RUBY;
  • ‘z’:ONIG_SYNTAX_PERL;
  • ‘b’:ONIG_SYNTAX_POSIX_BASIC;
  • ‘d’:ONIG_SYNTAX_POSIX_EXTENDED;

Descriptions of these constants are available in this document:API.txt

So,for example,mb_regex_set_options(‘pr’) is equivalent to mb_regex_set_options(‘msr’) and means:

  • . should include \n (aka “multiline-match”)
  • ^ is equivalent to \A,$ is equivalent to \Z (aka “strings are single-lined”)
  • using RUBY-mode

By the way,that is the default setting for mb_ereg_* functions. And,mb_ereg_match and mb_ereg_search families of functions take options-parameter explicitly.

So,back to functions:

// make sure,that the whole string matches the regexp:mb_ereg_match('[a-c]+',$user_string,'pz');// 'pz' specifies options for this operation   // (multiline perl-mode in this case)// replace any of letters from 'a' to 'c' range with 'Z'$output = mb_ereg_replace('[a-c]','Z',$user_string,'b');// use basic POSIX mode

Ok,these were easy and similar to what you’ve seen in preg_* functions. Now,to something more powerful. The real strength lies in mb_ereg_search_* functions. The idea is,that you can let oniguruma preparse and cache text and/or regexp in its internal buffers. If you do,matching will work a lot faster.

mb_ereg_search_init($some_long_text);// preparse textmb_ereg_search('[a-c]');// execute searchwhile ($r = mb_ereg_search_getregs()){// get next result  // work with matched result}mb_ereg_search('[d-e]');// execute different search on the same textmb_ereg_search_init($some_other_text);// preparse another textmb_ereg_search();// execute search using previous (already preparsed) regexp

This is the fastest way of parsing large documents in php,as far as I know.

Notes on charsets. Though,it is often mentioned,that mb_ereg_* functions are “unicode”,it would be more practical to say,that they are encoding-aware. It is a good idea to specify,which encoding you use beore calling oniguruma.

Some options:

mb_regex_encoding('UTF-8');mb_regex_encoding('CP1251');// windows cyrillic encodingmb_regex_encoding('Shift_JIS');// japanese

Check the full list of supported encodings.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • DZone
  • FriendFeed
  • Reddit
  • Tumblr
  • Twitter
Liked this post? Follow me on twitter:@jimi_dini.

  • Stas

    How about fixing the manual for them? ;)

  • http://www.workingweb.nl/ Maarten Stolte

    pretty cool information,someone should append it to the PHP Manual?

  • http://www.surroundsounddj.com dj

    You can replace all of your ereg with mb_ereg if you want quick solution and save your time. mb_ereg is not marked as deprecated and it is a direct replacement for ereg.

  • http://blog.milkfarmsoft.com/ Alexey Zakhlestin

    It is not “direct”replacement. mb_ereg supports wider set of regex patterns and it works with various codepages (which means,you have to set which codepage you work with)

  • http://gameblog.me/2011/04/technical-posting-function-ereg-is-deprecated-possible-fix/ Function ereg() is deprecated possible fix | Gameblog.me

    [...] than preg_match() –which is the recommended replacement for ereg() (see point 1 above). This author actually did a speed test for mb_ereg() function versus the preg_match(). So…*Shrugs*. The [...]

  • http://www.sleevetattoodesigns.net sleeve tattoo designs

    There is a common misunderstanding,that mb_ereg_* functions are just unicode counterparts of ereg_* functions:slow and non-powerful. That’s as far from truth as it can be. mb_ereg_* functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there.

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna,tincidunt vitae molestie nec,molestie at mi. Nulla nulla lorem,suscipit in posuere in,interdum non magna.