PHP has some sets of functions,which are not known to the wide audience. One of those is mb_ereg_* family of functions.
There is a common misunderstanding,that mb_ereg_* functions are just unicode counterparts of ereg_* functions:slow and non-powerful. That’s as far from truth as it can be.
mb_ereg_* functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there. Couple of years ago I made a little speed-test.
Anyway,this time,I was going to tell about it’s usage. PHP-documentation isn’t telling much.
Let’s start with the basic fact:you don’t need to put additional delimeters around your regular exprsssions,when you use mb_ereg_* funcitons. For example:
// find first substring consisting of letters from 'a' to 'c' in 'abcdabc' string.mb_ereg('[a-c]+','abcdabc',$res);
To execute same search,but in case-insensitive fashion,you should use mb_eregi()
mb_ereg(),mb_eregi() and mb_split() functions use pre-set options in their work. You can check current options and set the new ones using mb_regex_set_options() function. This function is parametrized by string,each letter of which means something.
There are parameters (you can specify several of these at the same time):
- ‘i’:ONIG_OPTION_IGNORECASE;
- ‘x’:ONIG_OPTION_EXTEND;
- ‘m’:ONIG_OPTION_MULTILINE;
- ‘s’:ONIG_OPTION_SINGLELINE;
- ‘p’:ONIG_OPTION_MULTILINE | ONIG_OPTION_SINGLELINE;
- ‘l’:ONIG_OPTION_FIND_LONGEST;
- ‘n’:ONIG_OPTION_FIND_NOT_EMPTY;
- ‘e’:eval() resulting code
And there are “modes”(if you specify several of these,the LAST one will be used):
- ‘j’:ONIG_SYNTAX_JAVA;
- ‘u’:ONIG_SYNTAX_GNU_REGEX;
- ‘g’:ONIG_SYNTAX_GREP;
- ‘c’:ONIG_SYNTAX_EMACS;
- ‘r’:ONIG_SYNTAX_RUBY;
- ‘z’:ONIG_SYNTAX_PERL;
- ‘b’:ONIG_SYNTAX_POSIX_BASIC;
- ‘d’:ONIG_SYNTAX_POSIX_EXTENDED;
Descriptions of these constants are available in this document:API.txt
So,for example,mb_regex_set_options(‘pr’) is equivalent to mb_regex_set_options(‘msr’) and means:
- . should include \n (aka “multiline-match”)
- ^ is equivalent to \A,$ is equivalent to \Z (aka “strings are single-lined”)
- using RUBY-mode
By the way,that is the default setting for mb_ereg_* functions. And,mb_ereg_match and mb_ereg_search families of functions take options-parameter explicitly.
So,back to functions:
// make sure,that the whole string matches the regexp:mb_ereg_match('[a-c]+',$user_string,'pz');// 'pz' specifies options for this operation // (multiline perl-mode in this case)// replace any of letters from 'a' to 'c' range with 'Z'$output = mb_ereg_replace('[a-c]','Z',$user_string,'b');// use basic POSIX mode
Ok,these were easy and similar to what you’ve seen in preg_* functions. Now,to something more powerful. The real strength lies in mb_ereg_search_* functions. The idea is,that you can let oniguruma preparse and cache text and/or regexp in its internal buffers. If you do,matching will work a lot faster.
mb_ereg_search_init($some_long_text);// preparse textmb_ereg_search('[a-c]');// execute searchwhile ($r = mb_ereg_search_getregs()){// get next result // work with matched result}mb_ereg_search('[d-e]');// execute different search on the same textmb_ereg_search_init($some_other_text);// preparse another textmb_ereg_search();// execute search using previous (already preparsed) regexp
This is the fastest way of parsing large documents in php,as far as I know.
Notes on charsets. Though,it is often mentioned,that mb_ereg_* functions are “unicode”,it would be more practical to say,that they are encoding-aware. It is a good idea to specify,which encoding you use beore calling oniguruma.
Some options:
mb_regex_encoding('UTF-8');mb_regex_encoding('CP1251');// windows cyrillic encodingmb_regex_encoding('Shift_JIS');// japanese
Check the full list of supported encodings.

