PHP has some sets of functions, which are not known to the wide audience. One of those is mb_ereg_* family of functions.
There is a common misunderstanding, that mb_ereg_* functions are just unicode counterparts of ereg_* functions: slow and non-powerful. That’s as far from truth as it can be.
mb_ereg_* functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there. Couple of years ago I made a little speed-test.
Anyway, this time, I was going to tell about it’s usage. PHP-documentation isn’t telling much.
Let’s start with the basic fact: you don’t need to put additional delimeters around your regular exprsssions, when you use mb_ereg_* funcitons. For example:
// find first substring consisting of letters from 'a' to 'c' in 'abcdabc' string. mb_ereg('[a-c]+', 'abcdabc', $res);
To execute same search, but in case-insensitive fashion, you should use mb_eregi()
mb_ereg(), mb_eregi() and mb_split() functions use pre-set options in their work. You can check current options and set the new ones using mb_regex_set_options() function. This function is parametrized by string, each letter of which means something.
There are parameters (you can specify several of these at the same time):
- ‘i’: ONIG_OPTION_IGNORECASE;
- ‘x’: ONIG_OPTION_EXTEND;
- ‘m’: ONIG_OPTION_MULTILINE;
- ‘s’: ONIG_OPTION_SINGLELINE;
- ‘p’: ONIG_OPTION_MULTILINE | ONIG_OPTION_SINGLELINE;
- ‘l’: ONIG_OPTION_FIND_LONGEST;
- ‘n’: ONIG_OPTION_FIND_NOT_EMPTY;
- ‘e’: eval() resulting code
And there are “modes” (if you specify several of these, the LAST one will be used):
- ‘j’: ONIG_SYNTAX_JAVA;
- ‘u’: ONIG_SYNTAX_GNU_REGEX;
- ‘g’: ONIG_SYNTAX_GREP;
- ‘c’: ONIG_SYNTAX_EMACS;
- ‘r’: ONIG_SYNTAX_RUBY;
- ‘z’: ONIG_SYNTAX_PERL;
- ‘b’: ONIG_SYNTAX_POSIX_BASIC;
- ‘d’: ONIG_SYNTAX_POSIX_EXTENDED;
Descriptions of these constants are available in this document: API.txt
So, for example, mb_regex_set_options(‘pr’) is equivalent to mb_regex_set_options(‘msr’) and means:
- . should include \n (aka “multiline-match”)
- ^ is equivalent to \A, $ is equivalent to \Z (aka “strings are single-lined”)
- using RUBY-mode
By the way, that is the default setting for mb_ereg_* functions. And, mb_ereg_match and mb_ereg_search families of functions take options-parameter explicitly.
So, back to functions:
// make sure, that the whole string matches the regexp: mb_ereg_match('[a-c]+', $user_string, 'pz'); // 'pz' specifies options for this operation // (multiline perl-mode in this case) // replace any of letters from 'a' to 'c' range with 'Z' $output = mb_ereg_replace('[a-c]', 'Z', $user_string, 'b'); // use basic POSIX mode
Ok, these were easy and similar to what you’ve seen in preg_* functions. Now, to something more powerful. The real strength lies in mb_ereg_search_* functions. The idea is, that you can let oniguruma preparse and cache text and/or regexp in its internal buffers. If you do, matching will work a lot faster.
mb_ereg_search_init($some_long_text); // preparse text mb_ereg_search('[a-c]'); // execute search while ($r = mb_ereg_search_getregs()) { // get next result // work with matched result } mb_ereg_search('[d-e]'); // execute different search on the same text mb_ereg_search_init($some_other_text); // preparse another text mb_ereg_search(); // execute search using previous (already preparsed) regexp
This is the fastest way of parsing large documents in php, as far as I know.
Notes on charsets. Though, it is often mentioned, that mb_ereg_* functions are “unicode”, it would be more practical to say, that they are encoding-aware. It is a good idea to specify, which encoding you use beore calling oniguruma.
Some options:
mb_regex_encoding('UTF-8'); mb_regex_encoding('CP1251'); // windows cyrillic encoding mb_regex_encoding('Shift_JIS'); // japanese
Check the full list of supported encodings.

How about fixing the manual for them?
pretty cool information, someone should append it to the PHP Manual?
eY You!,, ThaNks for this USEFUL info,,, so, the world is better cuz u are here n.n