PHP has some sets of functions, which are not known to the wide audience. One of those is mb_ereg_*
family of functions.
There is a common misunderstanding, that mb_ereg_*
functions are just unicode counterparts of ereg_*
functions: slow and non-powerful. That’s as far from truth as it can be.
mb_ereg_*
functions are based on oniguruma regular expressions library. And oniguruma is one of the fastest and most capable regular expression libraries out there. Couple of years ago I made a little speed-test.
Anyway, this time, I was going to tell about it’s usage. PHP-documentation isn’t telling much.
Let’s start with the basic fact: you don’t need to put additional delimeters around your regular exprsssions, when you use mb_ereg_*
funcitons. For example:
1 2 3 |
|
To execute same search, but in case-insensitive fashion, you should use mb_eregi()
mb_ereg()
, mb_eregi()
and mb_split()
functions use pre-set options in their work. You can check current options and set the new ones using mb_regex_set_options() function. This function is parametrized by string, each letter of which means something.
There are parameters (you can specify several of these at the same time):
- ‘i’:
ONIG_OPTION_IGNORECASE
- ‘x’:
ONIG_OPTION_EXTEND
- ‘m’:
ONIG_OPTION_MULTILINE
- ’s’:
ONIG_OPTION_SINGLELINE
- ‘p’:
ONIG_OPTION_MULTILINE | ONIG_OPTION_SINGLELINE
- ‘l’:
ONIG_OPTION_FIND_LONGEST
- ‘n’:
ONIG_OPTION_FIND_NOT_EMPTY
- ‘e’:
eval()
resulting code
And there are “modes” (if you specify several of these, the LAST one will be used):
- ‘j’:
ONIG_SYNTAX_JAVA
- ‘u’:
ONIG_SYNTAX_GNU_REGEX
- ‘g’:
ONIG_SYNTAX_GREP
- ‘c’:
ONIG_SYNTAX_EMACS
- ‘r’:
ONIG_SYNTAX_RUBY
- ‘z’:
ONIG_SYNTAX_PERL
- ‘b’:
ONIG_SYNTAX_POSIX_BASIC
- ‘d’:
ONIG_SYNTAX_POSIX_EXTENDED
Descriptions of these constants are available in this document: API.txt
So, for example, mb_regex_set_options('pr')
is equivalent to mb_regex_set_options('msr')
and means:
.
should include\n
(aka “multiline-match”)^
is equivalent to\A
,$
is equivalent to\Z
(aka “strings are single-lined”)- using RUBY-mode
By the way, that is the default setting for mb_ereg_*
functions. And, mb_ereg_match
and mb_ereg_search
families of functions take options-parameter explicitly.
So, back to functions:
1 2 3 4 5 6 7 |
|
Ok, these were easy and similar to what you’ve seen in preg_*
functions. Now, to something more powerful. The real strength lies in mb_ereg_search_*
functions. The idea is, that you can let oniguruma preparse and cache text and/or regexp in its internal buffers. If you do, matching will work a lot faster.
1 2 3 4 5 6 7 8 9 10 11 |
|
This is the fastest way of parsing large documents in php, as far as I know.
Notes on charsets. Though, it is often mentioned, that mb_ereg_*
functions are “unicode”, it would be more practical to say, that they are encoding-aware. It is a good idea to specify, which encoding you use beore calling oniguruma.
Some options:
1 2 3 4 |
|
Check the full list of supported encodings.