24788, "mppf", "Regex Flag Confusion", "2024-04-05T22:47:32Z"
The Regex module currently includes several optional arguments for the initializer. The problem is, some of these don't work.
These ones are OK, as I understand things:
(from the documentation for regex.init)
- literal: set to true to treat the regular expression as a literal (ie, create a regex matching
pattern
as a string rather than as a regular expression). - noCapture: set to true in order to disable all capture groups in the regular expression
- ignoreCase: set to true in order to ignore case when matching. Note that this can be set inside the regular expression with
(?i)
. - dotAll: set to true in order to allow
.
to match a newline. Note that this can be set inside the regular expression with(?s)
.
There are three that are problematic (these bullets are from the documentation for regex.init):
- multiLine: set to true in order to activate multiline mode meaning that
^
and$
match the beginning and end of a line instead of just the beginning and end of the text. Note that this can be set inside a regular expression with(?m)
. - nonGreedy: set to true in order to prefer shorter matches for repetitions; for example, normally
x*
will match as many x characters as possible andx*?
will match as few as possible. This flag swaps the two, so thatx*
will match as few as possible andx*?
will match as many as possible. Note that this flag can be set inside the regular expression with(?U)
. - posix: set to true to disable non-POSIX regular expression syntax
We are using these to set RE2 Options like this:
opts->set_longest_match(!options->nongreedy);
opts->set_one_line(!options->multiline);
But, what the RE2 docs describe these Options as doing is very different from nongreedy
and multiline
. This gets to the 3 problems:
longest match vs nongreedy
RE2 Options docs say:
// longest_match (false) search for longest match, not first match
But the nongreedy flag is meant to correspond to this totally different feature (per the RE2 syntax docs):
U ungreedy: swap meaning of «x*» and «x*?», «x+» and «x+?», etc (default false)
I don't think we should be setting longest_match = !nongreedy
. Instead, I think we need to prefix the regex with (?U)
when nonGreedy=true
is provided, and throw
an error if this is provided in combination with posix
.
The difference is observable with this program:
use Regex;
writeln("replacing bb |b->a"); // compare with Python re.sub(r'|b', 'a', "bb") -> aaaaa
writeln("bb".replace(new regex("|b"), "a"));
Right now, it produces aa
, but if I turn off setting longest_match, it produces ababa
(which makes more sense, according to the RE2 docs).
one_line vs multiline
The RE2 Options docs say:
// The following options are only consulted when posix_syntax == true.
// When posix_syntax == false, these features are always enabled and
// cannot be turned off; to perform multi-line matching in that case,
// begin the regexp with (?m).
// perl_classes (false) allow Perl's \d \s \w \D \S \W
// word_boundary (false) allow Perl's \b \B (word boundary and not)
// one_line (false) ^ and $ only match beginning and end of text
But the multiLine flag is meant to correspond to this related feature (per the RE2 syntax docs):
m multi-line mode: «^» and «$» match begin/end line in addition to begin/end text (default false)
That's close to being reasonable, but normally posix
mode is off, so one_line=!multiLine
will be ignored. I think we need to prefix the regex with (?m)
when multiLine=true
is provided and posix
mode is not requested. How this works when posix
mode is requested is the subject of the next section.
posix mode
RE2 provides a "canned" option for Posix mode. That activates longest_match=true
. Additionally, per the above, it's my understanding that Posix regexs has a different default w.r.t. multiLine mode (multiline mode is enabled by default).
Presumably, posix=true
should activate longest_match=true
. It would also normally imply multiLine=true
. Is it possible or reasonable to change the default behavior for new Regex("something", posix=true)
to activate multiLine=true
? This would be better match RE2 (and presumably POSIX regular expressions).