New Issue: Regex Flag Confusion

24788, "mppf", "Regex Flag Confusion", "2024-04-05T22:47:32Z"

The Regex module currently includes several optional arguments for the initializer. The problem is, some of these don't work.

These ones are OK, as I understand things:

(from the documentation for regex.init)

  • literal: set to true to treat the regular expression as a literal (ie, create a regex matching pattern as a string rather than as a regular expression).
  • noCapture: set to true in order to disable all capture groups in the regular expression
  • ignoreCase: set to true in order to ignore case when matching. Note that this can be set inside the regular expression with (?i).
  • dotAll: set to true in order to allow . to match a newline. Note that this can be set inside the regular expression with (?s).

There are three that are problematic (these bullets are from the documentation for regex.init):

  • multiLine: set to true in order to activate multiline mode meaning that ^ and $ match the beginning and end of a line instead of just the beginning and end of the text. Note that this can be set inside a regular expression with (?m).
  • nonGreedy: set to true in order to prefer shorter matches for repetitions; for example, normally x* will match as many x characters as possible and x*? will match as few as possible. This flag swaps the two, so that x* will match as few as possible and x*? will match as many as possible. Note that this flag can be set inside the regular expression with (?U).
  • posix: set to true to disable non-POSIX regular expression syntax

We are using these to set RE2 Options like this:

  opts->set_longest_match(!options->nongreedy);
  opts->set_one_line(!options->multiline);

But, what the RE2 docs describe these Options as doing is very different from nongreedy and multiline. This gets to the 3 problems:

longest match vs nongreedy

RE2 Options docs say:

    //   longest_match    (false) search for longest match, not first match

But the nongreedy flag is meant to correspond to this totally different feature (per the RE2 syntax docs):

U       ungreedy: swap meaning of «x*» and «x*?», «x+» and «x+?», etc (default false)

I don't think we should be setting longest_match = !nongreedy. Instead, I think we need to prefix the regex with (?U) when nonGreedy=true is provided, and throw an error if this is provided in combination with posix.

The difference is observable with this program:

use Regex;
writeln("replacing bb |b->a"); // compare with Python re.sub(r'|b', 'a', "bb") -> aaaaa
writeln("bb".replace(new regex("|b"), "a"));

Right now, it produces aa, but if I turn off setting longest_match, it produces ababa (which makes more sense, according to the RE2 docs).

one_line vs multiline

The RE2 Options docs say:

    // The following options are only consulted when posix_syntax == true.
    // When posix_syntax == false, these features are always enabled and
    // cannot be turned off; to perform multi-line matching in that case,
    // begin the regexp with (?m).
    //   perl_classes     (false) allow Perl's \d \s \w \D \S \W
    //   word_boundary    (false) allow Perl's \b \B (word boundary and not)
    //   one_line         (false) ^ and $ only match beginning and end of text

But the multiLine flag is meant to correspond to this related feature (per the RE2 syntax docs):

m       multi-line mode: «^» and «$» match begin/end line in addition to begin/end text (default false)

That's close to being reasonable, but normally posix mode is off, so one_line=!multiLine will be ignored. I think we need to prefix the regex with (?m) when multiLine=true is provided and posix mode is not requested. How this works when posix mode is requested is the subject of the next section.

posix mode

RE2 provides a "canned" option for Posix mode. That activates longest_match=true. Additionally, per the above, it's my understanding that Posix regexs has a different default w.r.t. multiLine mode (multiline mode is enabled by default).

Presumably, posix=true should activate longest_match=true. It would also normally imply multiLine=true. Is it possible or reasonable to change the default behavior for new Regex("something", posix=true) to activate multiLine=true ? This would be better match RE2 (and presumably POSIX regular expressions).