[Israel.pm] Regex: Groups inside a ?:-cluster

Gaal Yahas gaal at forum2.org
Thu Jan 1 04:10:00 PST 2009


Optimizations in the regexp engine have been a long-standing TODO for
perl5. 5.10 introduced some improvements natively (but I don't think
those affect your case). There's also Regexp::Optimizer on CPAN, which
unfortunately doesn't address your case either.

See also this writeup: http://swtch.com/~rsc/regexp/regexp1.html which
claims a fundamental algorithmic overhaul of the engine is due. The
good news is that at least in principle, newer perls allow you to plug
in an alternate engine, now it's just a simple matter of writing it...

On Thu, Jan 1, 2009 at 1:57 PM, Eli Billauer <eli at billauer.co.il> wrote:
> Gaal Yahas wrote:
>
>> I couldn't find other mention of this, so I'd say this behavior is a
>> bit underspecced, but unlikely to change in Perl 5 -- too many things
>> would break otherwise.
>>
>>
> Thanks. I suppose that's the best answer one can get...
>
> In the meanwhile, I found out that it may not always be such a good idea
> to be a mathematician about regular expressions. Consider, for example,
> this:
>
> $chars = qr/[\-_+%a-z0-9]/; # Some chars we allow
> $charsdot = qr/\.|$chars/; # Dot allowed as well
>
> Cute, isn't it? $charsdot is everything $chars is, only with the dot
> allowed as well. Now we can use it in regular expressions, such as
>
> print "Matched\n" if ($x =~ /$charsdot{20000}/);
>
> Well, not such a good idea. Trying this on Perl 5.8.8 makes the matching
> above run 10 times slower (5 whole seconds for a 10MBytes random string)
> compared with simply adding the dot to the square brackets.
>
> Which shouldn't come as a surprise, if we run "print $charsdot;" just to
> find out that it gives:
> (?-xism:\.|(?-xism:[\-_+%a-z0-9]))
>
> Lesson learned: This regular expression is not optimized. Not the
> slightest bit. This isn't a qr// issue, since the same thing happens
> when $chars' content is written in explicitly.
>
> -----------------------------
>
> As for inline comments with /x, I don't think that makes the code more
> readable, but that's a matter of taste. I kind-of lose the continuity,
> and it's pretty difficult to get really useful comments in there. On the
> other hand, if a parentheses get wrongly placed, then the comments
> convince the readers what he or she should read, which makes the code
> even more difficult to maintain.
>
> What I liked about the qr// is that the regular expression can be broken
> down to its pieces with meaningful names. But as the example above
> shows, that could have a cost.
>
>    Eli
>
> --
> Web: http://www.billauer.co.il
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://perl.org.il/mailman/listinfo/perl
>



-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/



More information about the Perl mailing list