[Israel.pm] A simpler regex required

Amir E. Aharoni amir.aharoni at gmail.com
Tue Aug 14 09:54:26 PDT 2007


On 14/08/07, Peter Gordon <peter at pg-consultants.com> wrote:
> Hi.
>
> Let's suppose that I have the following lines in an HTML file.
> I want to substitute the spaces in the date part with non-breaking spaces (&nbsp;)
>
> <td  style="text-align: left" bgcolor="#92c1bb">Aug 12 23:59:59 2007 GMT</td>
> <td  style="text-align: left" bgcolor="#92c1bb">Aug 12 23:59:59 2007 GMT</td>
>
> I came up with this line - but somehow it isn't aesthetic.
>
> s!(<td.*?>)(.*?)(</td>)!my $t1 = $1 ;my $t2 = $2 ; my $t3 = $3 ; $t2 =~ s/\s/&nbsp;/g ; "$t1$t2$t3" ;!egs ;
>
> Is there a nicer/cleaner way to write it?

It's a clever way, but i am very much into "Perl Best Practices"
lately, which says "Don't be clever" :)

It's very TMTOWTDI, of course.

I thought of a different regex for this, without the /e . I thought of
using lookbehind assertions, something like (?<= .*>), but apparently
variable length lookbehind assertions are not implemented.

I could also recommend HTML::Parser, but if all you need is replacing
some spaces, then it would be overkill.

So your algorithm is OK, but you don't need the outer s/// at all, and
if you do use it, then you don't need the outer /g , because the first
part of the outer s/// is used only for capturing the HTML.

I would write the same algorithm more readably and simply like this:

if ($str =~ m{
    (<td.*?>)
    (.*?)
    (</td>)
}xms)
{
    my ($t1, $t2, $t3) = ($1, $2, $3);
    $t2 =~ s/\s/&nbsp;/g;
    $str = "$t1$t2$t3";
}
else {
    print "expected HTML not found\n";
}



More information about the Perl mailing list