[Israel.pm] finding text

Ephraim Dan E.Dan at F5.com
Wed Oct 10 05:52:00 PDT 2007


Actually, what you want is to parse HTML, not extract text from a string.
You should do it with a parser, not a regexp.

A brief example that does what you want:

use HTML::Parser ();
my @links;
HTML::Parser->new( api_version => 3,
                   start_h => [sub { push @links, $_[0]->{href} if exists $_[0]->{href}}, "attr"],
                   report_tags => [qw(a)],
)->parse( $page );

If you *insist* on using a regexp, and just for your edification, I think you are looking for a simple alternation:

(@link) = $page =~ m# href= (.*?)(>|\s)#g;

Does that do what you wanted?

--edan

-----Original Message-----
From: perl-bounces at perl.org.il [mailto:perl-bounces at perl.org.il] On Behalf Of Ernst, Yehuda
Sent: Wednesday, October 10, 2007 14:17
To: Shlomi Fish
Cc: perl at perl.org.il
Subject: Re: [Israel.pm] finding text

i want extract text from string.

instead of writing 2 lines 


(@link) = $page =~ m| href= (.*?) |g;
(@link) = $page =~ m| href= (.*?)>|g;


i want to know if i can do it in one line ? 




-----Original Message-----
From: Shlomi Fish [mailto:shlomif at iglu.org.il]
Sent: Wednesday, October 10, 2007 2:13 PM
To: Ernst, Yehuda
Cc: perl at perl.org.il
Subject: Re: [Israel.pm] finding text


On Tuesday 09 October 2007, Ernst, Yehuda wrote:
> i will check html parse
>
> i want to match until the  ">" sign and a space together
>

I still don't understand. Can you give some examples to what you want to do?

Regards,

	Shlomi Fish

> -----Original Message-----
> From: Shlomi Fish [mailto:shlomif at iglu.org.il]
> Sent: Tuesday, October 09, 2007 1:04 PM
> To: perl at perl.org.il
> Cc: Ernst, Yehuda
> Subject: Re: [Israel.pm] finding text
>
>
> Hi!
>
> On Tuesday 09 October 2007, Ernst, Yehuda wrote:
> > Hello!
> >
> > I want to extract sub text from string
>
> Is "sub text" substring? Is it substrings?
>
> > i used (@link) = $page =~ m| href= (.*?) |g;
> >
> > this will give me all the href with space before and after.
>
> Well, it is generally a better idea to use an HTML parser to process HTML
> instead of regular expressions:
>
> {{{{{{{{{{{{
> <perlbot>       don't parse html with regular expressions! See
> HTML::Parser, and its subclasses: HTML::TokeParser,
> HTML::TokeParser::Simple,
> HTML::TreeBuilder, HTML::TableExtract, etc. See also
> http://htmlparsing.icenine.ca/.
> }}}}}}}}}}}}
>
> > what is i want the href with space before but i want all of them the have
> > space after but also > after?
>
> I don't understand this sentence. You can put a ">" by writing ">" inside
> the regexp.
>
> Regards,
>
> 	Shlomi Fish
>
> > Thanks
> >
> > Yehuda Ernst                        יהודה ארנסט
> > NDS Technologies Israel Ltd. mailto:yernst at nds.com>
> > Jerusalem          Tel:  +972 (2) 589-4427
> > PO Box 23012    Fax: +972 (2) 589-4825
> > Israel.    91235    Cell  +972 54 5664427
> >
> > *************************************************************************
> >** ****************************** This e-mail is confidential, the
> > property of NDS Ltd and intended for the addressee only.  Any
> > dissemination, copying or distribution of this message or any attachments
> > by anyone other than the intended recipient is strictly prohibited.  If
> > you have received this message in error, please immediately notify the
> > postmaster at nds.com and destroy the original message.  Messages sent to
> > and from NDS may be monitored.  NDS cannot guarantee any message delivery
> > method is secure or error-free.  Information could be intercepted,
> > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > viruses.  We do not accept responsibility for any errors or omissions in
> > this message and/or attachment that arise as a result of transmission. 
> > You should carry out your own virus checks before opening any attachment.
> >  Any views or opinions presented are solely those of the author and do
> > not necessarily represent those of NDS.
> >
> > NDS Limited Registered office: One Heathrow Boulevard, 286 Bath Road,
> > West Drayton, Middlesex, UB7 0DQ, United Kingdom. A company registered in
> > England and Wales  Registered no. 3080780   VAT no. GB 603 8808 40-00
> >
> > To protect the environment please do not print this e-mail unless
> > necessary.
> > *************************************************************************
> >** *******************************
> >
> > _______________________________________________
> > Perl mailing list
> > Perl at perl.org.il
> > http://perl.org.il/mailman/listinfo/perl



-- 

---------------------------------------------------------------------
Shlomi Fish      shlomif at iglu.org.il
Homepage:        http://www.shlomifish.org/

If it's not in my E-mail it doesn't happen. And if my E-mail is saying
one thing, and everything else says something else - E-mail will conquer.
    -- An Israeli Linuxer
*********************************************************************************************************
This e-mail is confidential, the property of NDS Ltd and intended for the addressee only.  Any dissemination, copying or distribution of this message or any attachments by anyone other than the intended recipient is strictly prohibited.  If you have received this message in error, please immediately notify the postmaster at nds.com and destroy the original message.  Messages sent to and from NDS may be monitored.  NDS cannot guarantee any message delivery method is secure or error-free.  Information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.  We do not accept responsibility for any errors or omissions in this message and/or attachment that arise as a result of transmission.  You should carry out your own virus checks before opening any attachment.  Any views or opinions presented are solely those of the author and do not necessarily represent those of NDS.

NDS Limited Registered office: One Heathrow Boulevard, 286 Bath Road, West Drayton, Middlesex, UB7 0DQ, United Kingdom. A company registered in England and Wales  Registered no. 3080780   VAT no. GB 603 8808 40-00

To protect the environment please do not print this e-mail unless necessary.
**********************************************************************************************************

_______________________________________________
Perl mailing list
Perl at perl.org.il
http://perl.org.il/mailman/listinfo/perl


More information about the Perl mailing list