[Perl] URL
Shlomo Yona
shlomo at cs.haifa.ac.il
Tue Aug 27 06:01:31 PDT 2002
Hello.
If you want to identify URLs correctly you will need to
do the following:
First you will need to check the relevant RFCs:
RFC1738 - Uniform Resource Locators (URL)
ftp://ftp.isi.edu/in-notes/rfc1738.txt
RFC1808 - Relative Uniform Resource Locators
ftp://ftp.isi.edu/in-notes/rfc1808.txt
and perhaps
RFC2396 - Uniform Resource Identifiers (URI): Generic Syntax
ftp://ftp.isi.edu/in-notes/rfc2396.txt
Most probably, you can just do fine with only RFC1738.
In that RFC there is a BNF (Backus Normal Form), definition
of a URL string. BNF actually describes a grammar (rules, if you will).
The trick is to convert the BNF description into a regular expression.
There is some problem here, BNF is equivalent to Context Free Grammars
which are more expressive and more powerful than regular expressions
(in other words, every language that can be recognized by any regular
expression can also be identified using a context free grammar. The
opposite is not always true).
Now, assumming that the BNF description of a URL does not exceed the power
of regular languages, you should be able to convert the description into
a regular expression. Otherwise, you might only be able to get a regular
expression which accepts approximately the same language as the grammar
describing the URL in that RFC.
Good luck.
--
Shlomo Yona
shlomo at cs.haifa.ac.il
http://cs.haifa.ac.il/~shlomo/
More information about the Perl
mailing list