[Perl] URL

Shlomo Yona shlomo at cs.haifa.ac.il
Tue Aug 27 06:01:31 PDT 2002


If you want to identify URLs correctly you will need to
do the following:

First you will need to check the relevant RFCs:

RFC1738	-	Uniform Resource Locators (URL)

RFC1808	-	Relative Uniform Resource Locators

and perhaps
RFC2396	-	Uniform Resource Identifiers (URI): Generic Syntax

Most probably, you can just do fine with only RFC1738.
In that RFC there is a BNF (Backus Normal Form), definition
of a URL string. BNF actually describes a grammar (rules, if you will).

The trick is to convert the BNF description into a regular expression.
There is some problem here, BNF is equivalent to Context Free Grammars
which are more expressive and more powerful than regular expressions
(in other words, every language that can be recognized by any regular
expression can also be identified using a context free grammar. The 
opposite is not always true).

Now, assumming that the BNF description of a URL does not exceed the power
of regular languages, you should be able to convert the description into 
a regular expression. Otherwise, you might only be able to get a regular
expression which accepts approximately the same language as the grammar
describing the URL in that RFC.

Good luck.

Shlomo Yona
shlomo at cs.haifa.ac.il

More information about the Perl mailing list