Code Answer: How do I write a regular expression for a URL without the scheme?

How can I write a RE which validates the URLs without the scheme:

Pass:

www.example.com
example.com

Fail:

http://www.example.com

From stackoverflow

My guess is

/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/

In more primitive RE syntax that would be

/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/

Or even more primitive still:

/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/

URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:
```
^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$
```
brian d foy : This doesn't work for the example input, which has more than one dot.
```
^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
```
- string must start with an ASCII letter or number
- ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
- optional: a port is allowed (":8080")
- optional: anything after a slash may follow (since you said "URL")
- then the end of the string
Thoughts:
- no line breaks allowed
- no validity or sanity checking
- no support for "internationalized domain names" (IDNs)
- leave off the "optional:" parts if you like, but be sure to include the final "$"
If your regex flavor supports it, you can shorten the above to:
```
^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$
```
Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.

Axeman : Actually the RFC has a domainlabel as the possible first token, defined as alphanum. [ domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum ]. So it looks like your first bullet point is wrong.

Tomalak : To me that reads as "(alphanum) or (alphanum plus (any number of alphanum or '-') plus alphanum)". What am I missing here?

brian d foy : Your regex starts with alpha, excluding a legal num in the first position. That's what Axeman is talking about.

Tomalak : Oh, I see. My bad, I was to fixed on the hyphen to notice. :-) Corrected, thanks for pointing out.
If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.

The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:

[2.1] ... In general, URLs are written as follows:
```
   <scheme>:<scheme-specific-part>
```
A URL contains the name of the scheme being used (<scheme>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme.

And, later in the BNF,

scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]

So, if a scheme is there, you can match it with:
```
/^[a-z0-9+.-]+:/i
```
If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.

Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.

Thanks guys, I think I have a Python and a PHP solution. Here they are:

Python Solution:

import re

url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m     # m returns _sre.SRE_Match if url is valid, otherwise None

PHP Solution:

$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);

brian d foy : Now what happens when you have https://?

Thierry Lam : The url will still be invalid, but if you insist, I can still handle it.

Code Answer

Sunday, April 3, 2011

How do I write a regular expression for a URL without the scheme?

0 comments:

Post a Comment

Blog Archive