Adverts

Checking a website for broken links is a simple way of making a website appear more professional and keeping it more manageable. This website is checked fairly frequently and I have written some custom tools that catch problems that a normal link checker can't catch (like files that aren't linked at all and redirects that point at nothing). Hopefully that will mean that you won't find anything broken on this website.

Under Windows there is a great link checker called Xenu which is free but not open source. It gives a detailed and fairly easy to understand report of anything that is broken or missing. Under Linux the tools are anywhere near as good. The best I have found is LinkChecker which is essentially a command line tool.

To run LinkChecker pop open a command prompt and create a directory called linkchecker. Enter this directory and enter the command:

linkchecker http://www.example.com/

This will cause LinkChecker to check your website and any single outgoing link from your website. This can be a huge check so use it carefully. If you want to stop LinkChecker going through the whole site use the -r option to stop it at a certain depth. If you are setting up a new site a depth of 1 or 2 is useful.

By default LinkChecker dumps it's output to the command line which IMHO isn't that useful but you can also dump it to a file with the -F flag. For instance to have it drop the information into a HTML file use:

linkchecker -F html http://www.yourwebsite.com/

LinkChecker, at least on my machine, reports an error just after it starts. It doesn't seem to impair it's performance though so I just ignore it. Note: this has been fixed in the latest version.

You probably want to let LinkChecker send and recieve cookies (the -C option) for your site as well, especially if you have any dynamic content. If you don't each request starts a new session which, if you are not careful, can bring a server down. Even a fairly small site such as this one has many hundreds if not thousands of links and most servers wont be able to cope with the thousands of simultaneous sessions LinkChecker would create without the cookie option. Under normal load most websites will only have a few hundred sessions on the go at any one time. If you do have a site that has thousands of sessions going at the same time there are ways of dealing with the increased memory load. The quickest of which is to just give the server and VM (if you are using Java which I am) more memory.

This gives you a nice command like this to run against your site:

linkchecker -r2 -C -F html http://www.example.com/index.html

Other useful options include the "--ignore-url=REGEX" which can be used to ignore certain URLs. This is particularly useful when you know that an external website will block access from LinkChecker (which the google ads site does) and you don't want to see hundreds of errors. If you have move than one set of pages you want to ignore you can use this argument multiple times. An example of it's useage would be:

linkchecker -C -F html --ignore-url=http://pagead2.googlesyndication.com/.* \
http://www.example.com/index.html

It can also be useful to ignore URLs that use anchors or relative jumps in a page (eg URLs using a #). Building on a previous example gives us this:

linkchecker -C -F html --ignore-url=http://pagead2\.googlesyndication\.com/.* \
--ignore-url=.*#.* http://www.example.com/index.html

Or more simply you could use the --no-anchor-caching which will treat all anchor containing links in a page as the same. There is some risk in doing this but it can substantially increase the speed of a large check.

linkchecker -C -F html --ignore-url=http://pagead2\.googlesyndication\.com/.* \
--no-anchor-caching http://www.example.com/index.html

To not check any external links use:

linkchecker -C -F html --ignore-url=\!http://www.example.com/.* \
--no-anchor-caching http://www.example.com/index.html

Note the use of \ to escape the ! to stop bash getting hold of it.

Adverts

Donate and Help

Please support this site and
Bandwidth doesn't grow on trees y' know :o)

Adverts

Get Adsense