linkcrawl.pl

Content is copyright © 2002 Matthew Astley

$Id: linkcrawl-pl.html,v 1.3 2002/09/01 20:08:11 mca1001 Exp $

This page is just here to document the existence of linkcrawl.pl. Hopefully you'll never see it; it was written for internal use. If you do ever see the thing, at least there will be one hit in the search engines to give you some idea why and where.

User agent
Of the form linkcrawl.pl/1.11 [rude mode] libwww-perl/5.64
Crawling speed
Fast and rude, unless I've changed the inheritance to use LWP::RobotUA since I wrote this. It's not multithreaded, it tends to be CPU bound on parsing the HTML anyway.
Depth/breadth first?
Configurable
robots.txt
It doesn't. Several of the things I need it to crawl forbid robots, at least in production use. I still need to crawl all over them though.
Purpose
Multipurpose link checker and enumeration tool.
Neat features (IMHO)
Licence
Not distributed; proprietary by default. If it turns out to be useful I can ask for a copyright waiver and then GPL it.
How do I make it stop?
If it's thrashing your server, my suggestion would be to run whois or a reverse DNS query on the originating IP. If it's from a domain I administer, there should be a phone number in the TXT record attached to the relevant domain.

So you get the picture. It's a very rude web crawler, and it shouldn't be allowed Out.