This page is just here to document the existence of
linkcrawl.pl. Hopefully you'll never see it; it was written
for internal use.
If you do ever see the thing, at least there will
be one hit in the search engines to give you some idea why and
- User agent
- Of the form linkcrawl.pl/1.11 [rude mode]
- Crawling speed
- Fast and rude, unless I've changed the inheritance to use
LWP::RobotUA since I wrote this. It's not multithreaded,
it tends to be CPU bound on parsing the HTML anyway.
- Depth/breadth first?
- It doesn't. Several of the things I need it to crawl forbid
robots, at least in production use. I still need to crawl all over
- Multipurpose link checker and enumeration tool.
- Neat features (IMHO)
- Follows links from framesets (the original reason I needed to
write it), indeed anything HTML::LinkExtor spits out.
- Fetches and lists pages once each.
One use is preparing a list of URLs to hammer with a HTTP
performance tester such as
- Regexps define what is to be fetched
- Regexps are applied to links before they're followed, so I can
undo the link munging I had to do earlier.
- Link checking down to the bookmark/anchor level for HTML.
- Separate compartments for images, Java classes, hrefs etc.
- I can hang various things off the different MIME types that come
back. I plan to extend it to check for TODO lists, pages not
covered by robot rules, GET locations that have unwanted side
effects .. whatever is needed really.
- Not distributed; proprietary by default. If it turns out to be
useful I can ask for a copyright waiver and then GPL it.
- How do I make it stop?
If it's thrashing your server, my suggestion would be to run
whois or a reverse DNS query on the originating IP. If it's
from a domain I administer, there should be a phone number in the
TXT record attached to the relevant domain.
So you get the picture. It's a very rude web crawler, and it
shouldn't be allowed Out.