NAME

talktome.pl - a web robot to fetch pages just for the sake of it


DESCRIPTION

The aim of this web robot is to imitate some of the behaviour of a human browser, so that the people who want to index everything I look at will have plenty of data to store about me.

It doesn't have to be very good at imitating me, because they aren't allowed to find out which page on a site I request .. but at this point, things get more complicated.


QUICK QUESTIONS

These are just the questions that popped into my head while I was writing about the program. There will be more info somewhere on http://www.t8o.org/~mca1001/ .

How do I know who the program came from?

I've signed it with my GPG key.

You can run the signed copy, there's no need to extract it. You just need to check that nobody stuck anything on the top (before the start of the signature) or put extra PGP data blocks in here somewhere.

The file should start

  #! /usr/bin/perl -Tw
  '
  -----BEGIN PGP SIGNED MESSAGE-----

It is quite possible that the difference of line endings will cause the signature to fail under Windows or MacOS, if you've downloaded the file as text. I haven't tested this.

How can I trust this program you give me?

The original plan was to compile most of the code inside a Safe module compartment, which promises you that certain operations cannot be performed at runtime.

At the moment, this has to take a back seat to getting some sort of ``proof of concept'' program running.

Anyway, despite the fact that there is no warranty (and there never will be, either), I will do my best to make sure this program doesn't do anything stupid to or with your computer. That's the best you can get at the moment - along with access to the source code of course.

Isn't it wasteful?

Well it depends whether you think there's any value in what it does. Perhaps some comparisons will help clear things up?

The emacs command ``M-x spook'' adds a couple of lines of subversive junk text to a file, usually to the bottom of each email a person writes. This is also wasteful, eg.

  AFSPC argus threat cryptographic Compsec global military AGT. AMME
  explosion CDMA Kosovo ASDIC anarchy ARPA CipherTAC-2000

It's just there to attract unwelcome attention.

Compare also streaming audio. The internet isn't (currently) designed as a broadcast medium.

Won't it upset the hit counters used by online advertisers?

They could exclude the hits on the grounds that they come from a self-confessed robot. They should be careful not to exclude real humans pretending to be robots, though.

How can you tell the difference between this robot and a human?

It's probably very easy, even for a computer, if you have access to the full captured data of a session.

However, AIUI the 2002 budget in the UK only allows for serious snooping on 1 in 10,000 users. For the rest, the data will just say how much data came and went, and when.

How can you tell the difference between this robot and a human who wants to look like a robot?

That depends how careful the human is. The robot will be essentially random, within some set of boundaries.

If the human doesn't cause the total behaviour of the system to stray outside what is statistically feasible, it could be very hard to tell the difference unless you have something that can infer purpose from something that's being careful to stay mostly inside one standard deviation.

Don't you feel bad about helping the terrorists?

I'm not helping them. The terrorists are quite capable of doing whatever they like - remember, they're not bound by the law the way innocent citizens are. It makes them much harder to track.


DETAILS OF OPERATION

The current plan is to make a web crawler that obeys all the current rules on robot exclusion, and have it fetch a small quantity of data from a selection of web servers. It will identify itself, it will fetch pictures and other things embedded in HTML files, it will follow links and it will probably need to make a few queries to search engines too.

These are all the operations that a person does when surfing the net - except people don't obey robots.txt files and they don't usually surf 24/7!

These aren't details, I know. I haven't written the program yet.


RELATED PROJECT

It is my aim to swap a small quantity of random and pointless data with as many computers as possible, on a regular basis.

In theory my 128k cable modem can shift about one gigabyte of data per day. Most home broadband ISPs will start getting upset when one uses more than about 2% (50:1) of the available bandwidth all the time, so perhaps I shall aim for one tenth of this limit. That still leaves plenty bandwidth for my personal use - I'm not a heavy data muncher.

That's about 2.6 megabytes per day, to or from my machine, to some other places.

Divided between 1000 computers, this is under 3kb each. It's not enough to give anyone cause to be grumpy, except the nosey folk who wish to record its passing and sit on the data for seven years.

It is, however, enough to send a fairly substantial email. The data doesn't have to be ignored at the other end, if someone is expecting it.


AUTHOR

There is no code yet.

(c) 2002 Matthew Astley

 $Id: talktome.pl,v 1.12 2003/01/19 18:56:30 mca1001 Exp $