[mca1001] / bin / apache-logproc.pl  

mca1001: bin/apache-logproc.pl

File: [mca1001] / bin / apache-logproc.pl (download) (as text)
Revision: 1.14, Mon Jul 18 22:22:30 2005 UTC (5 years, 1 month ago) by mca1001
Branch: MAIN
CVS Tags: HEAD
Changes since 1.13: +3 -3 lines
fix double-agent complaints .. not sure what was going on

#! /usr/bin/perl -w

use strict;
use vars qw($VSN %dowarn);

$VSN = q$Id: apache-logproc.pl,v 1.14 2005/07/18 22:22:30 mca1001 Exp $;

=head1 NAME

  apache-logproc.pl

=head1 DESCRIPTION

It's a log file processor. This is a little dull, and very much an
"already baked" idea. The advantages are,

=over 4

=item - It's mine; I know how it works and I can tweak it easily

I concede this may not be an advantage for you.

=item - It groks my funky logfiles.

I've fiddled with the LogFormat directive. Several times. This should
pick up any valid log line I've used and maybe some others.

=item - It allows me to grow my worm detection

The algorithm is fairly simple. Since web server attacks tend to come
in indiscriminate clusters from on IP address at a time, then when I
see a know attack (eg. C</scripts/..%255c../winnt/...blah>), it seems
fair to assume that anything else from that machine is also an attack.
If it wasn't recognised as an attack, it should be logged as a new
flavour of attack .. and so the regexps grow.

=item - It reads compressed formats with no messing

Provided the file is named, not piped to STDIN, it is extensible to
decompress pretty much anything. This is mostly thanks to Perl's
magic.

=item - Takes logfiles in numerical order

If you simply process access.log* then 10 will comes out just after 1.
This isn't what you need, so there's a sort routine that gets it right
in cases where logrotate or similar named the files.

=item - Filters out my pageviews

There's no point telling me when I visited my own web server, and
truth be told I'm the most frequent visitor to the site. When I'm
debugging some CGI thing it can get a bit silly really.

Also the cron-scheduled fetches of various status URLs can also be
filtered out.

=item - Different classes of Strange Things Happening can be filtered
out separately

It's just a filter on warn().

=back


=head1 LICENCE

As always, I GPL it:

    apache-logproc.pl extracts some useful data from web server logs
    Copyright (C) 2002  Matthew Astley

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


    As an exception to the GPL for this program, you may extract the
    regular expressions which describe classes of HTTP attack and use
    them as you see fit.

As far as I'm concerned the regexps I've used to detect Code Red, Code Blue
and other HTTP attacks are public domain - I constructed them from
effectively public domain data (ie. what I've seem knocking on the
door over a period).

It is possible these may be more use to you than the rest of the
program, but they are not of sufficient value to force you to GPL your
code against your will .. so I'm not going to try.

The same goes for the algorithm to find new patterns. It's pretty
obvious. Go and improvise a new symphony on this theme.

(Yes, I'm allowed to do that with the GPL because I am the author of
the program!)


=head1 WISHLIST

=over 4

=item - detect concurrent bulk downloads

and list against known robots.

Can get concurrency from start times and request duration, throughput
from size and duration.

Should also detect when the concurrent connections maxes out, ie. all
server threads busy. Probably requires knowledge of the httpd.conf, or
guesswork.

=item - list aborted downloads

=item - undo url encoding where safe

=item - try to avoid putting proxies on the "attacker" list

Fits nicely with "try to avoid putting local browsers on attacker
list!

=item - detect fictitious or non-secquitur referrers (why would they do that?) and proxy probes

=item - list referers?

=item - list 404s (or errors in general), and their referers

=item - slurp error logs also, tie the two together by date? (eww, breaks on busy servers)

=item - slashdotting detector! Notice same referer(not search eng)/different host..

I'm not so vain that I imagine I'll ever get slashdotted, but it may
be of use to other to have a realtime alert of the onset.

=item - detect probes for open proxies?

They generally ask for some page not on your site. I presume they're
proxy probes, anyway. Shouldn't be too hard to check against the
vhosts list.

=item - tidy the regexps

They should be grouped in one big happy hash, with keys like
'useragent:netscapefoo' => qr{netscapepattern}.

The meanings of the patterns or capture brackets will vary depending
on the prefix. It will be slightly inefficient to trundle the hash for
each matching key. Hmm, maybe nest to depth one? This makes some
operations (eg. "find all the 'newfoo_\d+' entries that have been
added during processing") slightly harder.

=item - Robot coverage checks

Take the output of a link crawler (damn I knew I'd find a use for that
prog), list all the links and their reported last-mod times, and then
per robot find out which ones are up-to-date.

Oh, could check that they bother to read and comply with robots.txt,
but that sounds like a lot of wasted effort. I mean, why do I care?

=item - Interactive UI

Dumping text is OK, but it's easier to read if the columns line up.
This requires most of the columns to be stupidly huge to avoid
truncating data, and still some doesn't fit.

CGI may be the way to go, but where would I keep the state? (Not
interested in modperl at the moment)

=item - robot/host mismatches

(miscellaneous nosiness) who's this chap claiming to be running
Googlebot/2.1? I mean, he could at least bother to fake the PTR
record!

Rules like these are hard to generalise though. Will probably just
have a (what's the collective noun for kludges?) fudge of kludges in a
sub somewhere.

=item - another lookup pass?

Sometimes the rev lookup fails at log-time. Could have another bite
but that's a bit wasteful on the ones that don't resolve, if you keep
frobbing the program.

A cached whois lookup might be interesting though

=item - ptr/a mismatches

Spot the PTRs to things that don't exist/are wrong. There's a robot
that hangs off one of these.

=back

=cut


$| = 1;

@ARGV = reverse sort by_lognum @ARGV;

{
    # Should be a filter, not a bunch of warnings?
    my @bogus = grep /~$/, @ARGV;
    warn "Bogus names (@bogus)" if @bogus;
}

foreach (@ARGV) {
    die "Aiee, weirdos in '$_'\n" if m![^-a-z0-9./]!i;
    s/(.*\.gz)/zcat $1 |/;
    s/(.*\.bz2)/bzcat $1 |/;
}

%dowarn = (#----------------------------------------' max size to detect
	   'Bogus line'	=> 1,
	   'skip'	=> 0,
	   'New attack?'=> 1,
	   'Next file'	=> 1,
	   'Odd'	=> 1,
	   'Attack'	=> 1,
	   'Double agent' => 1,
	   'New useragent' => 1,
	   '[unknown]'	=> 1,	# things not labelled
	   );

my $apache_date =
    qr{\[(\d{2})/(\w{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2}) ([-+]\d{4})\]};
my @apache_date = qw(DD Mon YYYY hh mm ss tzone); # 7

# ODD: Yes I've seen (no useragent) ask "HEAD / HTTP\1.0". Caught as attack.
# ODD: Also seen "-" for request, presumably when there is none; status=408
# ODD: OPTIONS was attached to an MS WebDAV tool
# ODD: One attack script sent "\x05\x01", "\x05\x01\x02", "\x01", "\x1A", "GET GET HTTP/1.0" plus a raft, to go below.
my $apache_req =
    qr{"(?:([-\x00- ]*)|(GET|POST|HEAD|OPTIONS) ([^ \"]*)(| +HTTP[\\/][\d.]{3}))"};
my @apache_req = qw(invalidreq method location proto); # 4

# ODD: I've seen "foo " used as a username. It didn't work.
my $apache_user = qr{(\S+ {0,3})};

# ODD: Yes, I've seen "http://url1/, http://url2/, http://url3/" in referrer! (IE4.01/98)
my $apache_referer = qr{"(-|[^\"]+)"}; # 1
my $apache_agent   = qr{"([^\"]+)"}; # 1
my $apache_curlied = qr{\{([^{}]+)\}}; # 1

# Hostnames of computers seen to be worm infected, plus location regexp
my %attacker; # key = host, value = { type => count }

my $coderainbow = # regexp for location
    qr{ ^/default\.ida\?N{200,}%u9090 |
	^/(?:scripts|_vti_bin|_mem_bin|msadc)/(?:\.|%2e|%252e)+[%0-9a-zA-Z/.]+(?:\.|%2e|%252e)+/.*?/cmd\.exe\?/c\+dir(?:\+c:\\|) |
	^/(?:scripts|MSADC|[c-f]/win\w+/system\d*)/(?:root|cmd)\.exe\?/c\+dir(?:\+c:\\|) 
	}xi;	# beink careful not to leavink null case matchink	 ^

my $otherattacks = # regexp for location
    qr{	^GET$ (?# as in "GET GET HTTP/1.0") |
	^/invalidfilename.htm (?# stupid? ) |
	^/(?:\.\./|)invalidfilename.(?:html?|cgi) |
	^/(?:iisadmpwd|_vti_bin|scripts|msadc|cgi-dos)/?$ |
	^/cgi-bin/(?:perl\.exe|webgais|infosrch\.cgi|rguest\.exe|ezshopper2/loadpage\.cgi|dcboard\.cgi|nph-maillist\.pl|talkback\.cgi|ustorekeeper\.pl)$ |
	^/(?:mall_log_files|Admin_files)/$ |
	^///quote\.html?$ |
	\.\./etc/passwd |
	^NULL\.printer |
	%00
	}xi;

# TODO: These attack regexps should be organised more carefully,
# perhaps with some examples of each for reference.

# TODO: A ( class => qr/ foo /x ) structure might be more useful.
# TODO: Define something more specific than "_ok".

my $ok_attacks = # regexp for location
    qr{	/cgi-bin/$	(?# cgi-bin/ listing .. could be harmless) |
	/[^/]+[#~]$	(?# backup file check) |
	/\.?#		(?# temp file check; .#foo #foo# ) |
	/\.ht
	}xi;	# less serious probes



# Vhosts we know we provide. Regexp is constructed from these, but
# they could in principle update as we roll.

my %vhosts = (# log.name => nickname; nicknames mostly used for display

	      # TOY
	      'toy.ddts.net'		=> 'toy',
	      'toy'			=> 'toy?',
	      'www.t8o.org'		=> 't8o',
	      'www.t80.org'		=> 't80',
	      'www.leete.org.uk'	=> 'leete',
	      'www.house.internal'	=> 'www',
	      'toy.house.internal'	=> 'www',

	      # TROOP
	      'troop.granta.internal'	=> 'troop',
	      'www-mirror.granta.internal'	=> 'www-mirror',
	      'www-test.granta.internal'	=> 'www-test',
	      'www-develop.granta.internal'	=> 'www-develop',
	      );

# TODO: Do something similar to attack lists with useragents; current
# regexp set is a mess. It would help if I had some idea what I'm
# trying to do.
# TODO: Keep match string list & maybe count per UA
my %useragents;
{
    my $V	= qr{([\d.]{3,10})};
    my $URL	= qr{http://[a-z/.]+};
    my $WV	= qr{Windows ([A-Z \d.]{2,10}(?:; (?:[TQ]\d+|DigExt|Win 9x 4\.90))?)}; # NT 5.1 (; Q312461|; T312461|; DigExt)
    %useragents =
    (#shortname => regexp
     # Capture brackets' content is appended to key
     # Unless it's just ("1").
     "" => qr{^-$},
     "NS,Lin" => qr{^Mozilla/$V \[en\] \(X11; U; Linux ([\d.]{5,10}) \w+(?:; Nav)?\)$}, # 4.77, 2.4.14, i586
     # Mozilla/4.77 [en] (X11; U; Linux 2.2.20 i586; Nav)
     # Mozilla/3.0

     "NS,Mac"	=> qr{^Mozilla/$V \(Macintosh; U; PPC, Nav\)$}, # 4.08
     "Opera,Lin"	=> qr{^Opera/$V \(Linux ([\d.]{5,10}) \w+; U\)  \[en\]$}, # 5.0, 2.4.14, i586
     "Opera,Win"	=> qr{^Mozilla/4\.0 \(compatible; MSIE 5\.0; $WV\) Opera $V  \[en\]$}, # NT 4.0, 6.01

     # Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01  [ja]
     Oregano	=> qr{^Mozilla/$V \[en\] \(Compatible; RISC OS $V; Oregano $V\)$}, # 4.72, 4.02, 1.10

     "Lynx"	=> qr{^Lynx/([.\w]{3,12}) libwww-FM/$V(?:FM)?$}, # 2.8.4rel.1 | 2.8.4pre.5, 2.14
     "Links"	=> qr{^Links \($V; Linux 2.2.20 i586\)$}, # 0.1
     "wget"	=> qr{^Wget/$V$}, # 1.8.1

     "IE"	=> qr{^Mozilla/4\.0 \(compatible; MSIE $V; $WV\)$},
# Mozilla/4.0 \(compatible; MSIE 5\.5\;\ Windows\ 98\;\ Win\ 9x\ 4\.90\)

     "IE.net"	=> qr{^Mozilla/4\.0 \(compatible; MSIE $V; $WV; \.NET\ CLR $V\)$}, # 6.0, NT 5.[01], 1.0.3705
     # Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)

     "IE,mac"	=> qr{^Mozilla/4\.0 \(compatible; MSIE $V; Mac_PowerPC\)$},

     "Moz,Lin" => qr{^Mozilla/5\.0 \(X11; U; Linux i\d86; en-US; rv:([.\da-z]{3,10})\) Gecko/\d+$}, # i686, 0.9.6, (20011213|20011229)

     # Mozilla/5.0 Galeon/1.0.2 (X11; Linux i686; U;) Gecko/20011224
     # Mozilla/5.0 Galeon/1.0.2 (X11; Linux i686; U;) Gecko/20020102
     # Mozilla/5.0 Galeon/1.2.3 (X11; Linux i686; U;) Gecko/20020529 Debian/1.2.3-1
     # Mozilla/5.0 Galeon/1.2.1 (X11; NetBSD i386; U;) Gecko/0
     # Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc3) Gecko/20020531 Debian/1.0rc3-2
     Konqueror	=> qr{^Mozilla/5\.0\ \(compatible; Konqueror/$V; Linux\)$},

     ### Random non-browser
     Hashcash	=> qr{(?:^|, )HashCashCalcApplet/$V Java/(?:Sun Microsystems Inc\.|Netscape Communications Corporation)/([\d.]{3,7})$}, # NS prepends browser
     W3Cval	=> qr{^W3C_Validator/$V libwww-perl/$V$}, # 1.183, 5.64

     ### ROBOTS
     Googlebot	=> qr{^Googlebot/$V \(\+$URL\)$}, # 2.1
     'Fast.no'	=> qr{^FAST-WebCrawler/$V \(atw-crawler at fast dot no; $URL\)}, # 3.6
     Mirago	=> qr{^HenryTheMiragoRobot$},
     BaiDu	=> qr{^BaiDuSpider$},
     AskJeeves	=> qr{^Mozilla/$V \(compatible; Ask Jeeves\)$},
     cosmos	=> qr{^cosmos/$V\_\(robot\@xyleme\.com\)$},
     MSURL	=> qr{^Microsoft URL Control - $V$}, # 6.00.8169
     AVscoot	=> qr{^Scooter-(\S+)$}, # 3.2.EX
     bumblebee	=> qr{^bumblebee/$V \(bumblebee\@relevare\.com; $URL\)$},
     slysearch	=> qr{^SlySearch/$V \(?$URL\)?$},
     Mercator	=> qr{^Mercator-$V$}, # 2.0
     Inktomi	=> qr{^(?:Mozilla/\S+ \()?Slurp/(cat|si-emb);? \(?slurp\@inktomi\.com\; $URL\)$},
     Archiver	=> qr{^ia_archiver$},

     ### seen/later
     # Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; T312461; MSIECrawler)
     # Mozilla/4.0 compatible ZyBorg/1.0 (ZyBorg@WISEnutbot.com; http://www.WISEnutbot.com)
     # libwww-perl/5.53
     );
}

#
# End "config"
#
##############################################################################

$SIG{__WARN__} = sub {
    my $msg = "@_";
    my $type = $msg =~ /^([^:]{2,40}?)\s*:/ ? $1 : "[unknown]";
    if (!defined $dowarn{$type}) {
	warn "New warning type '$type', enabled\n";
	sleep 1; # icky, but you may never see it otherwise
	$dowarn{$type} = 1;
    }
    warn $msg if $dowarn{$type};
};

my $vhost = join "|", keys %vhosts;
$vhost =~ s/\./\\./g;
$vhost = qr{($vhost)};

my $lastfile = "";
while (<>) {
    if ($lastfile ne $ARGV) {
	$lastfile = $ARGV;
	warn "Next file: $ARGV\n";
    }

    my %l;

    if (m{^([.0-9a-zA-Z-_]+) (-) $apache_user $apache_date $apache_req (\d{3}) (-|\d+) $apache_referer $apache_agent(?: (\d+) $vhost)?$}) {
# cache-loh-ac08.proxy.aol.com - - [16/Jun/2002:12:54:55 +0100] "GET / HTTP/1.0" 200 3456 "http://www.google.com/search?hl=en&q=matthew+astley+kat" "Mozilla/4.0 (compatible; MSIE 5.5; AOL 7.0; Windows 98)" 0 toy.ddts.net
# lsanca1-ar5-114-036.lsanca1.dsl-verizon.net - - [16/Jun/2002:10:16:37 +0100] "GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir" 404 - "-" "-" 0 toy.ddts.net
	@l{(
	    qw(host ident user),
	    @apache_date,
	    @apache_req,
	    qw(status size referer agent duration vhost),
	    )} = ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10,
		  $11, $12, $13, $14, $15, $16, $17, $18, $19, $20);
	$l{type} = 'full';

	# TODO: These should more likely be undef, but the later code
	# isn't great about checking these things
	@l{qw(type duration vhost)} = ("combined", "-", "-")
	    if !defined $l{duration} && !defined $l{vhost};

    } elsif (m{^\[([\d\.]{7,15})\] ([.0-9a-zA-Z-_]+) $apache_curlied  (-) $apache_user $apache_date $apache_req (\d{3})([-+X]) (-|\d+) $apache_referer $apache_agent (\d+) $vhost $apache_curlied$}) {
# [213.104.12.1] pc4-camb4-0-cust1.cam.cable.ntl.com {10.0.36.1}  - - [19/Jun/2002:12:34:38 +0100] "GET /cvshistory/c2h_websites_www.fruitcake.demon.co.uk_cgi-bin__rf.html HTTP/1.0" 200+ 6842 "http://www.t8o.org/cvspublish/" "Mozilla/4.77 [en] (X11; U; Linux 2.2.20 i586; Nav)" 0 www.t8o.org {1.0 toy.house.internal:3128 (Squid/2.4.STABLE6)}
	@l{(
	    qw(ip host proxiedfor  ident user),
	    @apache_date,
	    @apache_req,
	    qw(status statusx size referer agent duration vhost proxy),
	    )} = ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12,
		  $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24);
	$l{type} = 'funky';

## debug case for when I can't figure out where the irregularity is in
## my expression
#    } elsif (m{^([.0-9a-zA-Z-_]+) (-) $apache_user $apache_date $apache_req (\d{3}) (-|\d+) $apache_referer $apache_agent (\d+)}) {
#	warn "Partial $_";

    } elsif (m{\[17/Jun/2002:(00:5[456789]|01:0.|01:1[0123456])}) { # }} emacs) is confused;
	# While I was messing with funky format
	warn "skip      : $_";

    } elsif (m{^([.0-9a-zA-Z-_]+) (-) $apache_user $apache_date $apache_req (\d{3}) (-|\d+)$}) {
	@l{(
	    qw(host ident user),
	    @apache_date,
	    @apache_req,
	    qw(status size),
	    )} = ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10,
		  $11, $12, $13, $14, $15, $16);
	@l{qw(vhost agent)} = ('-') x 2;
	$l{type} = 'common';

    } else {
	#TODO: Consider sending this to attacker()
	warn "Bogus line: $_"; # with LF
    }
    next unless scalar keys %l;

    # TODO: Mess with the host info here

    # Check various attack cases
    my $attack = undef;
    if (!defined $l{location}) {
	# No location --> probably no request at all,
	# may be an attack script
	warn "Odd: no invalidreq on $_" unless defined $l{invalidreq};

	$attack = "unknown";		# Not seen before, but invalid
	if ($l{status} eq '408' && $l{invalidreq} eq '-') {
	    $attack = "timeout";	# Connect and wait, benign

	} elsif ($l{invalidreq} =~ /[\x00-\x1F]/) {
	    $attack = "binreq";		# Serious weird stuff
	}
    } elsif ($l{location} =~ $coderainbow) {
	$attack = "rainbow";		# Code Red, Code Blue, etc.
    } elsif ($l{location} =~ $otherattacks) {
	$attack = "nonrainbow";		# Nessus-alike?
    } elsif ($l{location} =~ $ok_attacks) {
	$attack = "nonrainbow_ok";	# Might just be finger trouble
    } elsif ($l{proto} =~ /\\/) {
	$attack = "backslashproto";	# Weird, maybe just broken
    } elsif ($l{method} !~ /^(GET|POST|HEAD)$/) {
	$attack = "options_ok";		# I don't take OPTIONS, UPLOADS etc.
	# TODO: Someone else might. Marking this as an attack is .. crusty
    }
    # Deal with attacks
    if (defined $attack) {
	attacker($attack, \%l, $_);
	next unless $attack =~ /_ok$/; # TODO: unkludge, with flavours or wotnot
	# Some attacks are just URL checks, might as well continue
    }

# TODO: Attack checks need running again after applying URL-decode
#	and again after that until nothing else is decoded.

    # Apparently a valid query. Quick check for previous record so we
    # can build extra rules for attacks.
    if ($attacker{$l{host}} && !defined $attack) {
	# Ooh, a new location (ie. it's hardly going to be a valid
	# browser, and the location regexp didn't pick it up)
	my $last_types = join ":", keys %{ $attacker{$l{host}} }, "";
	warn "New attack?: was $last_types\n >> $_  > $l{location}\n";
	# Skip the line if there were only non-serious probes
	s/_ok:/;/g; next if $last_types =~ /:/;
	# This doesn't get ones we've already seen. Run the file
	# backward for that, or pre-init %attacker. 8-)
    }


    # Massage useragent field
    if (!defined $l{agent}) {
	warn "Odd: no useragent $_";
    } elsif (my @agents =
	     grep { $l{agent} =~ $useragents{$_} }
	     keys %useragents)
    {
	my $agent = join ":", @agents;
	if (@agents > 1) {
	    # could fill in the details, but it's obviously knack
	    warn "Double agent: '$l{agent}' --> '$agent'\n"
	} else {
	    my (@parts) = ($l{agent} =~ $useragents{$agent});
	    $agent = join ",", $agent, @parts
		if @parts > 1 || $parts[0] ne '1';
	}
	@l{qw(agent agent~)} = ($agent, $l{agent});
    } else {
	my $countnew = grep /^newua_\d+$/i, keys %useragents; # count new
	my $ua = "NewUA_$countnew";
	$useragents{$ua} = qr{^\Q$l{agent}\E$}; # exact match, numbered from zero
	warn "New useragent: '$l{agent}' --> $ua\n";
	@l{qw(agent agent~)} = ($ua, $l{agent});
    }

    # TOY
    # TODO: This should go by more than one field, and have configurable time ranges on the hostnames
    next if $l{host} =~ /^localhost$|\.house\.internal$|^(pc4-cmbg1-4-cust1|pc2-cmbg1-4-cust218|pc4-camb4-0-cust1)\.cam\.cable\.ntl\.com$/;
    next if $l{host} eq 'localhost' && $l{location} eq '/server-status?auto';

    # TROOP
    next if $l{agent} eq 'mon.d/http.monitor';
    next if $l{location} =~ m{cgi-bin/fom\?cmd=maintenance};
    next if $l{agent} =~ m{^linkcrawl.pl/};

    # Massage location - this is just ickiness for my temp log output
    # TODO: deal with "GET http://www.t80.org", when from t80.org or even elsewhere
    $l{"location~"} = $l{location};
    $l{location} = $vhosts{$l{vhost}} . $l{location};
    substr $l{host}, 0, length($l{host})-27, "..."
	if length($l{host}) > 29;

    printf("%s-%s.%s:%s:%s %-30s %10s%-30s %-30s %s\n",
	   @l{qw(Mon DD hh mm ss  host)},
	   ($l{status} eq '200' ? "" : "$l{status} ").
	   ($l{method} eq 'GET' ? "" : "$l{method} "),
	   @l{qw(location agent referer)},
	   );
}

END {
    warn +(join "\n  ", "New useragents seen:",
	   sort map { $useragents{$_} }
	   grep { /^newua_\d+$/i } keys %useragents),
	       "\n";
}

sub attacker # args = ("type", \%fieldshashref, "Raw log line\n")
{
    die "Args != 3" unless @_ == 3;
    my ($type, $l, $line) = @_;
    my $h = $l->{host};
    warn( "Attack: $h does a $type\n",
	  $type eq 'rainbow' ? "" : " >> $_".
	  (defined $$l{location} ? "  > $$l{location}\n" : "")
	 )
	unless defined $attacker{$h}
	    && defined $attacker{$h}->{$type};
# TODO: Marking self as attacker is dumb .. do this properly!
if ($h ne '10.0.36.1' && $h ne 'toy.house.internal' &&
    $type !~ /_ok$/) {
    $attacker{$h} = {} unless $attacker{$h};
    $attacker{$h}->{$type} ++;
}
    # TODO: Note new host here .. or ...
    # TODO: Separate repeat attacks from new ones instead of hiding repeats
    # TODO: Check the request returned 40x or 50x, otherwise we may be in trouble!
}


sub by_lognum {
    my ($al, $an, $ax) = $a =~ /(.*)\.(\d+)(\.[^.]+)?$/;
    my ($bl, $bn, $bx) = $b =~ /(.*)\.(\d+)(\.[^.]+)?$/;
    ($al, $an, $ax) = ("", "", "") unless defined $al;
    ($bl, $bn, $bx) = ("", "", "") unless defined $bl;
    die "Argh, same root comparing $a <=> $b\n"
	if $al eq $bl && $an eq $bn && $ax ne $bx;
    my $numcmp = $an ne "" && $bn ne "" ? $an <=> $bn : 0;
    return( $al cmp $bl || $numcmp || $a  cmp $b );
}
__DATA__


Repository owner

Powered by ViewCVS 1.0-dev
(Powered by ViewCVS)

ViewCVS and CVS Help