[mca1001] / bin / sort_revname.pl  

mca1001: bin/sort_revname.pl

File: [mca1001] / bin / sort_revname.pl (download) (as text)
Revision: 1.2, Wed Jul 14 22:47:59 2004 UTC (6 years, 1 month ago) by mca1001
Branch: MAIN
CVS Tags: HEAD
Changes since 1.1: +115 -11 lines
Deal with directories - now needs to stat everything; more docs,
including better explanation of sort order; -l param for newline
filename list (no longer accepts filename list source as param).

#! /usr/bin/perl -Tw


=head1 NAME

sort_revname.pl - sort list of filenames by least significant name

=head1 DESCRIPTION

Filter for use on list of filenames.  Puts files with similar path
names together, for the purpose of increasing the final compression
ratio.

Works on the output of commands such as GNU L<find(1)>:

  find dir1 dir2 dir3 -print0 | sort_revname.pl |

before piping in to a GNU L<tar(1)> such as

  tar cfj squished.tbz2 --null --no-recursion -T -

=head2 Sort Order

Directories come first.  This is why the C<--no-recursion> is needed.
They are left in the order we get them, on the assumption that C<find>
puts parents before children.  Thanks, Richard.

Non-files are next.  They are also left in the order we get them.
They take so little space it shouldn't really matter.

Regular files are sorted by comparing each piece of the path name,
starting at the right hand end.

The extension is currently regarded as a piece, so C<.jpg> will be
together and near to C<.jpeg> ... but far away from C<.JPEG> .  It
isn't obvious that this is always a useful thing to do...?

Various forms of backup filename are regarded as equivalent to the
original, so C<hello.txt> sorts just before C<hello.txt.~1.5~> and
C<hello.txt~> .  C<hello.txt2> will come after these, rather than
between the two backup files.

After that, the directory names are considered.  C<foo/a/hello.txt>
comes before C<bar/b/hello.txt>, on account of the directory names
C<a> and C<b>.

Finally, C</home/fred/cvswork/fooproj/source.c> will sort right next
to C</home/liz/cvswork/fooproj/source.c> , and this is the whole point
of the exercise.  The compression algorithm now has a more
compressible input.

=head1 CAVEATS

The script needs to C<stat> each filename you give it, so make sure
they are either absolute or relative to the directory from which the
script runs.

I haven't tested it much, I don't use or need it.  I just wrote it
because I felt like it, and I used to have a script that did something
similar.

The assumption is that files with similar names will have similar
contents.  This does not hold with (for example) the 2.6.7 Linux
kernel source.  If you have a better way of putting similar files next
to each other, do that instead.

For most uses of a tarball, you don't care what order the files are
stored.  If you're doing something unusual (e.g. storing multiple
copies of the same file) then you will.  You'd better know when you're
doing this!

Using this script will exacerbate the normal skew on your backups.
This could be a problem if your files change while you're reading the
directory - thanks Ian.

=head1 DEMO

(nb. the directories are missing from this one, but it's a good enough
demo anyway.)

I shall "back up" two mostly identical copies of the assorted junk
currently in my home CVS working copy, with and without the filter,
each with L<bzip2(1)> and L<gzip(1)> compression:

  find ~/cvswork-toy /tmp/cvswork-toy -type f -print0 | sort_revname.pl | tar cfj - --null -T - | wc -c

gets the best compression ratio.

			Bytes	Ratio
 No compression		9738240	   0%
 gzip (default: -6)	3747840	  61%
 bzip2			3461120	  64%
 gzip + filter		3153920	  68%
 bzip2 + filter		2304000	  76%

=head1 TWEAKS

It may pay to tweak the parts of the filename that are used, the
ordering and the parts of names that are ignored.

It might be worthwhile bringing the length of the files into this
somewhere.

If you have many files then you'll want the Schwartzian Transform to
speed up the sort, but I'm guessing that this will be negligible
compared to the bzipping time. (Erk, not now I'm C<stat>ting each file
lots of times!  OK, that's cached...  but the ST would be neater.)

Ideally it would be best to sort by the output of some sort of
similarity metric, if you can imagine one.  For identical files a
checksum will work nicely if slowly; but then for files only slightly
different it could be worse than just taking them off in directory
order.

=head1 AUTHOR

Question and other helpful comments arose at
L<http://cam.pm.org/archive/2004-July/001088.html>

(c) 2004 Matthew Astley  L<http://www.t8o.org/~mca1001/>

This program is Free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

$Id: sort_revname.pl,v 1.2 2004/07/14 22:47:59 mca1001 Exp $

=cut


use strict;

# Line in/out separators...
if (@ARGV && $ARGV[0] eq '-l') {
    # Linefeed - less safe but more convenient for humans
    $/ = $, = $\ = "\n";
} else {
    # null-terminated file list
    $/ = "\0";
    $, = $\ = "\0";
}


# do it
my @file = <STDIN>;

chomp @file;

print sort fancy @file;


# Comparison for two filenames,
# starting at the leafname and working back up the tree
sub fancy {
    my @type = (filetype($a), filetype($b));

    if ($type[0] != $type[1]) {
	# Put directories first, then stuff, then regular files
	return $type[0] <=> $type[1];
    }

    if ($type[0] != 2) {
	# If they're not files, leave them in the same order
	return 0;
    }

    my @A = break($a);
    my @B = break($b);

    for (my $i=-1; -$i <= @A && -$i <= @B; --$i) {
	my $c = $A[$i] cmp $B[$i];
	return $c if $c;
    }

    # Files and their emacs backups will sort the same (for current
    # value of &break) so far, so plain sort to fix them
    return $a cmp $b;
}


# Arg = filename
# Returns 0 (directory), 1 (symlink, socket, pipe, device...), 2 (regular file)
my %filetype_cache;
sub filetype {
    my $f = shift;
    return $filetype_cache{$f} if exists $filetype_cache{$f};

    my $t;
    if (-l $f) {
	$t = 1;
    } elsif (-f _) {
	$t = 2;
    } elsif (-d _) {
	$t = 0;
    } elsif (! -e _) {
	warn "Can't stat $f, will sort to the top\n";
	# Might be safer to stat the lot up front, if anything isn't
	# found we assume we are in the wrong place and pass names
	# through untouched...?
	$t = 0;
    } else {
	$t = 1;
    }

    $filetype_cache{$f} = $t;
    return $t;
}


# Break a filename into bits,
# in a somewhat arbitrary and destructive way,
# for the purpose of sorting.
sub break {
    my $file = shift;
    $file =~ s{\.~[\d.]+~$}{}; # sort CVS versioned backups with current
    $file =~ s{/\.?#}{/}; # emacs autosave files and lock links
    $file =~ s{[~#+]+$}{}; # emacs backups, copy conventions I sometimes use
    return split m{[\./]}, $file;
}

Repository owner

Powered by ViewCVS 1.0-dev
(Powered by ViewCVS)

ViewCVS and CVS Help