Broken things, fixed

$Id: index.html,v 1.15 2006/04/16 02:12:53 mca1001 Exp $

This page is dedicated to those who bother to use a search engine to find answers to their questions.

Naturally caveats will apply, including but not limited to

these are problems I had with my hardware and software, so you will find a strong bias towards Debian GNU/Linux on Intel hardware, at least at the moment.
sometimes I forget the details or don't bother to write any of it down.
the information may be incomplete, misleading or incorrect.
I expect you to engage your brain rather than following my instructions blindly.
I don't always read all the docs because I'm an impatient sausage, and apparently this doesn't bite me often enough for me to change this habit. Lucky me!

Anyway hopefully this will be of some use to someone, even if only as a starting point for debugging.

...later... satisfied visitors so far: one, that I know of. Thanks for dropping me a line, Zak.

Monitor wobbles (shaking picture on multisync/multiscan computer monitor)

A problem at work with a wobbling picture picture bad enough to give our office manager a headache, circa 1998. Featuring various clues,

Nothing wrong with the monitor or computer, it was the location in the room that did it.
The speed of the wobble changed with the screen resolution and size.
The magnitude of the wobble changed with the time of day. A large but obscure clue was that boiling the kettle (on the other side of the office) made it significantly worse. When the five large storage heaters kicked in the picture was unusable.

Sometimes you can get a little wobble if you put a small power supply (PSU) under or near the monitor, if they aren't magnetically shielded. Just move it away.

This was bigger, and there was nothing obvious in range. My diagnostic tool was a fairly coarse open solenoid wound on a ferrite rod, plus an oscilloscope. It's quite likely that a crystal earpiece or mic input to a tape recorder would have been adequate instead of a 'scope.

In this case I found the 50Hz (mains frequency) field emanating from the cable trunking, specifically the earth bond wire for an outside water pipe. This metal pipe came out of the ground inside the building and went out through the wall to a tap. It was earth bonded internally - this is normal practice for internal metalwork, although potentially dangerous to people outside the office in the event of an earth fault.

A current clamp (non-contact ammeter) registered up to 10 amps in this earth bond. This unbalanced current was causing the huge field and disrupting the picture on the CRT. The current was flowing because the metal pipe served as an excellent earth and return path to the substation, so the current that normally just takes the neutral return was sharing this route back home.

Generally you can't just remove earth bonds. Talk to a qualified electrician. What we did was cut the water supply permanently (it was mostly used for washing cars IIRC), remove the external tap and cringe the pipe stub back into the concrete floor. Since no metal was exposed we could remove the earth bond, and then the picture was as steady as any other.

Mozilla 1.0 secure website (https) error boxes

There is a problem with your certificate database [Error Code: -8174]

Long running problem from about May 2002, fixed 2002-10-29. Other reports of the same problem,

This is caused by the setting in Edit > Preferences... > Privacy & Security > Validation > OCSP > "Use OCSP to validate only certificates that specify an OCSP service URL" . Changing it back to Do not use OCSP for certificate validation makes the problem go away.

Of course, finding this out took quite a while because my method was to brainwash mozilla (move the configuration directory away and let it make another), then move things back until it broke again.

I haven't bothered to investigate what the problem is exactly. It's probably fixed in the next release anyway...

Mozilla 1.0 keeps going to find.com

This is another long standing problem (since early 2002), and something to do with the history sidebar. Certainly if I view a "folder" of history items such as "4 days ago" I get sent to find.com. It used to redirect to a holding page, but now apparently someone has bought it. Anyway, I didn't want to go there...

At some point the behaviour changed, so that clicking a link would sometimes take me to find.com. I grepped the .mozilla settings directory for 'find.com' and removed some apparently pointless entries (I forget which, sorry). Half of the problem solved, anyway.

Building linux-wlan-ng-modules-MYKERNEL and linux-wlan-ng on Debian Woody

I wanted to try the linux-wlan-ng under Woody because I don't really want to run the testing release on this laptop. Unfortunately I've come to this rather late in the day (2002-11-5).

Normally building a testing package from source under the current stable release doesn't seem to cause any problems, but this time it turned out to be more 'fun' than I was expecting. I blundered through, fetching bits I needed and trying to compile/install them, until I found a route that worked. It may not be a correct route, but I have installable packages which appear to contain the relevant files...

Get linux-wlan-ng source, version 0.1.15-3 as it turns out.
Find an old, archived bug report which draws my attention to the fact that I've failed to check the Build-Depends (in debian/control before trying to build, and that I need debhelper-4.0.13. Woody is on 4.0.32, and testing is on some new thing that depends on vast swathes of the rest of testing.
Go to Joey Hess's viewcvs and grab the tarballs for debconf-1.1.1 and debhelper-4.0.13. Unpack, build in the usual way with debian/rules binary. There are additional dependencies on docbook docbook-dsssl jade tidy, but these are not troublesome to install.
Install new packages. Takes two passes of dpkg -i, because a file moved between packages I think.
Now I can build the package I want. Doing this without using make-kpkg to invoke the build scripts is also fun. Try something like,
```
cd linux-wlan-ng-0.1.15
fakeroot debian/rules binary
fakeroot debian/rules KSRC=~/compile/linux-2.4.19 KDREV=5 KVERS=2.4.19-relapse PSRC=/usr/src/modules/pcmcia-cs binary
```
The first command builds the utilities (hmm, I think this may be possible without the new debhelper .. but I needed the modules anyway). The second builds modules and requires the kernel source and other stuff to be present. The extra make variables are:
1. the configured and make depped kernel tree,
2. the revision of that kernel that I'm using (not sure why I need this),
3. the whole version name of the kernel. I roll my kernels with something like
```
time make-kpkg --rootcmd fakeroot --config menuconfig --revision 5 --append-to-version -relapse kernel_image
```
  , so the modules live in /lib/modules/2.4.19-relapse.
4. an unpacked (and configured, but this isn't necessary IIRC) copy of pcmcia-source. Oh, I had symlinked ln -s /usr/src/modules/pcmcia-cs ../ from the build directory too, and I suspect this may make the PSRC variable unnecessary. Never mind.
I realised later that using apt-src might have made some of this easier. Never mind.
Downgrade debconf, debconf-utils and debhelper to the Woody versions because I'm superstitious, and I think that they might break something else I do in some hard-to-debug out way...

Needless to say, that's the "to cut a long story short" version. I suspect I would have been scupperewed without access to the old versions of the source. I could always have used the upstream (original version) tarball I suppose. 8-)

XFree86 <= v4.2.1 libX11 fails, thinking that the socket() call didn't work

(The version number may not be relevant at this point [2002-01-09], because future versions are likely to continue to have the problem, and if it gets fixed the fix may be backported quite easily.)

When trying to make a socket (tcp or unix domain) connection to an X11 server, you get an error like this,

_X11TransSocketOpen: socket() failed for tcp
_X11TransSocketOpenCOTSClient: Unable to open socket for tcp
_X11TransOpen: transport open failed for tcp/localhost:10

You faff about for a bit trying various things with Xauthority files and DISPLAY environment variables, then you give up and try strace. After various libraries are opened and presumably loaded,

JDK-1.3.1_06_SUN/jre/lib/i386/libawt.so
JDK-1.3.1_06_SUN/jre/lib/i386/libmlib_image.so
/usr/X11R6/lib/libXp.so.6
/usr/X11R6/lib/libXt.so.6
/usr/X11R6/lib/libXext.so.6
/usr/X11R6/lib/libXtst.so.6
/usr/X11R6/lib/libX11.so.6
/usr/X11R6/lib/libSM.so.6
/usr/X11R6/lib/libICE.so.6
JDK-1.3.1_06_SUN/jre/lib/i386/libfontmanager.so
JDK-1.3.1_06_SUN/jre/lib/i386/libawt.so

libX11 (as it turns out) tries to make a socket and connect it to the specifed X server. The socket() call works normally, but then errors are printed,

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 456
write(2, "_X11Trans", 9)                = 9
write(2, "SocketOpen: socket() failed for "..., 36) = 36
write(2, "_X11Trans", 9)                = 9
write(2, "SocketOpenCOTSClient: Unable to "..., 52) = 52
write(2, "_X11Trans", 9)                = 9
write(2, "Open: transport open failed for "..., 49) = 49

Turns out there's a limit to the number of file descriptors you can have open already, at the point you try to make a connection to your X server. This isn't on all systems, but apparently on some flavours of Linux at least [haven't decoded the #if conditions yet either].

The current work-around is to not have more than 255 file descriptors open at the time you try to connect to the X server, whether this is by tcp or unix domain socket.

You're probably only going to bump into this if you're running Java, I think. Only in Java would you need to open so very many files at once, and still need to connect to X afterwards (for example, to render a graph in your servlet).

How I found the problem

Gave up throcking the various combinations of X server, webapp, Tomcat server and userid after a few hours.
Ran Tomcat under strace to find out what system calls were being made. This was complicated by the fact that it seems unable to follow the threads the JVM uses. Fortunately it looks like the top thread does the work.
Stared at the sequence
1. JVM: "Can I have a socket please?"
2. Linux: "Yes, here you go, it's number 456"
3. JVM: "Help! Help! I couldn't get a socket!"
in disbelief.
Failed to twig that the socket call was being made by libX11.so, of which I could easily have obtained a debugging version.
Fired up ddd (Data Display Debugger) on Tomcat. Note that this wasn't Java-level debugging, it was machine code level. Very tedious, and no debugging symbols.
This is from memory, and it's a summary of the method I eventually settled on. There were many dead end, but I believe the significant steps were something like
1. Ran DEBUG_PROG=ddd java, which starts ddd and tells it what we're going to play with. Also, this sets up the various environment variables.
2. Rustled up a set of command arguments for Tomcat, something like -classpath bin/bootstrap.jar org.apache.catalina.startup.Bootstrap start should do the trick.
3. Fed these args into "Run command..." dialog.
  Maybe I could have done something more cunning with the catalina.sh debug command. Never mind.
4. In the GDB options, enabled "Shared library events"
5. In the "Signals..." dialog, switched off 'Stop' for realtime signal 32, so the thing would just run. (Using "Continue without signal" in the machine code debugger caused the signal to be blocked, I think; in any event, the JVM stopped.)
6. Hit 'Run' and opened the machine code window. After each shared library event, I tried to set a breakpoint on socket.
7. With a breakpoint set, I continued through the first few socket calls (you can get some idea how many there will be from the strace output). Once I could set a breakpoint on XOpenDisplay I knew that libX11 had just loaded it it was about to be used.
  I could have then disabled the breakpoint on XOpenDisplay and waited for the socket call, but instead I single stepped through reams of junk that I didn't need to. Debugging is like that when you're not sure what you're doing.
8. We get to the socket call we're interested in. If you blink you can easily miss it - what you'll see (at least under i386 and ARM, probably other architectures too) is some shuffling around of pointers to argument lists, then a call to "int $0x80" or some SWI. When it comes back, there's a check for error condition and return.
9. The next stack frame up checks for < 0, then checks >= 256 (failed!) and burbles off to generate the error. I got the name of the function and left.
Ran grep across the shared libraries and the JVM - like I said, I hadn't realised this was in libX11 - to find out where the function lives.
Downloaded the source for libX11 (there's a lot of it) and found the function.

You can see the patch I suggested to help clarify the error. I don't want to get involved in messing with OPEN_MAX and its relative TRANS_OPEN_MAX because I don't understand why they're limited to 256 under GNU, when the limit is obviously higher. Maybe later...

For now, the work around is just to ensure that AWT (in the case of Java) is initialised before loads of other things are opened. It's possible that this could be kludge by opening something pointless early on, and then closing it before requesting the X connection. This is likely to be a fragile solution though!

How I should have found the problem

Well if I'd looked at X11 in any detail before, I would have realised that the actual opening of the connection is done by a library. I could have looked up which library in a standard X tutorial and saved myself about six hours.
If I had more experience with file descriptors at the low level, I would have seen the FAQs on the topic of OPEN_MAX; there are many. Then it would have been a small step to realise that when socket() returns a number > 256, there may be trouble ... but think of all the fun I would have missed, learning about ddd!
I could have grepped for the symbol names mentioned in the error message (or fragment of error message itself), across the JVM and shared libraries again. Gotchas here include
- Error messages are frequently composite, so you have to search for small fragments or risk missing the thing you're looking for.
- Clever me forgot that the X libraries aren't in /usr/lib, they're in /usr/X11R6/lib. Yes, even after ignoring the filenames mentioned in the strace output. Doh!
But still, I enjoyed reading up on i386 assembler, and I especially enjoyed being grateful I don't have to write in it. ARM assembler is much more human-friendly.

The main lesson to be learned is that running grep is a quick and easy way to get the computer to do your work for you. No idea where the error came from? Search the whole of filesystem for a unique-looking string and see what comes back. Bonus points for searching a small enough subset of the disk that it mostly fits in RAM - performance on the next grep pass with different arguments will be greatly improved.

Hopefully this will serve someone as a crash course in how to cheat when debugging things, and serve others as an example of why error messages should always be verbose. Especially when something unusual or unpredictable might have happened. 8-)

Epson Perfection 1260 Photo (USB ID 0x04B8 0x011D)

It seems that support for this scanner had only recently been added [2003-01], and since Dad's machine normally runs stable (Woody), several upgrades were needed. The procedure for choosing and installing the scanner went something like this:

Look at the list of USB scanners supported by Linux
Pick the Epson 1260 because it is supported, and note that it uses the Plustek driver
Order, wait, forget. (It was a Christmas present for Dad, hence the delays and the mixed use of I/we).
Scanner arrives but I'm not there to install the software. Wait some more.
Old kernel with an old scanner.o module doesn't recognise the thing, so modprobe parameters were required. It's fixed at about 2.4.18 I think.
Configure SANE's Epson driver to use the device, because it's an Epson, right? (Not exactly, no).
```
scanner.c: read_scanner(0): funky result:-75. Consult Documentation/usb/scanner.txt.
```
Lots of those in the syslog/console/dmesg, when the driver tries to talk to the scanner.

RTFM again, slap self and switch over to using the Plustek driver. The Epson driver doesn't like the 1260 because it uses a different chipset. I had forgotten about this.

After changing the configuration, we get a "Segmentation Fault". Some relevant-looking fragments of the diagnostics were

export SANE_DEBUG_DLL=12
export SANE_DEBUG_PLUSTEK=12
 ...
[sanei_debug] Setting debug level of dll to 12.
[dll] sane_init: SANE dll backend version 1.0.8 from sane-backends 1.0.11
 ...
[plustek] Plustek backend V0.45-4, part of sane-backends 1.0.11
 ...
[plustek] drvopen()
[plustek] usbDev_open(auto,0x04B8-0x011D)
[plustek] Found device at >/dev/usb/scanner0<
[plustek] Vendor ID=0x04B8, Product ID=0x011D
[plustek] usbio_DetectLM983x

This appears to have been caused by the earlier use of the Epson driver; after a reboot we can't reproduce the segfault.

Scanner now works. The next problem after this was the gnome-panel falling over because it ran out of disk space and trashed all its settings. Oops.

This was a while ago now though, so I apologise for the slightly rough program output and lack of detail.

So, Matthew gets his come-uppance for not reading the docs properly, and then more trouble for failing to explain that scanners eat disk space very quickly.

Gnome Panel, segmentation fault on startup

Just a quickie, since it is probably dealt with elsewhere. This applies to gnome-panel v1.4.0.2 or perhaps v1.4.0.6, which was current in Debian Woody at the time [2003-02].

Disk partition is filled shortly after the scanner starts working. 8-}
gnome-panel tries to save its updated settings to disk, and fails. Instead it truncates each file it touches.
Next time it starts up, the panel dies with a "Segmentation fault" and pops up the gnome_segv program. The X session is unusable (solution: select an alternative "session" at the GDM login dialog).
Upgraded to gnome-panel v1.4.2
It copes with the corrupted files by resetting itself and losing all the settings.

A kernel upgrade makes the USB mouse stop working under X

This happened to me when I upgraded from 2.2.20 to 2.4.21-pre7 [2003-04]. /dev/input/mice worked fine until I upgraded. When I ran menuconfig for 2.4.21-pre7, there seemed to be a lot of options that needed changing, so I merrily went and changed them.

I'm not sure when in kernel history the CONFIG_USB_HIDINPUT option appeared under CONFIG_USB_HID (hid.c) in "USB support", but if you don't enable it then the mouse event don't get passed down the chain to input.o (input.c) and mousedev.o (mousedev.c).

Of course this is obvious when you read the option carefully. On the other hand if you're not sure what you changed when you upgraded stuff, it could be tricky to find.

Things to note when making sound recordings with a Vaio Picturebook C1XD

I recently [2004-02] recording a public talk for a friend who couldn't attend. The resulting sound quality is probably adequate, but it could have been much better.

Many of these hints will apply in other situations, but I would imagine that larger laptops won't suffer quite so badly in the "noise" department. It seems likely that electrical screening between the laptop's sound systems and data busses was kept to a minimum to avoid wasting space inside the machine.

Microphone and set-up

I used a fairly cheap desktop microphone from Maplin, connected to my laptop in the usual way. It seems to be a lot less directional that the packet would have you believe, but I did forgot that people's voices are quite directional!

Putting the laptop back in the rucksack was probably a good idea. The hard disc is very noisy, and of course it must run for the duration.

Putting the microphone on the table behind the speaker was a mistake. On a chair in the front row of the audience might have been better, or perhaps on the floor at the front, although either of these positions would probably pick up a lot of noise from the audience, the wooden floor and the chairs moving about.

I did a quick test of sound quality before the talk began, but I used the tiny built in speakers for this test. It was enough to check that I had a signal, but not enough to check the quality. Next time I will take my headphones.

Because the mic signal was fairly weak, the electical noise inside the laptop swamped the signal. The noise consists of crackly white noise and the sound of the CPU chopping on and off.

Post-processing

Having taken a two hour sound sample I needed to cut it up into tracks according to the logical flow of the talk and edit out a few very loud noises and pauses. I found the Sweep sound editor more than adequate for this, but I was glad I had recently thrown a gigabyte of RAM into this machine!

After doing the slice and dice with the main sample, I chopped out lots of snatches of "silence" (i.e. just machine noise) and stuck them together. Then I filtered the lot through dnoise (part of csound) - it's magical! The noise just vanishes! It does leave some artifacts, but you can tweak around to reduce those. Here are my Makefile rules:

%.cdr: %.wav
	sox $< $@

%.wav: %.aiff nz.wav
	csound -U dnoise -S2 -m-40 -n5 -N 4096 -t1 -W -i nz.wav -o $@ $<

(yeah, watch the tabs). I don't remember why I saved AIFF format from Sweep, but it was convenient enough to use wav as an intermediate for generating MP3s and CD-R ready files.

Jewel case inserts (CD labels)

I tried an assortment of CD labelling programs from the Debian distribution and found disc-cover was my favourite.

Reading Windows Unicode text files with Emacs

This is probably covered somewhere else, but just in case...

I think these files are generated by Windows NT onwards when you put accented characters in a Notepad file and save it. The simple text part may look right if you cat it to a unix terminal, but if you view it with less it will probably show something like

<FF><FE> ^@S^@t^@a^@r^@t^@ ^@o^@f^@ ^@R^@e^@p^@o^@r^@t^@ ^@^M^@

The file(1) program reports them as

Little-endian UTF-16 Unicode English character data, with CRLF line terminators

You can load them into emacs with the sequence C-x <RET> c utf-16-le-dos <RET> C-x C-f filename <RET>, which means to do the find-file command in this Windows friendly Unicode mode.

Hmm, there's probably a way to use this to save out a utf-16-le file, but I'm getting the error

write-region: Symbol's function definition is void: utf-16-le-pre-write-conversion

Never mind.

Update of video=neofb:picturebook patch for 2.6.16.5

[2006-04-16] This is just a trivial update of somebody else's patch which I found via this blog. It didn't apply cleanly to 2.6.16.5 so I did the merge and diffed again. See also /etc/fb.modes info.

neofb.2.6.16.6.patch