SOLVED -- obscure bug
shapj@us.ibm.com
shapj@us.ibm.com
Thu, 16 Sep 1999 17:16:52 -0400
I finally found the bug from the previous note. This was pretty obscure.
[Jay: I'm copying you both to let you know why the revision has been delayed and
also on the chance that you may want to forward this note to the OsKit core
developer list. The same bug may bite them someday, and having seen it before
may let them recognize it. Also, it may provide a useful anecote about the
hazards of hasty implementations.]
Here's what was happening -- far and away the most obscure bug I've seen in this
effort. Perhaps one to add to your respective stashes of kernel bug lore.
The problem was that I was taking a page fault on the active instruction when
resuming a small process that had just been invoked by a large process. It was
obvious! :-).
Preconditions:
1. The kernel, assuming that the hardware address space pointer matches the
invoker address space pointer, does the following optimization:
If we are transferring a string to a small space OR
If sender and receiver spaces are the same THEN
1. Probe the receive buffer using the active mapping table
2. Do not build a kernel window into the receive buffer area.
2. The kernel was designed to save the address space on the way IN to the
kernel. This was pretty foolish, as there are only two ways OUT of the kernel
and lots of ways IN. Note a basic failure of software engineering here.
Sequence of Events:
Some large process A transfers control to some small process. Because the
target is a small process, the kernel doesn't bother to update the address space
pointer in the hardware register. The problem is that it failed to record the
fact that the small process was now using the A space as its active address
space. THE RECORDED MAPPING TABLE POINTER IN THE SMALL PROCESS IS NOW STALE. As
mentioned, the kernel was designed to save the address space on the way IN to
the kernel. Because of a mistake in the code, this was not being done correctly
when the kernel took a page fault returning to user mode -- the page fault
manifests as a general protection fault, and must be recognized specially by the
kernel; I had failed to insert the "record the active space" code at the
recovery point.
Life continues on for a bit, and eventually the small process calls some other
large process B. Because of the recording failure, the kernel believes that the
second condition above is true. This is VERY BAD. In the worst case, the stale
mapping table may even have been reclaimed.
In the particular case at hand, it proved that the OLD mapping table was the
dedicated kernel mapping table. As luck would have it, the kernel mapping table
contains TWO mappings of the kernel: one at it's segment-relocated address and
the other based at 0x0. Retaining the latter is necessary to allow calls to the
BIOS32 routines. It was the latter that killed me, as it allowed the gate jump
logic to overwrite the kernel.
Eventually, the paranoia checks in the consistency checker correctly detected
that the kernel was mangled and committed ritual suicide on behalf of the entire
system rather than let a checkpoint occur.
Why this was Hard:
After identifying the deceased kernel data structure, data forensics quickly
confirmed that the area had been overwritten by something suspiciously like the
data pattern used by the test case. The puzzle was *how*, as the mapping tables
appeared to show good mappings. Further, the intended receive address was in
low memory, so almost every mapping table had valid writable mappings there.
The kernel debugger has some code to dump mapping structures for the
send/receive strings. Because the correct mapping table was obscured, this was
not helpful in tracking down the bug.
The bug is timing dependent, as it is influenced by the cleaning logic. I'ld
been seeing it off and on for quite some time without being able to reliably
reproduce it.
Morals (such as they are):
1. Document your assumptions using assert().
2. Do things in the smallest possible number of places.
3. Always snapshot a buggy kernel before debugging so you can rip out all the
tracing crud that you put in while randomly hunting for the bug.
Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595