Fw: Some very thought-provoking ideas about OS architecture.
Mon, 21 Jun 1999 11:49:21 -0400
Reminder: If you reply, reply to both cap-talk and email@example.com
---------------------- Forwarded by Jonathan S Shapiro/Watson/IBM on 06/21/99
11:48 AM ---------------------------
Steve Underwood <firstname.lastname@example.org> on 06/21/99 05:56:22 AM
To: Jonathan S Shapiro/Watson/IBM@IBMUS
Subject: Re: Fw: Some very thought-provoking ideas about OS architecture.
>>> [Alan Cox:] Another peril is that external interfaces don't always
replay of events.
>>[Eric Raymond:] A much more serious objection, I agree.
> There is a more fundamental issue: replaying events to external
> violates security, because the connections need to be reauthenticated.
> drivers are part of the kernel, which is outside of the checkpoint
> the issue you are getting at doesn't really apply where devices are
> Low-level code (e.g. the net stack) must be written with awareness of
> checkpoint mechanism. The operating system explicitly rescinds all
> connections on restart in order to ensure that reauthentication occurs
There seems tremendous vagueness in this. You highlight further problems
than Alan highlighted, but give no explanation of a cure. Replaying any
events is impossible - the rest of the world just won't play ball.
Picking up where things left off is also impossible - you don't know
where that was; you only know about the last checkpoint. Clearly network
connections must be recinded, but what do you do about that? My
understanding is that when the machine comes back up it will step back
precisely to the last checkpoint, and run from there. I guess all its
TCP connections will suddenly show an error and die, and the software
must reconnect to the rest of the universe. Right? What then happens to
other stateful external interactions? Do you have some mechanism that
will cause all external stateful activities to die, so they don't
continue in an undefined way? I'm thinking of UDP interactions;
interfaces with custom equipment; printing; and so on. Custom interfaces
would be very difficult to deal with in a totally generalised way.
Persistence is a curse, not a benefit, when applied to these things.
> The real win, and the answer to Alan's question, is that it's not
> sequential writes. It's about bulk block transfers. As you modify
> objects, changes are written to a write-ahead log, which is where the
> spends most of it's time. Empirically, you are rarely more than three
> from where you want to be. This significantly reduces seek time. Also,
> is logically append-only, which means that few writes (even if they
> different objects) require a seek at all. In this sense, there is
> large sequential block write occuring.
> Eventually, the log fills up, a checkpoint transaction is declared,
and a bulk
> transfer of the data to it's home locations occurs. What Eric
describes as a
> sequential block write isn't really what happens here. What really
> that you reach into the log, pull out an overflowing handful of blocks
> moved to their home location, and then transfer them in a single arm
> the disk. In practice, many of the mutates are clustered, so you end
> seeks only between clusters. Essentially it's a sorted bulk transfer.
> machine fails during transfer you are okay because a complete snapshot
> the checkpoint log.
This description is very write oriented. What about reads? In a real
world situation the machine doesn't just sit there processing, changing
its data, and needing writes to make it persistent. Most applications
manage a large data set, and are forever hoping around that data. If the
RAM is much smaller than the disk (like in anything but a TPC benchmark
test) there will be lots of disk reading. In most systems this is
through explicit reads. I assume in Eros there is a tremendous number of
page breaks and VM swaps (am I wrong?). Doesn't this rather pull the
large disk block operation concept apart?
Another issue. Does the read/write clustering offer that much benefit
with modern disk drives? They keep bringing down the seek times, as the
head assemblies get smaller and light, but they can't damp the vibration
much faster. The seek times are approaching the settling times, so
clustering the reads and write around a few tracks doesn't have the
performance benefit it used to have (though it might help with wear and
tear). Writing a large stream of sectors certainly has benefits, but
does writing non-contiguous sectors in a small area of the disk still
offer much over widely spaced ones?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to email@example.com
Please read the FAQ at http://www.tux.org/lkml/