post-mortem on networking into kernel
Jonathan S. Shapiro
jsshapiro@earthlink.net
Sun, 3 May 1998 14:47:28 -0400
After reading the various reactions to moving the networking stack
into the kernel, I'm going to hold off, and continue with the
user-level implementation.
To get this done in finite time, I'm going to make things so that all
connections are lost on restart (or very shortly thereafter), because
the entire networking stack is demolished on restart. Preserving
connections is potentially complicated, and I want to do the simpler
thing first.
Also, just so that people know:
There have been a variety of measurements that emphasize the
performance benefit of being able to handle back to back packets. I'm
not aware of any data on the size of these packets. If you know of
any, please send pointers. Also, if you know of any data on how
performance varies with the *number* of succefully accepted back to
back packets I'ld be very greatful.
In moving the protocol stack out of the kernel, I've been concerned
about context switch and data copying overheads.
Here is a "back of the envelope" calculation for NFS behavior:
Consider a 100 Mbit ethernet. An 8192 byte NFS packet (UDP, 8k being
a commonly chosen size) turns into 6 ethernet frames, of which the
last is only partially filled. These are queued essentially all at
once, so it seems that 6 frames is a reasonable first-order assumption
about back to back packet arrival.
A 100 Mbit ethernet is roughly a 12.5 Mbyte ethernet. Assume that
there is a low likelihood of contention on the ethernet during this
transmission (a worst case assumption), and that the packets will
therefore be placed on the wire in a sequential burst. Assume that we
are talking about transmitting drivers capable of back to back
transmission (most 100 Mbit drivers are, at least under LINUX, and
probably under OpenBSD).
The interrupts for these packets will therefore arrive at
(1518/(12.5*1024*1024)) or 116 microsecond intervals on the receiving
CPU.
The good news is that these processors have supervisor CPIs in the
range of 1.3 to 1.4, and it will take us over 100 instructions to
service the first interrupt, so with high likelihood we will never
actually field the second and subsequent interrupts, provided that we
have someplace to stash the packets.
There are several possible designs at a higher level:
1. One kernel invocation to receive each packet. This is what
I shall try first.
2. A hybrid kernel invocation in which the ethernet driver may
return multiple ethernet frames in a single invocation.
This is probably better than (1), and only slightly harder.
3. In-kernel IP packet reassembly, and return full packet.
4. A shared memory interface. This is significantly harder
than (2), and I don't see that in practice it yields all
that much benefit relative to (2) for this problem.
If this is a reasonable SWAG for back-to-back packet behavior we're in
relatively good shape, though data motion is still a critical issue.
If, however, the back to back packets are small (64 bytes, worst
case), then the interarrival time is closer to 4.8 usecs, and we are
smack dead on the kernel invocation time. If we do not do a
multipacket interface, we won't get much of anywhere.
The copying and extra invocation times have me running a little
worried, but let's see how it goes.
shap