A transaction is wrapped around every extension invocation, not every kernel operation; the overhead seen depends on how many extensions are in the path of the operation.
The L4 IPC overhead is an interesting figure, but I'm not sure how it applies here. What kernel operations are you concerned with? How long do they take? Is a naked IPC the best benchmark for such an operation? A getpid()? How about a call to read(), for one byte, when the data is in the buffer cache? I would guess that on most systems the latter takes more than 1570 cycles, but a getpid() can be done in the same amount of time that an L4 IPC can be done.
Once again, this is orthogonal to the main line of discussion; I'm picking nits.
And of course, no offense taken -- I just wanted to make sure you had the up-to-date numbers.