Message passing
Jonathan Shapiro
shap@viper.cis.upenn.edu
Fri, 16 Dec 94 14:02:44 -0500
> [string move] is one of the major impediments in building a
> decently fast 370 implementation.
I guess I just don't understand. The semantics are simple, and
the hardware is nearly the only thing in the right place to implement
decisions based of the memory architecture. Certainly the compiler
is very unlikely to know about things such as interleaving, and
whether this move should load the cache or not.
There are several reasons that string moves are undesirable from the
hardware perspective:
1. The necessary hardware design to take advantage of special
alignment and blocking is complex. Enough so that it isn't really
feasible to do it in hardware (microcode is another matter, but
microcode isn't any faster than software). Software implementations
therefore operate at an advantage.
2. The string move operation can touch multiple cache lines late in
it's execution. This leads to a hardware design that must be able to
field cache and TLB faults at multiple stages of the pipeline. This
can be handled, but it makes the hardware's I/O interface much more
complicated, and introduces resource contention in the hardware where
none arises in the absence of a block move operation. This is an
especially difficult complication in multi-issue implementations, as
it screws up the machine's ability to quickly decide how many
instructions it can issue.
Similar issues arise, in lesser degrees, in instructions that modify
more than one register (e.g. multiply on low-performance RISC
architectures). Both the bussing and the resource management logic
get screwed up by multiple targets. Machines with load-update
instructions raise similar problems. Write ports on the register file
are a very precious resource.
3. If you have a look at a typical bcopy implementation, you'll find
that the software is very aware of cache blocking an overlap
issues. On a multi-issue RISC machine, and in many cases on a
single-issue RISC machine, the hand-coded bcopy implementation runs at
least as fast, and sometimes faster, than a hardware block move.
A lot of newer COBOL runtimes avoid string move instructions for this
reason.
All of these issues can, and have, been solved. Solving them slows
down the machine. String move complicates a lot of parts of the
machine that are performance critical.
Putting all of this together, it's unlikely that RISC architectures
will evolve to have string move instructions. Why slow down the
string move?
As it happens, good optimizing compilers are aware of cache blocking
these days. None do uncached moves that I'm aware of, but most make
some effort to eliminate cache thrashing in inner loops.
Jonathan