I've included Iain's entire mail below, as he has asked some very good questions in a very clear way, but neglected to copy the cap-talk list. Apologies for length to the Linux list. I'm going to answer his file system question separately, because there are a lot of issues buried in the question that need to get teased out.
Iain is correct that there is a serious issue here. If your computation modifies all of memory, and does so in a truly random way, and it does so fast enough, and you have managed to get the physical memory on the machine tuned so that it just exactly holds the problem without doing I/O, checkpointing is going to be a problem. I have several reactions to this, none of which really solve the problem from Iain's point of view.
Before I suggest the workarounds, I want to note that Iain's problem identifies a very peculiar border case in which the problem "just fits" in the memory. Were the problem slightly smaller or slightly larger the issues would be quite different. I submit that if you tune your machine carefully enough you can arrive at a similar issue on almost any operating system, and the fact that checkpointing gets you at one very specific point in the optimization space isn't all that compelling a problem.
Also, I want to note that this class of applications simply isn't what EROS was designed to do well.
Given which, and agreeing that in his scenario there is a problem, here are the workarounds that are possible in EROS:
Ultimately, this goes back to some of the earlier comments about tradeoffs. The designs of KeyKOS and EROS are biased for fault tolerance, decomposability, and global consistency, because for the target applications we needed to address these were mission critical requirements. Iain has identified an application for which none of these criteria are important, but the last byte of memory is precious. Given the falling cost of memory and the fact that no system can successfully optimize for everything, I think that we made a reasonable choice even if it isn't right for the problem Iain wants to solve.
And, as I say, we have a design that we think will solve the problem. If pressed, I'ld have to admit that implementing that design is not presently at the top of the queue.
Once again, an insightful question (that's two for two, Iain; keep it up, 'cause the crowd is going wild in the stands :-).
Speaking for myself, I have never taken very seriously the notion that persistence eliminates the need for serialization. Setting aside the robustness issue, there is still a need for data interchange and network interaction, and that invariably requires serialization. There does appear to be a "cut line" that can be drawn in decomposed systems, and many services do not appear to need to do their own serialization.
That said, let's answer your question.
First, it's useful to ask "What is the source of the robustness?" in your MAGIC example. I suggest that it's not the distinction between files and memory. Rather, it's the fact that the file-based form sits on the other side of a protection boundary and uses a simpler representation. The protection boundary eliminates random pointer problems. The simpler representation increases the likelihood that the data is stored consistently.
In KeyKOS, the "file" object is a carefully implemented process. It does indeed have to watch its pointer references closely, but no more or less so than the Linux file system does. Ultimately, the usage pattern in EROS is expected to be simpler -- there is neither need nor benefit to having a separate instance of the editor for every document. Actually, this would impede data interchange rather badly.
So where is the benefit of persistence?
Perhaps Norm Hardy or Charlie Landau will have things to add about this (if so, guys, please respond to BOTH LISTS!!)
Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595
Iain McClatchie <iainmcc@ix.netcom.com> on 06/23/99 08:10:19 PM
To: Jonathan S Shapiro/Watson/IBM@IBMUS
cc: linux-kernel@vger.rutgers.edu
Subject: Questions about EROS
Jonathan,
The discussion about EROS has been quite entertaining. I have three questions.
I run an application which uses Binary Decision DAGs (BDDs) in the optimization of netlists. On one small testcase, this application grabs 100 MB of memory, which it overwrites almost 1000 times over the course of 10 minutes. When I imagine running this application with the BDDs in persistent storage, checkpointed once every five minutes, I imagine one of two things will happen:
Either the application will grab less than half of the available physical RAM, in which case half the RAM will store the checkpoint being written and the other half will store the active process state, or the application will come to a dead stop every five minutes when it runs out of memory attempting to copy 96MB of its COW pages. It will stay stopped like this for 10 to 15 seconds every 5 minutes, which is about a 5% slowdown, or half the difference between a $550 500MHz PIII and a $330 450MHz PIII. Either way it appears that persistence will cost approximately $100 for this admittedly small testcase.
It seems that the checkpointing of massive temporary computation caches is fairly expensive. Do you folks on EROS have some ideas about how to mitigate this cost? For instance, how would an application like Navigator cache the decompressed versions of the JPEG pictures it presents frequently?
Software bugs and robust versus execution states
I use another wonderful application, MAGIC, to edit the layout for my circuits. MAGIC crashes every month or two due to software problems, and yet I have little difficulty maintaining, uncorrupted, the state of hundreds of interconnected layout cells. It appears to me that this ease of use stems from having two representations of the layout cells: one on disk, in ASCII, which is terse, simple, and robust, and another in memory which is larger, complex, fast, but more prone to accidental corruption.
How do persistent object systems contain software errors? At first glance, it seems to me that you would need, at the very least, some form of this dual data representation, along with its attendant operator-directed translations between the two, and a way of putting at least one representation into a seperately controlled, global namespace. Having done this, you've recapitulated the programming effort of having a file system and interfaces to that file system from every program. Wasn't avoiding this effort one of the proposed benefits of a persistent object system in the first place?
Allocation policy
A few months ago we had a wonderful debate on linux-kernel about the goodness of reiserfs, which is a new filesystem being developed by Hans Reiser. I took away from that debate a notion that any decently designed filesystem can find and read data already on the disk in a minimum of seeks. The number of seeks, and the overall performance of the filesystem is determined by their allocation policy: how they lay the data down on the disk.
The filesystem in a persistent object system has the same problem presented to it as the filesystem in a UNIX system, except (a) it is also expected to store many short-lived objects, and (b) the entire contents of memory are regularly swapped to disk. It seems that both (a) and, to a lesser extent, (b) would form a kind of noise would tend to swamp the signal that your filesystem allocator is looking for. If you think of the allocator as being something like the malloc() algorithm in a garbage collector, the programs in a UNIX system mostly seperate the long-lived objects from the short-lived objects before they get to the allocator.
Does EROS have some sort of system for implicitly or explicitly recognising long-lived versus short-lived objects?
-Iain
iainmcc@ix.netcom.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/