Issues in Persistence
1 IntroductionGeneral purpose operating systems need to provide a way to write things down in permanent fashion. Any information you can store away and get back after turning the machine off is said to be persistent. 1.1 Kinds of PersistenceHistorically, persistence has been implemented in two ways: explicit and transparent. In an explicit system, a program must request that the operating system write things down by issuing a write() call. Windows, UNIX, VMS, etc. are all explicit persistence systems. In order to have something last when the system is shut down it must be written to an on-disk file. If you accidentally kick the plug out of the wall, the system must do some recovery work when it restarts to clean up damage to the file system. In UNIX, this is done automatically by the fsck utility. In Windows, the user has to run checkdisk periodically. Users typically do not know this, which helps make Windows systems less reliable. In an implicit system, things get written down without any special action on the part of the program. One example of such a system was the early Lisp machines. In these machines, the operating system would periodically write down the entire state of the system: processes, data, video buffer, everything. If you accidentally kick the plug out of the wall, the system returns to whatever state was last written down. To the program, no interruption is ever visible. The catch in an implicit system is that everything gets written down. Doing this efficiently is tricky -- before KeyKOS, nobody really knew how. For some special-purpose applications, transparent persistence isn't what you want. To ensure security, network connections must evaporate when the system restarts. Databases require a means to write things down explicitly in order to support transactions. The general questions about the two approaches are:
In the context of capability systems, we should also ask:
I'll quickly address these questions and turn to how persistence works in EROS. 1.2 PerformanceThe performance issues can be broken down into two kinds:
1.2.1 Recovery TimeAn explicit persistence system is usually in the middle of an update when the system fails. As a result, the file system must be sanitized when the system restarts. Current file system implementations can take 30 minutes or more to recover on a large server after a crash. Until the file system is consistent, the system cannot let you run programs. While log-structured file system designs can recover more quickly, measurements by Ousterhout and Seltzer have shown that these designs don't work well under heavy load. There are actually two approaches to implicit persistence: per-process and transparent. In the per-process approach, each process is saved separately and is responsible for re-establishing consistency with its neighbors. In the transparent approach, the entire machine is saved at once so there is no need for different processes to sort out consistency between themselves. Both approaches must use a transacted log to be recoverable. In EROS, the logging approach is different from what has been tried by Ousterhout, Seltzer, et. al, and is quite fast. The end result is that EROS recovers in about 30 seconds regardless of how many disks are attached to the machine. 1.2.2 Runtime PerformanceSome people argue that explicit persistence should be more efficient, because you only write down what you need to. Transparent persistence is forced to write down things like temporary program state (the heap and the stack) that may not actually be needed. This involves extra work. The data suggests that EROS does a good bit better than conventional designs on comparable hardware, but it is hard to make any principled claim about why this is. Here are some of the issues:
EROS, which is transparently persistent, is able to fully utilize 100% of the available sustained disk write bandwidth on a sustained basis. UNIX, which is explicitly persistent, can only sustain about 20% throughput. Windows does much worse. By ``sustain,'' I mean for a long series of I/O's. It's easy to get good disk performance for a single write. Doing it for a hundred writes is a little more challenging. What was observed in KeyKOS (which was transparently persistent) is that the KeyKOS UNIX emulator did much better on file operations than native UNIX does. What this means to your application is not clear. It does suggest strongly that the benefits of transparent persistence should be carefully examined. 1.3 Burden on ProgramsOne argument in support of transparent persistence is the effect on application software. If you look at a typical program, you will find that a large amount of it is oriented toward dealing with files: opening, reading, writing, and closing them. Files exist for two reasons:
The interchange issue is fairly rare, and in most cases can be accomplished equally well using shared-memory techniques. A small number of utilities are certainly needed to allow tapes compatible with other machines to be written, but this is not central to the activity of a complex application. For many programs, a simpler solution would be to have programs resume after system shutdown with all of their state intact. Suppose, for a moment, that your To-Do manager simply runs forever (as it does if you use a device like a Palm Pilot). In that case, there is no need to write or read files, and almost half of the code for the application can be eliminated. Keeping data in active containers (processes) also makes it simpler to ensure that access to the data is properly managed. If the ``file'' is supposed to be a sequence of records, the managing software can ensure that all updates preserve the record structure. In a file system, this sort of customization would place a considerable burden on the operating system. 1.4 Preferred Solution for Capability DesignsAssuming that it can be made to work efficiently, transparent persistence is the preferred design for capability systems, for two reasons:
The startup security issue is nontrivial. The question is: ``Where does the first program get it's capabilities from?'' The most obvious answer is: ``From the file system.'' The next question ought to be: ``And how does it have the authority to talk to the file system if it doesn't have any capabilities yet?'' The answer is either ``Ulp.'' or ``Well, everything can talk to the file system.'' The design is now in deep trouble. If everything can talk to the file system, then everything can talk to everything else and it becomes impossible to enforce some important principles such as the principle of confinement and the principle of least privilege. As if that wasn't trouble enough, the next question is: ``And how do we decide if the process is authorized to access a given object in the file system?'' If your answer is ``According to the user's identity,'' then congratulations -- you've just re-invented UNIX. 2 Implications of Transparent PersistenceThe main challenge to transparent persistence is performance. The naive approach is to stop and make a snapshot of the entire machine. Let's examine what that would mean (in approximate terms) on a reasonably typical desktop PC: The machine has 32 megabytes of memory. 40% of the memory has been modified and needs to be written down. 12.8 megabytes of data must be transferred to the disk in units of 4096 (the page size), for a total of 3276 disk I/O operations. On a modern drive, it is just as expensive to switch from one head to the next as from one cylinder to the next, so the question to ask is how many tracks (not cylinders) we need to write down. Assume that there are 63 sectors per track, each 512 bytes (a typical modern EIDE drive). This is not true, but modern drives have sufficient on-disk buffering that we can pretend it is true without terrible loss of accuracy. 7 pages fit on a track, so a minimum of 468 tracks must be written. A current-technology disk (as of March 1998) rotates at 5400 rotations per minute, or 11 milliseconds per rotation. The minimum seek time is 4 milliseconds, which is the cost of going from one track to the next. The minimum total cost to write a track is therefore 15 milliseconds. 468 tracks times 15 milliseconds is just a bit above 7 seconds. During those 7 seconds, the content of the machine must not be modified. This is definitely long enough that the user will notice and get upset. For a server machine with 16 times as much memory, this approach is clearly out of the question. Obviously, the naive approach will not work. Several early implementations actually used this approach, which gave checkpointing a bad reputation. The solution is not to try to write everything so aggressively, but rather to mark all of the dirty pages ``copy on write'' and mark them as needing to be written. You can then write them in the background while letting computation continue. This reduces the overhead to something under 10 milliseconds, which is fast enough. 2.1 Writing the OS StructuresIn addition to the user data, however, we must also write down all of the data structures maintained by the operating system on behalf of the processes. If the process were to come back without any of its open files, we would not have a very useful persistence mechanism. More precisely, you must capture all of the persistent operating system state. A great deal of the operating system state is not persistent. Some examples:
Conceptually, capturing the OS state isn't hard to do -- you simply locate all of the operating system data structures to be saved, convert them into an on-disk format (sometimes known as "pickling"), and write them down. The main challenge is that this copy activity must be done all at once, because it is difficult to make the operating system data structures copy on write. Because the data structures need to be copied quickly, it is very inconvenient if the persistent structures are allocated from a heap. If they are, then a garbage-collection sweep pass must be made to find them and pickle them, and the operating system must reserve sufficient memory to pickle them to (because you cannot stop halfway through the job without getting an inconsistent result). Pickling is therefore greatly simplified if the persistent data structures are allocated out of fixed-size tables. Most operating systems are not designed this way. Aside: I would argue that dynamically allocated heap memory is a deceptively easy way to get into trouble in most kernels. It is too easy for one subsystem to allocate memory that another subsystem needs, which creates both hidden sources of deadlock and hidden communication channels within the kernel. It also makes performance difficult to predict. The counterargument is that a common heap allows the operating system to respond to changes in workload without requiring excessive heap memory. This is true, but does not address the issue of simultaneously high demand from multiple workload sources. If a dynamic heap is to be used, it is best to segregate the heap by usage type so that every subsystem is guaranteed a certain minimum portion of the heap sufficient to ensure it's correct execution. Finally, it is important to note that bugs in a persistent operating system last forever. Operating system state is difficult to check for consistency. The greater the amount of state that is "owned" by the operating system, the greater the likelihood that damaged state will get written down. Once written down, there is no recovery short of reloading the system. 2.2 Write-Ahead Checkpoint LogThe other problem with transparent persistence is that the system can fail while you are halfway through writing down the state. This means that you cannot throw away the previous state before you are done writing down the new version. Some portion of the disk must therefore be reserved for a write-ahead log; data in official ``home locations'' must not be updated until a complete snapshot has been accomplished. After it has been successfully written down, the data can be migrated to it's official home. You might think that this extra migration load causes more I/O to happen. It does, but for the reasons discussed in section 1.2 the impact is different from what you might expect. 3 The Database ProblemBy now, I hope, you will have seen that checkpointing might be made to work. If you are reading this on EROS, a checkpoint has probably occurred at some point while you were reading and you didn't notice The checkpointing overhead on EROS is less than 0.3%, so a lot of other things get in the way first. When an ATM machine hands you money, it is important that the bank remember this fact. Suppose that the server takes a checkpoint, hands you some money, and then fails before the next checkpoint. When it comes back, it will not remember that it has given you any money. Aside: A strong case can be made that this is a perfectly sound design. No matter how good the system design is, some amount of money will be lost in the face of certain kinds of failures. The question is how much and how often. A bank experiencing one machine failure every two years might well decide that it was willing to accept a loss of up to $2 million every two years if this enabled the bank to execute twice as many transactions. On such a machine, a reasonable checkpoint policy is every 5 minutes or $2 million in withdrawals (whichever comes first). As the customer whose deposit got lost, you are likely to be less forgiving. Recall, however, that the ATM machine contains the check or envelope you deposited, so these transactions can be reapplied. For all that it might be a reasonable design, the bank isn't going to take that view. A mechanism is needed to remember this sort of information if the system fails. A transparently persistent system must provide such a mechanism. In EROS, the solution is something called journaling. A process holding the journaling capability can say ``mark this page as a journaled page.'' It can later come back and say ``write this page down quickly, because I need it to survive a machine failure.'' This mechanism provides the necessary underpinnings for the database system. 4 ConclusionsThe moral of the story is that mechanisms for explicit persistence are necessary to support databases applications. My view, however, is that these applications are the exceptional case. Most applications should not have to deal with files or persistence at all. EROS shows that transparent persistence is feasible and efficient. Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License |