[eros-cvs] cvs commit: eros/src/doc/www/design-notes 00DesignNotes.html IA32-Emulation.html

Jonathan S. Shapiro shap@eros.cs.jhu.edu
Fri, 15 Mar 2002 19:11:53 -0500


shap        02/03/15 19:11:53

  Modified:    src/doc/www/design-notes 00DesignNotes.html
                        IA32-Emulation.html
  Log:
  More updates on the IA32 emulation note

Revision  Changes    Path
1.37      +1 -1      eros/src/doc/www/design-notes/00DesignNotes.html

Index: 00DesignNotes.html
===================================================================
RCS file: /cvs/eros/src/doc/www/design-notes/00DesignNotes.html,v
retrieving revision 1.36
retrieving revision 1.37
diff -u -r1.36 -r1.37
--- 00DesignNotes.html	15 Mar 2002 14:09:48 -0000	1.36
+++ 00DesignNotes.html	16 Mar 2002 00:11:53 -0000	1.37
@@ -127,7 +127,7 @@
 		  </tr>
 		  <tr valign="top">
 		    <td width = "50%">
-		      <a href="IA32-Emulation.html">Support for IA-32 Segmentation</a>
+		      <a href="IA32-Emulation.html">IA-32 Emulation</a>
 		      <img src="../img/small-new.gif" alt="NEW:" align="top">
 		    </td>
 		    <td width = "50%">



1.2       +456 -602  eros/src/doc/www/design-notes/IA32-Emulation.html

Index: IA32-Emulation.html
===================================================================
RCS file: /cvs/eros/src/doc/www/design-notes/IA32-Emulation.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- IA32-Emulation.html	15 Mar 2002 14:09:48 -0000	1.1
+++ IA32-Emulation.html	16 Mar 2002 00:11:53 -0000	1.2
@@ -1,690 +1,544 @@
 <html>
   <head>
-    <title>Support for IA-32 Segmentation</title>
+    <title>IA-32 Emulation</title>
   </head>
   <BODY BGCOLOR="#ffeedd" text="#000000" link="#0000ee" vlink="#551a8b" alink="#ff0000">
     <table>
-      <tr valign=top>
+      <tr valign="top">
 	<td width="10%">&nbsp;</td>
 	<td>
+	  <p>
+	    <img src="../img/construction.gif" align="left">
+	    (3/15/2002) This note is actively being authored at this
+	    time. Therefore, much of what it says is utterly wrong,
+	    flat nonsense, or simply balderdash. Now why didn't the
+	    W3C dudes build a smiley face into the HTML character set?
+	  </p>
+	  <br clear="left">
 	  <center>
-	    <h1>Support for IA-32 Segmentation</h1>
+	    <h1>IA-32 Emulation</h1>
 	  </center>
 	  <p>
-	    The primary goal of segment register support is to run
-	    protected-mode programs that rely on relocation and bounds
-	    as implemented by the IA32 segment registers -- i.e. code
-	    that is not flat model code. In addition, we <em>may</em>
-	    need to support call gates where the destination is in the
-	    same ring as the source -- these may conceivably be used
-	    by some applications to implement dynamic binding. It is
-	    unclear at this point to what degree we need to deal with
-	    call gates.
-	  </p>
-	  <p>
-	    There is no requirement to support segments for code other
-	    than user mode code. In all cases where a segment entry
-	    would lead to a privilege level change, it is satisfactory
-	    to generate a fault that causes control to be transferred
-	    to a keeper.
-	  </p>
-	  <p>
-	    There are a LOT of ways that this can go badly
-	    wrong. Those of you who actually know the chip
-	    <em>please</em> check me on all of this.
-	  </p>
-	  <h2>1. General Background</h2>
-	  <p>
-	    The purpose of this exercise is to support operating
-	    system emulation of guest operating systems. The goal is
-	    to take a hybrid approach in which user-mode guest code
-	    runs in a normal EROS process, and supervisor code is
-	    interpreted or dynamically compiled. To do this, it is
-	    clearly necessary that we be able to emulate that portion
-	    of the segmentation behavior that is observable from
-	    ring-3 (user) code.
-	  </p>
-	  <p>
-	    While doing this, we would like to minimize the degree to
-	    which the operating system relies on the emulator to be
-	    correct. Thus, the operating system should be defensive
-	    about segment table values that might lead to protection
-	    compromise.
-	  </p>
-	  <p>
-	    On the x86, segment descriptors live in one of two tables:
-	    the local descriptor table (LDT) and the global descriptor
-	    table (GDT).  The LDT is (by convention) a per-process
-	    table. The GDT is (by convention) a global table.  The
-	    locations of these tables are named by the <b>LDTR</b> and
-	    <b>GDTR</b> registers, respectively. Only ring 0
-	    (privileged) code can reload these registers.
-	  </p>
-	  <p>
-	    You have to hunt in the manual to find it, but segment
-	    descriptor loads are checked by the paging system in
-	    <em>supervisor</em> mode even when executed from
-	    non-privileged code. This means that the table can (and
-	    should) sit in kernel memory. This is actually nice, as it
-	    means that if we want to play games with the table we
-	    don't have to find an unused location within application
-	    memory to hold it.
-	  </p>
-	  <p>
-	    All of the segment table entries that are potentially
-	    ``dangerous'' are easily identified, and can be turned off
-	    if a shadow segment table is used. One possible problem
-	    with shadow segment tables is that the <b>SGDT</b>,
-	    <b>SLDT</b>, <b>STR</b> instructions reveal the location
-	    of this table in the kernel portion of the virtual address
-	    space. If an application subsequently passes this value
-	    back to the operating system in a fashion that is expected
-	    to be meaningful, this could lead to difficulty. I am not
-	    aware of any operating system in which this is a problem
-	    in practice.
-	  </p>
-	  <p>
-	    I've spent some time looking at the Pentium Pro manual,
-	    and it appears that shadow segment tables should be
-	    possible. By this, I mean that the operating system should
-	    be able to build a duplicate segment table in which
-	    dangerous stuff has been removed, and nonetheless create
-	    the appearance to all ring-3 processes that the requested
-	    segment table exists. This is possible because:
+	    This note describes some early thoughts on how to do
+	    complete IA-32 emulation in a user-level domain. The
+	    approach requires kernel support, but allows the majority
+	    of the emulation to occur in user mode. Some of the
+	    mechanisms needed were already contemplated or have
+	    already been considered. The major new introduction is
+	    support for segment tables and the impact of this on the
+	    assumptions of the underlying EROS kernel.
+	  </p>
+	  <p>
+	    Some of what follows was crystalized by a long weekend
+	    session with Kevin Lawton (see: <a
+	    href="www.plex86.org">plex86</a>). I had initially hoped
+	    to borrow heavily in our implementation from
+	    plex86. Borrowing is certainly still possible, but it is
+	    now clear that the two pieces most likely to survive a
+	    port to EROS -- the interpreter and the JIT compiler --
+	    would require significant modification. While there is
+	    much in common between the strategy outlined here and the
+	    one taken by plex86, the details are quite different.
+	  </p>
+	  <h2>1. General Approach</h2>
+	  <p>
+	    The first thing to say about running x86 code is that the
+	    hardware is decently good at it and software
+	    isn't. Emulating the behavior of segmentation and paging
+	    with a pure software solution carries considerable
+	    overhead: 30 to 40 instructions of JIT-generated code per
+	    memory-mode instruction. Of these instructions, most go to
+	    simulating the behavior of the page translation and
+	    segmentation logic.
 	  </p>
-	  <ol>
-	    <li>
+	  <ul>
 	      <p>
-		The <b>LGDT</b> and <b>LLDT</b> instructions
-		can only be performed by ring 0 code.
+		<b>Ancient History:</b> I was one of the early authors
+		of HaLsim -- a 64 bit RISC processor simulator. The
+		situation is better for RISC chips because
+		segmentation doesn't need to be emulated, but
+		simulating page tables is still horrendously hard to
+		do well.
 	      </p>
-	    </li>
-	    <li>
+	  </ul>
+	  <p>
+	    The good news is that we have an engine ready to hand that
+	    already knows how to do this: the IA-32 (a.k.a. Pentium)
+	    chip. The bad news is that this chip <em>really</em>
+	    doesn't want to be virtualized. For a detailed discussion
+	    of (most of) the problems, see:
+	  </p>
+	  <ul>
 	      <p>
-		There exists no instruction by which a ring-3
-		application can learn whether it is running from any
-		particulat global or local descriptor table. This
-		means that shadow tables can be used to deal with
-		protection issues.
+		John Scott Robin and Cynthia Irvine, ``Analysis of the
+		Intel Pentium's Ability to Support a Secure Virtual
+		Machine Monitor,'' <em>Proceedings of the 9th USENIX
+		Security Symposium</em>, Denver, CO, August 2000,
+		pp. 129--144.
 	      </p>
-	    </li>
-	    <li>
 	      <p>
-		Access checks are made at descriptor load time, but
-		the two instructions that expose segment limits and
-		access rights re-read the table. This means that
-		access limit forgery and access right forgery can (if
-		necessary) be accomplished with the support of the
-		trace (single-step) bit.
+		At the time of this writing, a copy of this paper
+		could be found on the USENIX site <a
+		href="http://www.usenix.org/publications/library/proceedings/sec2000/robin.html">here</a>.
 	      </p>
-	    </li>
-	  </ol>
-
-	  <p>
-	    The processor runs in one of two modes: real-mode and
-	    protected mode. I will deal with each in turn. There is
-	    also a "virtual 8086" mode, which I will not address in
-	    this note. As a design matter, our objective is only to
-	    manage segment registers when operating in protected mode
-	    as user-mode (non privileged) code.
-	  </p>
-	  <h2>1. Structure of Selectors</h2>
-	  <p>
-	    Execution proceeds by loading entries from these
-	    tables. The "selectors" (indexes) specified by the
-	    application take the form:
-	  </p>
-	  <center>
-	    <table border="all">
-	      <tr>
-		<td>
-		  index[13]
-		</td>
-		<td>
-		  L/G[1]
-		</td>
-		<td>
-		  RPL[2]
-		</td>
-	      </tr>
-	    </table>
-	  </center>
+	  </ul>
 	  <p>
-	    The meaning of these fields is as follows:
+	    The challenge in using the hardware for user level code is
+	    that we need to provide segment and paging behavior
+	    consistent with that provided by the original operating
+	    system. In plex86 this is done by providing a kernel
+	    subsystem that builds a real page table (and, I am
+	    guessing, a shadow segment table) on the side and
+	    ``warping'' between the guest application and the host
+	    operating system. Plex86 then implements non-application
+	    code using a variety of interpretation techniques ranging
+	    from single-instruction interpretation to (eventually) JIT
+	    compilation. That is, plex86 implements a microkernel
+	    within a kernel.
+	  </p>
+	  <p>
+	    Unfortunately, the amount of code that plex86 places in
+	    the kernel to support this is considerable (including a
+	    JIT compiler), and would probably preclude assurance
+	    evaluation for EROS if we were to do that. I don't think
+	    we will need to, because EROS some advantages over Linux
+	    where emulation is concerned:
 	  </p>
 	  <ul>
-	    <table>
-	      <tr valign="top">
-		<td><b>Field</b></td>
-		<td><b>Meaning</b></td>
-	      </tr>
-	      <tr valign="top">
-		<td>
-		  L/G
-		</td>
-		<td>
-		  <p>
-		    The L/G field indicates whether to load from the
-		    local or global table. So long as neither table
-		    contains "unsafe" entries this field is
-		    protection-neutral.
-		  </p>
-		</td>
-	      </tr>
-	      <tr valign="top">
-		<td>
-		  RPL
-		</td>
-		<td>
+	    <li>
+	      <p>
+		The hardware paging system is directly exported. By
+		manipulating address space trees, it is possible to
+		exactly simulate the behavior of the IA-32 paging
+		system. The primary challenge is getting the
+		``accessed'' bit. This isssue is discussed extensively
+		in:
+	      </p>
+	      <ul>
 		  <p>
-		    The RPL field indicates the requested privilege
-		    level; for user code the only legal RPL value is
-		    "11b" (indicating ring 3, i.e. non-privileged
-		    code) -- provided the corresponding entry in the
-		    LDT/GDT is a non-privileged entry, attempts by the
-		    application to specify a privileged RPL will yield
-		    a fault. It is not necessary for us to take any
-		    measures to check the requested RPL for validity
-		    at instruction execution time so long as we
-		    properly constrain the entries in the respective
-		    tables.
+		    Paul A. Karger, Mary Ellen Zurko, Douglas
+		    W. Bonin, Andrew H. Mason, and Clifford E. Kahn,
+		    ``A Retrospective on the {VAX} {VMM} Security
+		    Kernel,'' <em>IEEE Transactions on Software
+		    Engineering</em>, IEEE, <b>17</b>(11), Nov 1991,
+		    pp. 1147-1165
 		  </p>
-		</td>
-	      </tr>
-	      <tr valign="top">
-		<td>
-		  index
-		</td>
-		<td>
 		  <p>
-		    The index field is simply an offset into the
-		    descriptor table. Each descriptor table can be up
-		    to 64k, which means that we may need to reserve
-		    space on a per-process basis for this.
+		    I am not aware of an online copy of this paper.
 		  </p>
-		</td>
-	      </tr>
-	    </table>
+	      </ul>
+	      <p>
+		It simply isn't possible to fully emulate the paging
+		semantics using the <em>mmap</em>(2) interface. Even
+		ignoring considerations of performance, Kevin had no
+		choice but to implement plex86 as part of the Linux
+		kernel to get the semantics right.
+	      </p>
+	    </li>
+	    <li>
+	      <p>
+		The EROS fault handling path is much faster than
+		Linux. It is therefore conceivable that we can get
+		away with doing the interpretation in user
+		land. Fundamentally, my belief is that the EROS
+		context switch time is inherently faster than the
+		Windows user/supervisor crossing time. So long as that
+		is true, we should get away with doing all of this in
+		user code <em>provided</em> that we can somehow beat
+		the overhead of simulating segmentation and paging in
+		the kernel.
+	      </p>
+	    </li>
+	    <li>
+	      <p>
+		EROS does not enforce a 1:1 relationship between
+		address spaces and processes. The emulator can build
+		multiple address space trees (one per guest address
+		space) and simply switch spaces as required -- this
+		has exactly the effect of changing the master address
+		space pointer register, but in a way that preserves
+		the persistence properties of the EROS system.
+	      </p>
+	      <p>
+		It is difficult to describe how much easier this makes
+		things to someone who hasn't built or seriously
+		studied a shadow paging system.
+	      </p>
+	    </li>
+	    <li>
+	      <p>
+		The EROS kernel can, if necessary, be moved ``out of
+		the way.''
+	      </p>
+	      <p>
+		One of the problems with emulation is that the IA-32
+		requires the kernel to be mapped in the user address
+		space. As a result, the host kernel and some portion
+		of the guest space overlap. Invariably, they overlap
+		where the guest kernel wants to live.
+	      </p>
+	      <p>
+		This is especially painful when (a) the guest kernel
+		is designed to live at the high end, and (b) it has
+		been placed their by linking it using a high starting
+		address. When this combination of things occurs, the
+		guest OS cannot be relocated out of the way using
+		segmentation. The native therefore OS has to
+		move. Plex86 switches the entire address space, using
+		a small trampoline to switch back when the guest
+		environment faults.
+	      </p>
+	      <p>
+		In 1991, Norm Hardy and I had a disagreement about
+		whether the operating systems should, in principle,
+		reside in the application address space. Norm came
+		from the IBM 370 world, where the answer was
+		self-evidently ``no and hell no.'' This was also
+		feasible on the 88K. After a <em>lot</em> of work we
+		figured out how to achieve the same effect with the
+		x86. I've included a description below.
+	      </p>
+	    </li>
 	  </ul>
 	  <p>
-	    The key point to note here is that so long as the content
-	    of the segment tables is properly managed, there is
-	    nothing in the selector to worry about from a security
-	    point of view.
-	  </p>
-	  <h2>2. Structure of Segment Tables</h2>
-	  <p>
-	    Entries in the descriptor tables can be either "system" or
-	    "application" entries. Application entries describe code
-	    or data segments. I shall deal with those first because
-	    system entries are quite different and considerably more
-	    complex.
+	    If we have to, we can consider moving this code into the
+	    kerrnel <em>after</em> it is debugged.
 	  </p>
 	  <p>
-	    There are three pieces of tremendous good news in all of
-	    what follows:
+	    To make a long story short, it is possible (with
+	    considerable effort) to run emulated application (ring 3)
+	    code in an EROS process using the native hardware under a
+	    modest set of assumptions. This emulation is not exact,
+	    but it can be made good enough to fool most of the
+	    operating systems out there -- most notably Windows,
+	    Linux, and EROS. The key kernel requirements to support
+	    such execution are:
 	  </p>
-	  <ol>
+	  <ul>
 	    <li>
 	      <p>
-		The <b>LGDT</b> and <b>LLDT</b> instructions
-		can only be performed by ring 0 code.
+		Provide kernel support for the portion of segment
+		semantics that are visible from ring 3 (application)
+		code.
 	      </p>
-	    </li>
-	    <li>
 	      <p>
-		There exists no instruction by which a ring-3
-		application can learn whether it is running from any
-		particulat global or local descriptor table. This
-		means that shadow tables can be used to deal with
-		protection issues.
+		Note that not all of this simulation needs to be
+		fast. As long as we can transparently fault the
+		process when it makes low-frequency inquiries, we can
+		do a pretty minimal implementation. Also, some
+		``leaks'' in the illusion are tolerable in practice.
 	      </p>
 	    </li>
 	    <li>
 	      <p>
-		Access checks are made at descriptor load time, but
-		the two instructions that expose segment limits and
-		access rights re-read the table. This means that
-		access limit forgery and access right forgery can (if
-		necessary) be accomplished with the support of the
-		trace (single-step) bit.
+		Implement the bouncing kernel tricks. In practice,
+		this may end up being avoidable, as we can relocate a
+		hostile guest kernel in the JIT compiler if needed.
 	      </p>
 	    </li>
-	  </ol>
-	  <h2>2.1 Code and Data (Application) Entries</h2>
+	  </ul>
 	  <p>
-	    For application entries, the logical fields of the
-	    descriptor are as follows. I'm re-combining some bitfields
-	    that are scattered in the actual representation for
-	    purposes of this discussion. To support emulation, none of
-	    these fields should be directly modifiable by the
-	    application. Where a given field carries a protection
-	    issue <em>from the perspective of the EROS kernel</em>, I
-	    describe it.
+	    The balance of this note describes what the ring-3 visible
+	    semantics of segmentation actually are, how we can use the
+	    EROS paging logic to simulate the native paging logic, and
+	    (in abstract) how we will execute guest kernel code. It
+	    also discusses some miscellaneous virtualization issues
+	    surrounding visible system registers. The actual
+	    implementation of the guest code executive is left for
+	    another design note, as it is a topic unto itself.
+	  </p>
+	  <h2>2. Segment Semantics Visible in Ring 3</h2>
+	  <p>
+	    In the following discussion, we assume that the
+	    application code is running in ring 3. That is, all
+	    instructions are executed with non-supervisor access
+	    modes. An exception: the IA-32 always performs segment
+	    register loads using supervisor-mode references on the
+	    x86.
+	  </p>
+	  <p>
+	    To determine what is needed to preserve the desired
+	    illusion, we first need to enumerate what a ring 3
+	    application can learn and which of these things are
+	    important. For a more detailed explanation of issues, see
+	    section 3.1 of the Robin and Irvine paper. A key issue in
+	    the following discussion: which things allow
+	    <em>detection</em> of emulation (which we can tolerate)
+	    vs. cause emulation to break.
+	  </p>
+	  <h3>2.1 Location of System Tables</h3>
+	  <p>
+	    Ring-3 can use the <b>SGDT</b>, <b>SLDT</b>, <b>SIDT</b>
+	    instructions to learn the virtual memory location of the
+	    global descriptor table, the local descriptor table, and
+	    the interrupt dispatch table, respectively. They also
+	    reveal the <em>size</em> of these tables.
+	  </p>
+	  <p>
+	    This is essentially useless information, and I don't know
+	    any application that has a reason to use these
+	    instructions from ring 3 unless it is checking to see if
+	    emulation is going on. I do not see emulation detection as
+	    a big issue unless it breaks something. Actually, it's a
+	    lousy strategy for detection, because the virtual
+	    addresses of all tables are likely to change from kernel
+	    version to kernel version as a result of recompilation.
+	  <p>
+	    The only way I can see that these values can be
+	    problematic is if the guest application later passes the
+	    discovered location back to the guest OS and a comparison
+	    of locations is made. This is a problem because we will
+	    almost certainly need to implement a shadow LDT/GDT, so
+	    the reported location will not match the location expected
+	    by the guest OS.
+	  </p>
+	  <p>
+	    Regrettably, these instructions directly reveal the
+	    content of protected system registers. There appears to be
+	    no straightforward way to prevent this
+	    revelation. Fortunately, applications don't actually
+	    execute these instructions.
+	  </p>
+	  <p>
+	    If this becomes an impediment, we can probably arrange to
+	    place the GDT, LDT, and IDT at the same virtual address
+	    where the guest OS placed it. I propose we defer this
+	    until it proves to be a probleem.
+	  </p>
+	  <h3>2.2 Exposure of Segment Table Content</h3>
+	  <p>
+	    Four instructions partially expose the values of a segment
+	    table entry. None of these instructions has security
+	    implications <em>per se</em> (i.e. it would be okay to
+	    emulate them), but each reveals something to ring 3 code
+	    about the content of the segment table:
 	  </p>
 	  <table>
 	    <tr valign="top">
-	      <td><b>Field</b></td>
-	      <td><b>Meaning</b></td>
+	      <td><b>Instruction</b></td>
+	      <td><b>Action</b></td>
 	    </tr>
 	    <tr valign="top">
-	      <td>S=1</td>
+	      <td>LAR</td>
 	      <td>
 		<p>
-		  Part of entry type field.
+		  Loads the ``access rights'' field from a segment
+		  table entry.
 		</p>
 		<p>
-		  <b>Protection issues:</b> determines
-		  interpretation of rest of fields. Non-application
-		  entries can be frightfully powerful.
+		  This reveals various permissions bits in any entry
+		  that is <em>visible</em> from ring 3. For example, a
+		  guest operating system might reveal a read-only to
+		  state shared between guest os and guest
+		  application. This is really a mapping of guest OS
+		  state into the guest application, and would need to
+		  be honored.
 		</p>
 	      </td>
 	    </tr>
 	    <tr valign="top">
-	      <td>Base</td>
-	      <td>
-		<p>
-		  The base of the segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b> see discussion of limit,
-		  below.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>Limit</td>
-	      <td>
-		<p>
-		  The limit of the segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b>
-		</p>
-		<p>
-		  The EROS IA32 implementation currently relies on
-		  page-based protection to protect the kernel, so
-		  the base and limit fields pose no hazard from this
-		  perspective. However, there is a problem with
-		  ``small spaces.''
-		</p>
-		<p>
-		  EROS/IA32 implements small spaces by restricting
-		  the limit of an ordinary program to 3 Gbytes,
-		  placing the kernel at 3.5 Gbytes, and using the
-		  range from 3 Gbytes to 3.5 Gbytes to hold small
-		  spaces.
-		</p>
-		<p>
-		  Regrettably, the processor exposes the current
-		  value of the limit field to applications via the
-		  non-privileged <b>LSL</b> instruction. I do not
-		  know of any real programs that invoke this
-		  instruction, but pure support may demand that we
-		  allow for it. As it happens, we have to be able to
-		  cope with this anyway (see below on kernel
-		  placement).
-		</p>
-		<p>
-		  The best ``solution'' to this problem is to
-		  introduce into the kernel implementation the idea
-		  that not all large address spaces support small
-		  address spaces. This eliminates the hazarded
-		  region between 3Gbytes and 3.5Gbytes at the cost
-		  of adding one extra instruction to the IPC path to
-		  test if a small space switch is feasible. I
-		  believe this cost is acceptable.
-		</p>
-		<p>
-		  A further concern here is the need to provide a
-		  ``transparently relocatable kernel.'' (see
-		  below).
-		</p>
-		<p>
-		  If both the transparently relocatable kernel and
-		  the small spaces issues can be fully addressed
-		  using the paging system, then the base/limit
-		  fields cease to have protection implications.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>G</td>
-	      <td>
-		<p>
-		  The granularity (pages, bytes) of the limit
-		  field.
-		</p>
-		<p>
-		  <b>Protection issues:</b> none if the base/limit
-		  issues identified above have been addressed.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>B</td>
-	      <td>Big. Determines whether stack pointer is 16-bit or
-		32-bit.</td>
-	    </tr>
-	    <tr valign="top">
-	      <td>D</td>
+	      <td>LSL</td>
 	      <td>
 		<p>
-		  Default. Determines whether instructions are
-		  16-bit operations or 32-bit operations by
-		  default. That is, it determines the behavior of
-		  the "size" prefix.
+		  Loads the segment limit field.
 		</p>
 		<p>
-		  <b>Protection issues:</b> none.
+		  This reveals the length of the segment. If the
+		  segment is accessable to the application at all,
+		  revealing its length is relatively harmless. For
+		  emulation, however, it is unfortunate that this is
+		  done. It would be nice to be able to add two
+		  ``invisible'' entries to the GDT, for example,
+		  without revealing the length change. It turns out
+		  that we can do this. We will discuss how below.
 		</p>
 	      </td>
 	    </tr>
 	    <tr valign="top">
-	      <td>AVL</td>
+	      <td>VERR, VERRW</td>
 	      <td>
 		<p>
-		  Available for system use. May have subsequently
-		  claimed by AMD as part of the 64-bit support.
+		  Verify for reading, writing.
 		</p>
 		<p>
-		  <b>Protection issues:</b> none.
+		  Reveals to the application whether a segment can be
+		  read (respectively: written).
 		</p>
 	      </td>
 	    </tr>
 	    <tr valign="top">
-	      <td>P</td>
+	      <td>STR</td>
 	      <td>
 		<p>
-		  Present -- indicates whether the entry is
-		  valid. Attempts to load invalid entries cause a
-		  fault.
+		  Store task register.
 		</p>
 		<p>
-		  <b>Protection issues:</b> if not present, no other
-		  protection issues need to be considered.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>DPL</td>
-	      <td>
-		<p>
-		  Descriptor privilege level. Identifies the least
-		  privilege level that is permitted to access this
-		  segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b> None for ring 3 code
-		</p>
-		<p>
-		  There is some confusion created in the
-		  specification of DPL in the Pentium manual because
-		  "greater privilege" means "smaller PL". Thus, a
-		  ring 3 application can only access segments whose
-		  DPL is 3. Attempts to access a segment with a
-		  DPL&lt;3 will cause an exception.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>Type</td>
-	      <td>
-		<p>
-		  The type field is a four bit field. The first bit
-		  indicates whether the segment is a code or a data
-		  segment. Code cannot be fetched from non-code
-		  segments. The remaining fields get interpreted
-		  according to whether the segment is a code or a
-		  data segment.
-		</p>
-		<p>
-		  For code segments:
-		</p>
-		<table>
-		  <tr valign="top">
-		    <td>C</td>
-		    <td>
-		      <p>
-			Indicates whether this is a "conforming"
-			segment. In practice, all of the segments
-			we need to deal with should be conforming.
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None for ring 3 code
-		      </p>
-		    </td>
-		  </tr>
-		  <tr valign="top">
-		    <td>R</td>
-		    <td>
-		      <p>
-			Readable -- indicates whether reads can be
-			performed of the code using load
-			instructions.
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None. EROS
-			provides read/write protection in the
-			paging system, so we can safely view this
-			as a discretionary access control
-			restriction imposed by the
-			application.
-		      </p>
-		    </td>
-		  </tr>
-		  <tr valign="top">
-		    <td>A</td>
-		    <td>
-		      <p>
-			Accessed -- indicates whether the segment
-			has been accessed (loaded).
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None.
-		      </p>
-		    </td>
-		  </tr>
-		</table>
-		<p>
-		  For data segments:
-		</p>
-		<table>
-		  <tr valign="top">
-		    <td>E</td>
-		    <td>
-		      <p>
-			Indicates the expansion direction (up or
-			down). As a rule, data segments grow up
-			and stack segments grow down, but this is
-			purely a convention. Pragmatic impact is
-			to determine whether push and pop
-			instructions decrement or increment the
-			stack pointer.
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None.
-		      </p>
-		    </td>
-		  </tr>
-		  <tr valign="top">
-		    <td>W</td>
-		    <td>
-		      <p>
-			Writable -- indicates whether writes can be
-			performed to this segment.
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None. See above
-			in discussion of code readability.
-		      </p>
-		    </td>
-		  </tr>
-		  <tr valign="top">
-		    <td>A</td>
-		    <td>
-		      <p>
-			Accessed -- indicates whether the segment
-			has been accessed (loaded).
-		      </p>
-		      <p>
-			<b>Protection issues:</b> None.
-		      </p>
-		    </td>
-		  </tr>
-		</table>
-		<p>
-		  So there are relatively few issues with the type
-		  field.
+		  Reveals to the application the segment number
+		  (index) from which the task register was loaded.
 		</p>
 	      </td>
 	    </tr>
 	  </table>
 	  <p>
-	    Given the preceding analysis, I would be tempted to say
-	    that we could safely let an application manage its own
-	    descriptor tables so long as it ran in ring-3 and did not
-	    diddle with the system/application bit. Applications that
-	    do this would not carry small spaces, and would pay a
-	    corresponding penalty for context switches. Unfortunately,
-	    the system/application bit kills this.
+	    As discussed in Robin and Irvine, these instructions
+	    present various problems for execution of supervisor
+	    code. However, that isn't the problem we are trying to
+	    solve, and for ring 3 code things are not so bad:
+	  </p>
+	  <h4>2.2.1 VERR, VERW</h4>
+	  <p>
+	    In ring 3, the <b>VERR</b>, <b>VERW</b> instructions
+	    reveal information only about segments that are accessable
+	    to ring 3 code anyway. That is, they are not sensitive
+	    <em>when invoked from ring 3</em>.
+	  </p>
+	  <h4>2.2.2 LSL</h4>
+	  <p>
+	    There is a general issue with TSS, call gate, and task
+	    gate segments that needs to be addressed below. Here we
+	    consider only the implications of the <b>LSL</b>
+	    instruction in the unlikely case that a TSS segment is
+	    created with DPL=3 (no current operating systems do so).
+	  <p>
+	    The LSL instruction, when executed from ring 3, reveals
+	    the length of accessable segments plus TSS segments whose
+	    DPL value is 3. It is not our responsibility to stop the
+	    guest OS from revealing stupidity to the guest
+	    application. We need only be concerned about revealing
+	    information about the host OS TSS segments (if any).
+	  </p>
+	  <p>
+	    The EROS kernel uses a singleton master TSS with a DPL of
+	    0. Even if it used a DPL of 3, revealing the length of a
+	    statically created kernel structure does not create either
+	    a significant disclosure or a channel of communication. In
+	    effect, this behavior reveals a non-sensitive constant to
+	    the guest application.
+	  </p>
+	  <p>
+	    There <em>is</em> one potential revelation concerning the
+	    TSS limit: the guest OS may make use of the permissions
+	    bitmask and the differences between the size of the guest
+	    OS TSS and the size of the EROS OS TSS might reveal the
+	    fact of emulation hosting <em>if</em> the DPL of the guest
+	    OS TSS is set to 3 (i.e. if the guest OS author was a
+	    complete idiot). Revealing the fact of emulation may be a
+	    foregone conclusion in any case, but we do not need to
+	    reveal it <em>here</em>.
+	  </p>
+	  <p>
+	    Alternatively, note that the <b>LSS</b> instruction does
+	    <em>not</em> reveal the linear base address of the task
+	    segment. Therefore, the EROS kernel could resolve the
+	    problem by maintaining a dummy TSS region and using false
+	    TSS entries in the shadow descriptor table that point to
+	    this dummy TSS and reflect appropriate sizes.  This is the
+	    preferred resolution for reasons discussed below.
+	  </p>
+	  <p>
+	    Finally, note that all of this sillyness is required only
+	    to support idiot operating systems that set the DPL value
+	    to 3. We'll do it. Someday. A long time from now.
+	  </p>
+	  <h4>2.2.2 LAR</h4>
+	  <p>
+	    The <b>LAR</b> instruction raises many of the same issues
+	    as the <b>LSS</b> instruction. As with <b>LSS</b> it
+	    reveals information about code/data segments accessable
+	    from ring 3, but this is not sensitive. Like <b>LSS</b>,
+	    it reveals potentially sensitive information about TSS
+	    segments to ring 3 code. It also reveals information about
+	    call gates and task gates. As before, these are an issue
+	    only when the segment entry DPL value is DPL=3.
+	  </p>
+	  <p>
+	    The statements about TSS segments made under the
+	    discussion of <b>LSL</b> apply equally well to the
+	    <b>LAR</b> instruction. <b>LAR</b> reveals that a TSS
+	    segment exists and what access rights exist to it, but
+	    does not reveal anything about the nature of the process
+	    that will be invoked.
+	  </p>
+	  <h4>2.2.3 STR</h4>
+	  <p>
+	    The <b>STR</b> instruction reveals the identity of the
+	    descriptor table entry from which the current task was
+	    loaded. This instruction is not used by applications in
+	    most systems. The primary requirement to simulate this
+	    instruction's behavior correctly for ring 3 code is to
+	    ensure that any TSS entry in the shadow descriptor tables
+	    appears at the same location as the corresponding entry in
+	    the original descriptor table. This can be done without
+	    actually implementing multiple TSS segments in the
+	    operating system.
+	  </p>
+	  <h3>2.3 TSS, Task Gates, and Call Gates</h3>
+	  <p>
+	    For performance reasons, current IA-32 operating systems
+	    generally use a single, supervisor-only TSS and do not use
+	    task gates or call gates. In a nutshell, it's faster to
+	    simulate this behavior in software than to let this sorry
+	    excuse for a processor do the work. In such systems, no
+	    segment of these types will exist with DPL=3. Since we are
+	    only doing native execution of ring 3 code, the
+	    virtualization issues associated with simulating the
+	    behavior of these misfeatures disappears.
+	  </p>
+	  <h4>2.3.1 Call Gates</h4>
+	  <p>
+	    Call gates are nastily complicated, but not really that
+	    bad to manage. The ``solution'' is for the EROS kernel to
+	    provide a set of call-gate entry point in the kernel that
+	    accepts zero arguments (and therefore construct a uniform
+	    stack frame). Each call gate is directed to a unique
+	    kernel entry point that records the identity of the
+	    descriptor table selector used in the code. This selector
+	    is passed to the keeper of the guest application, which is
+	    the program performing the supervisor-mode
+	    emulation. Given access to the selector invoked, the
+	    emulator can use the original (non-shadow) descriptor
+	    table to work out what should be done.
+	  </p>
+	  <p>
+	    If it is absolutely essential to do so, the EROS kernel
+	    could also arrange to record the argument words and
+	    encapsulate these upward into the keeper invocation. This
+	    would penalize the normal capability invocation path, and
+	    I am therefore somewhat reluctant to do it. Efficient
+	    emulation is important, and this decision should therefore
+	    be dictated by performance measurement.
+	  </p>
+	  <h4>2.3.2 TSS, Task Gates</h4>
+	  <p>
+	    Fortunately, transfers via a jump or call to a TSS or task
+	    gate segment do not make provision for passing arguments
+	    or specifying an entry point. Further, while the privilege
+	    level necessary to <em>access</em> the task is revealed by
+	    various instructions, the privilege level at which the
+	    destination task actually executes thankfully is not. This
+	    means that the ``honey pot'' solution works: create a
+	    dedicated singleton TSS whose sole purpose is to be the
+	    destination of all emulated TSS and task gate
+	    transfers. 
+	  </p>
+	  <p>
+	    The honey pot TSS is configured to proceed executing EROS
+	    kernel code. It immediately unwinds the task linkages (in
+	    order to become available for next time), switches back to
+	    the expected kernel TSS using the <b>LTR</b> instruction,
+	    marks the guest application as having trapped to the
+	    emulator, and resumes it, causing a fault into the keeper.
 	  </p>
-	  <h2>2.2 System Entries</h2>
+	  <h2>3. Use of Segmentation in the Emulator</h2>
 	  <p>
 	  </p>
-	  <table>
-	    <tr valign="top">
-	      <td><b>Field</b></td>
-	      <td><b>Meaning</b></td>
-	    </tr>
-	    <tr valign="top">
-	      <td>S=0</td>
-	      <td>
-		<p>
-		  Part of entry type field.
-		</p>
-		<p>
-		  <b>Protection issues:</b> determines
-		  interpretation of rest of fields. Non-application
-		  entries can be frightfully powerful.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>Base</td>
-	      <td>
-		<p>
-		  The base of the segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b> see discussion of limit,
-		  below.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>Limit</td>
-	      <td>
-		<p>
-		  The limit of the segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b>
-		</p>
-		<p>
-		  For system segments, the base and limit generally
-		  describe the locations of operating-system managed
-		  tables, and must therefore be considered sensitive.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>G</td>
-	      <td>
-		<p>
-		  The granularity (pages, bytes) of the limit
-		  field.
-		</p>
-		<p>
-		  <b>Protection issues:</b> sensitive, per discussion
-		  under limit, above.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>P</td>
-	      <td>
-		<p>
-		  Present -- indicates whether the entry is
-		  valid. Attempts to load invalid entries cause a
-		  fault.
-		</p>
-		<p>
-		  <b>Protection issues:</b> if not present, no other
-		  protection issues need to be considered.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>DPL</td>
-	      <td>
-		<p>
-		  Descriptor privilege level. Identifies the least
-		  privilege level that is permitted to access this
-		  segment.
-		</p>
-		<p>
-		  <b>Protection issues:</b> None for ring 3 code
-		</p>
-		<p>
-		  There is some confusion created in the
-		  specification of DPL in the Pentium manual because
-		  "greater privilege" means "smaller PL". Thus, a
-		  ring 3 application can only access segments whose
-		  DPL is 3. Attempts to access a segment with a
-		  DPL&lt;3 will cause an exception.
-		</p>
-	      </td>
-	    </tr>
-	    <tr valign="top">
-	      <td>Type</td>
-	      <td>
-		<p>
-		  Indicates the type of the system segment. The
-		  currently defined types are:
-		</p>
-		<table>
-		  <tr valign="top">
-		</table>
-		<p>
-		  So there are relatively few issues with the type
-		  field.
-		</p>
-	      </td>
-	    </tr>
-	  </table>
+	  <h2>4. Shadow Paging</h2>
+	  <p>
+	    The remaining com
 	  <hr> <em>Copyright 2002 by Jonathan Shapiro.  All rights
 	    reserved.  For terms of redistribution, see the <a
 	    href="../legal/license/GPL.html">GNU General Public
 	    License</a></em>
-	  <h2>??. Implementation Thoughts</h2>
-	  <p>
-	    My current intention is to grab an architecture-specific
-	    capability slot in the process root for this purpose. This
-	    slot will point to a small space (one node that in turn
-	    holds pages) in which the LDT and GDT will reside. For
-	    reasons of security the application must not have write
-	    access to this area. For reasons of emulation success the
-	    application should not in most cases have read access to
-	    this space either. Therefore, it is my initial intention
-	    that this slot should NOT be readable by a wielder of the
-	    process key. It may be possible for the kernel to guard
-	    against abuse, in which case we can relax this, but for
-	    the moment let's assume that anybody who manipulates a
-	    segment table must be trusted and package that authority
-	    as a separate key (analogous to the current situation with
-	    process tool protecting the brand).
-	  </p>
 	</td>
 	<td width="10%">&nbsp;</td>
       </tr valign=top>