[eros-cvs] cvs commit: eros/src/doc/www/design-notes 00DesignNotes.html IA32-Emulation.html
Jonathan S. Shapiro
shap@eros.cs.jhu.edu
Fri, 15 Mar 2002 19:11:53 -0500
shap 02/03/15 19:11:53
Modified: src/doc/www/design-notes 00DesignNotes.html
IA32-Emulation.html
Log:
More updates on the IA32 emulation note
Revision Changes Path
1.37 +1 -1 eros/src/doc/www/design-notes/00DesignNotes.html
Index: 00DesignNotes.html
===================================================================
RCS file: /cvs/eros/src/doc/www/design-notes/00DesignNotes.html,v
retrieving revision 1.36
retrieving revision 1.37
diff -u -r1.36 -r1.37
--- 00DesignNotes.html 15 Mar 2002 14:09:48 -0000 1.36
+++ 00DesignNotes.html 16 Mar 2002 00:11:53 -0000 1.37
@@ -127,7 +127,7 @@
</tr>
<tr valign="top">
<td width = "50%">
- <a href="IA32-Emulation.html">Support for IA-32 Segmentation</a>
+ <a href="IA32-Emulation.html">IA-32 Emulation</a>
<img src="../img/small-new.gif" alt="NEW:" align="top">
</td>
<td width = "50%">
1.2 +456 -602 eros/src/doc/www/design-notes/IA32-Emulation.html
Index: IA32-Emulation.html
===================================================================
RCS file: /cvs/eros/src/doc/www/design-notes/IA32-Emulation.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- IA32-Emulation.html 15 Mar 2002 14:09:48 -0000 1.1
+++ IA32-Emulation.html 16 Mar 2002 00:11:53 -0000 1.2
@@ -1,690 +1,544 @@
<html>
<head>
- <title>Support for IA-32 Segmentation</title>
+ <title>IA-32 Emulation</title>
</head>
<BODY BGCOLOR="#ffeedd" text="#000000" link="#0000ee" vlink="#551a8b" alink="#ff0000">
<table>
- <tr valign=top>
+ <tr valign="top">
<td width="10%"> </td>
<td>
+ <p>
+ <img src="../img/construction.gif" align="left">
+ (3/15/2002) This note is actively being authored at this
+ time. Therefore, much of what it says is utterly wrong,
+ flat nonsense, or simply balderdash. Now why didn't the
+ W3C dudes build a smiley face into the HTML character set?
+ </p>
+ <br clear="left">
<center>
- <h1>Support for IA-32 Segmentation</h1>
+ <h1>IA-32 Emulation</h1>
</center>
<p>
- The primary goal of segment register support is to run
- protected-mode programs that rely on relocation and bounds
- as implemented by the IA32 segment registers -- i.e. code
- that is not flat model code. In addition, we <em>may</em>
- need to support call gates where the destination is in the
- same ring as the source -- these may conceivably be used
- by some applications to implement dynamic binding. It is
- unclear at this point to what degree we need to deal with
- call gates.
- </p>
- <p>
- There is no requirement to support segments for code other
- than user mode code. In all cases where a segment entry
- would lead to a privilege level change, it is satisfactory
- to generate a fault that causes control to be transferred
- to a keeper.
- </p>
- <p>
- There are a LOT of ways that this can go badly
- wrong. Those of you who actually know the chip
- <em>please</em> check me on all of this.
- </p>
- <h2>1. General Background</h2>
- <p>
- The purpose of this exercise is to support operating
- system emulation of guest operating systems. The goal is
- to take a hybrid approach in which user-mode guest code
- runs in a normal EROS process, and supervisor code is
- interpreted or dynamically compiled. To do this, it is
- clearly necessary that we be able to emulate that portion
- of the segmentation behavior that is observable from
- ring-3 (user) code.
- </p>
- <p>
- While doing this, we would like to minimize the degree to
- which the operating system relies on the emulator to be
- correct. Thus, the operating system should be defensive
- about segment table values that might lead to protection
- compromise.
- </p>
- <p>
- On the x86, segment descriptors live in one of two tables:
- the local descriptor table (LDT) and the global descriptor
- table (GDT). The LDT is (by convention) a per-process
- table. The GDT is (by convention) a global table. The
- locations of these tables are named by the <b>LDTR</b> and
- <b>GDTR</b> registers, respectively. Only ring 0
- (privileged) code can reload these registers.
- </p>
- <p>
- You have to hunt in the manual to find it, but segment
- descriptor loads are checked by the paging system in
- <em>supervisor</em> mode even when executed from
- non-privileged code. This means that the table can (and
- should) sit in kernel memory. This is actually nice, as it
- means that if we want to play games with the table we
- don't have to find an unused location within application
- memory to hold it.
- </p>
- <p>
- All of the segment table entries that are potentially
- ``dangerous'' are easily identified, and can be turned off
- if a shadow segment table is used. One possible problem
- with shadow segment tables is that the <b>SGDT</b>,
- <b>SLDT</b>, <b>STR</b> instructions reveal the location
- of this table in the kernel portion of the virtual address
- space. If an application subsequently passes this value
- back to the operating system in a fashion that is expected
- to be meaningful, this could lead to difficulty. I am not
- aware of any operating system in which this is a problem
- in practice.
- </p>
- <p>
- I've spent some time looking at the Pentium Pro manual,
- and it appears that shadow segment tables should be
- possible. By this, I mean that the operating system should
- be able to build a duplicate segment table in which
- dangerous stuff has been removed, and nonetheless create
- the appearance to all ring-3 processes that the requested
- segment table exists. This is possible because:
+ This note describes some early thoughts on how to do
+ complete IA-32 emulation in a user-level domain. The
+ approach requires kernel support, but allows the majority
+ of the emulation to occur in user mode. Some of the
+ mechanisms needed were already contemplated or have
+ already been considered. The major new introduction is
+ support for segment tables and the impact of this on the
+ assumptions of the underlying EROS kernel.
+ </p>
+ <p>
+ Some of what follows was crystalized by a long weekend
+ session with Kevin Lawton (see: <a
+ href="www.plex86.org">plex86</a>). I had initially hoped
+ to borrow heavily in our implementation from
+ plex86. Borrowing is certainly still possible, but it is
+ now clear that the two pieces most likely to survive a
+ port to EROS -- the interpreter and the JIT compiler --
+ would require significant modification. While there is
+ much in common between the strategy outlined here and the
+ one taken by plex86, the details are quite different.
+ </p>
+ <h2>1. General Approach</h2>
+ <p>
+ The first thing to say about running x86 code is that the
+ hardware is decently good at it and software
+ isn't. Emulating the behavior of segmentation and paging
+ with a pure software solution carries considerable
+ overhead: 30 to 40 instructions of JIT-generated code per
+ memory-mode instruction. Of these instructions, most go to
+ simulating the behavior of the page translation and
+ segmentation logic.
</p>
- <ol>
- <li>
+ <ul>
<p>
- The <b>LGDT</b> and <b>LLDT</b> instructions
- can only be performed by ring 0 code.
+ <b>Ancient History:</b> I was one of the early authors
+ of HaLsim -- a 64 bit RISC processor simulator. The
+ situation is better for RISC chips because
+ segmentation doesn't need to be emulated, but
+ simulating page tables is still horrendously hard to
+ do well.
</p>
- </li>
- <li>
+ </ul>
+ <p>
+ The good news is that we have an engine ready to hand that
+ already knows how to do this: the IA-32 (a.k.a. Pentium)
+ chip. The bad news is that this chip <em>really</em>
+ doesn't want to be virtualized. For a detailed discussion
+ of (most of) the problems, see:
+ </p>
+ <ul>
<p>
- There exists no instruction by which a ring-3
- application can learn whether it is running from any
- particulat global or local descriptor table. This
- means that shadow tables can be used to deal with
- protection issues.
+ John Scott Robin and Cynthia Irvine, ``Analysis of the
+ Intel Pentium's Ability to Support a Secure Virtual
+ Machine Monitor,'' <em>Proceedings of the 9th USENIX
+ Security Symposium</em>, Denver, CO, August 2000,
+ pp. 129--144.
</p>
- </li>
- <li>
<p>
- Access checks are made at descriptor load time, but
- the two instructions that expose segment limits and
- access rights re-read the table. This means that
- access limit forgery and access right forgery can (if
- necessary) be accomplished with the support of the
- trace (single-step) bit.
+ At the time of this writing, a copy of this paper
+ could be found on the USENIX site <a
+ href="http://www.usenix.org/publications/library/proceedings/sec2000/robin.html">here</a>.
</p>
- </li>
- </ol>
-
- <p>
- The processor runs in one of two modes: real-mode and
- protected mode. I will deal with each in turn. There is
- also a "virtual 8086" mode, which I will not address in
- this note. As a design matter, our objective is only to
- manage segment registers when operating in protected mode
- as user-mode (non privileged) code.
- </p>
- <h2>1. Structure of Selectors</h2>
- <p>
- Execution proceeds by loading entries from these
- tables. The "selectors" (indexes) specified by the
- application take the form:
- </p>
- <center>
- <table border="all">
- <tr>
- <td>
- index[13]
- </td>
- <td>
- L/G[1]
- </td>
- <td>
- RPL[2]
- </td>
- </tr>
- </table>
- </center>
+ </ul>
<p>
- The meaning of these fields is as follows:
+ The challenge in using the hardware for user level code is
+ that we need to provide segment and paging behavior
+ consistent with that provided by the original operating
+ system. In plex86 this is done by providing a kernel
+ subsystem that builds a real page table (and, I am
+ guessing, a shadow segment table) on the side and
+ ``warping'' between the guest application and the host
+ operating system. Plex86 then implements non-application
+ code using a variety of interpretation techniques ranging
+ from single-instruction interpretation to (eventually) JIT
+ compilation. That is, plex86 implements a microkernel
+ within a kernel.
+ </p>
+ <p>
+ Unfortunately, the amount of code that plex86 places in
+ the kernel to support this is considerable (including a
+ JIT compiler), and would probably preclude assurance
+ evaluation for EROS if we were to do that. I don't think
+ we will need to, because EROS some advantages over Linux
+ where emulation is concerned:
</p>
<ul>
- <table>
- <tr valign="top">
- <td><b>Field</b></td>
- <td><b>Meaning</b></td>
- </tr>
- <tr valign="top">
- <td>
- L/G
- </td>
- <td>
- <p>
- The L/G field indicates whether to load from the
- local or global table. So long as neither table
- contains "unsafe" entries this field is
- protection-neutral.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>
- RPL
- </td>
- <td>
+ <li>
+ <p>
+ The hardware paging system is directly exported. By
+ manipulating address space trees, it is possible to
+ exactly simulate the behavior of the IA-32 paging
+ system. The primary challenge is getting the
+ ``accessed'' bit. This isssue is discussed extensively
+ in:
+ </p>
+ <ul>
<p>
- The RPL field indicates the requested privilege
- level; for user code the only legal RPL value is
- "11b" (indicating ring 3, i.e. non-privileged
- code) -- provided the corresponding entry in the
- LDT/GDT is a non-privileged entry, attempts by the
- application to specify a privileged RPL will yield
- a fault. It is not necessary for us to take any
- measures to check the requested RPL for validity
- at instruction execution time so long as we
- properly constrain the entries in the respective
- tables.
+ Paul A. Karger, Mary Ellen Zurko, Douglas
+ W. Bonin, Andrew H. Mason, and Clifford E. Kahn,
+ ``A Retrospective on the {VAX} {VMM} Security
+ Kernel,'' <em>IEEE Transactions on Software
+ Engineering</em>, IEEE, <b>17</b>(11), Nov 1991,
+ pp. 1147-1165
</p>
- </td>
- </tr>
- <tr valign="top">
- <td>
- index
- </td>
- <td>
<p>
- The index field is simply an offset into the
- descriptor table. Each descriptor table can be up
- to 64k, which means that we may need to reserve
- space on a per-process basis for this.
+ I am not aware of an online copy of this paper.
</p>
- </td>
- </tr>
- </table>
+ </ul>
+ <p>
+ It simply isn't possible to fully emulate the paging
+ semantics using the <em>mmap</em>(2) interface. Even
+ ignoring considerations of performance, Kevin had no
+ choice but to implement plex86 as part of the Linux
+ kernel to get the semantics right.
+ </p>
+ </li>
+ <li>
+ <p>
+ The EROS fault handling path is much faster than
+ Linux. It is therefore conceivable that we can get
+ away with doing the interpretation in user
+ land. Fundamentally, my belief is that the EROS
+ context switch time is inherently faster than the
+ Windows user/supervisor crossing time. So long as that
+ is true, we should get away with doing all of this in
+ user code <em>provided</em> that we can somehow beat
+ the overhead of simulating segmentation and paging in
+ the kernel.
+ </p>
+ </li>
+ <li>
+ <p>
+ EROS does not enforce a 1:1 relationship between
+ address spaces and processes. The emulator can build
+ multiple address space trees (one per guest address
+ space) and simply switch spaces as required -- this
+ has exactly the effect of changing the master address
+ space pointer register, but in a way that preserves
+ the persistence properties of the EROS system.
+ </p>
+ <p>
+ It is difficult to describe how much easier this makes
+ things to someone who hasn't built or seriously
+ studied a shadow paging system.
+ </p>
+ </li>
+ <li>
+ <p>
+ The EROS kernel can, if necessary, be moved ``out of
+ the way.''
+ </p>
+ <p>
+ One of the problems with emulation is that the IA-32
+ requires the kernel to be mapped in the user address
+ space. As a result, the host kernel and some portion
+ of the guest space overlap. Invariably, they overlap
+ where the guest kernel wants to live.
+ </p>
+ <p>
+ This is especially painful when (a) the guest kernel
+ is designed to live at the high end, and (b) it has
+ been placed their by linking it using a high starting
+ address. When this combination of things occurs, the
+ guest OS cannot be relocated out of the way using
+ segmentation. The native therefore OS has to
+ move. Plex86 switches the entire address space, using
+ a small trampoline to switch back when the guest
+ environment faults.
+ </p>
+ <p>
+ In 1991, Norm Hardy and I had a disagreement about
+ whether the operating systems should, in principle,
+ reside in the application address space. Norm came
+ from the IBM 370 world, where the answer was
+ self-evidently ``no and hell no.'' This was also
+ feasible on the 88K. After a <em>lot</em> of work we
+ figured out how to achieve the same effect with the
+ x86. I've included a description below.
+ </p>
+ </li>
</ul>
<p>
- The key point to note here is that so long as the content
- of the segment tables is properly managed, there is
- nothing in the selector to worry about from a security
- point of view.
- </p>
- <h2>2. Structure of Segment Tables</h2>
- <p>
- Entries in the descriptor tables can be either "system" or
- "application" entries. Application entries describe code
- or data segments. I shall deal with those first because
- system entries are quite different and considerably more
- complex.
+ If we have to, we can consider moving this code into the
+ kerrnel <em>after</em> it is debugged.
</p>
<p>
- There are three pieces of tremendous good news in all of
- what follows:
+ To make a long story short, it is possible (with
+ considerable effort) to run emulated application (ring 3)
+ code in an EROS process using the native hardware under a
+ modest set of assumptions. This emulation is not exact,
+ but it can be made good enough to fool most of the
+ operating systems out there -- most notably Windows,
+ Linux, and EROS. The key kernel requirements to support
+ such execution are:
</p>
- <ol>
+ <ul>
<li>
<p>
- The <b>LGDT</b> and <b>LLDT</b> instructions
- can only be performed by ring 0 code.
+ Provide kernel support for the portion of segment
+ semantics that are visible from ring 3 (application)
+ code.
</p>
- </li>
- <li>
<p>
- There exists no instruction by which a ring-3
- application can learn whether it is running from any
- particulat global or local descriptor table. This
- means that shadow tables can be used to deal with
- protection issues.
+ Note that not all of this simulation needs to be
+ fast. As long as we can transparently fault the
+ process when it makes low-frequency inquiries, we can
+ do a pretty minimal implementation. Also, some
+ ``leaks'' in the illusion are tolerable in practice.
</p>
</li>
<li>
<p>
- Access checks are made at descriptor load time, but
- the two instructions that expose segment limits and
- access rights re-read the table. This means that
- access limit forgery and access right forgery can (if
- necessary) be accomplished with the support of the
- trace (single-step) bit.
+ Implement the bouncing kernel tricks. In practice,
+ this may end up being avoidable, as we can relocate a
+ hostile guest kernel in the JIT compiler if needed.
</p>
</li>
- </ol>
- <h2>2.1 Code and Data (Application) Entries</h2>
+ </ul>
<p>
- For application entries, the logical fields of the
- descriptor are as follows. I'm re-combining some bitfields
- that are scattered in the actual representation for
- purposes of this discussion. To support emulation, none of
- these fields should be directly modifiable by the
- application. Where a given field carries a protection
- issue <em>from the perspective of the EROS kernel</em>, I
- describe it.
+ The balance of this note describes what the ring-3 visible
+ semantics of segmentation actually are, how we can use the
+ EROS paging logic to simulate the native paging logic, and
+ (in abstract) how we will execute guest kernel code. It
+ also discusses some miscellaneous virtualization issues
+ surrounding visible system registers. The actual
+ implementation of the guest code executive is left for
+ another design note, as it is a topic unto itself.
+ </p>
+ <h2>2. Segment Semantics Visible in Ring 3</h2>
+ <p>
+ In the following discussion, we assume that the
+ application code is running in ring 3. That is, all
+ instructions are executed with non-supervisor access
+ modes. An exception: the IA-32 always performs segment
+ register loads using supervisor-mode references on the
+ x86.
+ </p>
+ <p>
+ To determine what is needed to preserve the desired
+ illusion, we first need to enumerate what a ring 3
+ application can learn and which of these things are
+ important. For a more detailed explanation of issues, see
+ section 3.1 of the Robin and Irvine paper. A key issue in
+ the following discussion: which things allow
+ <em>detection</em> of emulation (which we can tolerate)
+ vs. cause emulation to break.
+ </p>
+ <h3>2.1 Location of System Tables</h3>
+ <p>
+ Ring-3 can use the <b>SGDT</b>, <b>SLDT</b>, <b>SIDT</b>
+ instructions to learn the virtual memory location of the
+ global descriptor table, the local descriptor table, and
+ the interrupt dispatch table, respectively. They also
+ reveal the <em>size</em> of these tables.
+ </p>
+ <p>
+ This is essentially useless information, and I don't know
+ any application that has a reason to use these
+ instructions from ring 3 unless it is checking to see if
+ emulation is going on. I do not see emulation detection as
+ a big issue unless it breaks something. Actually, it's a
+ lousy strategy for detection, because the virtual
+ addresses of all tables are likely to change from kernel
+ version to kernel version as a result of recompilation.
+ <p>
+ The only way I can see that these values can be
+ problematic is if the guest application later passes the
+ discovered location back to the guest OS and a comparison
+ of locations is made. This is a problem because we will
+ almost certainly need to implement a shadow LDT/GDT, so
+ the reported location will not match the location expected
+ by the guest OS.
+ </p>
+ <p>
+ Regrettably, these instructions directly reveal the
+ content of protected system registers. There appears to be
+ no straightforward way to prevent this
+ revelation. Fortunately, applications don't actually
+ execute these instructions.
+ </p>
+ <p>
+ If this becomes an impediment, we can probably arrange to
+ place the GDT, LDT, and IDT at the same virtual address
+ where the guest OS placed it. I propose we defer this
+ until it proves to be a probleem.
+ </p>
+ <h3>2.2 Exposure of Segment Table Content</h3>
+ <p>
+ Four instructions partially expose the values of a segment
+ table entry. None of these instructions has security
+ implications <em>per se</em> (i.e. it would be okay to
+ emulate them), but each reveals something to ring 3 code
+ about the content of the segment table:
</p>
<table>
<tr valign="top">
- <td><b>Field</b></td>
- <td><b>Meaning</b></td>
+ <td><b>Instruction</b></td>
+ <td><b>Action</b></td>
</tr>
<tr valign="top">
- <td>S=1</td>
+ <td>LAR</td>
<td>
<p>
- Part of entry type field.
+ Loads the ``access rights'' field from a segment
+ table entry.
</p>
<p>
- <b>Protection issues:</b> determines
- interpretation of rest of fields. Non-application
- entries can be frightfully powerful.
+ This reveals various permissions bits in any entry
+ that is <em>visible</em> from ring 3. For example, a
+ guest operating system might reveal a read-only to
+ state shared between guest os and guest
+ application. This is really a mapping of guest OS
+ state into the guest application, and would need to
+ be honored.
</p>
</td>
</tr>
<tr valign="top">
- <td>Base</td>
- <td>
- <p>
- The base of the segment.
- </p>
- <p>
- <b>Protection issues:</b> see discussion of limit,
- below.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>Limit</td>
- <td>
- <p>
- The limit of the segment.
- </p>
- <p>
- <b>Protection issues:</b>
- </p>
- <p>
- The EROS IA32 implementation currently relies on
- page-based protection to protect the kernel, so
- the base and limit fields pose no hazard from this
- perspective. However, there is a problem with
- ``small spaces.''
- </p>
- <p>
- EROS/IA32 implements small spaces by restricting
- the limit of an ordinary program to 3 Gbytes,
- placing the kernel at 3.5 Gbytes, and using the
- range from 3 Gbytes to 3.5 Gbytes to hold small
- spaces.
- </p>
- <p>
- Regrettably, the processor exposes the current
- value of the limit field to applications via the
- non-privileged <b>LSL</b> instruction. I do not
- know of any real programs that invoke this
- instruction, but pure support may demand that we
- allow for it. As it happens, we have to be able to
- cope with this anyway (see below on kernel
- placement).
- </p>
- <p>
- The best ``solution'' to this problem is to
- introduce into the kernel implementation the idea
- that not all large address spaces support small
- address spaces. This eliminates the hazarded
- region between 3Gbytes and 3.5Gbytes at the cost
- of adding one extra instruction to the IPC path to
- test if a small space switch is feasible. I
- believe this cost is acceptable.
- </p>
- <p>
- A further concern here is the need to provide a
- ``transparently relocatable kernel.'' (see
- below).
- </p>
- <p>
- If both the transparently relocatable kernel and
- the small spaces issues can be fully addressed
- using the paging system, then the base/limit
- fields cease to have protection implications.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>G</td>
- <td>
- <p>
- The granularity (pages, bytes) of the limit
- field.
- </p>
- <p>
- <b>Protection issues:</b> none if the base/limit
- issues identified above have been addressed.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>B</td>
- <td>Big. Determines whether stack pointer is 16-bit or
- 32-bit.</td>
- </tr>
- <tr valign="top">
- <td>D</td>
+ <td>LSL</td>
<td>
<p>
- Default. Determines whether instructions are
- 16-bit operations or 32-bit operations by
- default. That is, it determines the behavior of
- the "size" prefix.
+ Loads the segment limit field.
</p>
<p>
- <b>Protection issues:</b> none.
+ This reveals the length of the segment. If the
+ segment is accessable to the application at all,
+ revealing its length is relatively harmless. For
+ emulation, however, it is unfortunate that this is
+ done. It would be nice to be able to add two
+ ``invisible'' entries to the GDT, for example,
+ without revealing the length change. It turns out
+ that we can do this. We will discuss how below.
</p>
</td>
</tr>
<tr valign="top">
- <td>AVL</td>
+ <td>VERR, VERRW</td>
<td>
<p>
- Available for system use. May have subsequently
- claimed by AMD as part of the 64-bit support.
+ Verify for reading, writing.
</p>
<p>
- <b>Protection issues:</b> none.
+ Reveals to the application whether a segment can be
+ read (respectively: written).
</p>
</td>
</tr>
<tr valign="top">
- <td>P</td>
+ <td>STR</td>
<td>
<p>
- Present -- indicates whether the entry is
- valid. Attempts to load invalid entries cause a
- fault.
+ Store task register.
</p>
<p>
- <b>Protection issues:</b> if not present, no other
- protection issues need to be considered.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>DPL</td>
- <td>
- <p>
- Descriptor privilege level. Identifies the least
- privilege level that is permitted to access this
- segment.
- </p>
- <p>
- <b>Protection issues:</b> None for ring 3 code
- </p>
- <p>
- There is some confusion created in the
- specification of DPL in the Pentium manual because
- "greater privilege" means "smaller PL". Thus, a
- ring 3 application can only access segments whose
- DPL is 3. Attempts to access a segment with a
- DPL<3 will cause an exception.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>Type</td>
- <td>
- <p>
- The type field is a four bit field. The first bit
- indicates whether the segment is a code or a data
- segment. Code cannot be fetched from non-code
- segments. The remaining fields get interpreted
- according to whether the segment is a code or a
- data segment.
- </p>
- <p>
- For code segments:
- </p>
- <table>
- <tr valign="top">
- <td>C</td>
- <td>
- <p>
- Indicates whether this is a "conforming"
- segment. In practice, all of the segments
- we need to deal with should be conforming.
- </p>
- <p>
- <b>Protection issues:</b> None for ring 3 code
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>R</td>
- <td>
- <p>
- Readable -- indicates whether reads can be
- performed of the code using load
- instructions.
- </p>
- <p>
- <b>Protection issues:</b> None. EROS
- provides read/write protection in the
- paging system, so we can safely view this
- as a discretionary access control
- restriction imposed by the
- application.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>A</td>
- <td>
- <p>
- Accessed -- indicates whether the segment
- has been accessed (loaded).
- </p>
- <p>
- <b>Protection issues:</b> None.
- </p>
- </td>
- </tr>
- </table>
- <p>
- For data segments:
- </p>
- <table>
- <tr valign="top">
- <td>E</td>
- <td>
- <p>
- Indicates the expansion direction (up or
- down). As a rule, data segments grow up
- and stack segments grow down, but this is
- purely a convention. Pragmatic impact is
- to determine whether push and pop
- instructions decrement or increment the
- stack pointer.
- </p>
- <p>
- <b>Protection issues:</b> None.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>W</td>
- <td>
- <p>
- Writable -- indicates whether writes can be
- performed to this segment.
- </p>
- <p>
- <b>Protection issues:</b> None. See above
- in discussion of code readability.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>A</td>
- <td>
- <p>
- Accessed -- indicates whether the segment
- has been accessed (loaded).
- </p>
- <p>
- <b>Protection issues:</b> None.
- </p>
- </td>
- </tr>
- </table>
- <p>
- So there are relatively few issues with the type
- field.
+ Reveals to the application the segment number
+ (index) from which the task register was loaded.
</p>
</td>
</tr>
</table>
<p>
- Given the preceding analysis, I would be tempted to say
- that we could safely let an application manage its own
- descriptor tables so long as it ran in ring-3 and did not
- diddle with the system/application bit. Applications that
- do this would not carry small spaces, and would pay a
- corresponding penalty for context switches. Unfortunately,
- the system/application bit kills this.
+ As discussed in Robin and Irvine, these instructions
+ present various problems for execution of supervisor
+ code. However, that isn't the problem we are trying to
+ solve, and for ring 3 code things are not so bad:
+ </p>
+ <h4>2.2.1 VERR, VERW</h4>
+ <p>
+ In ring 3, the <b>VERR</b>, <b>VERW</b> instructions
+ reveal information only about segments that are accessable
+ to ring 3 code anyway. That is, they are not sensitive
+ <em>when invoked from ring 3</em>.
+ </p>
+ <h4>2.2.2 LSL</h4>
+ <p>
+ There is a general issue with TSS, call gate, and task
+ gate segments that needs to be addressed below. Here we
+ consider only the implications of the <b>LSL</b>
+ instruction in the unlikely case that a TSS segment is
+ created with DPL=3 (no current operating systems do so).
+ <p>
+ The LSL instruction, when executed from ring 3, reveals
+ the length of accessable segments plus TSS segments whose
+ DPL value is 3. It is not our responsibility to stop the
+ guest OS from revealing stupidity to the guest
+ application. We need only be concerned about revealing
+ information about the host OS TSS segments (if any).
+ </p>
+ <p>
+ The EROS kernel uses a singleton master TSS with a DPL of
+ 0. Even if it used a DPL of 3, revealing the length of a
+ statically created kernel structure does not create either
+ a significant disclosure or a channel of communication. In
+ effect, this behavior reveals a non-sensitive constant to
+ the guest application.
+ </p>
+ <p>
+ There <em>is</em> one potential revelation concerning the
+ TSS limit: the guest OS may make use of the permissions
+ bitmask and the differences between the size of the guest
+ OS TSS and the size of the EROS OS TSS might reveal the
+ fact of emulation hosting <em>if</em> the DPL of the guest
+ OS TSS is set to 3 (i.e. if the guest OS author was a
+ complete idiot). Revealing the fact of emulation may be a
+ foregone conclusion in any case, but we do not need to
+ reveal it <em>here</em>.
+ </p>
+ <p>
+ Alternatively, note that the <b>LSS</b> instruction does
+ <em>not</em> reveal the linear base address of the task
+ segment. Therefore, the EROS kernel could resolve the
+ problem by maintaining a dummy TSS region and using false
+ TSS entries in the shadow descriptor table that point to
+ this dummy TSS and reflect appropriate sizes. This is the
+ preferred resolution for reasons discussed below.
+ </p>
+ <p>
+ Finally, note that all of this sillyness is required only
+ to support idiot operating systems that set the DPL value
+ to 3. We'll do it. Someday. A long time from now.
+ </p>
+ <h4>2.2.2 LAR</h4>
+ <p>
+ The <b>LAR</b> instruction raises many of the same issues
+ as the <b>LSS</b> instruction. As with <b>LSS</b> it
+ reveals information about code/data segments accessable
+ from ring 3, but this is not sensitive. Like <b>LSS</b>,
+ it reveals potentially sensitive information about TSS
+ segments to ring 3 code. It also reveals information about
+ call gates and task gates. As before, these are an issue
+ only when the segment entry DPL value is DPL=3.
+ </p>
+ <p>
+ The statements about TSS segments made under the
+ discussion of <b>LSL</b> apply equally well to the
+ <b>LAR</b> instruction. <b>LAR</b> reveals that a TSS
+ segment exists and what access rights exist to it, but
+ does not reveal anything about the nature of the process
+ that will be invoked.
+ </p>
+ <h4>2.2.3 STR</h4>
+ <p>
+ The <b>STR</b> instruction reveals the identity of the
+ descriptor table entry from which the current task was
+ loaded. This instruction is not used by applications in
+ most systems. The primary requirement to simulate this
+ instruction's behavior correctly for ring 3 code is to
+ ensure that any TSS entry in the shadow descriptor tables
+ appears at the same location as the corresponding entry in
+ the original descriptor table. This can be done without
+ actually implementing multiple TSS segments in the
+ operating system.
+ </p>
+ <h3>2.3 TSS, Task Gates, and Call Gates</h3>
+ <p>
+ For performance reasons, current IA-32 operating systems
+ generally use a single, supervisor-only TSS and do not use
+ task gates or call gates. In a nutshell, it's faster to
+ simulate this behavior in software than to let this sorry
+ excuse for a processor do the work. In such systems, no
+ segment of these types will exist with DPL=3. Since we are
+ only doing native execution of ring 3 code, the
+ virtualization issues associated with simulating the
+ behavior of these misfeatures disappears.
+ </p>
+ <h4>2.3.1 Call Gates</h4>
+ <p>
+ Call gates are nastily complicated, but not really that
+ bad to manage. The ``solution'' is for the EROS kernel to
+ provide a set of call-gate entry point in the kernel that
+ accepts zero arguments (and therefore construct a uniform
+ stack frame). Each call gate is directed to a unique
+ kernel entry point that records the identity of the
+ descriptor table selector used in the code. This selector
+ is passed to the keeper of the guest application, which is
+ the program performing the supervisor-mode
+ emulation. Given access to the selector invoked, the
+ emulator can use the original (non-shadow) descriptor
+ table to work out what should be done.
+ </p>
+ <p>
+ If it is absolutely essential to do so, the EROS kernel
+ could also arrange to record the argument words and
+ encapsulate these upward into the keeper invocation. This
+ would penalize the normal capability invocation path, and
+ I am therefore somewhat reluctant to do it. Efficient
+ emulation is important, and this decision should therefore
+ be dictated by performance measurement.
+ </p>
+ <h4>2.3.2 TSS, Task Gates</h4>
+ <p>
+ Fortunately, transfers via a jump or call to a TSS or task
+ gate segment do not make provision for passing arguments
+ or specifying an entry point. Further, while the privilege
+ level necessary to <em>access</em> the task is revealed by
+ various instructions, the privilege level at which the
+ destination task actually executes thankfully is not. This
+ means that the ``honey pot'' solution works: create a
+ dedicated singleton TSS whose sole purpose is to be the
+ destination of all emulated TSS and task gate
+ transfers.
+ </p>
+ <p>
+ The honey pot TSS is configured to proceed executing EROS
+ kernel code. It immediately unwinds the task linkages (in
+ order to become available for next time), switches back to
+ the expected kernel TSS using the <b>LTR</b> instruction,
+ marks the guest application as having trapped to the
+ emulator, and resumes it, causing a fault into the keeper.
</p>
- <h2>2.2 System Entries</h2>
+ <h2>3. Use of Segmentation in the Emulator</h2>
<p>
</p>
- <table>
- <tr valign="top">
- <td><b>Field</b></td>
- <td><b>Meaning</b></td>
- </tr>
- <tr valign="top">
- <td>S=0</td>
- <td>
- <p>
- Part of entry type field.
- </p>
- <p>
- <b>Protection issues:</b> determines
- interpretation of rest of fields. Non-application
- entries can be frightfully powerful.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>Base</td>
- <td>
- <p>
- The base of the segment.
- </p>
- <p>
- <b>Protection issues:</b> see discussion of limit,
- below.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>Limit</td>
- <td>
- <p>
- The limit of the segment.
- </p>
- <p>
- <b>Protection issues:</b>
- </p>
- <p>
- For system segments, the base and limit generally
- describe the locations of operating-system managed
- tables, and must therefore be considered sensitive.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>G</td>
- <td>
- <p>
- The granularity (pages, bytes) of the limit
- field.
- </p>
- <p>
- <b>Protection issues:</b> sensitive, per discussion
- under limit, above.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>P</td>
- <td>
- <p>
- Present -- indicates whether the entry is
- valid. Attempts to load invalid entries cause a
- fault.
- </p>
- <p>
- <b>Protection issues:</b> if not present, no other
- protection issues need to be considered.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>DPL</td>
- <td>
- <p>
- Descriptor privilege level. Identifies the least
- privilege level that is permitted to access this
- segment.
- </p>
- <p>
- <b>Protection issues:</b> None for ring 3 code
- </p>
- <p>
- There is some confusion created in the
- specification of DPL in the Pentium manual because
- "greater privilege" means "smaller PL". Thus, a
- ring 3 application can only access segments whose
- DPL is 3. Attempts to access a segment with a
- DPL<3 will cause an exception.
- </p>
- </td>
- </tr>
- <tr valign="top">
- <td>Type</td>
- <td>
- <p>
- Indicates the type of the system segment. The
- currently defined types are:
- </p>
- <table>
- <tr valign="top">
- </table>
- <p>
- So there are relatively few issues with the type
- field.
- </p>
- </td>
- </tr>
- </table>
+ <h2>4. Shadow Paging</h2>
+ <p>
+ The remaining com
<hr> <em>Copyright 2002 by Jonathan Shapiro. All rights
reserved. For terms of redistribution, see the <a
href="../legal/license/GPL.html">GNU General Public
License</a></em>
- <h2>??. Implementation Thoughts</h2>
- <p>
- My current intention is to grab an architecture-specific
- capability slot in the process root for this purpose. This
- slot will point to a small space (one node that in turn
- holds pages) in which the LDT and GDT will reside. For
- reasons of security the application must not have write
- access to this area. For reasons of emulation success the
- application should not in most cases have read access to
- this space either. Therefore, it is my initial intention
- that this slot should NOT be readable by a wielder of the
- process key. It may be possible for the kernel to guard
- against abuse, in which case we can relax this, but for
- the moment let's assume that anybody who manipulates a
- segment table must be trusted and package that authority
- as a separate key (analogous to the current situation with
- process tool protecting the brand).
- </p>
</td>
<td width="10%"> </td>
</tr valign=top>