Kernel Malloc in EROS Version 2.0

EROS Goes Dynamic Allocation

    The move to user mode drivers has an ironic consequence: while the kernel now does almost no dynamic allocation of resources, it is now more important than ever to have a dynamic memory allocator. This note describes the design (and the design assumptions) of the new kernel memory allocator.


Since 1980 (or so), the KeyKOS/EROS family has successfully avoided incorporating a dynamic memory allocator in the kernel. Even as we are removing devices from the kernel, two forces are pushing us to now provide a kernel malloc() implementation, and to rejigger various parts of the implementation to use it:

  • Several recent architectural ``enhancements'' (note skepticism) designed to extend the lifespace of 32 bit architectures have led to machines whose physical memory may greatly exceed the size of the virtual address space. While this is a basically brain-dead idea, we wish to be able to support these machines well.

    Newer x86 machines, for example, can have up to 64 (16?) Gbytes of physical memory. Not all of this can be mapped into a 4Gb virtual address space, and certainly not all of this can be be mapped into the 1Gbyte reserved for the kernel and small spaces.

  • The move to user-mode drivers means that the kernel is completely unaware of device memory maps until well after bootstrap and memory initialization has occurred. Simultaneously, the approach to physical memory ranges (see Physical Memory Regions in Changes in EROS Version 2.0) in the version 2 kernel requires that core table entries must exist for all physical memory regions. This requires that core table entries be dynamically allocatable after system startup.

This note attempts to resolve how all of this should work.

Requirements and Non-Requirements

Since the kernel is no longer responsible for DMA or programmed I/O (PIO), it is no longer necessary that there be any simple mapping from kernel virtual addresses to physical addresses in order to facilitate physical DMA (PDMA). That is, the old requirement that KVA=PA+offset is no longer relevant. Note that this requirement is unsupportable if the range of legal physical addresses exceeds the range of kernel virtual addresses. This was a known (and dreaded) problem with the old design. In the new design, PIO and PDMA are supported as follows

  • For PIO, the virtual/physical congruency problem is now an application-level virtual memory mapping problem. As long as no access based on physical addresses is required, there really isn't a mapping issue here at all.

  • For DMA support, the kernel must still be able to allocate contiguous physical ranges so that driver applications can build DVA=PA+offset mappings.

    Note: This isn't strictly necessary, but simulating virtual DMA on architectures where the hardware provides only physical DMA is a royal pain in the ass. Given this, I've decided to support contiguous DMA to the extent of providing a physically contiguous page range allocation in the kernel.

From the kernel perspective, the move to user-level drivers alters the mapping requirements fairly dramatically:

  • It is no longer necessary for pages to be mapped by the kernel at all, except ephemerally (during bzero or copyin/copyout). Ephemeral mappings can be dealt with by providing a per-CPU (SMP issue) virtual page frame at which these copyin/copyout/zero activities can occur. No long term mapping of page space is required at all, because no kernel-initiated DMA exists in the driver-externalized kernel.

  • In contrast, mapping tables are frequently updated by the kernel and therefore should be mapped in kernel virtual space. This suggests that mapping tables should no longer be part of page space.

  • Given that the KVA=PA+offset rule is now dinosaur fodder, the kernel requires a heap from which to allocate node space, page space, depend table entries, thread tables, and so forth.

  • Note that none of the kernel-allocated data structures are ever implicated by DMA, so there is no need for the kernel heap to be backed by contiguous physical pages. This implies that the heap can be allocated in response to demand, which is very useful (read on).

  • All of the kernel-allocated structures now need to come from the heap, which means that they are all now allocated using malloc() (or equivalent).

  • Because the kernel no longer manages I/O space or driver memory, there is no longer a general need for dynamic allocation in the kernel. This is now a driver (i.e. and application) problem.

  • Because the kernel must implement physical memory ranges in support of drivers, and because these physical ranges cannot be known until the drivers run, and because these physical ranges require associated core table entries, it follows that:

    • Core table entries cannot be preallocated at kernel initialization time,

    • Core table entries must therefore be dynamically allocated.

  • In contrast to UNIX, all calls to kernel malloc() can block.

When hot-pluggable devices are taken into account, the move to dynamic allocation is not so much good as inevitable.

Pragmatic Impact

The pragmatic impact of these changes can be boiled down to two architectural changes:

  • The old kernel divided memory at startup boot time into node space, page space, depend space, core tables, etc.; and treated these spaces as co-equal, in the sense that each was equally primary. The new kernel first allocates a grand page space consisting of all available physical pages, and then allocates other spaces from this.

  • The physical memory management logic and the kernel heap allocation logic must now be integrated, because the kernel heap allocator (i.e. the malloc() routine) may sometimes require the ability to allocate more physical pages to grow the heap. In the steady case these pages will be obtained from the page cache, by which we mean that they will be obtained by consulting the physical memory metadata structures and ``reclaiming'' pages that were allocated to the page cache.

The second issue leads to a ``chicken and egg'' problem. When initializing physical memory, we must construct core table entries corresponding to each physical page. These entries must be allocated by malloc(), but until we initialize physical memory, the malloc() routine has nothing to work from, because it obtains physical frames by selecting available frames using the core table entries.


This circular bootstrap dependency is resolved at startup time while allocating the first physical memory region. A computation is performed to determine the size of the core table array corresponding to the highest physical memory region. The necessary number of pages are ``donated'' to the heap before the corresponding core tables are allocated.

Once the kernel heap has been ``pre-expanded,'' the core table entries corresponding to the top physical region are then allocated using malloc(). Those core-table entries corresponding to the page frames that were pre-donated to the heap are marked (by hand) as PtKernelHeap. The remainder are placed on the free frame list.

Now that there is a list of free frames, subsequent calls to the kernel malloc() routine can be satisfied by reclaiming frames using the core table.

Closing Thoughts

Assuming that the KeyKOS "1 node per page" rule continues to hold for EROS, and that there are 7 nodes per page, it seems promising that EROS should comfortably manage physical memories that are roughly 8 times the size of the available kernel virtual address space.

Copyright 2001 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License