On Allocators, Databases, Filesystems, mmap, and flat memory spaces

Posted on
data-structures

PART ONE

Allocators, databases, and filesystems have one major feature in commmon: they all allow storing arbitrary-sized data within a larger region, and retrieving it by some identifier. The particulars of the region management are largely hidden from the caller.

On top of this basic ability, some layers of indirection can be added.

  • Filesystems add support for fragmentation. Databases diverge on this topic; they might fragment for dynamic row formats, but not for fixed. In either case, supporting this requires a layer of indirection not provided by allocators.

  • Filesystems add support for querying data by an additional index (namely a string filename). Supporting this requires a layer of indirection. Databases take this further by adding support for querying data by arbitrary indexes. Allocators do not offer this layer of indirection.

The following behaviours can be noted:

  • In the case of variably-sized files, all three systems support growing an existing file. Allocators would move the file, filesystems would fragment the file, and databases might do either.

  • A region might have cause to be right-extended. Databases add support for automatically right-extending the region when the region (tablespace) cannot accommodate an INSERT/UPDATE. Filesystems can be right-extended when unmounted, but will otherwise abort writes when the region is near capacity. Allocators do the same.

                                 +---------------------------------------------+
                                 |                                             |
+------------------------------+ |                  +-------------------------+|
|                              | |                  |                         ||
| +---------+----------------+ | | +---------+      | +---------+    +------+ ||
| |String / + prefix lookup \| +-->+ inode / +--+---+>| pointer +--->| data | ||
| |       \ for directories /| | | |  rowid  |  |   | +---------+    +------+ ||
| +--------------------------+ | | +---------+  |   |                         ||
| FILESYSTEMS                  | |              |   +-------------------------+|
|                              | |              |   ALLOCATORS                 |
| Arbitrary indexed query      | |              |                              |
|                              | |              +-----> pointer  +-->  data    |
+------------------------------+ |              +-----> pointer  +-->  data    |
DATABASES                        |                                             |
                                 +---------------------------------------------+
                                 FRAGMENTATION SUPPORT                          

PART TWO

This leads to some interesting conceptual digressions.

You could have a filesystem that didn’t fragment, and instead, rewrote/moved files whenever they needed to be right-extended without space available.

Filesystems easily expose their free space, an allocator does not necessarily do so, and a database may grow indefinitely.

It’s conventional for filesystems and databases to easily expose the size of an object within the region, but there’s no corresponding malloc/free API to retrieve the total object size of a pointer. This leads to hacks like over-allocating and storing the size “behind” the returned allocation; but for filesystems and databases to keep sizes known, they must either perform a similar trick, or store the size at a different level of abstraction.

You could have a filesystem that automatically right-extended its partition when low on space, much like a database would grow its tablespace files.

You could have a filesystem that supported arbitrary queries. fopen() only takes a string argument. opendir()/readdir() only take string arguments. You could declare custom indexes on files. Longhorn’s WinFS never made it out the door, but KDE Nepomuk and external tagging and search systems make this a reality without tying in too closely with the underlying filesystem.

You could layer on external tagging systems on allocators. Although at this point you are implementing application-dependent lookup tables in a generic way, kind of like putting the STL into memory.h.

You could have a database that did not support advanced queries, and only supported a single type of key (hi, NoSQL! you’re really a filesystem).

You could have an allocator that did support fragmentation, but to cope with the necessary extra level of indirection, either

  • you must implemented direct modification via an MMU - this does happen in real operating systems, and gives way to the 4KB page size; or

PART THREE

  • you must copy the data into callers instead of allowing direct modification, like filesystems and databases do (excluding some fixed-row-format, in-process “databases” like BDB).

The x86 is referred to as having a flat memory space - everything you need is accessible by pointer dereference, without any messy segment selection. But it’s not really the case. As well as outb() / inb(), filesystems must be accessed via special functions (fopen() / fread() et al) and not by pointer dereference.

The MMU example above shows us that this does not necessarily rise from the existence of fragmentation, and mmap() allows the filesystem to be treated as a normal part of the flat memory space by “allocating” some memory which is specially handled by the kernel. This is good because fewer mechanisms are required to access all the system’s resources, and as we have seen above, the architectural similarity makes this a quite watertight abstraction.

But it is bad, because mmap() cannot be really implemented in C in userspace. It’s not possible to write your own setter/getter implementations for memory handlers which will be used by other pointer dereferences. Not even in C++ with abuse of overloading operator * and operator &. Although it is possible to provide your own malloc()/free() implementation, this issue prevents you from transparently implementing your own swap files.

It would be possible with some kind of FUSE derivative, e.g. on Linux wrapping struct vm_area_struct in mm/memory.c.