This article is mirrored on my blog.

I have been working on a voxel-based game for the last several months. As part of its development, I created a binary versioning and migration system that I haven't seen before.

Game data goals

Arguably the most important aspect of a game's technical architecture is its data model. The choice of the data model has wide-ranging impacts on iteration speed, development time, game design, player experience, performance, and more.

I had several goals in mind when deciding this game's data model:

  • No loading screens. Computer hardware has gotten to the point where gigabytes of data can be read from SSDs in a single second. With discipline, I should be able to calculate when my game can display its first playable frame (i.e., from launching the executable to regular gameplay) based on how much data I need to load and how fast my disk can supply the data. There is work that needs to be done beyond just loading from the disk, such as negotiating creation of the game window and especially uploading to the GPU. However, I want to keep the pipeline from data on disk to gameplay-ready data as short as possible.
  • Limited/no dynamic allocation. While it is fun and easy to use a dynamic array like C++'s std::vector, it makes it much harder to reason about your memory. You don't know how much you use, how often you allocate, when you reallocate, or how much your game can even handle unless you exercise restraint or create your own version with lots of added tracking. One of my most common sources of crashes in my previous programs is referencing data in vectors that have been reallocated. I believe I can make more robust programs by setting memory constraints rather than treating memory as a mysterious essentially unlimited resource.
  • Simple. If a C array of e.g. PlayerData players[256] is sufficient, then it should be used rather than something more complex. Performance does cause tension with simplicity, so with everything a middle ground must be found.

The plain old data model

The game's data model is highly unusual for one made in 2023:

  • All world state data (essentially anything the game remembers frame-to-frame, excluding things like data uploaded to the GPU or caches for performance) is stored in one massive "plain old data" (POD) structure.
  • To maintain the POD-ness of the world state, all dynamic things are limited to a fixed size and stored in C arrays. Each set of data has some way of indicating it is an empty slot or not, usually through a zero or non-zero ID.
  • The world state is saved and loaded to disk or transferred over the network in its native binary form, i.e. through single calls to fwrite, fread, or write, respectively.

There are several advantages to this approach:

  • Memory is much easier to reason about when it is all in one contiguous block rather than scattered wherever the heap allocator decided to put it.
  • Similarly, cache locality is helped by enforcing use of arrays rather than e.g. linked lists. This doesn't "solve" locality because there still needs to be thought put into e.g. "array of structs" vs. "struct of arrays" and references between things.
  • The game state can be copied, saved, loaded, and moved trivially. There are no pointers--IDs are used instead. Pointers are taken to items in the state during e.g. simulation, but never as a mechanism for two items in the state to refer to each other.
  • Stability is improved by enforcing constraints. If the game has some runaway spawning issue that creates thousands of something, it will simply stop creating more of them once it fills the predefined space rather than continuing until memory is exhausted. This means the game is "artificially" limited in e.g. how many characters may exist, but these limitations are set high enough and in a way that won't impact the intended gameplay design. This also makes it possible to actually test the game with a "full" world, e.g. with the maximum amount of all characters, doors, players, etc. If such a test is made to pass acceptably, I can be much more confident that the game is stable and robust even in extreme scenarios.
  • Deciding on fixed sizes for things forces me to understand what I am trying to create. For example, if I decide the maximum amount of non-player characters is 1024, the potential game design is different than one with only 16, or one with 100,000. Any of those values could result in cool gameplay and design, but by forcing myself to pick a number, I eliminate possibilities. This sounds bad, but is actually good for giving me a clearer picture of my goal.

There are of course disadvantages:

  • Array elements must have some way to indicate that they are empty or active. I do this on a case-by-case basis, e.g. a simple num-used, a field in the element (e.g. is-used or a non-zero ID), or lack of references to that element.
  • Fixed sizes mean I must handle the case where there are no more free slots. This needs to be decided based on the mechanic, e.g. if it's a list of visual particles, it's fine to just stop creating them (or replacing older ones), whereas if the player is trying to place down a door, they need some explanation as to why they cannot do so. The alternative data strategy (malloc'ing or using dynamic arrays etc.) should also have you think of these edge cases when e.g. malloc returns NULL (out of memory), but game programmers rarely consider this, which leads to crashes.

Versioning plain old data

Data models can be simple right until the point where you need to reckon with changes to the data model "schema" over time.

There are several C++ projects to handle structure to binary serialization and structure versioning (list not exhaustive):

I'm not interested in hearing about more of these, because I'm only using Cakelisp and some third-party libraries in exclusively C. In addition, these all fail one of my main constraints, which is that the serialization "schema" must not be separate from the structure declaration.

Luckily, Cakelisp is designed to implement these sorts of things through full-power macro execution at compile time. Similar to my introspection system, binary versioning is implemented through a collection of macros and compile-time code generation phases.

I cooked up my own binary versioning scheme. Here's the game's world-state-data:

(def-versioned-struct world-state-data (version 7)
  space voxel-space (live (1 1 .))
  chunks (array 1024 voxel-chunk) (live (1 1 .))
  players (array 16 player-data) (live (1 1 .))
  accounts (array 16 player-account) (live (1 1 .))
  characters (array 1024 character) (live (1 4 .))
  character-profiles (array 1024 character-profile) (live (1 4 .))
  doors (array 1024 door-data) (live (1 2 2) (2 3 4) (3 5 5) (4 6 .))
  rooms (array 1024 room-data) (live (1 7 .))
  world-time world-time-data (live (1 1 .)))

This looks the same as a regular struct in Cakelisp, only with another expression after the field's type. This expression is the field's "versionings".

To explain, let's take one field, chunks (array 1024 voxel-chunk) (live (1 1 .)):

  • chunks is the name of the field.
  • (array 1024 voxel-chunk) is the type of the field. This means 1024 elements of type voxel-chunk contiguously arranged in memory.
  • (live (1 1 .)) is the versionings. live indicates the field is still relevant at runtime (more on this later). The (1 1 .) indicates that voxel-chunk version 1 (the first number) is valid in world-state-data version 1 (the second number) and onward (indicated by the ., essentially an ellipsis or regex-style "any version fits here").

The versioning format is live or dead followed by a list of versions (for structs), which look like (sub-struct-version start-version end-version), or simply a range of versions start-version end-version for plain-old data.

The doors field shows more complex history. (live (1 2 2) (2 3 4) (3 5 5) (4 6 .)) indicates that the door-data schema at version 1 was valid in the world-state-data at version 2, but door-data at version 2 was used instead in world-state-data versions 3 through 4 inclusive. You should be able to figure out the rest of the versionings from there.

Within door-data it is clear why its version has changed over time:

(def-versioned-struct door-data (version 4)
  dead-type uint32_t (dead 3 3)
  dead-position fixed-vec3 (dead 1 3)
  position voxel-position (live (1 4 .))
  type door-type (live 4)
  orientation cardinal-orientation (live 2)
  is-open bool (live 1))

In version 4 I removed dead-type and dead-position, added position, and added type. I actually changed the type of position from fixed-vec3 to voxel-position, but the easiest way to do this was to create a new field and rename the old one.

Virtual types

By including this version information per each field, we can construct the type at any version in its history. I call this "virtual types", because all C knows is the "real" type used at runtime, i.e. the struct with only live fields.

Here's door-data expanded into all of its virtual types:

(def-versioned-struct door-data (version 1)
  position fixed-vec3
  is-open bool)

(def-versioned-struct door-data (version 2)
  position fixed-vec3
  orientation cardinal-orientation
  is-open bool)

(def-versioned-struct door-data (version 3)
  type uint32_t
  position fixed-vec3
  orientation cardinal-orientation
  is-open bool)

(def-versioned-struct door-data (version 4)
  position voxel-position
  type door-type
  orientation cardinal-orientation
  is-open bool)

These definitions do not exist to the compiler except the latest one (in this case, version 4). Instead, when data is detected in an older version (via version header), the virtual type is constructed from reading the version log and adding fields one-by-one in order which were live in that version.

Construction of virtual types is possible through following C compiler struct padding and alignment rules. As part of validation, code is generated to check at runtime where the compiler says fields are vs. where the versionings think they are in order to catch any differences between platforms or compilers.

Migration

The first use-case this system was applied to was the game's on-disk save files. I needed to be able to add and remove fields during development without invalidating my previous save files. This becomes critical when the game goes live because any invalidation after live would destroy the player's save files. I know I'll need to modify the game after this point, so I might as well bite the bullet and get it working early.

The save file format is simple:

  • The characters CAKEVERS short for "Cakelisp versioned data" act as a magic number
  • The version number, an unsigned short, indicates what virtual version the struct was saved at
  • The binary struct data, completely unchanged from runtime

This makes saving the game state the following steps:

  • Write the header
  • fwrite the world state

It can hardly get simpler than this, and the only gains I can see for speed would be to use newer I/O interfaces rather than a blocking fwrite.

Loading is a bit more complex due to data migration. First, the magic number is checked to ensure this is a Cakelisp versioned binary file. Next, the version number is checked. If the version matches the runtime version, the data is simply fread into memory. This is the optimal loading case.

If the version does not match, migration starts. If the data version is newer than the runtime version, the file must have been created with a newer version, and cannot be read. If the data version is older, we lazily construct the virtual type at that version. We then virtually construct the next version after the disk version, migrating field by field until we reach the runtime version. Two buffers are needed so we can go from one version to another, back and forth until we reach the target version.

There are four field changes that can occur:

  • A new field was added in the later version. Nothing needs to happen during migration.
  • A field is present in both the current and later versions. It needs to be copied to the later version, but can literally just be memcpy'd thanks to being binary-compatible.
  • A field is live in the current version but now dead in the later version. This could mean loss of data, so the system checks for a "migration handler", a hand-written function which the programmer provides to e.g. convert the data into a new form or save it off somewhere else. This can be made "strict" where no fields may go dead without the programmer explicitly saying "this data should be thrown away" or providing a handler.
  • A nested versioned struct is version X in the current version but version Y in the next version. This case is thankfully simple: recursively migrate that nested struct using the current version as X and the target as Y. Arrays are handled by recursing on each element one-by-one.

Custom migration handlers provide the programmer with the ability to change types, reorganize data, etc. as desired without losing data. Because they are arbitrarily complex, they can complete any migration necessary, and could even prompt the user if they need to be involved.

Limitations

There are some constraints with using this system:

  • It was designed for plain-old data, and as such will not e.g. follow pointers.
  • Absolute pointers of course would need to be fixed up after loading from disk. Other varieties of pointers would be compatible, however.
  • There is the possibility of human error when modifying the versionings. Thanks to Cakelisp's arbitrary compile-time code execution, compile-time validation of versionings is performed whenever possible. For example, common changes like modifying the version of a substructure will cause errors if any referencing structures indicate the older version as being live. There is still the chance that numbers are modified in a way that the computer could not detect.
  • The versionings add visual clutter to struct declarations. This is I believe unavoidable without A) destroying history or B) relying on scary things like version control integration to deduce struct history. In development, I can decide to stop supporting versions older than some number and actually delete the versionings before that. Once the game goes live, this is no longer viable because player saves should always be migratable to any newer version of the game.

There are of course assumptions made that allow this setup to work:

  • The sizes of types are assumed to be the same on every inter-operable platform
  • The structure packing/padding of types is assumed to be the same
  • The endianness is assumed to be the same

These, I feel, are safe assumptions when targeting AMD64 architecture machines. I figure I can worry about compatibility with other architectures when I actually need to ship on one.

Because the schema is part of the code, endianness and even size conversions can happen automatically at the migration stage. This means even if I do end up having to break these assumptions later, I will have enough information to convert existing data.

Credit where credit is due

My friend Eric Alzheimer originally came up with the idea to include the live and dead log in-line and to use it for automated data migration. This insight came from his reading of database research from the 80's, but those techniques haven't seemed to really be used in a C/C++ serialization context. I'll get in touch with him and see if he can provide more information on where this idea came from.

Eric's idea for migration handlers was to create a super structure which is a union of all fields in all versions of the structure, then pass that to the hand-written handler functions. I decided to instead to virtualize the structure via reading the versionings at runtime because it was simpler to write the code generator for and generates less code.

What's next

Padding and the use of POD arrays means that there are a whole lot of zeros in the binary data on disk. It compresses extremely well, so I want to eventually start compressing it once the trade-off of write/read speed vs. compression/decompression speed favors compression.

I need to create a blend of binary versioning and my previous introspection system. The former is focused on getting data ready to go fast, whereas the latter is focused on allowing for various per-field processing on the data at runtime (e.g. automatically creating editors or watching for and describing field-by-field changes).

Writing custom migration handlers is a bit cumbersome. I should be able to come up with a variety of generic conversions (e.g. unsigned char to unsigned int).

I also want to support increasing/decreasing array sizes without creating new fields. This is clearly possible if you think of the versioning data as a log of changes to the schema. For example, an entry like (resize 2 256 1024) would be saying "from version 2 this field changed from size 256 to size 1024". The migration code would see that and simply copy 256 elements from version 2 into the 1024 element array in version 3.

Conclusion

The complete system is available here, pinned to a version that will make sense relative to this article.

I'd love to hear your thoughts on this approach to data migration.