This book is a work in progress, comments are welcome to: johno(at)johno(dot)se

Back to index...

Automated Persistence

Introduction

The work on the persistence scheme for Ground control 2 brought to mind an interesting prospect: What if persistence was a fundamental feature of C++? Wouldn't that solve a lot of problems? What if a programmer could simply mark memory as persistent, and each time an application was executed, he would find that memory in the exact state it was the last time the application terminated?

Indeed, such functionality is possible to implement. Since, in the Ground control 2 scheme, a way was devised to uniquely identify each instance of each class, a programmer can start considering each instance of a class as a persistent resource. This also means that all issues related to external file formats, load / save code, etc, can be ignored; from a client's perspective, data is simply persistent. Who cares how? Files themselves are no longer of any concern to the application programmer.

Third party tools still exist, of course, and are something that are often necessary, and as they save their data in physical files a programmer is still not completely free from having to deal with file parsing issues. But consider what is possible if a given third party tool has a programming API, which allows programmers to write custom exporters. Traditionally, one would write an exporter plug-in that manipulates the in-memory data of the third party tool and saves it out to a format that is friendly to the application in question. But with automated persistence, all that needs to be done in the exporter plug-in is to instantiate an instance of the target class; the class that the application will ultimately use; and write to the various data members of this instance. Then the instance is automatically persisted. When the application wants access to that particular instance, it will simply instantiate it, identifying it uniquely through the combination of class name and instance key. Again, knowledge about files and their internal formats can be eliminated completely.

Consider the possibilities for custom written editors that are more tightly coupled to the application at hand. With automated persistence, coupled with a healthy dose of Model / View architecture, once could build editor functionality right into the application itself. Think of the gains in terms of workflow. It would be possible to have an application in which a developer can playtest a level, see something wrong, switch to editing mode and correct the problem immediately, wherafter playtesting can continue directly. All changes are persistent. Everything is persistent. Don't worry about it.

In those cases where certain implementations or hardware require a "re-start" of certain subsystems, this could be explicitly communicated to the user. Quake III Arena does this when the user manipulates certain aspects of the 3d renderer, notifying the user that the video subsystem must be restarted for changes to take effect. Even though this takes some time, it is far superior to having to exit the application and restart it. TODO: couple this to core::Versioned...

Automated persistence totally obliterates the conceptual differences between "external asset data" and "run-time state". It's all just data, and since anything can be made persistent, this radically alters the way data in general can be viewed. Consider the general rule for member variables of a class; that they should be initialized to valid values in the constructor, since otherwise they will contain random data left over from the last application to use that particular memory. With automated persistence, this data will no longer be random; it will be the data left over from the last execution of the application at hand. The correct data, as it was when we left it last. It is easy to see how this fundamentally changes the premise of all programming.

Programmers no longer have to worry about WHERE the data values in their datastructures ultimately come from. They can happily forge ahead, creating datastructures as they see fit for the programming problems they are trying to solve, and later this data can be manipulated as needed. The data source is truly irrelevant. It could be a simple in-application editor, built later. It could be the output of an exporter plug-in, executed in a completely different application. It simply does not matter anymore.

As in the Ground control 2 implementation, the "pull" nature of data access is something that also allows for greater levels of backwards compatibility, completely automatically. Since automated persistence is code-centric, instead of file-format-centric, data schema changes (i.e. class members change size or type, are added or removed, etc) will not invalidate the persistent data as a whole, only on a per-member basis. All members data that cannot be found when data is loaded (i.e. a "new" member) will simply be default initialized with the constructor values. Old data is simply never accessed.

Load failures, due to incompatible formats, would no longer be such a "binary" thing; i.e. it either worked or it didn't. It would become something more of a "fuzzy" issue; this object was 70% loaded, or 99% percent loaded. This all depends on the severity of data schema changes. Consider what this would mean for migrations between object formats; indeed, this kind of migration functionality became critical during Ground control 2, and was finally implemented as a "purge everything that can't be understood" operation in the JuiceMaker tool.

Indeed, programmers can also easily write conversion code, again without having to delve into details of file formats. If the name of a class member needs to be changed for some reason, or data moved to some other place, the programmer can simply add the new member variable to the class, and then write a simple program that instantiates the class and copies the data from the old member to the new, even performing transformations is required. Then, after the program has finished execution, the old class member can be safely removed. The data has been effectively moved, without loss. In this manner, batch conversions can be handled quite simply, and indeed resembles what is often down via SQL/DDL code when reordering data schemas in relationl databases.

What about MBuild? Well, MBuild exists due to the various complexities inherent in dealing with numerous file formats, custom or no. With automated persistence, all such problems simply go away. It no longer becomes a question of "what files are required by the application", as programmers simply no longer deal in files. They deal in an automatically persistent programming language.TODO: talk about parsing specific asset-tags out of the source code of the application as opposed to using Juice or similar to explicitly name all assets. Probably requires a chapter discussing what MBuild was and the problem it attempted to solve. Also mention GC's "oldcrap" solution.

Implementation details

An automated persistence solution could work in a similar fashion to Ground control 2. The main concepts of EXCO_Persistent could remain, in that all persistent classes should inherit from such a baseclass, and instances are uniquely identified by class name and instance key. Members are uniquely identified within the scope of the instance, using the PersistXXX() interfaces as before.

The main difference, however, would be that loading / saving would not be "events" or "operations"; rather, objects are simply persistent. When an object is instantiated, an instance key is passed to the baseclass constructor, thereby identifying it uniquely and automatically restoring it to it's last state (when LoadMembers() is called). For these reasons, there is no need for any centralized "persistence manager", as persistence more closely resembles a fundamental language feature, like construction / destruction.

Of course, the system should be flexible. For this reason, the actual calls to EXCO_Persistent::LoadMembers() and EXCO_Persistent::SaveMembers() are at the discretion of the programmer, not part of the baseclass implementation (i.e. in baseclass constructors or destructors). The simplest usage would be to have a call to LoadMembers() at the very end of custom class constructors, after all members have been mapped, and also to have a call to SaveMembers() at the beginning of these classes destructors. Alternatives are of course possible, and that is as it should be. The system should in no way impose limitations on the programmers.

For example, in Ground control 2 we only had a single mission in memory at once, as the player played missions one at a time. How would one differentiate between different missions, as the game obviously had, if there was only a single instance of the "mission" object instantiated at once? One alternative would be to parameterize the "mission" object with different instance keys, each individual key denoting an individual map. It would be a question of playing mission "1" or "2" or "37". Of course, further meta-information about each map would be stored in some directory (like campaign data, etc).

Another alternative would be to allow the entire persistence system to be parameterized with a "database" concept, which would allow instances to be loaded from various physical locations (different directories or files, depending on how the actual persistence was implemented). With this kind of solution, the "mission" object could always be instance 1 of it's class, but the various "mission instances" could reside in different databases (i.e. Mission1, Mission2, Mission3, corresponding to directories or files, depending on implementation of the persistence backend).

For flexibility, there should indeed probably be some kind of "database" concept that parameterizes the entire persistence backend, so that different applications can run i parallell, using completely separate databases. Again, this should be at the discretion of the programmer for maximum flexibility.

Dynamically allocated members revisited

The problems encountered in Ground control 2 with dynamically allocated members remain. A number of potential solutions come to mind.

Following in the footsteps of the Ground control 2 solution, there could exist a PersistXXX() method for each type of standard collection class that we use (MC_List, MC_GrowingArray, MC_KeyTree, etc). This is probably the solution that incurs the least amount of client implementation overhead. These methods would know how to persist the internal state of these collections, but there would have to be multiple implementations for each collection, depending on if the objects in the collection were dynamically pointers to objects, objects, or basic datatypes (like ints). This would involve quite an amount of work, but since all of this code is reusable as a part of the persistence system itself, it would be worth the effort. Compare this to the work already done on custom types like MC_String and MC_LocString.

Another potential solution leans more towards relational data organization. Given that each class keeps tracks of all instances in a static list (in order to guarantee unique instance keys), one could extend the baseclass interface with methods such as the following:

template <class X>
class EXCO_Persistent
{
public:

    static void AutoInstantiate(const char* aDatabase);
    static X* First();
    static X* Find(const KEY aKey);
    X* Next()
};

AutoInstantiate() would automatically instantiate all instances of the class that exist in the given database (using a string id here is just an example). Then, all instances of a class could be accessed via the methods First(), Find(), and Next(), like so:

void Operation()
{
    EXG_Unit* u = EXG_Unit::First();

    //perform operation A on all units
    while(u)
    {
        u->OperationA();
        u = u->Next();
    }

    //perform operation B on unit 47
    u = EXG_Unit::Find(47);
    if(u)
        u->OperationB();

    //create a new object with a specific instance key
    //this will assert if the key is in use
    new EXG_Unit(48);

    //create a new object with an auto-assigned instance key
    new EXG_Unit();
}

This would impose a certain coding style upon the programmer. The basic idea is that all objects are globally accessible, and there are no custom container classes at all, rather the above style is used for all access to instances of any persistent type that need to be dynamically allocated. Notice how this closely resembles how data is organized in relational databases; EXG_Unit is here a "table", and 47 is here a "row" in the "table" EXG_Unit.

Of course, this style of implementation may not be suitable for all applications, but this kind of functionality can easily be included in the persistence scheme with no additional overhead for those applications that do not wish to use it. Again, flexibility for the programmer.

Version control issues

As with all multi-user environments that involve several people working on the same assets, some kind of version control system is required. Automated persistence does not do away with this problem, it rather brings new issues to light. Typically, version control systems work at a file level, and since an automated persistence system such as has been discussed here aims to do away with the concept of files, at least from the user perspective, the path forward is not entirely clear.

Given an automated persistence system as has been described, developers would no longer speak in terms of physical files, they would speak in terms of objects. Instead of asking "who has unittypes.juice checked out?" they will ask "who has instance 37 of EXCO_UnitType checked out?". Immediately it is obvious that a finer level of granularity should be possible with this kind of system; instead of checking out an entire file which represents all EXCO_UnitType instances (this example is from Ground control 2) they can check out a single instance of EXCO_UnitType instead, allowing for people to work more easily i parallell.

But since "there are no files", how does the version control system work? If one chose to implement the persistence backend by storing everything in a given database to a single file, and a traditional file based version control system was used externally, then only a single user could edit any of the applications data at any given time. There is only a single file which represents all data in the entire application! Considering the amount of asset data in Ground control 2, this file would be enormous, and obviously limiting access to a single user at a time is totally out of the question.

One could instead choose to implement persistence by storing each class in a database in a single file. Given a file based version control system, we would be limited to a single user per class. This might work well, but it depends greatly on the application at hand. One might argue that this is a good tradeoff, especially if developers are performing data schema modifications and need to do batch conversions. On the other hand, what about level designers who need to be able to edit different instances of a mission object in parallell?

It follow naturally then that one could implement persistence by storing each instance of each class in a database in it's own file. One file per instance. For the most part, these files would be quite small, and would work quite well with a traditional file based version control system, allowing for a very fine level of check-out granularity (per object). Depending on the file system used, there may be negative aspects to great numbers of very small files.

As can be seen, there are many tradeoffs associated with each approach. For these reasons, it is probably a good idea to decouple the actual details of how persistence is physically acheived from the general persistence baseclasses, so that a choice can be made on a per-project level. This way, one could have any number of concrete physical persistence implementations, including 1 file per database, 1 file per class, 1 file per instance, or perhaps even a version that stores data in a relational database (as mentioned before, classes map very well to tables, and instances to rows in these tables...)

With physical persistence decoupled, it would be possible to have the development team work on a "multi-user friendly" version of physical persistence, and when it comes time to distribute a release, then a single file version could be used instead, possibly even including encryption and / or compression. This is very similar to the reasoning behind MF_File / SDF.

Considering again using a relational database is the physical persistence implementation, the granularity of change would actually be even finer than instance level; it would be at member variable level. Depending on the team size and how / when data is changed, perhaps a classical relational database approach would work fine, even without version control in place. This may or may not be feasible, all depending on how often changes are loaded from (LoadMembers()) and commited back to (SaveMembers()) actual physical storage. Again, very project specific.

If very fine access granularity is required, and it is found that loading from and commiting to persistent storage is infrequent, then some kind of built-in version control functionality is required. The persistence baseclass would need to expose some kind of CheckOut() and CheckIn() interfaces, and the persistence backend would most definitely need to have actual physical persistence decoupled from the general interface, to allow the user to check out instances / members and keep them checked out across application executions. Some kind of local persistence would be required to support such extended check outs, as well is to support rollbacks. Added to this are the questions of user accounts, access rights, and everything else associated to version control.

Summary

None of these concepts are new, but fall inside the domain of object-relational mapping research. Many soft packages within the realm of the Java Enterprise platform use similar approaches (Hibernate, Java Persistence API), and map database tables to classes. However, it is clear that automated persistence of this kind at a "language extension" level, and also not necessarily bound directly to relational databases, would be most useful.

Back to index...