A brief history of persistence in software engineering

During the evaluation period of two new projects with an expected lifespan of 10 years, free from any legacy constraints, I recently had to decide on a method to provide data persistency.

Starting from scratch in 2015 prompts the architect with quite a lot of interesting solutions, especially because the more and more mature solutions which are not strictly pursuing the Relational Database approach.

In this article I will summarize my personal “history” of how I dealt with persistency and which way I am going to choose. The 80s: No Persistency at all.

It was not uncommon to use a computer which simply was incapable of preserving the contents of its volatile memory beyond a power-cycle. Before the purchase of a magnetic drive, one did not have to care about saving anything. Early 90s: One file I/O per functional unit.

In a Borland Pascal Project I worked on during schooltime in 1994, I chose a quite straightforward approach: each functional unit had its own persist/unpersist procedure, which wrote a file in the patten (moduleName)(concern)(version).(initials of the girl I was into) to the disk. Changes were handled by switching between and versions, aswell by replaying a series of files, depending on the kind of data the module was working on.

The application itself was so modular that it avoided DRY as good as it was possible, but due to the serialization code, it was necessary to (deliberately) violate the Single Responsibility rule.

This approach may have a couple of drawbacks, but it is still used nowadays, primarily on lean embedded systems with limited storage and processing capabilities. Honestly, I havent maintained an Oracle Cluster on my dishwasher for a long period of time. Mid 90s: Centralized Persistency provider and first steps towards IoC

Yes, I might actually have written the first tiny IoC container in Borland Pascal in 1995, when I was tired of not sufficiently satisfied with the first approach. Still, I had to take care of setting up an appropiate file format for all events that might occour, including IO failures. The approach was working perfectly, but as an electronics student I was asking myself what the big boys would have done. A first step was converting the binary configuration into an easy to read XML file.

As the solution above, I confess that I still regularly use XML configuration, especially because of its simplicity and the good tooling of XML, such as a lean deployment way of generating XML config by applying XSLT stylesheets.

XML had a big disadvantage anyway : one cannot simply append new data to an existing file without either violating XML syntax or leaving the IO in a risky state for an unacceptable period of time, so I had to introduce a log and replay mechanism, ending up with very high efforts on basic work. Late 90s: Enter SQL databases

Starting the first bigger projects with C/C++, I found that the approach above was fine for many matters and resilience purposes, but the freebies provided by SQL databases were very tempting. So, I began to put all persistent data into the SQL Database, abandoning the pretty (but platform dependent) approach of serializing Pascal records and C data structures to binary or XML, which appeared as a good trade. Paying attention to indexing and sophisticated searching and aggregation of big datasets, it became uneconomical to implement that lowlevel data-centric functionality myself.

Using a database for saving configuration and process data frees the developer from dealing with most of the pitfalls of loading and saving data himself, but not without asking for some things in exchange:

We add complexity to the project, as we have to deal with in-and exporting to an external unit that approaches persistency in a different way as our application does.
Testing it is painful - testing the persistency units makes is necessary to implement the expected state of the database in every state of the roundtrip.
The code gets bloated with SQL-like Structures, regardless whether plain SQL or any abstraction language is used.
Deployment gets more complex, as a simple file copy may not suffice to set up the database.
Having configuration and data which might be provided by users in the same space increases the need for additional security measures.

This line from one of my projects from the early 2000s, including a lot of efforts I made to reduce the SQL bloat, may demonstrate what I am talking about:

std::string szSqlBeginCallLog = mSql->buildSqlInsertString();

mSql->addInsertStringPair( szSqlBeginCallLog, GSM_LOCATIONAREA, mSql->sanitize(connection->lac); //… mSql->insert(szSqlBeginCallLog, &report);

The abstraction code was actually quite sophisticated for a C++ program, nevertheless it was SQL-Centric and felt “foreign” in an object related application, not even mentioning the performance impacts caused by the permanent string assembly.

In my Java projects, Prepared Statements came into play, which was a noticeable good step towards building a facade around the fact that SQL simply does neither want, nor attempt to mix with the principles of OO very well. So, it was possible to provide a SQL sniplet and fill placeholders in the subsequent instructions, which mixed significantly better with OO and made deduplication and testing a lot more comfortable.

PreparedStatement pstmt = connection.prepareStatement( “UPDATE tblCallLog SET " + “duration = ?, " + //… “maxGpsActivationDistance = ? " + “WHERE idCallLog=” + this.kdbIndex ); pstmt.setInteger(1,item.duration); //… pstmt.setDouble(7,item.maxGpsActivationDistance); pstmt.executeUpdate();

Early 2000s: Hibernate, we have been waiting for you.

Welcome Object Relational Mapping (ORM).

Provide a Java Bean, add some redundant XML mapping information and just tell a quite pragmatic API to persist those Objects to the database. Fantastic, except that you still had to write SQL sniplets in many cases, such as when you were searching for something particular and it was not possible just to process the search in memory due to scalability issues.

Fortunately, it became possible to configure Hibernate using Java Annotations quite soon, so it became possible to hide some more SQL from the developers daily work. Also, a lot of configuration clutter was avoided by establishing a convention-over-configuration pattern.

@Entity public class Landing implements Serializable {

@Id
@GeneratedValue
private Integer landingId;

String airportCode;

@Temporal(javax.persistence.TemporalType.DATE)
java.util.Date dtmIssued;
String strFlightNumber;

@ManyToOne
@JoinColumn(name="aircraft_id")
private Aircraft instOfAircraft;

public Landing(String airportCode, Date dtmIssued) {
    this.airportCode= strAptCode;
    this.dtmIssued = dtmIssued;
}

public Landing() {
}

// Lots of getters and setter, hopefully autogenerated

}

There still was a last evidence for the presence of SQL in the game: Queries. In the early time of Hibernate, it was common to perform queries by uglifying Java code with SQL sniplets. Regardless of the fact that Hibernate was a big step forward, I was quite uncomfortable with the fact that Strings have mayor impact on the control flow of my application.

List result = session.createQuery( “from Airport airport where strAptCodeICAO= ‘"+strAptCodeICAO+"'") .list();

This is when the Criteria API of Hibernate came into play, which provided an internal Domain Specific Language which compiled to SQL. I am deliberately using the Hibernate syntax here, as the JPA equivalent includes additional boilerplate which would make it less trivial to read the code below.

Criteria crit = session.createCriteria(Airport.class) .add( Restrictions.eq(“strAptCodeICAO”, strAptCodeICAO) ); crit.setMaxResults(1); List lstResult = crit.list();

Using an ORM, the developer gained a lot of advantages:

No relational breach - Just think in Objects and Repositories, the framework will eventually take care of the database
No strings affecting the control flow of the software
Syntax Checking and completion
The framework brings own support for concurrency matters, such as locking and versioning.
The framework brings caching, so various use-cases actually run faster than with plain SQL with the same effort

The disadvantage is that, unter the hood, there still is a RDBMS. Agility and Relational Databases

Relational Databases and Agile do not seem like best friends, which might simply be because of their age. Relational databases are here for a long time, the tendency to embrace changes and iterative work is much, much less mature. When dealing with ORM and change/feature requests, questions concerning schema generation and updates aren’t far away. Initial Database Setup

When starting a project that uses ORM, the team usually decides on one of the approaches to initially create the database below:

Code-First - Generate the database structure from your code and mapping information.
Database-First - Match the code to a new or existing database
Big Generator - Generate both code and database by an external generator

Code-First appears as the most attractive option for developers - just do the code, add some mapping information and let the ORM take care of the database maintainance. As I will mention in the next section, this approach may have several drawbacks in terms of performance and compatibility. Database-First, on the other hand, usually results in code with broken naming conventions or lots of mapping information, combined with high testing and integration effort when schema or code are updated. Generators that take care of both DB and code, for my opinion, combine the cons of the other approaches, and, lets be honest - who likes generated code? Updates

Updates implying database structure and/or mapping code changes have not been addressed to perfection yet. We developed several ways to deal with them, including migration scripts, database version management (such as liquibase) and even auto updates in code-first applications. Even if all of them got a realistic change to eventually succeed, they all add complexity and risk to your deployment pipeline. Performance and ORMs

The only aspect in which ORMs had (and for my opinion, still have) considerable drawbacks is performance tuning. The complexity seriously increases when addressing relational database domain specific functionality such as Indexes, Partitioning or even Triggers and Stored Procedures, it also makes it more difficult for developers with different SQL skills to colaborate when the first change requests drop in.

So, putting the gray hat on, performance and ORMs means:

Having the risk of an ORM building bad SQL code or taking a trivial approach which does not take advantage of optimizations offered by the DBMS.
Configuring Database performance in your mapping XML or by annotations - and go back to your SQL editor to check if the indexes are actually there and survived the last migration. Welcome to XML or Annotation hell.
Use a database first approach, so you can configure indexes with your favourite SQL toolstack and carefully match your code or use a tool which builds the code from the database. In any way, you end up either with generated code which does not neccessarily meet your style and quality requirements or with additional effort to match and test your database structure to your mapping code.
Put additional efforts in avoiding migration fail cycles when your migration touches indexing or partitions on LARGE databases and your server always hits the timeout.

Performance and ORMs always means acting very carefully and not being afraid to sidestep known patterns. In other words, dealing with Performance and ORMs is suspect to seriously increase the complexity of a Persistency unit, reducing the comfort benefits that made the developer go for ORMs in the first place. Double-checking the ORM approach

Developing an application which deals with data tailored for relational databases, an ORM just feels as an interface to a database driver, so one might simply forget about everything under the surface. As soon as the data structures become more complex and dynamic, resulting in additional effort to comply to the given enviromental limits, the following concern may be raised:

"Why do I have to put multiple layers of code on top of my persistency unit instead of getting it right in the first place?"

The answers were simple:

Management/Customer/your girlfriend enforces technology constraints in terms of the database system used
Legacy code which is tightly coupled with a certain database system
External reporting solutions or any other poor interface which directly consumes the database and violates the Highlander principle
Lacks of proven, mature alternative solutions with powerful toolchains
Experience and acceptance of the status quo, combined with lack of resources to investigate something new
And, as always: "You never get fired for buying Oracle/MS SQL Server."

The above results into a set of constraints regarding parts of the machinery, making it necessary to adapt to them. If there is no real fit for your purpose, the engineer has to make something else fit, which in this case, is an object-relational mapper.

As the statements above are, I confess, quite reasonable for 2008/2009, we let some years pass.

2009-2015: Graph and Document Databases (bracketed among others as “noSQL”)

In the meantime a lot of great projects evolved, and on several customer projects, I had the chance to deploy mid- and large scaled solutions which employed Document- and Graph oriented databases (Cassandra, MongoDB, and Neo4j) as partial and even sole persistency unit.

Persisting an Object in, for example, MongoDb consisted of nothing more than saving a serialized instance to a document store:

var collection = database.GetCollection(“Orders”);

ProductionOrder po = new ProductionOrder() { InsertDate = DateTime.UtcNow, Title = “Bewijs”, Article = article, };

collection.Insert(po);

Querying is possible in many ways. For example, in the MongoDb API for C# it is simply possible to retrieve an IQueryable from a collection, so one can use the expression language known from querying other collections such as Lists or Dictionaries (Maps).

var results = Collection.AsQueryable() .Where(po => po.Title == “Bewijs”);

The first thing that catches the eye: Both insert and query operations use a testable and comfortable DSL out of the box.

There also came news on the ORM side, such as the Hibernate port to .net (nHibernate), or the Microsoft Approach to ORM, known as Entity Framwork (EF). On a bigger C#.net Project I was not satisfied with any of them, so I wrote my own ORM with a NoSQL-centric approach which employed a RDBMS under the hood.

2013 I began to play with MongoDB and even deployed a first mid-scaled project which used a MongoDb Cluster as the only persistency provider, 2015 I became a certified MongoDB developer.

I considered Neo4j and MongoDB as a massive step forward, as it addressed my question regarding the need to wrap an RDBMS in thousands of lines of adaption code instead of getting it right in the first place.

Its almost like choosing a new stereo -if the sound in default setting sounds inappropiate and you need multiple tweaks in order to make it sound acceptable, I am sure this is not the combination which will close the deal. But if the pure essence of the product, in this case a database engine, feels just like the data structures you are actually working on and might only need a few domain-specific adjustments which actually move you forward, chances are much better.

In the project I basically built around Neo4J, I managed my data in graphs, relations/egges and attributes in the first place, so my data storage did not appear as some piece of foreign tissue, instead it was a native and natural part of the project, enabling me to focus on the actual problem, not on caring about how to encode/decode my data.

Same applies to a project I used MongoDB for, as a document storage with strong use of the aggregation pipeline as quite close to the actual work I was doing.

Combined with a mid-term in-memory Key Value Store (first memcached, later Redis), I handled multiple projects which did not use RDBMS for their own terms in a row. 2014: Blurring the edges

In germany, we consider things as a “Paukenschlag” (which relates to music) when they are groundbreaking and appear to be significantly different from the current approaches to our challenges.

I wrote a tiny ORM in 2010, which used a text column in a database table to store additional JSON data while, as a good ORM should do, hide that fact to the developer. I never was satisfied with the migration/auto migration features of the ORMs on the market, primarily due to their poor scalability and design.

About two years later, PostgreSQL came with JSON support, which, on the first glance, just seemed like a trivial text column, but:

All Queries were able to refer to JSON paths
JSON paths could be added to indexes, just like "normal" columns.

A table in PostgreSQL could be defined as:

CREATE TABLE “tblAirports” ( id uuid NOT NULL, icao_code character varying(5), iata_code character varying(3) NOT NULL, data json NOT NULL, CONSTRAINT “tblAirports_pkey” PRIMARY KEY (id) )

After inserting some testdata into our anything but complete Airport Database…

EDKA AAH {“names” : {“de” : “Aachen” , “nl” : “Aken”}} LKCH CPH {“names” : {“de” : “Kopenhagen” , “da” : “Kobenhavn”}}

… one could place a query such as:

select iata_code, data->‘names’->‘de’ as name from public.“tblAirports”

// Result “iata_code”;“name” “AAH”;“Aachen” “CPH”;“Kopenhagen”

or this way, if you prefer the location names in danish? Notice that AAH does not have a danish title in the example above, so the column remains empty.

select iata_code, data->‘names’->‘da’ as name from public.“tblAirports”

Result: “iata_code”;“name” “AAH”;”” “CPH”;“Kobenhavn”

Postgre provides an expression language which allows multiple ways to traverse through a JSON document, which pursues an approach close to XPath in XML. A datarow may have multiple json columns and even no “classic” columns at all.

So, what do I like about that approach?

It allows document oriented storage enriched with relational databases, which provides sufficient flexibility to cope with data modeled in relational or document oriented.
Document Storage might be the way our databases will work in 10 years, so it is no bad idea to introduce it step by step.
If I was a marketer, I would sell the performance of RDBMS, combined with the flexiblity and time-to-market of any shiny "NoSQL" database. If I had to report to a superior, this might suffice to reason the missing expenses for an Oracle License.
I am able to respond to the deal breaker statements expressed in 2008
No real need for ORMs anymore, we can life quite well with basic abstraction stacks, dropping a lot of overhead and glue code. And even if I worked with ORMs since the early 2000s and even programmed several myself (actually pursuing the same approach in C++, Java and C#) I honestly consider them as an unnecessary piece of software.
Talking about ORMs  - No need for migration support. Enough said.

What’s now?

I basically had to make this decision for two projects for different customers/employers. Depending on experience and skill levels of the other developers involved, I decided on two approaches:

First one: I chose PostgreSQL for the reasons above, combined with Redis (byebye memcached, we had a good time) for authentication and session data. An abstraction stack (400 lines in Java code) was sufficient to hide away the SQ-Hell and provide an API which feels like MongoDB.
Second one: Updated to the latest mySQL version and used my ORM which actually emulates the behaviour of Postgre. As soon as the .net Connector passes our pretests, a smooth change to Postgre should not be a problem.

What’s next?

As Software Engineering and Ops are changing very fast these days, estimating the way we will take within the next years has a certain amount of inaccuracy - which I consider as quite enjoyable.

I think that in 10 years we will still be able to choose between multiple approaches to DBMS, including relational, graph and document oriented without being biased to any of them, so we will simply use the solution which does the job best.

16 Minutes

Recent Posts

codefreeze 2024
Fuzz Testing in Golang
Hyères
Gran Canaria
Solingen
Norderney
Scotland
Træfik
Copenhagen | København
Tenerife
Etretat
Lanzarote