Data Serialization Frameworks (not only) in Java

Given an overall systems architecture or infrastructure which gets the “IoT” box ticket, the will probably be a place where data transfer and size will come into account, for instance if constrained devices are using a potentially unreliable or expensive connection, such as a cellular data connection. For instance, an enbedded monitoring device which serves the purpose of delivering real-time telemetry to the core system of a car manufacturer will quickly come to a point where JSON-Encoding might exceed the computing power required for the actual job.

In those cases, I often prefer to take a step back from fancy human-readable protocols to the size-aware binary protocols, but on the other hand, even if I implemented countless proprietary binary protocol providers and consumers, I would not implement my own today, especially when this would mean writing the same thing for multiple language in the very common event of polyglot architectures, such as Java on the server and Golang on the devices side.

How Data Serialization Frameworks work

At this point, data serialization frameworks enter the stage. What they have in common is that they provide APIs for multiple languages (usually C/C++, Java, Python, Golang) and sometimes an own (optional and out of scope) TCP/IP client/server implementation.

Schema or Schemaless

During the evaluation of several alternatives, three different types will be encountered:
  • Interface definition driven: Client-and Server code, including data structures and de/serialization is derived from a definition written in a meta-language, the “Interface definition language”.
  • Schemaless: The framework just provides a source and sink for information, client and server need to be aware of sequence and meaning of a certain transmission
  • Extended Schemaless: A schemaless framework is extended with custom functionality to make it schema/structure-aware again, for instance if it is used as a workhorse for a given library such as Jackson

As always, there is no silver bullet here, the correct choice depends on the technical environment. When the aim simply is to save some bytes on an existing (de)serialization with Jackson, exchanging the JSON backend against its Protobuf- or CBOR counterpart provides a quick improvement for almost no effort. When different provider/consumer versions with complicated schema evolutions are expected, it might make sense to use the schema and migration support of an IDL-driven framework, and when it is preferred to handle marshalling in own code or simply work without a schema to save library overhead, schemaless might be a compelling option.

Expected reduction of transmitted data

Depending on the payload transcived, binary serialization saves between 40% and 70% compared with XML or JSON due to the missing overhead and formating. This article will draw a comparism between several options, including JSON, later.

A walkthrough

Given a system which reports telemetry from a new engine family of your friendly local automobile manufacturer via i.E. Google Protocol Buffers, a typical “hello world” example would consist of:
  1. Schema/IDL design
  2. Client code generation
  3. Message composition and serialization
  4. Message transfer
  5. Deserialization

Schema design

In most frameworks, the IDL formats look pretty much similar to common programming languages, i.e. Protocol Buffers has certain similarities to C-Style languages:
 

Client code generation

To use the data structures definied on our interface definition in actual applications, code for the corresponding target platform (i.E. Java) has to be generated. This purpose is served by a commandline application, but in production, this is usually integrated into the build process, for example by using a Maven Plugin, a stage in a Makefile or at least a go-generate declaration. Depending on the overall project setup, the task of generating client code could also be performed on a CI server and pulled as a standard dependency later.

This is the first place to mention that data serialization frameworks are violently non-opiniated with regards on how to integrate then in any given stucture, which provides flexibility and, on the other hand, requires some thoughts regarding the architecture.

For Protocol Buffers, the most basic CLI call would be:

Another more comfortable way would be including the code generation into a maven build by using the generator plugin:
 

Message Composition and Serialization

The Java code generated usually includes Builder patterns, so in many cases it is not required to use libraries such as immutables or lombok anymore. Initializing a Java version of the Telemetry class definied in the IDL above could be performed as following:
 

 

The “proto” byte array includes the fully serialized and ready-to-transfer class instance:

 

Compared to the same information represented by JSON, capped to the same message size, it becomes obvious that there is a certain difference in message sizes:

While the binary version above already includes almost all 32 curve points defined in the loop above, the JSON version stops after the 4th point. Using a more efficient way of serializing the UUID in the beginning of the telegram would have increased the difference even further.

Message Transfer and Deserialization

After transfer, the byte array can be converted back into the original object on the same or any other language and architecture, for example in Java:

 

 

Integration Considerations

A common ground between all known serialization frameworks is that they are not opiniated in any aspect which is not covered by the core aspect of (de)serializing a data structure to binary, which means that they are a few things to address during the selection and integration phase.

In exchange, it is trivial to integrate them into any messaging system, such as:
  • TCP/IP raw sockets
  • Messaging: MQTT, AMQP, Kafka, …
  • Encapsulated in machine protocols (i.E. OPC/ua)
  • XMPP, RSS, …
  • Shared memory

Size optimization versus access costs

When a field is declared as a uint32, it is usually expected to end up in the serialized data exactly as defined. Depending on the original use case (or configuration) of the serializer, certain optimizations apply, such as:

  • Reducing the encoded size of a field if it contains a value which can be expressed with less bytes
  • Padding a field to an expected size to allow cheap an random access.

In Protocol buffers, the output message is automatically reduced if the original value can be represented in a shorter message:
 

In the default setting of Captain Proto, fields are padded to their maximum size, so a device which is just interested in a subset of a message could simply fetch a subset of the data received and save computation power on decoding.

Error detection

By default, there is no concept of error correction, such as checksumming or signing. If it is required, the developer has to take care of it after serializing the message.
 
 
As long as the byte stream contains valid data which can be converted into the given structure, no error would be raised. If the transport is potentially inreliable, measures such attaching a simple checksum or message digest should be taken.

No type announcements

The serialized messages don’t contain any information regarding their data type, so if multiple messages types are transferred and the framwork does not provide a substitute, such as the one_of feature of protobuf, this has to be dealt with, too.

API comparism

Lets show some code. In this example, a telemetry telegram unit will be composed and serialized into a byte array using the Java API of the given framework. The results will be compared in terms of message size and processing time, using JBH Java Microbenchmarking.

To have a non-binary format to compare against, we use Jackson to serialize the testdata to JSON.

Google Protocol Buffers

Protocol buffers uses an interface definition language, for the demo-use case a representation could be:
 

Encoding:

Captain Proto

Captain Proto and Google Protocol Buffers have a lot of similarities, for the simple reason that they have been designed by the same developer. Captain Proto was designed to be a faster and tidier alternative and successor to Protocol Buffers, which does apply in certain combinations and scenarios.

From the design, Captain Proto also relies on Interface Definitions which look just a little bit different than its Protobuf counterparts:
 

Compiling the IDLs to Java code is possible either by using the commandline or integrating a maven dependency:

Building and serializing an object in Java, however, is more complicated, which appears to originate from the (suspected) intention to give the Java implementation a feel which is comparable to its C/C++ counterpart:

Apache AVRO

Apache Avro can operate both in schemaless and schema-driven mode, for a better comparism this report will focus on the schema-driven way.
Like in Protobuf or Captain Proto, an IDL is compiled into platform-specific stubs, with the comfort of a maven plugin taking care of the Java side.
 
 
Maven plugin:
 

 

 

Common Binary Object Representation (CBOR)

CBOR is often neglected, but a competitive option because of its low memory footprint and wide platform support. CBOR itself does not provide schema-driven operation, but it is often used as a simple serialization workhorse in frameworks such as Jackson.

Serializing an object, such as the engine telemetry example used here, without a schema works straightforward:

 

Decoding the message requires knowledge of the schema used for encoding, and works right the way around.

Performance and Message Size comparism

While message sizes are easily comparable, performance measurements have a limited applicablity due to the mass of combinations which could occour. A certain binary serialization framework may have superior performance when serializing with the Java implementation and poor performance while deserializing the message i.E. with Micropython on an ESP. To have a common ground here, the observed unit is the serialization time of the Engine Telemetry above with Java, measured with Java Benchmarking Harness (JBH).
 
Technology Message Size [bytes] Encoding time
JSON (with Jackson) 1537 69000ns
Java Serialization 1022 5000ns
Google Protocol Buffers 817 2700ns
CaptainProto 664 5280ns
CBOR 509 14000ns
AVRO 577 5700ns

JSON, obviously, is very large and slow due to the high effort on string processing, which is even higher when deserializing JSON. Pure Java Binary serialization was acceptable and did not require any additional library, but lacks interoperability.

The multilanguage BSFs delivered compareable performance in terms of computing time, but AVRO stands out for its pre-optimization size.

Which one to use?

The easiest and most correct answer is: “It depends”. When talking to Java clients with considerable processing power only, the choice may highly be driven by the features and comfort a certain solution can deliver, such as the lightweight Client/Servers with gRPC. It may also be an option not to bother with any of them if no change in communication peers is expected during the entire product lifetime.

When it is foreseeable that any constrained device (let’s call them IoT devices) is involved and size and speed is something worth considering, I would recommend playing around with the options above and check how they integrate in the overall architecture. Personally, I mostly ended up with protobuf, which for a current project which involves messaging between a Java server and Golang Workers via RabbitMQ, AVRO was a better option.

dreese.de v8 (c)1994-2018 Manfred Dreese