Data Serialization Frameworks (not only) in Java

Given an overall systems architecture or infrastructure which gets the “IoT” box ticket, the will probably be a place where data transfer and size will come into account, for instance if constrained devices are using a potentially unreliable or expensive connection, such as a cellular data connection. For instance, an enbedded monitoring device which serves the purpose of delivering real-time telemetry to the core system of a car manufacturer will quickly come to a point where JSON-Encoding might exceed the computing power required for the actual job.

In those cases, I often prefer to take a step back from fancy human-readable protocols to the size-aware binary protocols, but on the other hand, even if I implemented countless proprietary binary protocol providers and consumers, I would not implement my own today, especially when this would mean writing the same thing for multiple language in the very common event of polyglot architectures, such as Java on the server and Golang on the devices side.

How Data Serialization Frameworks work

At this point, data serialization frameworks enter the stage. What they have in common is that they provide APIs for multiple languages (usually C/C++, Java, Python, Golang) and sometimes an own (optional and out of scope) TCP/IP client/server implementation.

Schema or Schemaless

During the evaluation of several alternatives, three different types will be encountered:
  • Interface definition driven: Client-and Server code, including data structures and de/serialization is derived from a definition written in a meta-language, the “Interface definition language”.
  • Schemaless: The framework just provides a source and sink for information, client and server need to be aware of sequence and meaning of a certain transmission
  • Extended Schemaless: A schemaless framework is extended with custom functionality to make it schema/structure-aware again, for instance if it is used as a workhorse for a given library such as Jackson

As always, there is no silver bullet here, the correct choice depends on the technical environment. When the aim simply is to save some bytes on an existing (de)serialization with Jackson, exchanging the JSON backend against its Protobuf- or CBOR counterpart provides a quick improvement for almost no effort. When different provider/consumer versions with complicated schema evolutions are expected, it might make sense to use the schema and migration support of an IDL-driven framework, and when it is preferred to handle marshalling in own code or simply work without a schema to save library overhead, schemaless might be a compelling option.

Expected reduction of transmitted data

Depending on the payload transcived, binary serialization saves between 40% and 70% compared with XML or JSON due to the missing overhead and formating. This article will draw a comparism between several options, including JSON, later.

A walkthrough

Given a system which reports telemetry from a new engine family of your friendly local automobile manufacturer via i.E. Google Protocol Buffers, a typical “hello world” example would consist of:
  1. Schema/IDL design
  2. Client code generation
  3. Message composition and serialization
  4. Message transfer
  5. Deserialization

Schema design

In most frameworks, the IDL formats look pretty much similar to common programming languages, i.e. Protocol Buffers has certain similarities to C-Style languages:
 

Client code generation

To use the data structures definied on our interface definition in actual applications, code for the corresponding target platform (i.E. Java) has to be generated. This purpose is served by a commandline application, but in production, this is usually integrated into the build process, for example by using a Maven Plugin, a stage in a Makefile or at least a go-generate declaration. Depending on the overall project setup, the task of generating client code could also be performed on a CI server and pulled as a standard dependency later.

This is the first place to mention that data serialization frameworks are violently non-opiniated with regards on how to integrate then in any given stucture, which provides flexibility and, on the other hand, requires some thoughts regarding the architecture.

For Protocol Buffers, the most basic CLI call would be:

Another more comfortable way would be including the code generation into a maven build by using the generator plugin:
 

Message Composition and Serialization

The Java code generated usually includes Builder patterns, so in many cases it is not required to use libraries such as immutables or lombok anymore. Initializing a Java version of the Telemetry class definied in the IDL above could be performed as following:
 

 

The “proto” byte array includes the fully serialized and ready-to-transfer class instance:

 

Compared to the same information represented by JSON, capped to the same message size, it becomes obvious that there is a certain difference in message sizes:

While the binary version above already includes almost all 32 curve points defined in the loop above, the JSON version stops after the 4th point. Using a more efficient way of serializing the UUID in the beginning of the telegram would have increased the difference even further.

Message Transfer and Deserialization

After transfer, the byte array can be converted back into the original object on the same or any other language and architecture, for example in Java:

 

 

Integration Considerations

A common ground between all known serialization frameworks is that they are not opiniated in any aspect which is not covered by the core aspect of (de)serializing a data structure to binary, which means that they are a few things to address during the selection and integration phase.

In exchange, it is trivial to integrate them into any messaging system, such as:
  • TCP/IP raw sockets
  • Messaging: MQTT, AMQP, Kafka, …
  • Encapsulated in machine protocols (i.E. OPC/ua)
  • XMPP, RSS, …
  • Shared memory

Size optimization versus access costs

When a field is declared as a uint32, it is usually expected to end up in the serialized data exactly as defined. Depending on the original use case (or configuration) of the serializer, certain optimizations apply, such as:

  • Reducing the encoded size of a field if it contains a value which can be expressed with less bytes
  • Padding a field to an expected size to allow cheap an random access.

In Protocol buffers, the output message is automatically reduced if the original value can be represented in a shorter message:
 

In the default setting of Captain Proto, fields are padded to their maximum size, so a device which is just interested in a subset of a message could simply fetch a subset of the data received and save computation power on decoding.

Error detection

By default, there is no concept of error correction, such as checksumming or signing. If it is required, the developer has to take care of it after serializing the message.
 
 
As long as the byte stream contains valid data which can be converted into the given structure, no error would be raised. If the transport is potentially inreliable, measures such attaching a simple checksum or message digest should be taken.

No type announcements

The serialized messages don’t contain any information regarding their data type, so if multiple messages types are transferred and the framwork does not provide a substitute, such as the one_of feature of protobuf, this has to be dealt with, too.

API comparism

Lets show some code. In this example, a telemetry telegram unit will be composed and serialized into a byte array using the Java API of the given framework. The results will be compared in terms of message size and processing time, using JBH Java Microbenchmarking.

To have a non-binary format to compare against, we use Jackson to serialize the testdata to JSON.

Google Protocol Buffers

Protocol buffers uses an interface definition language, for the demo-use case a representation could be:
 

Encoding:

Captain Proto

Captain Proto and Google Protocol Buffers have a lot of similarities, for the simple reason that they have been designed by the same developer. Captain Proto was designed to be a faster and tidier alternative and successor to Protocol Buffers, which does apply in certain combinations and scenarios.

From the design, Captain Proto also relies on Interface Definitions which look just a little bit different than its Protobuf counterparts:
 

Compiling the IDLs to Java code is possible either by using the commandline or integrating a maven dependency:

Building and serializing an object in Java, however, is more complicated, which appears to originate from the (suspected) intention to give the Java implementation a feel which is comparable to its C/C++ counterpart:

Apache AVRO

Apache Avro can operate both in schemaless and schema-driven mode, for a better comparism this report will focus on the schema-driven way.
Like in Protobuf or Captain Proto, an IDL is compiled into platform-specific stubs, with the comfort of a maven plugin taking care of the Java side.
 
 
Maven plugin:
 

 

 

Common Binary Object Representation (CBOR)

CBOR is often neglected, but a competitive option because of its low memory footprint and wide platform support. CBOR itself does not provide schema-driven operation, but it is often used as a simple serialization workhorse in frameworks such as Jackson.

Serializing an object, such as the engine telemetry example used here, without a schema works straightforward:

 

Decoding the message requires knowledge of the schema used for encoding, and works right the way around.

Performance and Message Size comparism

While message sizes are easily comparable, performance measurements have a limited applicablity due to the mass of combinations which could occour. A certain binary serialization framework may have superior performance when serializing with the Java implementation and poor performance while deserializing the message i.E. with Micropython on an ESP. To have a common ground here, the observed unit is the serialization time of the Engine Telemetry above with Java, measured with Java Benchmarking Harness (JBH).
 
Technology Message Size [bytes] Encoding time
JSON (with Jackson) 1537 69000ns
Java Serialization 1022 5000ns
Google Protocol Buffers 817 2700ns
CaptainProto 664 5280ns
CBOR 509 14000ns
AVRO 577 5700ns

JSON, obviously, is very large and slow due to the high effort on string processing, which is even higher when deserializing JSON. Pure Java Binary serialization was acceptable and did not require any additional library, but lacks interoperability.

The multilanguage BSFs delivered compareable performance in terms of computing time, but AVRO stands out for its pre-optimization size.

Which one to use?

The easiest and most correct answer is: “It depends”. When talking to Java clients with considerable processing power only, the choice may highly be driven by the features and comfort a certain solution can deliver, such as the lightweight Client/Servers with gRPC. It may also be an option not to bother with any of them if no change in communication peers is expected during the entire product lifetime.

When it is foreseeable that any constrained device (let’s call them IoT devices) is involved and size and speed is something worth considering, I would recommend playing around with the options above and check how they integrate in the overall architecture. Personally, I mostly ended up with protobuf, which for a current project which involves messaging between a Java server and Golang Workers via RabbitMQ, AVRO was a better option.

Northern Norway / Tromsø – Alta – Nordkapp

Traefik “Træfik” – A reverse proxy for containerized deployments

While I was preparing the deployment of a private pet project, I got the impression that my approach had significant room for improvement in the front-facing reverse-proxy department. The project consists of a scalable set of microservices serving several tasks in the backend, tied together with a message bus protocol.
 
While the backend was perfectly capable of handling its own environmental adaptions and even supports multitarget deployment perfectly, either to a Kubernetes Cluster or docker-compose on my little VPS, the situation was much worse on the frontend. There, I was still required to manually configure nginx in terms of backend addresses and use additional technology for handling infrastructure changes, such as up/downscaling of services, not to mention obtaining letsencrypt certificates, which simply felt wrong and not agile.

A little research unveiled that I am not the only one having that concerns, and that there is a very powerful reverse proxy called Træfik specifically addressing the requirements of containerized applications.

How it works

Træfik basically is a HTTP/s server and reverse proxy built on the Golang HTTP package, which does not sound that exciting in the first place. The noticeable thing about it though is its support for container orchestration frameworks, such as Docker (vanilla, docker-compose, Swarm) or Kubernetes in terms of obtaining all information the operator already defined when setting up the actual container.

Traefik uses a declarative, rule-based approach to automatically setup reverse proxies and route requests accordingly. For instance, a docker container could be labeled with multiple statements on its listening ports, protocol, hostnames and paths, which would be picked up by Traefik instantly. Same applies on changes in scaling or newly started containers – every change causes Traefik to adapt its own proxy configuration without any further operator interaction.

Example Træfik deployment strategy with docker-compose

On constrained hardware, such as a personal vServer, a lightweight approach such as docker-compose can provide an interesting alternative to a kubernetes installation. As Traefik also supports docker-compose, it can be completely configured with key-value pairs in labels section of the services items in the docker-compose file.

For instance, the docker-compose webservice below would start a Spring Boot service on Port 8080 and tell Traefik that:

  • The service serves the kp-backend service group
  • It should listen in APP_HOSTNAME, defined in the .env file or environment variable
  • Only requests below /api/v1/trends should be routed to an instance of that container
  •  Port 8080 is listening for HTTP requests

 

 
Relevance for the proxy can be expressed with the traefik.enable statement. In the example, below, the enable-directive prohibits Traefik from setting up a host for the service.
 

The Traefik config file

The configuration file is quite concise and fits on one screen. The file below sets up a basic Traefik installation with HTTPs, HTTPs enforcement and automatic letsencrypt certificate management.

Example: A full-featured file with Letsencrypt support and HTTPs enforcement:

 

Using Google Protocol Buffers as Glue between Java and Golang

This is a companion article for my first talk at the Go User Group “Gophers” Aachen about Protocol Buffers and coupling Go services with Java.

In a distributed system it is required to go beyond communication by shared memory, and even cross technology borders. This is nothing uncommon, it is more or less standard for a contemporary system to be designed as a distributed, interlinked collection of software modules which are not necessarily implemented in the same technology or even computer architecture.

Serving the task of establishing a communication between multiple processes of any kind basically does not take much more than a REST API, a technology which is supported perfectly in frameworks such as Spring. But, regardless of the sense a REST service makes when providing an open API, Protobuf has everything to be superior to REST in backend communications, because:

  • Low Overhead and size. Protobuf is a binary protocol with minimum payload and verbosity. For instance, a type- and architecture-safe pair of Double values takes from 4 to 8 bytes, with no further overhead
  • Not opinionated regarding the transport. Protobuf messages can safely be transported over Busses and 1:1 communication architectures.
  • No excuse for missing API documentation. It is literally not possible to build any messages without providing a corresponding Interface-Definition Language fileset.
  • Auto-generated Protocol Buffer Objects. Protobuf supports en- and decoding messages in any mayor programming language, even across language or architecture borders.

For demonstration purposes, I built a simple Java SpringBoot service that periodically generates Protobuf-encoded messages and uses MQTT to deliver them to a tiny service implemented in Go, serving the really important purpose to compute a moving average.

MQTT Broker Setup

If your lab does not provide an MQTT broker, just use the simple docker-compose recipe from the repository at the end of this article to bring up a working broker:

Protocol buffer Setup

Protobuf uses Interface Definition Language files, short IDL’s as a meta-definition of the message format. From this format, the client- and server code for all languages, such as C/C++, Java, C#, Go etc, are derived. For the democase used, a simple protobuf file suffices:

Obviously, it just defines a message named “SensorReading”, containing a String-representation of the sensors name and Double with the temperature. The explicit package names are a courtesy of namespace alignment preferences in the various languages.

To generate the actual language-specific client code, the tool protoc comes in place. Nevertheless, I very quickly refrained from using it directly, in favor of using an auto-generation toolchain or a step in my builds. For Java and Go, we use a Maven Plugin and a go-generate macro.

Protobuf and Java

Generating the protobuf Java stubs is not much more than introducing a Maven Plugin:

After the next Maven run, the protobuf generated code will be available at target/generated-sources, the IDE should be able to use this code for auto-completion right away. Generating a Protobuf Object Instance ready for serialization then does not differ much from using libraries like Immutables.io or Lombok:

Defining a Publisher for the message..

.. and implementing it:

After starting the service, it periodically generates messages on the MQTT bus, using the topic “gophers/measurements”. Without a consumer though, the setup does not yet make much sense.

The Consumer

In Go, consuming a protobuf message from MQTT is, as usual, straightforward. With Paho, Eclipse provides a very proven MQTT client, and there is an official Protobuf implementation for Go. After importing the Paho MQTT library, it is required to specify which topics to listen on and which callbacks to invoke after a message was received.

The callback has to implement the interface func(MQTT.Client, MQTT.Message). Here, it unmarshals the Protobuf message, prints its contents to stdout and performs some basic computations with it.

 

Message Size Deathmatch

In the Java Example, there is a test which illustrates the size advantages of messages serialized with Protocol Buffers, compared to JSON.

Given the tests above, the output from jUnit is:

This means that a message that takes 158 bytes in JSON only requires 58 bytes in protobuf. Speaking from experience, the ratio of 1:3 easily increases to 1:8 when the messages become more complex.

Alternatives and Conclusion

Protobuf is not the only framework serving the purpose of flexible exchange of binary-encoded information. Written from the same feather with a lot of simularities, there is Thrift. An alternative without the need to provide IDLs in the first place is Avro, and some people also use CBOR. I uses Protobuf for this example because of its stability and perfect Java and Go support. I very much like it for backend communication and use it since the early 2000s. But, because of the tooling and the “long live the standards” maxime, I would always favor REST for APIs exposed to 3rd parties.

 

You can find the code and all details in my Github profile:

https://github.com/codecyclist/gophersAachen-golang-protobuf

Integration Testing with Jimfs virtual Filesystems

In various test cases, primarily those covering components that interact with the filesystem and use filesystem entitites for a convention-based operation, it perfectly makes sense to provide a filesystem resource in an expected state to assess the component-under-test’s behavior.

To provide a resource in an expected state, the classic approaches are:

  • Have a handcrafted directory with testdata on the developer machine
  • Couple the testdata with the source in the repository

Both methods are quite poor, as the first will definitively yield into trouble when the application gets built on a CI or any other machine, or when the fact that the local filesystem is anything but immutable shows it evil face. The second includes hand-crafted static data that has to be kept in sync with the application contracts and logic. Despite this might have been an acceptable approach in the 90s, please do not do this today.

More contemporary approaches are:

  • Have a configurable location on a ramfs/tmpfs with freshly prepared testdata on each @Before* part of the test.
  • In Spring Framework, use the TemporaryFolder resource, which promises to cleanup the resources after testing.

The above are quite decent, but it still depends on local, platform specific and non-reproduceable resources, a fact that may (and will) corrupt the actual test. So, why not use a layer which emulates a filesystem on java.nio layer and uses a throwaway, in-memory filesystem which gets assembled in an expected state on each test run?

Jimfs

Jimfs performs the task of providing a virtual in-memory filesystem which behaves more or less exactly as the DefaultFs. Being developed by google and quite feature-complete, it drives all my tests which require File or Path dependencies.

Example code

First, jimfs needs to be imported. For Maven, the import is:

A minimal test with Jimfs in JUnit4:

From here on, the jimfs filesystem behaves like the normal filesystem, so lets populate it with some testdata.

There is one single trap: According to the jimfs repository and the Java documentation, the Path class does not neccessarily have to provide a toFile() method – so if any method which ends up using a Path hosted by Jimfs, it should create an InputStream from the Path, which might also be a free lunch if the filesystem-centric code relies on Java8 Files.walk() functionality.

Memory filesystems in tests may be a bad smell

In Java8, there are a lot of ways to abstract filesystem-driven components from the actual storage. When the testing strategy requires the use of an in-memory filesystem, it might be a sign that a component is coupled with the filesystem too tightly.

My usual way to abstract the filesystem from my other code is using a Supplier<ContractType>, which can easily be mocked or stubbed for components which depend on it, while the supplier, as a last outpost before the filesystem, is the right component to be tested against a virtual filesystem.

So, memory filesystems may be a bad smell used in the wrong place, but really useful when used right.

Norderney

Heligoland Revisited

Word Frequency in 5 Programming Languages (Java, Scala, Go, C++, R)

Java 8

 

Golang

Scala

R

C++

Scotland

Alexa Skill development and testing with Java

This article is supposed to give a brief overview over Amazon Alexa Skills development with the Alexa Java API. The mayority of tutorials on Alexa skills appear to be targeted on node.js developers, so I would like to highlight the Java way and point out some things that I missed in the official trainings and some things that I would have solved differently.

For demonstration purposes, I wrote a simple fun application which finds crew members on the enterprise spaceship. A user could ask Alexa questions like “Where is captain Picard” or “Ask Enterprise where Captain Picard is” – so this application makes perfectly no sense, but it demonstrates everything a developer has to know to implement own basic skills.

The Speechlet interface

Providing an Alexa-enabled applikation requires the developer to provide an implementation to the Speechlet interface, which is:

The functions are quite straightforward – the session related functions handle init and cleanup work for the task of instantiating or terminating a session, which in the Alexa domian the the lifetime of a conversation with the user. OnIntent gets invoked on any voice interaction the Alexa backend is able to map to an intent based on the predefined utterances schema.

Lets take our nonsense Enterprise crew resolver:

We deliberately do not deliver cards to the customers application, if a card is required, there is another signature of the newTellResponse method. As we do not have access to the board computer and there is no Amazon Alexa service for spaceships or even a region outside the the earth atmosphere yet, we inject a mock resolver for testing purposes.

Mocking an Alexa call for jUnit

Testing the data providers behind the Alexa API behaves as good as everyday testing, but mocking an Alexa request does not appear to be part of the primary feature set of the API, which means that we have to completely mock the request before passing it to our handler.

Fortunately, Amazon used a library related to immutables.net for their API, so it is possible to handcraft an IntentRequest which closely resembles an actual search request for Captain Picard as following:

What I would like to be changed in the Java API

Builders all the way

Builders are good. Please use them on the response types, aswell. For example, the code below feels very, very 90s:

Way too much ceremony. What I would like to have written without providing own facades is:

Easy testing

Mocking a request for semi-end to end testing like in the example above works, but it is not really comfortable. I would appreciate a function which exports the request to a JSON file, together with a corresponting input function. This would make it easy to mock the request without using the builders directly.

Besides, once a speech has been created, it is not possible to extract the speech text out of it without applying dirty reflection to break the private property barrier. Why not just provide a getSsml() member to make e2e testers happy?

Plaintext or Ssml?

Honestly, I do not want to use the Plaintext response at all. Ssml is a superset of Plaintext and allows more detailed control on the way Alexa text-to-speech works, for instance if it is a requirement to spell out a word instead of speaking it. So, why not just use Ssml all the way and improve the speech renderer so it does not crash if no <speech></speech> tags are present?