Encoding and Evolution

0c|4d617274696e|f214 -> Help me decode!!

Programs work with data in 2 different representations —

  1. In memory, data is kept in objects, structs, list, arrays, hash tables, trees etc. These data structures are optimised for efficient access and manipulation by the CPU (typically using pointers)
  2. When we want to write data to a file or send it over the network, we have to encode it as some kind of self-contained sequence of bytes (eg. — JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory

Thus, we need some kind of translation between the two representations. The translation from in-memory to a byte sequence is called encoding or serialization, and the reverse is called decoding or deserialization.

In most cases, a change in application’s features also requires a change to data it stores.

Relational databases generally assume all the data conforms to one schema and changing the schema is a tedious process. Schema-on-read databases don’t enforce a schema, so the database can contain a mixture of older and newer data formats.

With data format changes, a corresponding change to the application code needs to happen (application code starts to read and write that new field).

  1. With server-side applications, we may want to perform a rolling upgrade — deploying the new version to a few nodes at a time, checking whether the new version is running smoothly and gradually working our way through all the nodes.
  2. With client-side applications, we’re at the mercy of the user who may or may not install the update for some time.

This means that old and new version of the code, and old and new data formats may potentially all coexist in the system at the same time. So the system needs to maintain compatibility in both directions —

  1. Backward compatibility — Newer code can read data that was written by older code
  2. Forward compatibility — Older code can read the data that was written by newer code

Language-specific encoding. Many programming languages come with a built-in support for encoding in-memory objects into byte sequences. For eg — Java has java.io.Serializable, Ruby has Marshal, Python has pickle.

These encoding libraries are very convenient because they allow in-memory objects to be saved and restored with minimal additional code. However, such encoding is often tied to a particular programming language and reading the data in another language is very difficult. If we store or transmit data in such an encoding, we are committing ourself to our current programming language for potentially a very long time and precluding integrating your systems with those of other organisations which may use a different languages.

Therefore, it’s generally a bad idea to use language’s built-in encoding for anything other than very transient purposes.

Standardised encodings like JSON and XML can be written and read by many programming languages. These are also human readable. Drawback is that both JSON and XML encodings use a lot of space compared to binary formats.

This drawback of JSON and XML has led to the development of profusion of binary encodings for JSON (MessagePack, BSON, BJSON, BISON) and for XML (WBXML, Fast Infoset) but these encoding do not provide significant size reduction.

There are other encoding libraries like Apache Thrift (originally developed at Facebook), Protocol Buffers (developed at Google) and Avro which are open source and have support for a fairly wide range of programming languages.

There are many ways data can flow from one process to another. These processes need to be compatible i.e. data encoded by one process should be decode-able by the other process which uses the same data.

Dataflow through databases

Consider that you add a field to a record schema, and the newer code writes a value for that new field to the database. Subsequently, an older version of the code (which doesn’t know the field yet) reads the record, updates it and writes it back. If you decode a database value into model objects in the application and later re-encode those model objects, the unknown field might be lost in that translation process.

So, we need to process data cautiously!

Dataflow through services

In case of APIs, client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response.

RESTful APIs most commonly use JSON for responses, and a JSON or URI-encoded/form-encoded request parameters for requests.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store