Avro vs Protobuf
On my latest project the are fully onboard with Avro for shipping data between microservices. I’ve come from a typically Protobuf background and the shift to Avro is somewhat painful and I’m not sure I would recommend it over Protobuf. Here’s my high level thoughts:
Writing schemas in JSON is daft
Protobuf schemas are lightweight, obviously composable and by convention encourage good documentation within the schema. By contrast, the schema for Avro is in JSON, meaning no comments, and sure you can keep things DRY but it’s not particularly obvious and prone to breakage.
There is a human readable form of the Avro schema, IDL. This looks much more similar to Protobuf but still not as nice. Unfortunately I haven’t had much chance to use this, and maybe this does solve some of my issues so I will definitely look at it for my next Avro project.
The way things are made optional is idiotic
Avro doesn’t really have the concept of an optional field. Instead you declare a field with a union type that accepts null as a type. This means for anything optional your code is littered with blocks that look like:
{
"name": "some_field",
"type": ["null", "string"],
"default": null
}
When you get to having optional records with optional fields it’s worse:
{
"name": "some_field",
"type": [
null,
{
"name": "SomeRecord",
"type": "record",
"fields": [
{
"type": ["null", "string"],
"name": "optional_field",
"default": null
},
{
"type": "string",
"name": "required_field"
}
]
}
}
The equivalent expressed in Protobuf would be:
message SomeRecord {
optional string optional_field = 1;
required_field = 2;
}
optional SomeRecord some_field = 1;
I think it’s hard to disagree that the Protobuf version is clearer.
The other thing that bugs me with Avro type is that the type is declared as
"null"
but the value passed in as default is null
, which is not "null"
.
This points to perhaps JSON not being the best language to define something
heavily type based.
The schema can start to become coupled to the code
With Protobuf there’s a strict schema definition and, whatever language you’re using it with, that schema will always work. Not necessarily so for Avro. Given this Avro:
{
"name": "some_field",
"type": "string"
}
and pass it through certain tooling you could end up with something like:
{
"name": "some_field",
"type": {
"avro.java.string": "String",
"type": "string"
}
}
Now we’ve got reference to Java classes in the schema that may, or may not, break a Python Avro reader. There are ways round this but the fact is the schema is not really defining the data specification, it’s starting to define the implementation, and that leads to some bad practices.
Some of the tooling is wonky
Because the schema generally travels with the data in Avro, certain tools think it’s OK to modify the schema as they go. For example, Kafka Connect will deserialize Avro data to it’s own internal format and then re-serialize, potentially with an altered schema. I struggle to think when this would be the behaviour that you want, and again speaks to Avro schemas no longer being a data schema but more a reference of an implementation.
With Protobuf, I’m sure you could do something just as bad but I haven’t seen it done yet.
Schemas travelling with the data is not that useful
One of the main touted benefits of Avro is that you can pass the schema round with the data. From an application development point of view I just don’t see how it’s that useful. The only place it has some benefit is potential integration into tools like Presto that can directly query Avro files, whereas for protobuf you would need to do a separate transformation stage. Given that Avro isn’t a good data lake format this should be a pretty rare scenario.
Is Protobuf the answer to all these problems?
Well, the short answer is not completely and it has it’s own problems with build tools, but I’ve generally had a much better experience dealing with Protobuf. The biggest challenge has been inspecting the data once it’s in Protobuf format, but there are enough scripts and Stack OVerflow guides on how to quickly pull something together for that now that it’s no longer really an issue. For any future projects that’s definitely the serialization format I’ll go to.