Friday 11 October 2013

Embedding a Primary Key in a UUID / GUID for CQRS / ES

For some time now I have been working on an Event Sourced system. Today I am going to describe my UUID/GUID primary key approach that I devised. Ignoring much about the event sourced technology and the merits thereof; primary key's are common and critical and their usage in an Event Sourced system is very critical.

The usual route for a primary key in a system is an increasing number (Person ID = 102).

With Event Sourcing and CQRS, the UUID or GUID is often identified as a good implementation choice for your primary key

If you are interested in the topic, then these two resources are a good read. (nothing to do with Event Sourcing)

But I have a few issues with the UUID. For one, it's not very client friendly, nor is it easily memorable. I want a system wide unique reference for all my objects ; a UUID is great for that; and I want it to also function as a friendly and memorable primary key.. how ? Well here below is my solution.

A little about Event Sourcing

With event sourcing, the primary route to storage is not through the "Domain Object".

Most developers are familiar with an Object-Relational Mapping system (Hibernate, NHibernate etc). (Object persisted to a table). This usual approach is that a Domain object is persisted (from a class or class heirarchy structure) to a database table. Event Sourcing instead stores the "event" that creates or modifies the Domain Object, and not the Object itself. A rebuild or some other query is a replay of the Events stream that was persisted.

I explain all that to say that an Event Sourced (CQRS) system can be designed without "tradtional primary keys".

Each event has a key, and the objects created can have keys, but what is common is to use UUIDs across all domain objects.

My Design Goal

I wanted some traditional primary keys; and I wanted the UUID. In short, the system should be simple to use; being both client friendly and a memorable primary key for use.

When you have a UUID as the primary key and it is used in API's and get's passed around the office as "Look up client 3422" you need this API to be simple.

An example: /person/fbe645f0-3031-41e3-aa6e-0800200c9a66 is just not a nice URI

However,

/person/2078

or

/person/5f9

They are simple.

What I have embarked on is to segment my UUID (my primary keys) so that I utilise the entire UUID space - a part for randomness, a part for the Group (or table ID if you like) and a part for the object; in this example case, the person.

The make up of my Custom UUID

The UUID is a 128 bit number, represented as 32 hex characters (with some dashes for legibility)

The UUID specification reserves some bits for version and variant.

With much (actually 30 minutes) thought, I have decided to use Version 6. (it doesn't exist in the IETF RFC 4122, they just went 1, 2, 3, 4 and 5)

and just make the Variant 'a'.

but ignoring all that, what is special about my UUID is that I embed a primary key and a "group" ID into the UUID.

So, my UUID's look like

pppppppp-pppp-6ggg-arrr-rrrrrrrrrrrr

where p is the Primary key (sequential incremementing)

where 6 is the Version (always 6)

where g is the Group ID, (akin to an identifier of the table)

where a is the variant (always a)

where r is the random part

This is my working notes: 

00000000-0000-6000-a000-000000000000
                    -----------------
                    60 bits Random bits (time or other)
                   -
                   x == a, b, 8 or 9
               ---
               group ID - of 12 bits value is (0 to 4095)
              
              6 == always 6 - (Version 4, Random UUID)
-------- ----
Primary key ID of 48 bits  (max 281,474,976,710,656)

So I have enough bits for a primary ID (halfway between Int *32 bits and Long *64 bits)

The Version

I decided to use 6 as the Version, 1 - 3 are other uses, 4 is Random or Psuedo random, which is almost mine, but not quite and Version 5 is a SHA-1 Hash; the RFC 4122 stipulates the following

Process of identifier assignment:

Generating a UUID does not require that a registration authority
be contacted. One algorithm requires a unique value over space
for each generator. This value is typically an IEEE 802 MAC
address, usually already available on network-connected hosts.
The address can be assigned from an address block obtained from
the IEEE registration authority. If no such address is available,
or privacy concerns make its use undesirable, Section 4.5
specifies two alternatives. Another approach is to use version 3
or version 4 UUIDs as defined below.

The sentence I want to draw attention to is - "Generating a UUID does not require that a registration authority be contacted"

But then again, because I am using Version 6, it's probably not a UUID.

My reasoning to a Version 6 are:

  • My UUID is not a Version 4, because it is only partially random ( a part of it )
  • It is not any other Version (1-3, or 5)
  • Version 6 wasn't used
  • If someone has an issue with my use, then I'll call it instead a LUUID, a Local UUID

Moving on.

The Group ID

I wanted to have a unique key that represents all objects across the space. In this way I can now have a REST URI that looks like

/any/<uuid>

and the system can appropriately redirect to the correct resource, by looking at the group ID. (pattern matching the -6xxx- )

For example, if we have the UUID

000000231-22e-6aa1-a789-28ef27ab7c62

This is

  • aa1 for the Group ID for 'person'
  • 23122e for the primary key

/any/000000231-22e-6aa1-a789-28ef27ab7c62

and with a match we can redirect the request to

/person/000000231-22e-6aa1-a789-28ef27ab7c62

Clashes in the Primary Key with a Distributed System

CQRS guru Greg Young says of the UUID (and Event Sourced systems)

Having the client originate Ids normally in the form of UUIDs is extremely valuable in distributed systems.

And this is true, but I don't want to give the client that honour, I want to allocate them a UUID (for many reasons) - but that comes with a drawback of needing to cater for clashes. (there is always a tradeoff with IT)

For my UUIDs, the primary key is sequential, at point of allocation. (231-22e, 231-22f, 231-230 etc). Having the end of the "UUID" random (e.g.: -a789-28ef27ab7c62)means I can cluster (multiple systems) and allow for "duplicates" even in the primary key space, (where two servers generate the same ID and Group ID).

Leaving the UUID as it is (with a duplicate in the PK ID space) is ok but not ideal. So we will need to cater for that. (and of course for the lottery day when two systems generate the same UUID even down to the random part!)

Let me explain that : Assume that I have a web application with two servers that generate "people",

   www.myco.com  

  • server1-myco; and
  • server2-myco
 
If server1-repo and server2-repo BOTH generate a UUID but the random part differs, e.g.
 
server1 - 000000231-a32-6aa1-a789-28ef27ab7c62   (Jim)
server2 - 000000231-a32-6aa1-a25e-6ac5c6ef127c   (Mary)

then, technically I have two unique ID's, but the initial shorter ID's clash. This is not exactly ideal - because I want my friendly PK ID's to be unique also. If I want to minimise that "duplicate", then here are some options.

  1. Single ID generation node (creates a single point of failure)
  2. Regular re-assignment in case of a clash
  3. Generate and check with peers
  4. Block reserving

Options 2, 3 and 4 will be my preferred. With a distributed system, I have that PK ID issue anyways, so it's not different because I am playing with UUID's.

But what I really like is how I can use the primary key all by itself, outside of the /UUID

for example. 


Dear Mr Client,

Welcome as a supplier to Company X. For future reference, your client ID is 23122e.

...


or

/person/23122e

I don't have to use the full UUID everywhere, but rather just where I need it (in the event sourced messages), and as a unique global ID on the system.

Debugging a system will be easier too, looking at logs, someone with a little knowledge will recognise 'people' UUID's vs 'building', or 'schedule' UUID's (because they will distinguish the group ID after '-6xxx-' as the unique group identifier); in effect people will learn those 3 characters and identify what the group Id is.

UUID generation

So now I hear you wonder, how do I generate these mythical UUID's ?

Simply really.

This is the call for the generating the UUID (reusing the java.util.UUID class, just giving two lower and upper longs (64 bits each)

  /**

   * Create a new UUID given some ID as the groupID and an already sequenced ID

   */

  def createUuid(groupId: Int, id: Long): Uuid = {

 

    val randomBytes = new Array[Byte](8)

    secureRandom.nextBytes(randomBytes)

    val randomLong = java.nio.ByteBuffer.wrap(randomBytes).getLong()

 

    // groupID has to be 12bits as the 4L is going in over the top.

    return Uuid(new java.util.UUID(groupId | (6L << 12) | (id << 16), (10L << 60) | (randomLong >>> 4)).toString())

 

  }

 

The Primary Key is an incremental sequence on the code that creates a new person, or group, or employee (etc)

In Scala, it is simply a matter of adding in a Trait for the "class" you want ID's sequenced for/ UUID's generated for. e.g:  I add with UuidGenerator[Person] and this gives a newUuid() method 

class PersonProcessor(val repository: Repository[Uuid, Person]) extends AbstractProcessor[Person]

with UuidGenerator[Person] 

{ this: Emitter =>

  

  def klass = classOf[Person]

 

...

      createPerson(newUuid, cmd )

 

The implementation of that newUuid() method looks as follows:


trait UuidGenerator[D] {

  

  implicit def klass: Class[_]

  

  private val ids = Map.empty[String, Long]

  

  private[this]def className = klass.getClass().getCanonicalName()

 

  /**

   * return the next ID

   */

  def newUuid:Uuid = {

    val idKey = className

    val currentId = ids.getOrElseUpdate(idKey, 0L)

    ids += (idKey -> (currentId + 1))

    return Uuid.createUuid(klass, currentId + 1)

  }

...


And Uuid.createUuid looks like : 


 

  def createUuid(klass: Class[_], id: Long): Uuid = createUuid(groupId(klass), id)

 

The Group ID is simply CRC-12, or 12 bit CRC over the classname. This is because I had 12 bits to spare where I placed the groupId -6aa1 So given that all 'Person' domain objects extend from com.soqqo.system.domain.Person my groupId's are consistent there.

 

On bootstrapping my system, I make sure that all "in use" groupId's are not clashing on CRC12-ing them - could happen - and if it does I'll deal with that then. (just System.halt bootstrap .. and change code to suit)

 

This is the crc12 implementation in Scala. I haven't tested that brutally, but it DOES generate unique < 4096 ID's for random byte's passed in, so it is working the way I need it to.

  def crc12(toHash: String) = {

 

    /**

     * ************************************************************************

     *  Using direct calculation

     * ************************************************************************

     */

 

    varcrc:Int = 0xFFF; // initial contents of LFBSR

    varpoly: Int = 0xF01; // reverse polynomial

    var bytes = toHash.getBytes()

 

    for (b: Byte <- bytes) {

      var temp = (crc ^ b) & 0xff;

 

      // read 8 bits one at a time

      for (i <- 0 to 7) {

        if ((temp & 1) == 1) temp = (temp >>> 1) ^ poly;

        else temp = (temp >>> 1);

      }

      crc = (crc >>> 8) ^ temp;

    }

 

    // flip bits

    crc = crc ^ 0xfff;

 

    crc;

 

  }

 

Summary

I hope this helps someone on any of the weird topics I have covered here. I will share the Uuid and UuidGenerator classes happily for anyone that wants them. They are anything special, but rather a lot of thinking about how I wanted my Uuid's to be utilised in the system.

My system entails:

1. spray.io - Web router on top of Spray Can
2. Event Sourced (now also known as akka-persistence)
3. AngularJS

Enjoy!

Current 5 booksmarks @ del.icio.us/pefdus