When you horizontally partitioning data across multiple SQL Azure databases or using Data Sync Server for SQL Azure, there might come a time when you need to write to a member database without causing primary key merge conflicts. In this case you need to be able to generate a primary key that is unique across all databases. In this article we will discuss different techniques to generate primary keys and their advantages and disadvantage.
UniqueIdentifier
One way to generate a unique primary keys is to use the NEWID() function in Transact-SQL, which generates a GUID as a uniqueidentifier data type. The GUID is guaranteed to be unique across all databases.
Advantages:
It is a native type to SQL Azure.
It is infinitely big and you will never run out of GUIDs
Works with both horizontal partitioning and Data Sync Services.
Disadvantage:
The disadvantages of using this technique is that based on the GUID; there is no way to identify what database generated it. This can cause extra complications when doing horizontal partitioning.
The uniqueidentifier data type is large and will add to the size of your row.
Bigint
Another option is to use a bigint data type in place of an int. In this technique, the primary key is generated from being an identity column; however each identity in each database starts at a different offset. Different offset create the non-conflicting primary keys.
The first question most people ask, is bigint data type big enough to represent all the primary keys need. The bigInt data type can be as large as 9,223,372,036,854,775,807 because it is stored in 8 bytes. This is 4,294,967,298 times bigger than the maximum size of an int data type: 2,147,483,647. This means that you could potentially have 4 billion SQL Azure databases horizontally partitioned with tables of around 2 billion rows. More information about data types and sizes can be found here.
On the first SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (1,1))
On the second SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (2147483648,1))
And continue incrementing the seed value for each database in the horizontal partitioning.
Advantages:
It is easier to upgrade from a legacy tables that used an int data type as the primary key to a bigint data type (the legacy table would be the first partition).
You can reparation easier than some of the other techniques, since moving rows involve a straight forward case statement (not a recalculated hash).
The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.
The bigint data type consumes 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.
Disadvantages:
The database schema for each partition is different.
This technique works well for horizontal partitioning, but not for Data Sync Service.
Primary Key Pool
In this technique a single identity database is built where all the primary keys are stored, however none of the data. This identity database just has a set of matching tables that contain a single column of integers (int data type) as an auto incrementing identity. When an insert is needed on any of the tables across the whole partition, the data tier code inserts into the identity database and fetches the @@IDENTITY. This primary key from the identity database is used as the primary key to insert into the member database or the partition. Because the identity database is generating the keys there is never a conflict.
Read more: SQL Azure team blog
UniqueIdentifier
One way to generate a unique primary keys is to use the NEWID() function in Transact-SQL, which generates a GUID as a uniqueidentifier data type. The GUID is guaranteed to be unique across all databases.
Advantages:
It is a native type to SQL Azure.
It is infinitely big and you will never run out of GUIDs
Works with both horizontal partitioning and Data Sync Services.
Disadvantage:
The disadvantages of using this technique is that based on the GUID; there is no way to identify what database generated it. This can cause extra complications when doing horizontal partitioning.
The uniqueidentifier data type is large and will add to the size of your row.
Bigint
Another option is to use a bigint data type in place of an int. In this technique, the primary key is generated from being an identity column; however each identity in each database starts at a different offset. Different offset create the non-conflicting primary keys.
The first question most people ask, is bigint data type big enough to represent all the primary keys need. The bigInt data type can be as large as 9,223,372,036,854,775,807 because it is stored in 8 bytes. This is 4,294,967,298 times bigger than the maximum size of an int data type: 2,147,483,647. This means that you could potentially have 4 billion SQL Azure databases horizontally partitioned with tables of around 2 billion rows. More information about data types and sizes can be found here.
On the first SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (1,1))
On the second SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (2147483648,1))
And continue incrementing the seed value for each database in the horizontal partitioning.
Advantages:
It is easier to upgrade from a legacy tables that used an int data type as the primary key to a bigint data type (the legacy table would be the first partition).
You can reparation easier than some of the other techniques, since moving rows involve a straight forward case statement (not a recalculated hash).
The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.
The bigint data type consumes 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.
Disadvantages:
The database schema for each partition is different.
This technique works well for horizontal partitioning, but not for Data Sync Service.
Primary Key Pool
In this technique a single identity database is built where all the primary keys are stored, however none of the data. This identity database just has a set of matching tables that contain a single column of integers (int data type) as an auto incrementing identity. When an insert is needed on any of the tables across the whole partition, the data tier code inserts into the identity database and fetches the @@IDENTITY. This primary key from the identity database is used as the primary key to insert into the member database or the partition. Because the identity database is generating the keys there is never a conflict.
Read more: SQL Azure team blog