I came across a really good post about how to design a scalable partitioning strategy for Table Storage in Windows Azure. Here’s the link to the original documentation on msdn.
My understanding
1) Azure table storage has multiple servers which serve the queries on your table
2) Table storage can use multiple servers to perform queries and operations on your table. A single server can work on multiple partitions but it is guaranteed that a single partition will always be served by a single server.
3) If a server serving many partitions becomes too hot then it offloads some partitions to a different server. During the time this offloading is taking place the client can get Server Too Busy message from the service. So a retry strategy should be applied at the client (Either a fixed backoff or preferably exponential backoff)
4) Choosing an appropriate partition key is the “key” to design a scalable storage.
a) Too many rows in a single partition can result in bottlenecks as this would mean too many rows have to be served by a single partition. A partition server has a limit of 500 entities per second. If the application hits this throughput then the table service won’t be able to serve the requests. The advantage though is that application can perform batched operations on all the entities in a table thus reducing the costs.
b) Too few rows in a single partition will make querying the table less efficient. The queries will have to hit multiple partition servers to get an aggregate result thus increasing the costs.
5) In case you decide to use unique values in partition keys azure is intelligent to make range partitions itself. The range partitions are created by azure intelligently. If you query the data from a single range partition it may increase the performance of queries but operations like insert/update/delete will definitely be impacted. e.g. if partition key is 001,002,003,…,008 then azure may create a range partition from 001-003, 004-006, 007-008. This means that the next insert will go to the third server. Which means all the inserts are handled by a single server. In order to remove such a possibility application should consider GUIDs as partition keys in such a scenario.
6) The factors which affect the choice of partitioning key are
a) Entity group transactions :- If the application needs to perform batch updates then the participating entities should reside in a single partition. e.g. in an expense system the expense items can be inserted as a batch along-with expense report entity. In such a case ExpenseReportId will be a better partitioning key to ensure that report and items lie in a single partition.
b) Number of Partitions :- Number of partitions and their size will determine how the table will perform under load. Its a very difficult to hit the sweet spot. Using multiple small partitions will generally be a good idea. Azure can easily load balance multiple partitions on multiple partition servers. This can affect the queries though.
c) Queries :- The kind of queries in the system will drive the choice of partitioning key as well. The keys on which the queries are performed are a candidate for partitioning key.
d) Storage Operations :- If the storage operations happen seldom on the table then this is not a factor. For frequent operations partitioning key will determine how the operations are served by the service. If all the operations are served by a single server (e.g. in range partition case) then this can degrade the performance.
Now applying it on a live problem would be a good idea :).
Njoi!!!
¬Abhishek