Friday, August 24, 2012

Performance Counters in Windows Azure Web Role

I tried adding some performance counters for Windows Azure Web Role in the OnStart method of the WebRole and interestingly it was not getting enabled. The code just executed fine with no results present in the “WADPerformanceCountersTable”.

There are a few things which need to be done correctly to ensure that this works

1) The WebRole code should be running in the elevated context. In order to make sure this happens you should add a tag in ServiceDefinition.csdef file under the WebRole which is trying to add the counter. See screenshot below


2) If you are trying to update an existing deployment the new performance counters may not be picked up. To ensure that they are delete the existing deployment and create a new one. It seems that that the performance counters are picked up based on an xml file under the “wad-control-container” blob. This xml file is made for each deployment and for some reason it was not getting updated for my case.

Hope this helps…


Tuesday, August 14, 2012

Block Blob Vs. Page Blob

Windows Azure storage services provide ability to store large binary files using Blob storage. So if I have a large binary file which I want to store on Azure I can use the Blob Storage.

But now comes a tricky part. There are 2 types of blob storage available on Azure viz. you can store a large file as a Block Blob or a Page Blob.

The scenarios in which to use Page Vs. Block Blob according to MSDN documentation here is that Block Blobs are to manage the upload of large files in parallel while Page Blobs are good for Random read and write operations on a Blob. This one went right over my head.

What if I have a conflicting scenario e.g. I have a huge VHD file which I want to be uploaded in parallel and once uploaded do random read and write on it as needed. So I am not really sure which one should I use in such a case. I guess I would go for the Page Blob because upload is a one time operation for VHD file and after that my hypervisor will keep modifying the file for the life of the file. So it is more important that the random read and write operations are faster then the upload speed.

The technical difference can be summarized in the below table


Block Blob

Page Blob

Max Size 200 GB 1 TB
Chunk Size Each Block can be 4 MB Each Page can be 512 Bytes
Write Operation Multiple blocks can be modified and then a Commit Operation commits the modifications in a single go The write operations happens in place and the commit is done then and there rather then waiting for a separate.

So all in all.

Parallel upload == Block Blob

Random Read Write == Page Blob

Hope this helps


Wednesday, August 1, 2012

Designing a scalable Table Storage in Windows Azure

I came across a really good post about how to design a scalable partitioning strategy for Table Storage in Windows Azure. Here’s the link to the original documentation on msdn.

My understanding

1) Azure table storage has multiple servers which serve the queries on your table

2) Table storage can use multiple servers to perform queries and operations on your table. A single server can work on multiple partitions but it is guaranteed that a single partition will always be served by a single server.

3) If a server serving many partitions becomes too hot then it offloads some partitions to a different server. During the time this offloading is taking place the client can get Server Too Busy message from the service. So a retry strategy should be applied at the client (Either a fixed backoff or preferably exponential backoff)

4) Choosing an appropriate partition key is the “key” to design a scalable storage.

a) Too many rows in a single partition can result in bottlenecks as this would mean too many rows have to be served by a single partition. A partition server has a limit of 500 entities per second. If the application hits this throughput then the table service won’t be able to serve the requests. The advantage though is that application can perform batched operations on all the entities in a table thus reducing the costs.

b) Too few rows in a single partition will make querying the table less efficient. The queries will have to hit multiple partition servers to get an aggregate result thus increasing the costs.

5) In case you decide to use unique values in partition keys azure is intelligent to make range partitions itself. The range partitions are created by azure intelligently. If you query the data from a single range partition it may increase the performance of queries but operations like insert/update/delete will definitely be impacted. e.g. if partition key is 001,002,003,…,008 then azure may create a range partition from 001-003, 004-006, 007-008. This means that the next insert will go to the third server. Which means all the inserts are handled by a single server. In order to remove such a possibility application should consider GUIDs as partition keys in such a scenario.

6) The factors which affect the choice of partitioning key are

a) Entity group transactions :- If the application needs to perform batch updates then the participating entities should reside in a single partition. e.g. in an expense system the expense items can be inserted as a batch along-with expense report entity. In such a case ExpenseReportId will be a better partitioning key to ensure that report and items lie in a single partition.

b) Number of Partitions :- Number of partitions and their size will determine how the table will perform under load. Its a very difficult to hit the sweet spot. Using multiple small partitions will generally be a good idea. Azure can easily load balance multiple partitions on multiple partition servers. This can affect the queries though.

c) Queries :- The kind of queries in the system will drive the choice of partitioning key as well. The keys on which the queries are performed are a candidate for partitioning key.

d) Storage Operations :- If the storage operations happen seldom on the table then this is not a factor. For frequent operations partitioning key will determine how the operations are served by the service. If all the operations are served by a single server (e.g. in range partition case) then this can degrade the performance.

Now applying it on a live problem would be a good idea :).