Sharding 101

When RDBMS was invented, it caters to relatively small volumes of transactions compared to Google of today. When scalability and high availability is the name of the game, sharding was devised to cope with rivers of data.

What is Database Sharding? (courtesy of CodeFutures)

The concept of Database Sharding has been gaining popularity over the past several years, due to the enormous growth in transaction volume and size of business application databases. This is particularly true for many successful online service providers, Software as a Service (SaaS) companies, and social networking Web sites.

Database Sharding can be simply defined as a “shared-nothing” partitioning scheme for large databases across a number of servers, enabling new levels of database performance and scalability achievable. If you think of broken glass, you can get the concept of sharding – breaking your database down into smaller chunks called “shards” and spreading those across a number of distributed servers.

The term “sharding” was coined by Google engineers, and popularized through their publication of the Big Table architecture. However, the concept of “shared-nothing” database partitioning has been around for a decade or more and there have been many implementations over this period, especially high profile in-house built solutions by Internet leaders such as eBay, Amazon, Digg, Flickr, Skype, YouTube, Facebook, Friendster, and Wikipedia.

In a word, database sharding is distributing data (writes) and gathering data (reads) as if it is one database only, with scalability and high availability tucked in. Sort of a router for your data of Cisco quality.

The Traditional Scalability and HA Options

  • Master/slave
  • Cluster computing
  • Table partitioning
  • Federated tables

If you are like Google or Facebook who can hire top-notch programmers to do the scalability and HA (failover, replication) code for you, good for you. However, if programming is not your core business, database sharding is difficult. Here’s why: (courtesy of Scalebase)

It requires you to rewrite most of your Data Access Layer from scratch. And while it’s difficult to do when you write your own SQL code, it’s even more complex when using O/R mapping tools, as most are not “sharding oriented”.

But even after writing the initial sharding code, you might run into issues. For instance, a common problem occurs when scaling requires adding more shards. Usually, internally written sharding code supports a fixed number of shards, and adding shards requires massive code rewrites – as well as the major downtime required when moving data from one shard to another.

Other parts of the infrastructure also change when using a sharded database. For example, the reporting application must now be aware of the sharding logic, since you want to collect data from multiple databases rather than just one. And if the reporting application is an off-the-shelf product, you’re out of luck. You’ll have to write the reporting application from scratch.

Backup is an issue. Database Administration is an issue. And more complexities just continue to pop up.

So Database Sharding is a great solution for Database scaling, but it’s complex and costly.

Add to that some more issues (again courtesy of CodeFutures):

  • database reliability (the C in ACID)
  • distributed queries
  • avoidance of cross-shard joins (through replication of global/lookup tables)
  • key management (hash, list, and range)
  • support for multiple shard schemes

So what are the options?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s