Skip to main content

Data Partitioning

  • a technique to break up a big database (DB) into many smaller parts
  • It is the process of splitting up a DB/table across multiple machines
    • to improve the manageability, performance, availability, and load balancing of an application
  • after a certain scale point
    • it is cheaper and more feasible to scale horizontally
      • by adding more machines
    • It is more expensive to grow it vertically by adding beefier servers

Partitioning Methods

  • 3 most popular schemes used by various large scale applications
    1. Horizontal partitioning
      • put different rows into different tables
      • also called as range based partitioning
        • as we are storing different ranges of data in separate tables
      • also called as Data Sharding
      • key problem
        • if the value whose range is used for partitioning isn’t chosen carefully
          • the partitioning scheme will lead to unbalanced servers
    2. Vertical Partitioning
      • divide our data to store tables related to a specific feature in their own server
      • it is straightforward to implement and has a low impact on the application
      • main problem
        • if our application experiences additional growth
          • it may be necessary to further partition a feature specific DB across various servers
    3. Directory Based Partitioning
      • A loosely coupled approach to work around issues mentioned in the above schemes
        • is to create a lookup service
          • which knows your current partitioning scheme and abstracts it away from the DB access code
      • to find out where a particular data entity resides
        • we query the directory server that holds the mapping between each tuple key to its DB server
      • loosely coupled approach means
        • can perform tasks like
          • adding servers to the DB pool
          • changing our partitioning scheme without having an impact on the application

Partitioning Criteria

  1. Key or Hash-based partitioning
    • apply a hash function to some key attributes of the entity we are storing; that yields the partition number
    • This approach should ensure a uniform allocation of data among servers
    • fundamental problem
      • it effectively fixes the total number of DB servers
        • adding new servers = changing the hash function
          • this would require redistribution of data and downtime for the service
            • A workaround for this problem is to use Consistent Hashing
  2. List partitioning
    • each partition is assigned a list of values
      • whenever we want to insert a new record, we will see which partition contains our key and then store it there
  3. Round-robin partitioning a simple strategy that ensures uniform data distribution With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n) d. Composite partitioning combine any of the above partitioning schemes to devise a new scheme e.g.: first applying a list partitioning scheme and then a hash based partitioning Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed

Common Problems of Data Partitioning

  • On a partitioned database, there are certain extra constraints on the different operations that can be performed
    • constraints are due to operations across multiple tables rows in the same table will no longer run on the same server
    • Constraints and additional complexities introduced by partitioning
      1. Joins and Denormalization
        • Performing joins on a database which is running on one server is straightforward
          • but will no longer be so once once a database is partitioned and spread across multiple machines
            • because joins that span database partitions will not be feasible
        • Such joins will not be performance efficient since data has to be compiled from multiple servers
        • solution: denormalize the database
          • so that queries that previously required joins can be performed from a single table
          • however, service now has to deal with all the perils of denormalization such as data inconsistency
      2. Referential integrity
        • trying to enforce data integrity constraints such as foreign keys in a partitioned database can be difficult
        • Most of RDBMS do not support foreign keys constraints across databases on different database servers
          • this means that apps that require referential integrity on partitioned databases have to enforce it in app code
            • in such cases, applications have to run regular SQL jobs to clean up dangling references
      3. Rebalancing
        • many reasons we have to change our partitioning scheme
          1. The data distribution is not uniform
          2. There is a lot of load on a partition
  • In such cases, either we have to create more DB partitions or have to rebalance existing partitions
    • this means the partitioning scheme changed and all existing data moved to new locations
    • Doing this without incurring downtime is extremely difficult
  • Using a scheme like directory based partitioning make rebalancing a palatable experience
    • at the cost of increasing the complexity of the system
    • and creating a new single point of failure (i.e. the lookup service/database)