- a technique to break up a big database (DB) into many smaller parts
- It is the process of splitting up a DB/table across multiple machines
- to improve the manageability, performance, availability, and load balancing of an application
- after a certain scale point
- it is cheaper and more feasible to scale horizontally
- It is more expensive to grow it vertically by adding beefier servers
Partitioning Methods
- 3 most popular schemes used by various large scale applications
- Horizontal partitioning
- put different rows into different tables
- also called as range based partitioning
- as we are storing different ranges of data in separate tables
- also called as Data Sharding
- key problem
- if the value whose range is used for partitioning isn’t chosen carefully
- the partitioning scheme will lead to unbalanced servers
- Vertical Partitioning
- divide our data to store tables related to a specific feature in their own server
- it is straightforward to implement and has a low impact on the application
- main problem
- if our application experiences additional growth
- it may be necessary to further partition a feature specific DB across various servers
- Directory Based Partitioning
- A loosely coupled approach to work around issues mentioned in the above schemes
- is to create a lookup service
- which knows your current partitioning scheme and abstracts it away from the DB access code
- to find out where a particular data entity resides
- we query the directory server that holds the mapping between each tuple key to its DB server
- loosely coupled approach means
- can perform tasks like
- adding servers to the DB pool
- changing our partitioning scheme without having an impact on the application
Partitioning Criteria
- Key or Hash-based partitioning
- apply a hash function to some key attributes of the entity we are storing; that yields the partition number
- This approach should ensure a uniform allocation of data among servers
- fundamental problem
- it effectively fixes the total number of DB servers
- adding new servers = changing the hash function
- this would require redistribution of data and downtime for the service
- A workaround for this problem is to use Consistent Hashing
- List partitioning
- each partition is assigned a list of values
- whenever we want to insert a new record, we will see which partition contains our key and then store it there
- Round-robin partitioning
- a simple strategy that ensures uniform data distribution
- With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n)
d. Composite partitioning
- combine any of the above partitioning schemes to devise a new scheme
- e.g.: first applying a list partitioning scheme and then a hash based partitioning
- Consistent hashing could be considered a composite of hash and list partitioning
- where the hash reduces the key space to a size that can be listed
Common Problems of Data Partitioning
- On a partitioned database, there are certain extra constraints on the different operations that can be performed
- constraints are due to operations across multiple tables rows in the same table will no longer run on the same server
- Constraints and additional complexxities introduced by partitioning
- Joins and Denormalization
- Performing joins on a database which is running on one server is straightforward
- but will no longer be so once once a database is partitioned and spread across multiple machines
- because joins that span database partitions will not be feasible
- Such joins will not be performance efficient since data has to be compiled from multiple servers
- solution: denormalize the database
- so that queries that previously required joins can be performed from a single table
- however, service now has to deal with all the perils of denormalization such as data inconsistency
- Referential integrity
- trying to enforce data integrity constraints such as foreign keys in a partitioned database can be difficult
- Most of RDBMS do not support foreign keys constraints across databases on different database servers
- this means that apps that require referential integrity on partitioned databases have to enforce it in app code
- in such cases, applications have to run regular SQL jobs to clean up dangling references
- Rebalancing
- many reasons we have to change our partitioning scheme
- The data distribution is not uniform
- There is a lot of load on a partition
- In such cases, either we have to create more DB partitions or have to rebalance existing partitions
- this means the partitioning scheme changed and all existing data moved to new locations
- Doing this without incurring downtime is extremely difficult
- Using a scheme like directory based partitioning make rebalancing a palatable experience
- at the cost of increasing the complexity of the system
- and creating a new single point of failure (i.e. the lookup service/database)