How Notion Scaled to 100 Million Users Without Their Database Exploding

www.youtube.com

Brief Summary

Notion faced a major performance crisis in 2021 due to their database reaching terabytes of data, causing slowdowns and threatening to lose customers. To address this, they implemented a multi-step sharding strategy, first splitting their database into 32 physical instances with 15 logical shards each. They used PgBouncer for connection pooling and implemented a data migration process to move existing data to the new shards. However, by late 2022, they encountered performance issues again, leading them to further shard their database into 96 instances with smaller logical shards. They used logical replication for data synchronization and implemented a dark read testing strategy to ensure data consistency before switching to the new sharded system. This resharding project significantly improved performance, increased capacity, and future-proofed Notion's infrastructure to handle continued user growth.

Notion's database reached terabytes of data, causing performance issues.
They implemented a multi-step sharding strategy to address the problem.
The resharding project significantly improved performance and increased capacity.

Database Crisis

Notion's database, built on PostgreSQL, experienced significant performance issues in 2021 due to its massive size. The database, which stores all data as "blocks" (including text, images, and pages), was reaching terabytes of data, even after compression. This led to slowdowns and increased latency for users, as the single, monolithic database struggled to handle the load. The VACUUM process, responsible for reclaiming storage space occupied by deleted data, was also stalling, further contributing to performance degradation. Additionally, the database was nearing a critical point where transaction IDs would wrap around, potentially rendering the database read-only and causing a major disruption for users.

Initial Sharding Strategy

To address the performance issues, Notion decided to implement horizontal scaling by sharding their database. They split the database into 32 physical instances, each containing 15 logical shards. The partition key for sharding was the workspace ID, as users typically access data within a single workspace. This ensured that related data remained within the same shard. The new sharded setup needed to handle existing data and scale to meet projected growth for at least two years. To achieve this, they chose instances with at least 60,000 IOPS (Input/Output Operations Per Second) and set limits of 500GB per table and 10TB per physical instance.

Data Migration and Routing

Notion explored several options for migrating existing data to the new sharded system. They considered writing directly to both old and new databases simultaneously, but this posed a risk of data inconsistency. They also considered using PostgreSQL's logical replication feature, but this was not feasible due to the lack of workspace IDs in the old database. Ultimately, they opted for an audit log approach, where they recorded changes to the old database and used a catch-up script to populate the new databases with the necessary data. This process took approximately three days, using a powerful m5.24xlarge instance with 96 CPUs. During the migration, they compared record versions to ensure that they were not overwriting more recent data. Once the migration was complete, they updated their connection pooling layer (PgBouncer) to route queries to the appropriate shards.

Resharding for Continued Growth

Despite the initial success of the sharding strategy, Notion encountered performance issues again in late 2022. Some shards were experiencing high CPU utilization, approaching full disk bandwidth utilization, and reaching connection limits in the PgBouncer setup. To address these issues, they decided to further shard their database, increasing the number of physical instances from 32 to 96. They provisioned smaller instances for the new shards, aiming to reduce CPU, IOPS, and costs. Each old instance, which contained 15 logical schemas, was replaced by three new instances, each containing only five logical schemas. These new instances were configured with smaller instance types due to their expected lower load.

Data Synchronization and Optimization

For data synchronization, Notion utilized PostgreSQL's logical replication feature to continuously apply changes to the new databases. They set up three publications on each existing database, each covering five logical schemas. On the new databases, subscriptions were created to consume these publications, effectively copying the relevant data. To optimize the synchronization process, they delayed index creation until after the data transfer, reducing the syncing time from three days to 12 hours.

PgBouncer Resharding and Dark Read Testing

During the transition, Notion had 100 PgBouncer instances, each managing 32 database entries. They planned to add 96 new entries to PgBouncer, temporarily redirecting them to the existing shards and gradually migrating data to the new shards. However, testing revealed a critical issue: since each old shard mapped to three new shards, they needed to either reduce the number of connections per PgBouncer instance or increase it by three times. Increasing connections would overload the system, while reducing connections would lead to query backlogs. The solution was to shard the PgBouncer cluster itself, creating four new clusters, each managing 24 databases. This allowed them to increase connections per PgBouncer instance per shard, limit total connections per Postgres instance, and isolate PgBouncer issues to a smaller portion of the fleet.

Before rolling out these changes to production, Notion implemented a "dark read" testing strategy. This involved fetching data from both the new and old databases when requests were made, comparing the results for consistency, and logging any discrepancies. To minimize impact on user experience, they limited the comparison to queries returning up to five rows and sampled a small portion of requests. A one-second delay was introduced before issuing the dark read query to allow time for replication catch-up. The testing showed nearly 100% identical data.

Failover and Success

The failover process to the new sharded system involved halting new connections while allowing ongoing queries to complete, ensuring new databases were fully caught up, updating PgBouncer to point to the new databases, and reversing replication direction to stream changes from the new database to the old database as a precaution. Once these steps were completed, traffic was resumed to the new shards. The resharding project was a significant success for Notion, resulting in increased capacity, improved performance (with CPU and IOPS utilization significantly reduced), and a future-proofed architecture to handle continued user growth and increased data demands.

10/12/2024 www.youtube.com