imgix Blog | Behind the Scenes: Improving Google Spanner Efficiency

Google Spanner Efficiency Header Image

The imgix Asset Manager is a key component of our platform. We're dedicated to supporting our users' every asset management requirement — be it the magic of AI-powered image tagging, vigilant alerts for risky content, or empowering advanced searches. The reliability and efficiency of the infrastructure behind our Asset Manager isn’t just crucial; it’s the cornerstone of our commitment to providing an experience that's not just efficient but resonates with the passion we have for high-quality images and serving our users.

As our database of images and videos grew, our operational capacity faced increasing challenges. Even though this isn't something our customers notice today or is likely to grab their attention, it's our deep-seated commitment to make sure our operations remain fast, reliable, and resilient well into the future. We're driven by the vision of standing resilient against any surge in demand as imgix continues to expand its horizons. In the sections that follow, we're eager to unfold the narrative of how we poured our passion into refining our operations, always staying ahead of the evolving demands on our infrastructure.

The Challenge: Free Kitefin From Operational Strains

Under the hood, we use Google Cloud Spanner and Dataflow to drive the essential functions of Asset Manager. This encompasses a wide range of tasks, including video transcoding, image AI tagging, and tracking upload statuses. Spanner is our largest database, as every asset served from our CDN triggers a corresponding write operation in our Spanner database.

At the core of Asset Manager lies a set of Dataflow pipelines internally referred to as "Kitefin." Kitefin systematically traverses our Spanner database, identifying new assets and monitoring changes in existing ones, ensuring consistency between the data in Spanner and your Origin source. As the number of assets in our database grew over time, the demand on our infrastructure multiplied, due to Kitefin's need to constantly analyze the database's evolving state. This increased our risk of performance degradation, including a longer wait time or timeout to refresh new and updated assets in Asset Manager. It could also raise our operational costs substantially. The following are steps we undertook to achieve a 30% reduction in capacity consumption.

Remedy 1 - Query Optimization

Our initial implementation of Kitefin involved fetching assets from multiple sources simultaneously using partitionable queries, leveraging Apache Beam as the underlying library for Dataflow. This approach seemed promising due to Apache Beam's ability to effectively parallelize work. However, the query performance was inconsistent, primarily because the partitioned queries didn’t consistently batch sources in a deterministic manner.

These queries gradually became more time-consuming, occasionally resulting in timeouts that eventually became frequent. In response, we decided to reevaluate our approach to querying Spanner.

After conducting profiling and experimenting with different ideas, we determined that issuing multiple queries individually for each data source yielded the most consistent performance. These queries were executed in parallel within a non-autoscaling worker pool, minimizing the number of connections made to Spanner. To further enhance query speed, we also created a dedicated read index tailored for these jobs. While this optimization required a few terabytes of additional storage and took several days to build, it significantly improved query efficiency.

Furthermore, this optimization had a positive impact on our Dataflow billing, as it reduced the CPU time required to execute these queries.

Here are the results after we deployed that change:

Google Dataflow CPU consumption chart

Remedy 2 - Giving Spanner a Data Boost

From the jump, we provisioned an ample number of Spanner compute nodes to effectively handle the bursty demands placed by Kitefin. This strategic provisioning enabled us to keep our database's CPU usage well below Google's recommended threshold, even during unexpected spikes in asset processing demands. To our somewhat expected relief, this ensured that our database operations didn’t interfere with the performance of other services relying on Spanner.

Furthermore, when Google introduced Spanner Data Boost in June 2023, we quickly started to evaluate its potential as the on-demand compute resources for Spanner workloads. Soon after, we transitioned our most substantial tasks to leverage this new capability. Utilizing on-demand compute resources proved significantly more cost-effective compared to running provisioned nodes around the clock. As a result, we reduced our Spanner node count by 40%, leading to notable cost savings without compromising our operational efficiency.

While some of our compute costs were shifted to Data Boost expenses, our overall savings from reduced capacity more than made up for the additional Data Boost costs.

These are the results after enabling the feature:

Total CPU Utilization chart

Takeaways

When querying data across multiple shards, it can be more performant to query the shards individually rather than querying multiple shards simultaneously.
Data Boost is convenient as it enables us to maintain a serverless pool of compute resources to handle sudden bursts of load effectively.

This journey underscores our commitment to continually refining our technology stack and operational strategies to meet the evolving demands of our services. Through these efforts, we've achieved a balance between enhanced performance and cost savings, ensuring that Asset Manager remains a reliable and efficient component of our infrastructure.