Managed cloud databases bring speed, scale and new features for ecobee customers

An ecobee home isn’t just smart, it’s intelligent. It learns, adjusts, and adapts based on your needs, behaviors, and preferences. We design meaningful solutions that include smart cameras, light switches, and thermostats that work so well together, they fade into the background and become an essential part of your everyday life.

Our very first product was the world’s very first smart thermostat (yes, really) and we launched it in 2007. In developing SmartThermostat, we had originally used a homegrown software stack using relational databases that we kept scaling out. ecobee thermostats send device telemetry data to the back end. This data drives the HomeIQ feature, which offers data visualization to the users on the performance of their HVAC system and how well it is maintaining their comfort settings. In addition to that, there’s the eco+ feature that supercharges the SmartThermostat to be even more efficient, helping customers make the best use of peak hours when cooling or heating their home. As more and more ecobee thermostats came online, we found ourselves running out of space. The volume of telemetric data we had to handle was just continuing to grow, and we found it really challenging to scale out our existing solution in our collocated data center.

Graph showing 95th percentile latency drop

Graph showing 95th percentile latency drop to 0.1, or a 10x performance improvement, of our previous storage solution on switching to Bigtable.

In addition, we were seeing lag time when we ran high-priority jobs on our database replica. We invested a lot of time in sprints just to fix and debug recurring issues. To meet our aggressive product development goals, we had to move quickly to find a better designed and more flexible solution.

Choosing cloud for speed and scale

With the scalability and capacity problems we were having, we looked to cloud services, and knew we wanted a managed service. We first adopted BigQuery as a solution to use with our data store. For our cooler storage, anything older than six months, we read data from BigQuery and reduce the amount we store on a hot data store.

The pay-per-query model wasn’t the right fit for our development databases, though, so we explored Google Cloud’s database services. We started by understanding the access patterns of the data we’d be running on the database, which didn’t have to be relational. The data didn’t have a defined schema but did require low latency and high scalability. We also had tens of terabytes of data we’d be migrating to this new solution. We found that Cloud Bigtable would be our best option to fill our need for horizontal scale, expanded read rate capacity, and disk that would scale as far as we needed, instead of disk that would hold us back. We’re now able to scale to as many SmartThermostats as possible and handle all of that data.

Home IQ system monitor dashboard

Home IQ system monitor dashboard showing HVAC runtimes and home temperature over time. This is powered by data in Bigtable.

Enjoying the results of a better back end

The biggest advantage we’ve witnessed since switching to Bigtable is the financial savings. We were able to significantly reduce the costs of running Home IQ features, and have significantly reduced the latency of the feature by 10x by migrating all our data, hot and cold, to Bigtable. Our Google Cloud cost went from about $30,000 per month down to $10,000 per month once we added Bigtable, even as we scaled our usage for even more use cases. Those are profound improvements.

We’ve also saved a ton of engineering time with Bigtable on the back end. Another huge benefit is that we can use traffic routing, so it’s much easier to shift traffic to different clusters based on workload. We currently use single-cluster routing to route writes and high-priority workloads to our primary cluster, while batch and other low-priority workloads get routed to our secondary cluster. The cluster an application uses is configured through its specific application profile. The drawback with this setup is that if a cluster becomes unavailable, there is visible customer impact in terms of latency spikes, and this hurts our service-level objectives (SLOs). Also, switching traffic to another cluster with this setup is manual. We have plans to switch to multi-cluster routing to mitigate these issues, since Bigtable will automatically switch operations to another cluster in the event a cluster is unavailable.

And the benefits of using a managed service are huge. Now that we’re not constantly managing our infrastructure, there are so many possibilities to explore. We’re focused now on improving our product’s features and scaling it out. We use Terraform to manage our infrastructure, so scaling up is now as simple as applying a Terraform change. Our Bigtable instance is well-sized to support our current load, and scaling up that instance to support more thermostats is easy. Given our existing access patterns, we’ll only have to scale Bigtable usage as our storage needs increase. Since we only keep data for a retention period of eight months, this will be driven by the number of thermostats online.

Ops heatmap showing what our hot key ranges are.

Ops heatmap showing what our hot key ranges are. Note, though, that hot key ranges are in the 10 microOps/row/min band, which is still very low.

The Cloud Console also offers a continually updated heat map that shows how keys are being accessed, how many rows exist, how much CPU is being used, and more. That’s really helpful in ensuring we design good key structure and key formats going forward. We also set up alerts on Bigtable in our monitoring system and use heuristics so we know when to add more clusters.

Now, when our customers see up-to-the-minute energy use in their homes, and when thermostats switch automatically to cool or heat as needed, that information is all backed by Bigtable.