This post originally appeared on the ecobee engineering blog.
My team at ecobee spent months debugging build failures on the staging site for ecobee.com until we finally discovered a way to solve them with automated Slack notifications.
Production vs. staging
ecobee.com is an e-commerce website, which we populate with content from Contentful and Shopify.
Our designers and writers create new pages by building and assembling “blocks” of content in Contentful. As they work, they can preview their changes on a staging version of our site.
Here’s the live version of ecobee.com:
And here’s the staging version of ecobee.com:
The staging version looks very similar to the live version because, well, that’s the point. It’s generated using the exact same codebase as the live site and all the content for both versions of the site come from Contentful.
If you aren’t familiar, here’s what Contentful looks like:
This example is the “Page” entry that creates the homepage for ecobee.com. It’s basically a series of fields: some are required; some are not.
If you mark a field as required, Contentful will confirm that field is populated before it lets you publish the entry. (The same applies to other field validations like max length, number vs. text, etc.) And because the live site only consumes published entries, we know all the fields have been validated (e.g. no required fields are empty), and the data the website receives will match what we expect.
So far, so good.
However, the staging site is a little different: it connects to the same production Contentful data as the live site, but also includes any draft entries that have not yet been published.
That’s by design, so that editors can preview their in-progress changes on staging as they go.
But those draft entries have one major problem: their fields have not been validated.
Broken staging builds
The challenge we faced was that we kept seeing errors like this:
This the deploy log for a build of our staging website on Netlify. And over and over, we encountered errors saying Cannot read property X of null
or Cannot read property X of undefined
, which indicated that our codebase trusted a value existed (because its Contentful field was required) and tried to use it, but failed because that value was unexpectedly empty.
So, why didn’t Contentful make sure those required fields were populated?
Draft entries cannot be trusted
The basic reason is that unpublished draft entries (which staging consumes) are not validated by Contentful because Contentful only applies validations when an entry is published.
So, while our production builds were stable (since the live site only consumes validated, published entries), our staging builds were constantly exposed to the risk that a field coming from a draft entry could contain invalid data (or no data at all).
This issue was minor when our team only had one or two content editors. But as our team grew to include half a dozen or more editors working separately and constantly triggering new builds, incomplete data began to be hit staging on a daily basis.
Debugging in the dark
When one of these errors occurred, the staging website would stop updating, and one or more devs would have to put aside their work and jump into the build logs to locate an error like the one shown above.
Next came the hard part: trying to figure out which field in Contentful was responsible for the problem.
For example, all the error in the screenshot above says is that the Link
component was passed an empty href
. It doesn’t say which field in Contentful that empty value came from.
Trying to track down those empty fields was no fun. Each hunt took up a lot of time and involved a lot of head scratching and rooting through Contentful, searching for the root of each problem. Work ground to a halt for editors and developers while we hunted, sometimes for up to an hour.
We were to determined to free ourselves from these thankless debugging sessions.
Custom null checks everywhere
To do that, we started adding defensive null
checks all over our codebase.
Even though we didn’t have to worry about the existence of required values in production, for the sake of our editors having a working preview site to look at, we realized we needed to treat required fields as unreliable and start guarding against their lack of existence.
So, we started adding things like this all over our codebase that intercepted empty values and identified exactly which Contentful field to fix:
Slack to the rescue
We also created a helper, called logError()
, to decide whether each error message should appear just in the console or also be posted to Slack:
So, we ask, “Are we on the production site? Or staging? Or just in development?”, and based on the answer we decide if we should to stop the build (production only) and/or post the error to Slack (production and staging only). In all environments, we also log the error to the console.
We then created a dedicated Slack channel and invited all our editors to it, created a Slack bot, and used the Slack Web API to automatically post these Slack messages for us:
Now, whenever a build error occurs and is caused by a Contentful data entry problem we know about, the message goes straight to that editor-focused channel so they can can hop into Contentful and fix the issue without dev assistance:
An endless process
But, we weren’t done.
We started noticing that after we had fixed the 10 or so errors we had seen over and over, there were 10 or so new ones that we hadn’t previously seen because they never had a chance to appear because the other errors were happening first.
It slowly dawned on us that with our current approach of inserting custom null
checks and error messages in individual components, this problem would never end. We’d have to null
check every single value we considered required and continue doing that whenever we wrote new code.
That was not going to be scalable for us in the future.
Automated validation step
So, we came up with a better idea.
To avoid having to spread null
checks all over our codebase and worry about this forever, early in our build we now query all of the data from Contentful that our website will use along with the rules for every field, and then compare the two.
We take each field and its rules and pass it to a function that asks if it’s invalid. For example, for requiredness, we check if the field’s rules say it’s required and if its value is empty. If both are true at the same time, the field is invalid:
Then, we have a single place where we call our logError()
helper as many times as necessary:
The updated version of these Slack error notifications looks like this:
This updated template tells editors the name of the broken Contentful entry, which field in the entry has the problem, what the problem is, how to fix it, a link to the entry, and context about the environment where the build error occurred.
Happy editors and developers
We’ve been living with this solution for the past few months and our editors and developers are both much happier!
There are no more confusing debugging sessions. Whenever a staging build breaks, the solution is posted to Slack a second later and an editor can fix it themselves within a minute.
This has eliminated an annoying source context switching and restored hours of developer and editor productivity each week.
Please share your ideas!
So, that’s how the “.com” team at ecobee is currently tackling the challenge of using the same codebase for production and staging but sending that codebase different data in each environment.
If you’ve encountered this issue before and have tips or best practices that might help us, please let us know! We realize this is likely a common problem for sites with a content management system, so we’d love to hear how you approached solving this issue.