As the complexity of your smart home increases, it’s important to implement some kind of system monitoring to ensure the long-term reliability and stability of the different components that make up the system.
Technology failed me when I had my bike stolen not too long ago. When integration problems remain undetected over a period of weeks or months you are creating a false sense of reliability – placing trust in a system that is already failing and just waiting to show its symptoms.
To give some back story, on the night of the incident, the IP camera facing the driveway was not recording. The alarm system did not trigger because one motion sensor was out of battery and the other one had been turned off the day before due to some yard work. A Wifi enabled outdoor light did not turn on because of a break in Wifi connection at the time. This combination of failures resulted in the perfect opportunity for the bike thief.
Advertisement Begins
Advertisement End
This made me realise how wrong people approach monitoring in a smart home context. People’s efforts on home automation forums consist of monitoring host metrics like CPU and memory usage, percentage of available disk space, and internet connection details such as download/upload rates, ping statistics, etc.
It is a fun project to get this info into Home Assistant and create some nice dashboards in Lovelace using different cards available in Home Assistant Community Store. Over time, however, these dashboards are forgotten because there’s never a good reason to look at them and it’s easy to lose interest along the lines of “System was working fine yesterday, therefore it is working fine today”
The problem can be summarised like this:
- Tracking the wrong metrics that do not relate to the functionality relied upon on a daily basis
- Performing manual data analysis via dashboards
Start tracking meaningful metrics
CPU and RAM are not indicative of how the system is functioning from the user’s point of view. When was the last time a Wifi reconnection loop in one of your lights manifested itself as erratic CPU usage on the machine running Home Assistant?
In 5 years of home automation, my CPU usage was never all that exciting — and for good reason. The hardware is set up to adequately support the services I am running. Ensuring your home automation server runs on reliable computing and networking hardware is a prerequisite! Once this is achieved, monitoring the underlying infrastructure does not give any useful insights about the system once normal “business-as-usual” operation begins. Especially when there are more important metrics to collect and insights to be made aware of.
How do you know which metrics to track?
Look at the value chain of your smart home system. Where do you derive the most value? What are the most useful and critical functions your smart home performs for you? Do you have long chains of integrations that are on the critical path of some core functionality?
Example 1: If you depend on camera streams for your alarm system, set up pings to each camera’s IP address. This asserts network connectivity at the very least. A better approach would be to monitor the camera streaming API directly, as this is what is consumed by Home Assistant.
Example 2: Monitoring all the Wifi enabled devices with connectivity checks. This creates awareness of seldom used devices such as outdoor security lights.
Example 3: Monitor the uptime of critical services such as RF gateways, and network video recorders, and possibly even automate sensible recovery steps such as automatic reboots. These services are at the core of the system or perform critical functions. Downtime is not an option without severely impacting usability or safety.
Advertisement Begins
Advertisement End
Choose what to monitor
Monitor the critical devices you depend on every day to make sure they are operating reliably. My bike theft went undetected because a motion sensor was out of battery, a Wifi light was stuck in a reconnection boot loop and the camera was not transmitting a live stream at the time. Who knows how long these components were malfunctioning? It’s the same problem with smoke alarms. If you don’t test them regularly you have no idea if they are working correctly in the event of an emergency.
It’s easy enough to test smoke alarms. Smart homes differ due to the sheer number of services, potential points of failure and things to potentially monitor on a regular basis.
System monitoring for your smart home is like the automated “smoke detector” check for hundreds of detectors.
The following are some suggestions for good monitoring candidates in a home automation context:
- Monitor battery-operated devices
If your device sends battery information captured in Home Assistant, set up an alert to monitor the battery status.
- Monitor sensor events
If a sensor fails to send telemetry data for 5 minutes it may have connection issues.
- Monitor each individual application
Monitor each individual application installed on your server – from Home Assistant itself to the NVR running in a docker container.
- Monitor MQTT topics
Many devices transmit over MQTT and you should take advantage of this via monitoring. Should they stop sending telemetry data for an extended period of time, something must be wrong.
- Monitor your cameras object detection
This is a critical service! Does the camera normally record 15 motion events a day and there have not been any events in the last 2 days? This looks like a problem.
Home Assistant is the glue holding together multiple integrations and ecosystems. Consider how you are chaining devices from different ecosystems and how many dependencies this is creating. The longer the chain of integration, the more likely it becomes that any link breaks, which causes the whole process to fail. Once you identify these chains, think about ways to break them up. This is general advice for improving the reliability of your system.
Affiliate Content Start
Roku Ultra 2024 - Ultimate Streaming Player - 4K Streaming Device for TV with HDR10+, Dolby Vision & Atmos - Bluetooth & Wi-Fi 6- Rechargeable Voice Remote Pro with Backlit Buttons - Free & Live TV
$77.99 (as of December 19, 2024 06:18 GMT +08:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Apple Watch SE (2nd Gen) [GPS 40mm] Smartwatch with Starlight Aluminium Case with Starlight Sport Band S/M. Fitness and Sleep Trackers, Crash Detection, Heart Rate Monitor, Retina Display
$181.44 (as of December 19, 2024 06:18 GMT +08:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)TOZO Hybrid Active Noise Cancelling Wireless Earbuds, 6 Mics Smart Noise Cancelling 55H Playtime, 32 Preset EQs via APP, Bluetooth 5.3 ENC AI Call Ear buds, IPX8 Waterproof Headphones with LED Display
$24.99 (as of December 19, 2024 06:18 GMT +08:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Affiliate Content End
Follow along on the next page where I discuss how I implemented automated analysis and alerting based on heartbeat health checks.
Implement Automated Analysis and Alerting
System monitoring involves not only measuring and recording metrics but also doing some automated analysis to detect anomalies and create alerts when an action is required.
Pretty graphs and animated icons aside… even the flashiest dashboards get boring on day 3, especially when anomalies are few and far between. Do you really expect to monitor those graphs long-term? How much do you trust yourself to pay close attention and pick up on anomalies when things do fail?
Advertisement Begins
Advertisement End
I’d much rather have a computer do this analysis and alert me with precise, actionable notifications that already contain all the information required to begin further analysis. Introducing…
Alerta – System Monitoring Dashboard
I implemented the system dashboard called Alerta which uses a push approach to monitoring. Rather than pinging devices individually from a central server, Alerta relies on each device checking in periodically with a heartbeat request.
This heartbeat is a simple message saying “Hey I’m still here and operating normally”. If a heartbeat is missed, Alerta automatically generates an alert to indicate a problem with the device. For example, when my security light operates normally, it sends periodic heartbeats to Alerta. Of course, this is only possible when it is operating normally and connected to Wifi. If there is any Wifi issue it is unable to send the heartbeat message. Alerta reacts to this by raising an alert. Should the device start working again, the alert is automatically removed as manual intervention is no longer required.
You can see a sample alert in the images below.
Heartbeat monitoring for Tasmota devices
Kitchen Multi-Timer Pro
Now you’re cooking
Multi Timer Pro is your ultimate meal prep companion, keeping track of multiple cooking times and making adjustments on the fly. Give it a try today and become a better home cook!
My lights run on Tasmota – a popular IoT firmware that hardly needs introduction in the home automation community. Unfortunately, Tasmota is not capable to sending HTTP heartbeat requests itself because it is built on a comprehensive MQTT API. As a workaround, I created a small service called <a href="https://github.com/danobot/alerta-mqtt-gateway" target="_blank" rel="noreferrer noopener">alerta-mqtt-gateway</a>
that listens to specific MQTT topics and relays a heartbeat to Alerta every time a message is received.
The following is an extract from my configuration file. It monitors a Tasmota device on tele/tv_led/STATE
and two RF devices transmitting on home/gw/433toMQTT
.
topics:
tele/tv_led/STATE:
origin: tv_led
timeout: 130
home/gw/433toMQTT:
type: json
attribute: value
listeners:
- value: 2678906
heartbeat:
origin: mtn-living-room
timeout: 604800
tags:
- motion
- value: 13690522
heartbeat:
origin: mtn-living-room-secondary
timeout: 604800
tags:
- motion
Note that you might have to adjust your Tasmota device’s TelePeriod
setting to send status information more frequently. The timeout
value indicates how long Alerta should wait for the next heartbeat before raising an alert.
On the next page, we talk about Home Assistant and Lovelace integration.
Home Assistant integration
The monitoring solution would not be complete without integrating it back into Home Assistant. I set up a simple REST sensor to query the number of alerts from Alerta’s HTTP API.
packages/health_check.yaml
sensor:
- name: alerta
platform: rest
resource: http://tower.local/api/alerts?status=open
headers:
Authorization: Key soverysecret
value_template: "{{value_json.total | int}}"
unit_of_measurement: ""
Lovelace Conditional Card
The following card is displayed if alerts exist:
type: conditional
conditions:
- entity: sensor.alerta
state: '1'
card:
type: picture-entity
tap_action:
action: url
url_path: 'http://tower.local/alerts'
image: /local/images/alert.jpg
show_state: false
name: New System Alert
entity: sensor.alerta
Advertisement Begins
Advertisement End
Dashboards monitoring dashboards monitoring people
I said before monitoring dashboards is the wrong way to go about system monitoring. This is true is you use it to manually detect problems rather than raise awareness of problems. It’s a stark difference.
The Alert image does not show unless there are actionable alerts. (99% of the time, there are none!).
Push notifications have been set up and I use those for other reasons (doorbell for example), the thing is I really couldn’t care about the Wifi connectivity of the fan in my bedroom when I am on a night out. I don’t need to be on call 24/7 and respond immediately.
These types of smart home problems can be actioned soon enough when I see the alert on the dashboard. The point is, the alert is positive alert – something definitely failed and requires me to fix it. All data analysis and dull monitoring were done in an automated manner. I am not a slave to the dashboard and am required to check it constantly, however, when something goes wrong I am made aware of it unintrusively. The image below shows the wealth of information collected in Alerta. Most notably, the host
identifying which IoT device stopped sending heartbeats.
Conclusion
MY MISSION
This blog started nearly 10 years ago to help me document my technical adventures in home automation and various side projects. Since then, my audience has grown significantly thanks to readers like you.
While blog content can be incredibly valuable to visitors, it’s difficult for bloggers to capture any of that value – and we still have to work for a living too. There are many ways to support my efforts should you choose to do so:
Consider joining my newsletter or shouting a coffee to help with research, drafting, crafting and publishing of new content or the costs of web hosting.
It would mean the world if gave my Android App a go or left a 5-star review on Google Play. You may also participate in feature voting to shape the apps future.
Alternatively, leave the gift of feedback, visit my Etsy Store or share a post you liked with someone who may be interested. All helps spread the word.
BTC network: 32jWFfkMQQ6o4dJMpiWVdZzSwjRsSUMCk6
We covered how to do system monitoring for smart homes in this post, including an introduction to Alerta as a plug-and-play solution for compiling and deduplicating alerts from dozens of sources. I look forward to feeding more heartbeats to Alerta to ensure system faults are detected immediately. Monitoring was on my to-do list for a long time and it never received the attention it deserves. Maybe my bike theft could have been avoided.
If you want to receive a push notification on a night out that your shed light has been unresponsive for 15min… it’s up to you 🙂 Personally, I would find that too intrusive considering this is for a hobby smart home and not a multi-million dollar asset management system.