2.1.1. Self-healing device enabler
2.1.1.1. Introduction
This enabler aims at providing the IoT devices with the capabilities of actively attempting to recover themselves from abnormal states, based on a pre-established routines schedule. Hence, it should not require high computation capabilities in order to be deployed on any customizable device.
2.1.1.2. Features
As it can be seen in the figure below, the self-healing device enabler is divided in three components:
Self-detector: The goal of this component is to collect information from the IoT device.
Self-monitor: The Self monitor component is responsible for assessing the device’s state of health. It collects and analyses data from multiple sources of information received from the self-detector, such as memory usage, CPU usage, or network connection metrics, providing a health score. The health score metrics are fed to a predefined set of thesholds that determines whether the device is in a healthy state or not.
Self-remediator: When the Self-monitor component notices malfunctioning symptoms, it informs to this component, which itss job is to determine from a set of remediation processes, which should be used for a proper treatment. If after the remediation, the device is not back to its normal state, another remediation process from the list may be triggered.
2.1.1.3. Place in architecture
The self-healing device enabler is part of the Self-* transversal plane. However, since its functionalities are limited to edge devices, it is directly related with the edge and devices horizontal plane.
2.1.1.4. User guide
First of all, please proceed with the Installation following the commands explained in that section.
Next, the user can open a browser and in the address bar type the following:
http://localhost:1885/
.This will automatically open a node-red browser with the predefined self-healing features.
They can be divided between CPU/RAM usage and network availability:
2.1.1.4.1. CPU usage and RAM usage
The following figure illustrates the nodes and flows associated to these self-healing functionalities:
2.1.1.4.1.1. Self-detector
The following command obtains the overall CPU usage:
echo $[100-$(vmstat 1 2|tail -1|awk '{print $15}')]
The following command obtains the overall RAM usage:
free | grep Mem | awk '{print $3/$2 * 100.0}'
2.1.1.4.1.2. Self-monitor
If CPU usage > 80% (customizable parameter via API, as described below) during an amount of time, look for the process with the most CPU consumption. To do so, the following command gets the process with the most CPU consumption:
ps -eo pcpu,pid | sort -rnk 1 | head -1
. In this command, sort is being used with the -r (reverse), the -n (numeric) and the -k (key) options which are telling the command to sort the output in reverse numeric order based on the first column (cpu usage) in the output from ps.If RAM usage > 80% (customizable parameter via API, as described below) during an amount of time, look for the process with the most RAM consumption. To do so, the following command gets the process with the most RAM consumption:
ps -eo pmem,pid | sort -rnk 1 | head -1
. In this command, sort is being used with the -r (reverse), the -n (numeric) and the -k (key) options which are telling the command to sort the output in reverse numeric order based on the first column (memory usage) in the output from ps.
2.1.1.4.1.3. Self-Remediator
The following remediation command is performed (either with CPU usage or with RAM usage): CMD=``cat /proc/<pid>/cmdline |sed 's/\x0/ /g' \`\` && kill <pid> && \`\`$CMD\`\` &
. It will reload the process <pid>
with the kill command
2.1.1.4.2. Network availability
The following figure illustrates the nodes and flows associated to this self-healing functionality:
2.1.1.4.2.1. Self-detector
A Node-RED node to ping a remote server, for use as a keep-alive check) and use this node to ping a known service like www.google.es (IP = 8.8.8.8)
2.1.1.4.2.2. Self-monitor
If response of the ping node is false, the ethernet connection could be off.
2.1.1.4.2.3. Self-Remediator
Restarts network manager service with the following command:
sudo service network-manager restart
2.1.1.4.3. REST API endpoints
The currently supported REST API endpoints are:
Method |
Endpoint |
Description |
Payload (if needed) |
Response format |
---|---|---|---|---|
POST |
/cpuusage |
Change the maximum threshold of CPU usage to XX |
?threshold=XX |
|
POST |
/ramusage |
Change the maximum threshold of RAM usage to XX |
?threshold=XX |
|
POST |
/network |
Change the IP address over which the service should ping to check network availability |
?IP=XX |
2.1.1.5. Prerequisites
Linux
Docker, Docker-compose
The main technological dependencies of the self-healing enabler Docker are:
Technology |
Justification |
Component(s) |
---|---|---|
Node-RED |
Is a low-code programming tool for wiring together hardware devices, APIs and online services. Provides all it is needed to implement self-healing devices (hardware and software access) |
Self-detector, Self-monitor, Self-remediator |
Unix commands |
Used to access device hardware & software |
Self-detector, Self-monitor, Self-remediator |
Javascript |
Main language for developing custom functions over all components of the enabler. Selected for its familiarity |
Self-detector, Self-monitor, Self-remediator |
2.1.1.6. Installation
Download or clone the repository
Once downloaded, browse to that folder with a terminal
Run the following command:
docker-compose up
2.1.1.7. Configuration options
The threshold of the CPU and RAM usage can be configured via the supported restAPI endpoints
2.1.1.8. Version control and release
Version 1.0 includes:
CPU usage, RAM usage, and network accessibility are the HW resources monitored by the enabler
Kill PID (for the CPU, RAM usage monitoring), and restarting of the device’s network manager are the remediation actions supported
Version 2.0 will include:
Additional metrics (e.g., storage status, battery status)
More advanced monitoring options (i.e., smarter ML-based solutions)
Other remediation rules (e.g., isolate the device, shut down network ports, or reboot)
2.1.1.9. License
Apache 2.0