Alarm

From WICE Wiki v2.89
Revision as of 14:08, 3 February 2020 by Julian (talk | contribs) (Added a definition of warnings and errors)
Jump to navigation Jump to search

There is a lot going on in the system with WCUs uploading data and many there are thousands of WCUs doing all sorts of things. Keeping track if all this is a tough job and this is where the alarms are at your disposal. This view is meant to collect all relevant information regarding the status of individual WCUs. Of course, it can keep status of more than WCUs but at this very moment that is what is there.

A set of alarms in the alarm tab.

As default, the alarms from the last 24 hours are kept in the list. The list is updated automatically with new alarms once every 10 minutes. Alarms already on the list are not removed, hence, over time there will be alarms from over 24 hours ago. A figure of the alarm panel is on the right. An alarm consists of five parts, a resource identifier, a time, a message, a severity and a category.

A resource identifier can basically take any form, it is a string that identifies the source of the alarm. In the case to the right there are three distinct WCU resources with id wcu::04-1B-94-00-20-8C, wcu::00-09-D8-02-B7-4A and wcu::04-1B-94-00-20-76.

Time is simply at what time the alarm was triggered.

A message is a textual description of what happened. In this case it was because certificates are about to expire. The message looks as follows: 'Certificate expires on 20181019-184056 +02:00'.

Severity is a way to communicate the urgency of the alarm. Severity is currently divided into two categories, Error and Warning. Error implies that something unexpected has happened which requires attention. Warning implies that a resource has entered a state that could potentially (but not necessarily) be harmful.

Category is a way to categorize an alarm. This is particularly useful for searching.

The remark column tells what has happened with the alarm. If it is empty, the alarm has simply been raised. If there are letter present the are either an A, which means acknowledged, or a G which means that the alarm situation has been resolved either as a consequence of the acknowledgement or automatically. An example of where it has automatically been resolved is the case where the SD card in the WCU has reached, let us say, an 85% usage degree and later when data is uploaded the usage degree drops below 80%. By hovering over the remark column you will be presented with the date and user of when an alarm was acknowledged and with a date if it was automatically closed.

At the top is a set of controls to filter alarms based on other criteria than the text in the columns. First out we have the checkbox "Use fetch interval". Checking this box enables the two items "Time unit" and "Time interval". This makes it possible to fetch alarms from more than 24 hours back. Its default setting is conveniently 24 hours. The time unit says how the number should be interpreted, e.g. changing the time unit to Days while leaving the time interval at 24 will mean to fetch alarms from 24 days back. Available time units are hours, days and months. Also, you choose to include closed alarm by checking the box "Include closed alarms" and/or including acknowledged alarms by checking "Include acknowledged alarms". It is only when either or both of these check boxes are checked that the remark column is populated.

Functions

There are a few functions in this panel. You can update an alarm list, reload the alarm list, acknowledge an alarm and search among the alarms in the table.

Update alarm list

A said earlier, the list is automatically updated every 10 minutes. But if you feel like not waiting, simply press the 'Update alarms' button. This will fetch new alarms from the server.

Reload alarm list

By pressing 'Reload alarms' button you will clear the current set of alarms and fetch a new set from the last 24 hours. Usually this is not needed but it is here for your convenience. The button needs to be pressed if you use and change the fetch interval.

Acknowledge alarm

When alarm is 'taken care of' you can acknowledge this by pressing the button 'Acknowledge alarm'. This will make the alarm be marked as 'acknowledged'.

Search alarm

Alarm list filtered on resource '20'

Over time, many alarms will accumulate and the list will be rather long. In order to find alarms of a specific category or for a specific resource, you can enter text in some of the filter column entries to filter the list. If you enter text in more than one of the filter entries an and-operation is used to filter the list. As an example see the figure to the right where a filter for WCUs having '20' in their id has been filtered for. As you can see, there are two WCUs that match this filter expression. No regular expression can be used here, it is a simple text search.

Alarm fetch interval

At the top is a set of controls to filter alarms based on other criteria than the text in the columns. First out we have the checkbox "Use fetch interval". Checking this box enables the two items "Time unit" and "Time interval". This makes it possible to fetch alarms from more than 24 hours back. Its default setting is conveniently 24 hours. The time unit says how the number should be interpreted, e.g. changing the time unit to Days while leaving the time interval at 24 will mean to fetch alarms from 24 days back. Available time units are hours, days and months. Also, you choose to include closed alarm by checking the box "Include closed alarms" and/or including acknowledged alarms by checking "Include acknowledged alarms". It is only when either or both of these check boxes are checked that the remark column is populated.

Available alarms

Presented below is the set of alarms currently available.

Certificate expires

This alarm is identified with 'wcu::info::cert::expire'. Currently, alarms start to appear from 30 days of certificate expiry date. The alarm is triggered at most once per 24 hours.

Certificate not present

This alarm is identified with 'wcu::info::cert::not_present'. It means there is no certificate at all on the WCU. The alarm is automatically closed when the WCU reports that there is a certificate installed on the WCU. It is triggered at most once per 24 hours.

Certificate password missing

The alarm is identified with 'wcu::info::cert::password::missing'. There is a certificate on the WCU and the private key is encrypted but there is no password supplied to decrypt it with. As soon as correct a password is supplied this alarm will be automatically closed.

Certificate unlock failed

The alarm is identified with 'wcu::info::cert::unlock::failed'. The most common problem to this alarm is that the wrong password has been supplied to decrypt the private key. As soon as correct a password is supplied this alarm will be automatically closed.

Sdcard usage

The alarm is identified with 'wcu::info::sdcard::use_percent'. An alarm is raised, currently, when the usage percentage of the SD card on the WCU is 80% or more. Alarms for this are triggered at most once every 24 hours. If the usage percentage is 95% or more, an alarm can be triggered up to once every 10 minutes. Once the usage percentage drops below 80% alarms are automatically closed.

Switch in INT position

The alarm is identified with 'wcu::info::start_switch::int'. An alarm is raised when the WCU reports that the switch is in the int position. It is automatically closed when the WCU reports that the switch is in position ext.