Distributed Alarming System and CISN Alarm System Requirements
Kalpesh Solanki
Rationale
Event is an abstract concept; it serves the purpose of describing a “happening” of something in the physical world or within a computer system. Alarm is an independent entity with only two states, on and off. In off state alarms keep waiting for a particular type of event to occur, once it becomes aware of an event through some communication mechanism it turns its state to on. The manifestation of both states is alarm dependent.
Figure 1: Alarm State Diagram
Figure 1 shows the state diagram of an alarm. An event triggers a state transition from ‘OFF’ to ‘ON’ and action triggers a state transition from ‘ON’ to ‘OFF’. The goal of an alarm is to stay in a steady state, which is ‘OFF’ state.
From above, it is clear that any alarm system will need to allow three types of entities, events, alarms and actions to communicate with each other. Relationship between all three entities could be one to one, one to many or many to many. For example in many-to-many relationship model, many different events could trigger state change in one particular alarm and one single event could trigger state change in many different types of alarms.
An alarm system will provide operations to generate the events, to create alarms, actions and to establish relationships between them. From the above diagram it seems that the alarm itself is a single domain entity, however the idea could be extended to multi-domain. In a multi-domain model of an alarm, alarm state is distributed among different domains (e.g. different computer systems) but the behavior must remain the same. Any alarm system that can facilitate alarms with multi-domain model is a distributed alarm system regardless of its implementation. Events and actions are associated with the alarm states. Since alarm state is distributed events and actions could live in different domains.
Figure 2: Distributed Alarming System Model

Figure 2 shows the fundamental model of the distributed alarming system. It is not biased towards any implementation approach. It shows how the system keeps the alarm in a consistent state (steady state). The time sequence is as follows:
“Happening” is selected as an event and triggers state transition in an alarm (distributed alarm) on computer B.
Alarm ‘ON’ state is propagated to computer C.
Alarm ‘ON’ state is propagated to computer A (but the state hasn’t been updated yet).
Alarm ‘ON’ state is propagated to computer D (but the state hasn’t been updated yet).
Alarm state transition from ‘OFF’ to ‘ON’ on computer C triggers action.
Action on computer C triggers state transition from ‘ON’ to ‘OFF’.
Alarm ‘OFF’ state is propagated to computer B.
Alarm ‘OFF’ state is propagated to computer D.
Alarm ‘OFF’ state is propagated to computer A.
The framework of a distributed alarming system could include alarm state tracking and reliability in it. Due to the nature of any distributed system, system state synchronization is a big problem, type and cost of a distributed system is determined by the trade offs you have to make to solve that problem.
Following requirements forces the system to be a distributed one.
Components involved are geographically distributed.
Avoiding a single point of failure.
Adding reliability by redundancy.
Apparently CISN Alarming System will be a distributed alarming system.
Reliability
This section presents an algorithm to achieve reliability in a distributed alarming system model discussed in Section 1. When a system is able to keep alarm state in synch regardless of node failures within a system, it is ok to say that such system is reliable to a certain degree.
To achieve reliability we need to extend the concept of actions. In section 1 action is stored only on computer C, if computer C goes down that action will not be executed and system wide state of the alarm will not be updated to OFF. Now it is quite natural to think that an action is part of an alarm and it stays with an alarm on all nodes, however we must make sure that only one instance of the alarm action get executed.
Algorithm
Let’s represent alarm as a tuple of three components, alarm, action and timeout.
Tuple: <alarm, action, timeout>. At the time of alarm creation, we set this tuple on all the nodes with the varying timeout values.
For example,
EQAlarm is an alias for the alarm that turns on when earth quake event occurs.
Email_to_XYZ is an alias for the action that sends an email at some location.
The tuple is stored on all nodes as follows:
Computer A: <EQAlarm, Email_to_XYZ, 10 seconds>
Computer B: <EQAlarm, Email_to_XYZ, 20 seconds>
Computer C: <EQAlarm, Email_to_XYZ, 0 seconds>
Computer D: <EQAlarm, Email_to_XYZ, 30 seconds>
The algorithm is as follows:
If (alarm state = ‘OFF’)
Exit.
Else wait for timeout seconds
If (alarm state = ‘OFF’)
Exit.
Execute action.
If (action finished properly)
Propagate alarm ‘OFF’ state to all nodes
When the event occurs on machine B, it will propagate ‘ON’ state to all nodes. Computer C will timeout first and will execute the action Email_to_XYZ. If the action is successful, state on all nodes will turn ‘OFF’. If the action is not successful, Computer A will timeout and will execute the action Email_to_XYZ, so on and so forth.
With this algorithm we can achieve fault tolerance to N-1 node failure, where N is total number of participating nodes.
System Scope
This section lists some of my assumptions regarding the CISN Alarming System.
0 It will be designed especially so that Caltech, Berkeley and Menlo Park can back each other up.
1 It is not intended for private parties to interact with the system.
2 Higher initial setup cost in terms of time and effort is acceptable.
3 Business privacy is not required between participating organizations.
Requirements
This is an informal list of requirements and it doesn’t put requirements in the context of a distributed alarming model discussed above.
All participants are allowed to generate the events.
New event will be propagated to all participants.
System will be able to handle new types of events without modifying the system behavior.
Event must satisfy pre-specified criteria to generate alarms.
Event will be able to trigger many different types of alarms.
Triggered alarms will get fired with high reliability.
If one participant is not able to fire an alarm, other participants will take over the task to fire that particular alarm.
Low human interaction.
Low maintenance.
Applications will be able to interface with the system.
N-1 node failure tolerance. If N-1 nodes in the system go offline, system will still work.
System will process very high rate of events (for example 100 events/sec!)
|