Downtown Los Angeles Tehipite Dome, Kings Canyon National Park Theme Building, Los Angeles International Airport Joshua Tree National Monument Vincent Thomas Bridge, Los Angeles Harbor

Home Frequently Asked Questions Contact Information


Distributed Alarming System and CISN Alarm System Requirements

Kalpesh Solanki

Rationale

Event is an abstract concept; it serves the purpose of describing a “happening” of something in the physical world or within a computer system. Alarm is an independent entity with only two states, on and off. In off state alarms keep waiting for a particular type of event to occur, once it becomes aware of an event through some communication mechanism it turns its state to on. The manifestation of both states is alarm dependent.

Figure 1: Alarm State Diagram

Figure 1 shows the state diagram of an alarm. An event triggers a state transition from ‘OFF’ to ‘ON’ and action triggers a state transition from ‘ON’ to ‘OFF’. The goal of an alarm is to stay in a steady state, which is ‘OFF’ state.

From above, it is clear that any alarm system will need to allow three types of entities, events, alarms and actions to communicate with each other. Relationship between all three entities could be one to one, one to many or many to many. For example in many-to-many relationship model, many different events could trigger state change in one particular alarm and one single event could trigger state change in many different types of alarms.

An alarm system will provide operations to generate the events, to create alarms, actions and to establish relationships between them. From the above diagram it seems that the alarm itself is a single domain entity, however the idea could be extended to multi-domain. In a multi-domain model of an alarm, alarm state is distributed among different domains (e.g. different computer systems) but the behavior must remain the same. Any alarm system that can facilitate alarms with multi-domain model is a distributed alarm system regardless of its implementation. Events and actions are associated with the alarm states. Since alarm state is distributed events and actions could live in different domains.

Figure 2: Distributed Alarming System Model

Figure 2 shows the fundamental model of the distributed alarming system. It is not biased towards any implementation approach. It shows how the system keeps the alarm in a consistent state (steady state). The time sequence is as follows:

“Happening” is selected as an event and triggers state transition in an alarm (distributed alarm) on computer B.
Alarm ‘ON’ state is propagated to computer C.
Alarm ‘ON’ state is propagated to computer A (but the state hasn’t been updated yet).
Alarm ‘ON’ state is propagated to computer D (but the state hasn’t been updated yet).
Alarm state transition from ‘OFF’ to ‘ON’ on computer C triggers action.
Action on computer C triggers state transition from ‘ON’ to ‘OFF’.
Alarm ‘OFF’ state is propagated to computer B.
Alarm ‘OFF’ state is propagated to computer D.
Alarm ‘OFF’ state is propagated to computer A.

The framework of a distributed alarming system could include alarm state tracking and reliability in it. Due to the nature of any distributed system, system state synchronization is a big problem, type and cost of a distributed system is determined by the trade offs you have to make to solve that problem.

Following requirements forces the system to be a distributed one.
Components involved are geographically distributed.
Avoiding a single point of failure.
Adding reliability by redundancy.

Apparently CISN Alarming System will be a distributed alarming system.

Reliability

This section presents an algorithm to achieve reliability in a distributed alarming system model discussed in Section 1. When a system is able to keep alarm state in synch regardless of node failures within a system, it is ok to say that such system is reliable to a certain degree.

To achieve reliability we need to extend the concept of actions. In section 1 action is stored only on computer C, if computer C goes down that action will not be executed and system wide state of the alarm will not be updated to OFF. Now it is quite natural to think that an action is part of an alarm and it stays with an alarm on all nodes, however we must make sure that only one instance of the alarm action get executed.

Algorithm

Let’s represent alarm as a tuple of three components, alarm, action and timeout.

Tuple: <alarm, action, timeout>. At the time of alarm creation, we set this tuple on all the nodes with the varying timeout values.

For example,

EQAlarm is an alias for the alarm that turns on when earth quake event occurs.

Email_to_XYZ is an alias for the action that sends an email at some location.

The tuple is stored on all nodes as follows:

Computer A: <EQAlarm, Email_to_XYZ, 10 seconds>

Computer B: <EQAlarm, Email_to_XYZ, 20 seconds>

Computer C: <EQAlarm, Email_to_XYZ, 0 seconds>

Computer D: <EQAlarm, Email_to_XYZ, 30 seconds>

The algorithm is as follows:

If (alarm state = ‘OFF’)

Exit.

Else wait for timeout seconds

If (alarm state = ‘OFF’)

Exit.

Execute action.

If (action finished properly)

Propagate alarm ‘OFF’ state to all nodes

When the event occurs on machine B, it will propagate ‘ON’ state to all nodes. Computer C will timeout first and will execute the action Email_to_XYZ. If the action is successful, state on all nodes will turn ‘OFF’. If the action is not successful, Computer A will timeout and will execute the action Email_to_XYZ, so on and so forth.

With this algorithm we can achieve fault tolerance to N-1 node failure, where N is total number of participating nodes.

System Scope

This section lists some of my assumptions regarding the CISN Alarming System.

0 It will be designed especially so that Caltech, Berkeley and Menlo Park can back each other up.

1 It is not intended for private parties to interact with the system.

2 Higher initial setup cost in terms of time and effort is acceptable.

3 Business privacy is not required between participating organizations.

Requirements

This is an informal list of requirements and it doesn’t put requirements in the context of a distributed alarming model discussed above.

All participants are allowed to generate the events.
New event will be propagated to all participants.
System will be able to handle new types of events without modifying the system behavior.
Event must satisfy pre-specified criteria to generate alarms.
Event will be able to trigger many different types of alarms.
Triggered alarms will get fired with high reliability.
If one participant is not able to fire an alarm, other participants will take over the task to fire that particular alarm.
Low human interaction.
Low maintenance.
Applications will be able to interface with the system.
N-1 node failure tolerance. If N-1 nodes in the system go offline, system will still work.
System will process very high rate of events (for example 100 events/sec!)


home | about scsn | earthquakes | stations | data processing | software development

education and outreach | real-time information distribution | faq | contact

Questions or comments regarding this site may be directed to webmaster@scsn.org

The tool below will allow you to search the public SCSN files within this site:

Google
  Web www.scsn.org

Last Updated: 2005-11-07 © Copyright 2004, California Institute of Technology. All Rights Reserved. Permission required to reproduce any portion of this site.