Support Migration Notice: To update migrated JIRA cases click here to open a new case use www.vmware.com/go/sr | vFabric Hyperic 5.7.0 is Now Available

Hyperic HQ

Down/Recovery alert may fire immediately after recovery alert fires

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Cannot Reproduce
  • Affects Version/s: 4.2.0
  • Fix Version/s: None
  • Component/s: Alerts
  • Environment:
    4.2.0 #1182

Description

This is not always reproducible but happens often to be of a small concern.

The scenario is as follows:

Have a group of platforms down with a down alert fired for each of them.
All the platforms come up within a short interval (under 10 minutes) causing recovery alerts to fire for each of them.
For one or two platforms the platform availability down alert fires immediately after first recovery alert fires and within 1 minute recovery alert fires again (1 minute is platform's availability metric collection interval).
The duplicate down and recovery alerts are not expected.
I did see a spike in availability inserter queue size in hq-stats and the spike lasted less then 30 seconds. The spike may be a symptom of availability inserter slowing down causing backfiller to fire down alert.
See attached screenshot, in this case the platform resource was brought down at 1:01 PM (along with many other platforms) and was bought back up ~2:10 PM> This caused 2 down and 2 up alerts to fire instead of just one of each.

Activity

Hide
Jennifer Hickey added a comment -

This can happen in existing implementation if backfill sends a "Down" alert after the recovery alert has fired. Timestamps are not taken into account here b/c the state has cleared with firing of the recovery alert. I saw this once or twice myself, but figured this was expected behavior, since the alert definition is actually enabled once the recovery alert fires, so down alert should fire again. As long as it recovers from the old down again, is this a problem? I'm assuming you weren't still in escalation when the down happened, as duplicate alerts should be suppressed during that time (i.e. the down alert was enabled when it fired)...

Show
Jennifer Hickey added a comment - This can happen in existing implementation if backfill sends a "Down" alert after the recovery alert has fired. Timestamps are not taken into account here b/c the state has cleared with firing of the recovery alert. I saw this once or twice myself, but figured this was expected behavior, since the alert definition is actually enabled once the recovery alert fires, so down alert should fire again. As long as it recovers from the old down again, is this a problem? I'm assuming you weren't still in escalation when the down happened, as duplicate alerts should be suppressed during that time (i.e. the down alert was enabled when it fired)...
Hide
Kashyap Parikh added a comment -

Correct dups are not firing within escalation (that would be a big issue). The only usecase where this could be annoying to a sys admin is if collection interval for a resource is 15 minute or more the next recovery will take that much more time to fire and if this is happening in the middle of the night 15 minute may be too long to wait for before confirming that the resource is collecting fine. But if this is an acceptable edge case behavior I am fine not fixing it and documenting it.

On a side note above FishEye comment (for whoever can see it) is not relevant, There's a typo in checkin comment.

Show
Kashyap Parikh added a comment - Correct dups are not firing within escalation (that would be a big issue). The only usecase where this could be annoying to a sys admin is if collection interval for a resource is 15 minute or more the next recovery will take that much more time to fire and if this is happening in the middle of the night 15 minute may be too long to wait for before confirming that the resource is collecting fine. But if this is an acceptable edge case behavior I am fine not fixing it and documenting it. On a side note above FishEye comment (for whoever can see it) is not relevant, There's a typo in checkin comment.
Hide
Kashyap Parikh added a comment -

CF42

Show
Kashyap Parikh added a comment - CF42
Hide
David Wiener added a comment -

Closed due to being outdated

Show
David Wiener added a comment - Closed due to being outdated

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved:
    Last comment:
    1 year, 37 weeks, 2 days ago