Support Migration Notice: To update migrated JIRA cases click here to open a new case use www.vmware.com/go/sr | vFabric Hyperic 5.7.0 is Now Available

Hyperic HQ

Cancelled availability measurements from ScheduleThread should report as down

Details

  • Type: Improvement Improvement
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 4.5.1
  • Fix Version/s: 4.5.1.2
  • Component/s: Deprecated: Agent:Core
  • Case Links:
    none
  • Regression:
    No
  • Tags:

Description

The ability to cancel metric collections was added in HQ 4.5.1. By default the timeout value for this setting is 5s. If this timeout is exceeded, the ScheduleThread will attempt to cancel the metric collection.

As documented, the ability to cancel a collection still requires that the plugin be in an interruptible state. This means wait(), sleep() or non-blocking read(). If this is not the case, the metric is not actually cancelled and the ScheduleThread moves on.

In the past this was not a big deal since the Backfiller on the server would eventually mark the entire platform as down since a hung ScheduleThread would result in metrics not being reported. With the new design this is not the case, if any server or service Availability metric hangs, it will take until that particular metric collection times out for it to be reported as down.

This request for improvement is to have the ScheduleThread automatically send a down metric data point for any cancelled Availability metric.

Activity

Hide
Ryan Morgan added a comment -

Fixed in 4.5.1.2. Insert a down data point for any canceled Availability metric:

To git@git.springsource.org:hq/hq.git
300d80a..7ae7d7d 4.5.1.x -> 4.5.1.x

Show
Ryan Morgan added a comment - Fixed in 4.5.1.2. Insert a down data point for any canceled Availability metric: To git@git.springsource.org:hq/hq.git 300d80a..7ae7d7d 4.5.1.x -> 4.5.1.x
Hide
Kashyap Parikh added a comment -

Tested schedule thread fix to send down value on canceled measurements by adding sleep in Agent's getValue() as below. To test set the sleep time to 10 minutes and expect agent server to report down value every time collection is scheduled (since it should be canceled after 5 seconds). What I am seeing is down data point is reported along with an up one in the same report. This is due to graceful handling of ScheduleThread interrupt in MeasurementPlugin. We should avoid reporting 0 and 1 availability data points in same interval.

Here's the change in AgentMeasurementPlugin.java to simulate collection hang:

@@ -68,6 +70,18 @@ public class AgentMeasurementPlugin
}
public MetricValue getValue(Metric metric) {...

if(metric.toString().equals(AVAIL_TMPL)){
+ Properties p = agent.getBootConfig().getBootProperties();
+ String hang = p.getProperty("hang");
+ if (hang != null) {
+ try { + Thread.sleep(Long.parseLong(hang.trim())); + } catch (InterruptedException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + }
+ }
return new MetricValue(Metric.AVAIL_UP);
}

Show
Kashyap Parikh added a comment - Tested schedule thread fix to send down value on canceled measurements by adding sleep in Agent's getValue() as below. To test set the sleep time to 10 minutes and expect agent server to report down value every time collection is scheduled (since it should be canceled after 5 seconds). What I am seeing is down data point is reported along with an up one in the same report. This is due to graceful handling of ScheduleThread interrupt in MeasurementPlugin. We should avoid reporting 0 and 1 availability data points in same interval. Here's the change in AgentMeasurementPlugin.java to simulate collection hang: @@ -68,6 +70,18 @@ public class AgentMeasurementPlugin } public MetricValue getValue(Metric metric) {... if(metric.toString().equals(AVAIL_TMPL)){ + Properties p = agent.getBootConfig().getBootProperties(); + String hang = p.getProperty("hang"); + if (hang != null) { + try { + Thread.sleep(Long.parseLong(hang.trim())); + } catch (InterruptedException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } + } return new MetricValue(Metric.AVAIL_UP); }
Hide
Ryan Morgan added a comment -

I reopened this bug based on Kashyap's findings thinking we may have a way to work around this issue.

As it turns out, we don't have much control over what the measurement plugin reports in the case of a failed cancellation. In the code above, I'd say it's a plugin bug that an interrupt does not cause an exception to be raised or a down value to be returned.

In the example above:
1. ScheduleThread cancels metric collection sending an interrupt to the measurement plugin
2. ScheduleThread inserts avail=0
3. InterruptedException caught in measurement plugin, falls through returning avail=1

In this case we could potentially check the result of the interrupt by checking if the measurement plugin was still running. In this case it would not be so avail=1 would be reported and the ScheduleThread would skip sending the down measurement.

Consider the case though where the interrupt would fail, leaving the measurement collection running:

1. ScheduleThread cancels metric collection sending an interrupt to the measurement plugin
2. ScheduleThread inserts avail=0 since measurement plugin still running
3. Some time later, the collection times out and from that point the plugin could return any value it chooses.

So inserting some logic to detect if the plugin is still collecting here is not sufficient. The ScheduleThread would need to know the future behavior of the plugin, which of course cannot be known.

Do we think this case would be frequent? I'm not sure if this is testing the real-world cases. In those cases we see:

1. Plugin hangs but eventually times out returning avail=0
2. Plugin hangs but never times out

In both these cases I think the current solution would be sufficient.

Show
Ryan Morgan added a comment - I reopened this bug based on Kashyap's findings thinking we may have a way to work around this issue. As it turns out, we don't have much control over what the measurement plugin reports in the case of a failed cancellation. In the code above, I'd say it's a plugin bug that an interrupt does not cause an exception to be raised or a down value to be returned. In the example above: 1. ScheduleThread cancels metric collection sending an interrupt to the measurement plugin 2. ScheduleThread inserts avail=0 3. InterruptedException caught in measurement plugin, falls through returning avail=1 In this case we could potentially check the result of the interrupt by checking if the measurement plugin was still running. In this case it would not be so avail=1 would be reported and the ScheduleThread would skip sending the down measurement. Consider the case though where the interrupt would fail, leaving the measurement collection running: 1. ScheduleThread cancels metric collection sending an interrupt to the measurement plugin 2. ScheduleThread inserts avail=0 since measurement plugin still running 3. Some time later, the collection times out and from that point the plugin could return any value it chooses. So inserting some logic to detect if the plugin is still collecting here is not sufficient. The ScheduleThread would need to know the future behavior of the plugin, which of course cannot be known. Do we think this case would be frequent? I'm not sure if this is testing the real-world cases. In those cases we see: 1. Plugin hangs but eventually times out returning avail=0 2. Plugin hangs but never times out In both these cases I think the current solution would be sufficient.
Hide
Kashyap Parikh added a comment -

Closing based on Ryan's comments above.

Show
Kashyap Parikh added a comment - Closing based on Ryan's comments above.

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved:
    Last comment:
    3 years, 12 weeks, 4 days ago