Support Migration Notice: To update migrated JIRA cases click here to open a new case use www.vmware.com/go/sr | vFabric Hyperic 5.7.0 is Now Available

Hyperic HQ

Oracle plugin needs to throw MetricUnreachableException if managed Oracle instance is not available due to read timeout

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Deferred
  • Affects Version/s: 4.2.0
  • Fix Version/s: None
  • Component/s: Plugins
  • Case Links:
    none
  • Regression:
    No

Description

Related to HHQ-1695, if a managed Oracle server cannot be contacted to collect metrics due to a read timeout the agent continues trying to collect all metrics for the resource. Since these timeouts appear to be about 6 seconds (though the plugin defines the timeout as 60000) this can cause metric collection for other resources on this agent to not occur on time. This will cause false availability alerts.

In lieu of fixing HHQ-1695, this situation can be improved by throwing a MetricUnreachableException from the OracleMeasurementPlugins's getValue() method. This will tell the measurement plugin manager to not attempt collection of the other metrics on that resource for this interval. The stacktrace is as follows:

2010-04-05 04:35:54,170 ERROR [ScheduleThread] [OracleMeasurementPlugin] Io exception: The Network Adapter could not establish the connection
java.sql.SQLException: Io exception: The Network Adapter could not establish the connection
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:179)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:333)
at oracle.jdbc.driver.OracleConnection.<init>(OracleConnection.java:404)
at oracle.jdbc.driver.OracleDriver.getConnectionInstance(OracleDriver.java:468)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:314)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.hyperic.hq.plugin.oracle.OracleMeasurementPlugin.getConnection(OracleMeasurementPlugin.java:97)
at org.hyperic.hq.product.JDBCMeasurementPlugin.getCachedConnection(JDBCMeasurementPlugin.java:199)
at org.hyperic.hq.product.JDBCMeasurementPlugin.getCachedConnection(JDBCMeasurementPlugin.java:188)
at org.hyperic.hq.plugin.oracle.OracleMeasurementPlugin.tablespaceIsOffline(OracleMeasurementPlugin.java:514)
at org.hyperic.hq.plugin.oracle.OracleMeasurementPlugin.getQuery(OracleMeasurementPlugin.java:449)
at org.hyperic.hq.product.JDBCMeasurementPlugin.getQueryValue(JDBCMeasurementPlugin.java:227)
at org.hyperic.hq.plugin.oracle.OracleMeasurementPlugin.getQueryValue(OracleMeasurementPlugin.java:547)
at org.hyperic.hq.product.JDBCMeasurementPlugin.getValue(JDBCMeasurementPlugin.java:138)
at org.hyperic.hq.product.MeasurementPluginManager.getPluginValue(MeasurementPluginManager.java:176)
at org.hyperic.hq.product.MeasurementPluginManager.getValue(MeasurementPluginManager.java:274)
at org.hyperic.hq.measurement.agent.server.ScheduleThread.getValue(ScheduleThread.java:298)
at org.hyperic.hq.measurement.agent.server.ScheduleThread.collect(ScheduleThread.java:387)
at org.hyperic.hq.measurement.agent.server.ScheduleThread.collect(ScheduleThread.java:344)
at org.hyperic.hq.measurement.agent.server.ScheduleThread.collect(ScheduleThread.java:490)
at org.hyperic.hq.measurement.agent.server.ScheduleThread.run(ScheduleThread.java:512)
at java.lang.Thread.run(Unknown Source)

Activity

Hide
Todd Rader added a comment -

At what point(s) are you proposing to throw the MetricUnreachableException, and should it be thrown always when a SQLException is hit? It looks like there is no wrapped IOException, just the exception message.

Show
Todd Rader added a comment - At what point(s) are you proposing to throw the MetricUnreachableException, and should it be thrown always when a SQLException is hit? It looks like there is no wrapped IOException, just the exception message.
Hide
Ryan Morgan added a comment -

This exception needs to be thrown from the MeasurementPlugin's getValue() method. Initially I was thinking we'd throw this based on the error code returned from the exception, but in thinking about it more it would seem that any Exception should probably throw unreachable since those will likely be caused by the database not being available.

Maybe we should run this by Scott as well.

Show
Ryan Morgan added a comment - This exception needs to be thrown from the MeasurementPlugin's getValue() method. Initially I was thinking we'd throw this based on the error code returned from the exception, but in thinking about it more it would seem that any Exception should probably throw unreachable since those will likely be caused by the database not being available. Maybe we should run this by Scott as well.
Hide
Scott Feldstein added a comment -

This bug really sucks. I am not confident about the side affects around MetricUnreachable exception, but there are still issues even if we do go that route:

1) Availability should never throw a MetricUnreachableException since we need to explicitly return AVAIL_DOWN to tell the HQ Server that the resource is down. Therefore the senderThread will still hang.
2) The only thing that metricUnreachable does is to avoid more than 1 iteration of going further into the resource to collect all of its metrics if it actually is down. It will still try to collect the metrics at least once a cycle

I think we need to step back on this one and make a quick fix to the oracle timeout and figure out a better strategy down the road.

Show
Scott Feldstein added a comment - This bug really sucks. I am not confident about the side affects around MetricUnreachable exception, but there are still issues even if we do go that route: 1) Availability should never throw a MetricUnreachableException since we need to explicitly return AVAIL_DOWN to tell the HQ Server that the resource is down. Therefore the senderThread will still hang. 2) The only thing that metricUnreachable does is to avoid more than 1 iteration of going further into the resource to collect all of its metrics if it actually is down. It will still try to collect the metrics at least once a cycle I think we need to step back on this one and make a quick fix to the oracle timeout and figure out a better strategy down the road.
Hide
Ryan Morgan added a comment -

The point of the UnreachableException is it will only collect one metric per resource, which should help, but yes we will still run into issues if there are 10 or more oracle resources on the agent. In that case a lower timeout will help, so I suggest we look into that as well.

Show
Ryan Morgan added a comment - The point of the UnreachableException is it will only collect one metric per resource, which should help, but yes we will still run into issues if there are 10 or more oracle resources on the agent. In that case a lower timeout will help, so I suggest we look into that as well.
Hide
Todd Rader added a comment -

Ryan, is this addressed by the ScheduleThread improvements?

Show
Todd Rader added a comment - Ryan, is this addressed by the ScheduleThread improvements?
Hide
Ryan Morgan added a comment -


The changes to the ScheduleThread will improve this situation greatly. The root cause still persists, but at least now when we run into this condition only Oracle resource types will be affected. I'm guessing that's ok, since there isn't much we can do here. (Aside from tinkering with timeouts and exception handling)

Show
Ryan Morgan added a comment - The changes to the ScheduleThread will improve this situation greatly. The root cause still persists, but at least now when we run into this condition only Oracle resource types will be affected. I'm guessing that's ok, since there isn't much we can do here. (Aside from tinkering with timeouts and exception handling)
Hide
Yoav Epelman added a comment -

Bulk change to new components

Show
Yoav Epelman added a comment - Bulk change to new components
Hide
Idan Hod added a comment -

As part of our continuous effort to improve product quality, The Hyperic product team has decided to implement a "zero bug policy" methodology.

Following this methodology, only defects that are planned to be handled in the near future will remain open. Any other defect will be deferred, with the option to be reevaluated if the need arises, or if changes to the Hyperic road-map make such defect a candidate for a fix.

We believe this new process will help create clarity and focus in the Hyperic road-map and ultimately benefit our customer base.

This bug has been deferred as part of the new policy.

We appreciate your cooperation and continues contribution to the improvement of Hyperic.

Show
Idan Hod added a comment - As part of our continuous effort to improve product quality, The Hyperic product team has decided to implement a "zero bug policy" methodology. Following this methodology, only defects that are planned to be handled in the near future will remain open. Any other defect will be deferred, with the option to be reevaluated if the need arises, or if changes to the Hyperic road-map make such defect a candidate for a fix. We believe this new process will help create clarity and focus in the Hyperic road-map and ultimately benefit our customer base. This bug has been deferred as part of the new policy. We appreciate your cooperation and continues contribution to the improvement of Hyperic.

People

Vote (1)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved:
    Last comment:
    41 weeks ago