(Breaking down common problems to expedite diagnosis and resolution of common problems)
Device interaction is the foundation for the HP Network Automation (NA) product. Each vendor’s new device model is intentionally different from the next. Therefore, NA uses device drivers to standardize the data that each device returns. This extra layer between NA and managed devices opens the door to potential bugs, both in the NA core product and in the NA device drivers. Of course, there can also be bugs in the implementation of the device’s own operating system. This blog post facilitates the identification and classification of a number of common error conditions. We run through some common situations to demonstrate whether an error is environmental (for example: device errors, network issues, and environmental settings), in the NA core product (across all device drivers), or related to a specific NA device driver. For these latter two cases, we suggest the troubleshooting information to collect and provide to HP Support personnel. Providing this information at the beginning of a support call saves everyone time and money as a result.
How NA interacts with devices
Our concern is primarily with device driver issues, so let’s first go over how NA interacts with devices in general. Almost all interaction with devices is accomplished through tasks. Here is a reduced example of the Take Snapshot task.
In the image, the left columns of boxes represent parts of the NA core product while the two colored boxes collectively represent the NA device driver. The image illustrates the process of the most basic of device tasks, the snapshot.
The top half represents the device interaction phase, during which Network Automation:
Establishes the device connection
Captures the output
Stores the data (in this case, the running configuration) to the NA database
The lower half comprises the parsing phase, during which the driver receives the database information and parses out the device-specific information (in this case, the Device Information diagnostic which collects host name, model name, operating system version, and so on).
Side note: Checkpoint snapshot as a complement to a snapshot-related hotfix
Many problems related to the parsing phase of a snapshot require a checkpoint snapshot after installing the appropriate hotfix. Opening the Device Information diagnostic page runs the driver’s parser as the page is loaded, so the page shows the correct data. However, the Device Home page pulls from the NA database, which still contains the old values, so the page shows old data. This difference is caused by an efficiency step used by NA—if the top half (device interaction) yields a configuration that has not changed, NA skips the lower half (parsing) on the premise that if the input didn’t change, the output won’t change. It has no way to know whether the driver’s parser code (step 11) has changed. You can run a checkpoint snapshot to force NA to run the lower half (parsing), updating the variables as per the fix.
Environment vs. Driver vs. NA Core
How do we know where the problem occurs? Here are some defining types of task-related errors, broken down into categories.
There can certainly be overlap between these categories, though many times the problem falls neatly into a single category. Knowing how to identify what kind of error you’re seeing will speed the support process, and will enable you to solve it yourself (for environmental issues) or to direct your request to the right parties with the necessary logs.
Environmental error conditions are commonly correctable by the user in their environment, without involving HP Support.
The most basic of errors is the connection failure, which is usually caused by an issue in the network at large. It may be device-centric, as with improperly set up DNS or gateway information, or it could be in the network and caused by a firewall or other network appliance.
In this situation, the solution is to work with your network engineers to resolve the issue because the problem is not with NA. Another way to confirm this situation would be to open a connection to the device from the OS console of the NA server. This connection failing mirrors what NA is experiencing. If a device does not use the default port, you may need to set the SSH Port or Telnet Port device access variables or add the custom variables http_port or https_port to the credentials used in NA.
Another common scenario is the failure of file transfers initiated as part of a snapshot. These failures can manifest in two ways, either as a caught failure (left side example) or an uncaught failure (right side example). In the caught failure, the error condition is handled properly and the task rolls to the next protocol, but it’s still a failure that may reduce overall task efficiency. In the uncaught failure, the timeout can cause a synchronization problem and destabilize the task, resulting in failure.
In each case, the problem is that the protocol server in question (on the NA core server) can’t be reached by the device. The most likely cause is a network issue, but there can be other causes, including:
A protocol error in the device OS
An unexpected interactive question from the device
The protocol server on the NA core server running on a non-default port
This last cause most often occurs when the OS sshd is left on port 22. The most common SCP use case requires that the NA proxy/SCP server be running on port 22, which can raise issues with internal IT policies.
A third environmental error case revolves around authentication errors. In this case, the supplied credentials may be incorrect or the rotation of password rules causes rule-specific settings (such as device access variables) to not be used properly.
While the fact of the failure is often easy to detect, the cause of the failure commonly is not. Because passwords are hidden for security reasons, one key is to look at the password rule name, which is included in the Connect line each time a new device connection is attempted. Because NA tends to use the first rule that works, there can also be issues with multiple rules that work for initial login but have different security privileges after that. Because the privilege limitation is hit after the login completes, NA may not try other rules because it believes the problem is not with the credentials.
Correctly identifying the credentials being used is critical to fixing the problem. You can often solve the problem by specifying which rule Network Automation tries first (on the Edit Device page).
One last case that often can be fixed environmentally is three specific problems in the CLI Driver Discovery task related to the handling of device wakeup and of More prompts issued by the device. SNMP is not affected, so this case often arises in networks where SNMP is restricted or not allowed entirely.
The first of these three involves the use of wakeup characters, specifically the Ctrl-U to wake up the terminal. Some devices (incorrectly) interpret this character as a printable character, which causes the prompt identification process to fail. NA doesn’t have a usable prompt, so it falls back to using the regular expression pattern “\S[\S\s]+$” along with a short timeout and just takes what it gets back, rather than expecting a firm prompt.
This issue causes NA to not be able to properly handle the More prompt (that is, —More—), because there is no way for NA to know if it has encountered a device prompt or has stalled with a More prompt that needs to be paged through. If the latter pages of the command data contain additional information, it is not made available to the drivers’ detection routines, possibly resulting in a failed discovery.
To fix this scenario, set the device access variable skip_ctrl_u to true.
The second scenario specifically affects SSH connections in discovery, while Telnet is unaffected. The bug is caused by the third-party SSH client code that NA uses. This code interprets a timeout (due to lack of expected patterns) as a session disconnection. This interpretation can be worked around in other driver tasks but not in discovery tasks due to architectural reasons. When a More prompt is encountered, the timeout occurs. NA sends a space character to continue, but it’s too late, the session is already considered dead. Future attempts to interact with the device are unsuccessful, and the discovery task fails. This situation occurs most often with devices whose show version information is paged. To work around this case, set the device access variable PollRead to true, or set
in the system-wide RCX settings (adjustable_options.rcx). This configuration applies the PollRead variable to all discovery tasks.
A third error condition can be caused when the standard timeout variable is set to too large a value. NA relies at points on the combination of a “[\S\s]+?” regex pattern and a time delay, and uses the standard timeout (if it is set) instead of a default of 4 seconds. If the standard timeout is set too high, the device could idle out the connection due to the large sleep+timeout that is used, causing the discovery task to fail. Unless a device is particularly slow and needs a long time to establish the initial connection, the standard timeout should be set to a reasonable value of less than 30 seconds. Large values are typically not needed for other cases because NA only starts the timeout clock when no data is incoming.
There are several kinds of problems that are directly related to a fault in the driver code.
The most obvious is a straightforward ‘syntax error’ message, which while rare, can happen due to a typo, especially inside rarely used code blocks. This situation requires a hotfix, and the message indicating where the error occurred alerts HP Support to where the error can be easily fixed.
A second error case involves a CLI timeout. There are two common causes for this timeout. The most common cause is that the device response does not match expectations. It may be due to an unexpected permission error or a message that is particular to the environment. There are many cases that can’t be easily simulated in a lab environment.
A related but different scenario involves the driver’s script getting out of sync, often due to unexpected instances of the device prompt occurring within the data stream. Some devices, like the F5 BigIP in this case, perform syntax verification as the user types the command or scroll the command line to accommodate a long command. In this process, the device reissues the prompt with some ANSI markup characters that preserve the terminal context. Because Network Automation matches the prompt to manage the data buffer, NA may think the command is completed (and all related data captured) before the actual command executes. The image shows NA moving on to the set cli config-output-format command while the data buffer indicates that the previous command is still being handled. This behavior can cause CLI timeouts due to (now) unexpected messages. Even if the task succeeds, the compromised data capture stores the output from each command in the wrong spot, resulting in data values not being parsed properly.
The final category of driver related problems is the failure to parse specific values from the stored configuration data. Before contacting HP Support, evaluate the previously discussed cases to make sure that they are not the primary cause for the failure. An important thing to note when contacting HP Support is whether the value in question appears in the configuration or the other data collected, as seen in the device session log. In some cases, the desired value (for example, a card serial number) was not parsed because it did not appear on the devices used to create the device driver or because it was represented in a different form on those devices. In the former case, the command needed to collect the value is not included in the driver snapshot, so the value is not in the parsed output. In this case, provide HP Support with the full command line for the command and also a sample of that command’s output, highlighting the value to be parsed. Providing this information up front reduces the inevitable churn needed to get this output and facilitates the parsing update and resultant testing.
There are many other categories of driver error. They require the intervention of HP Support to handle. The resulting tickets should be referenced to the NA Content team, and packaged with the logging described in What Logs to Provide.
NA Core Issues
This section refers to issues that involve the NA Core product and are commonly felt across multiple device types and models. We will not discuss these issues in this post except to note that issues that affect more than one device type should be referred to as such when contacting HP Support so they can direct their internal discussion accordingly.
Driver Logging Requirements
Network Automation offers a number of logging options. Selecting the correct options ensures full coverage of the problem in question while minimizing impact to the customer’s system and the time required for HP Support personnel to review the logs. All of these together work toward fixing your problem faster.
What Logs to Provide
NA offers two methods to enable jboss logging: global logs and per-task logs. Per-task logging is enabled by selecting logging settings when creating a task, while global logs are set through the Admin > Troubleshooting menu. Per-task logging is preferable in most cases because it is restricted to the running task and has minimal system impact. Global task logging may be needed for randomly occurring errors, group tasks, or for automatically scheduled tasks because there is no opportunity to select the logging options. Logging is captured for the system as a whole, including all running tasks, so be sure to note the TaskID when filing a support request; it is crucial to sifting through what may be a very large haystack. Also, when using global logs, ensure that the TaskID is present in the jboss_wrapperX.log files within the troubleshooting package. Log files wrap after 10MB, which can happen quickly with logging settings that are more verbose. If the task is not present, include multiple “history” files and ensure that at least one of them contains the TaskID within. Submitting a logging package that doesn’t contain the task in question guarantees extra cycles of churn in processing.
For each of the jboss logging methods, knowing what logs to select is just as crucial. For almost all driver-related tasks, three central logs contain all needed information:
device/driver/discovery should also be included for Driver Discovery tasks.
Remember that logging settings are hierarchical, so selecting “device/driver,” for example, includes all of the logs underneath. While using device/driver may save you a few seconds in clicking, it may waste tens of minutes in log processing due to the verbosity of the other logs that are now interleaved within the required log files.
In addition to jboss logging, the task Session Log is a key part of debugging task-related problems. To obtain a session log, run a task with the Store session log check box selected and then copy the text from the task window after it completes. This text contains many Send, Receive, and Expect lines, similar to the images shown in the examples above. Similar data is available through the device/session/log logging setting, but this data is more difficult to process. Copying the data from the browser window often speeds processing dramatically and also provides data needed to simulate the problem in unit tests, which enables more reliable hotfixes.
While we can work with either jboss or session logging, including both reduces churn and ensures that all avenues are covered, speeding the resolution of your problem.
What Tasks to Run
A question that comes up is what tasks should be run to fill or update the data in a given part of NA’s device information. This information is also useful when applying a hotfix to ensure that the proper data is updated. Briefly stated, the mapping is as follows:
Now you can facilitate the process of identifying and correcting problems with Network Automation device drivers. We look forward to fixing more (or would that be fewer?) of your NA device driver bugs.
The most recent driver pack is available from HP Live Network under Driver Packs > Network Automation Version 9.x - 10 Driver Packs.
Under the HP Software Network Management Vendor Relationship Program, the NNMi and NA driver teams work hand-in-glove with the top hardware vendors in the networking space to provide the broadest, deepest device support in the timeliest manner. If any of your hardware vendors does not already participate in the Vendor Relationship Program, please help us establish that connection by contacting HP Support.
John has been a member of the HP NA driver team for 8 years. For the last several years, he has been the lead developer for HP NA Content, and also handles the ongoing development of customer enhancements and defect fixes.