What Happens with Remaining Devices After a ServiceGateway Related Processes Restart?

Article Original Creation Date: 2014-03-21

Overview

This article is some of the frequently asked questions regarding the things that happen in the background with the remaining devices after ServiceGateway related processes restart.

Reference Information

The Customer uses a specific Plugin to provision devices. That plugin is a workflow step in the Default Configuration Synchronization policy.

Question: What happens with the remaining devices which were processed by policy, after the restart, does ServiceGateway (SG) tries to process them again?

Answer: Any devices that are in the SPRT_SG_POLICY_DEVICE_HOLD will be processed after the system restarts.

In the case of a timer-based policy, they will be processed the next time the policy is in an active window, provided that the concurrency limit has not been reached.
In the case of an event-based policy, if there are any devices that are queued in SPRT_SG_POLICY_DEVICE_HOLD, then they can be processed as well the next time the timer runs (the timer is what clears up event-based devices that are queued in the _HOLD table, but only if the concurrency limit for the policy has not yet been reached).
As for the concurrency limit, that would be the number of records in SPRT_SG_POLICY_DEVICE for a given policy. Service Gateway does not look at the history tables to determine what is in progress. In SG 4.3.3 and earlier we did check the history for things like policy action retries, but we never check it to see if anything is currently in progress, or to determine what else we need to process for a policy. That all comes from the SPRT_SG_POLICY_DEVICE and SPRT_SG_POLICY_DEVICE_HOLD tables.
As mentioned above, clearing the history does not affect the remaining devices for a policy. That information is retained in SPRT_SG_POLICY_DEVICE_HOLD

Question: When does a device end up in the SPRT_SG_POLICY_DEVICE_HOLD table?

Answer: The only way a device will end up in SPRT_SG_POLICY_DEVICE_HOLD is if SG puts it there.

So in the case of a timer-based policy this will be when building the device list.
In the case of an event-based policy this will be if the device is already in progress in another policy, or if the concurrency limit for the policy has already been reached.
SG will also move a device from SPRT_SG_POLICY_DEVICE to SPRT_SG_POLICY_DEVICE_HOLD if the policy workflow marks the device as SKIPPED.
If a device is actively being processed, it will be in SPRT_SG_POLICY_DEVICE, and not in SPRT_SG_POLICY_DEVICE_HOLD). So if the system crashes while the device is being processed, the Database (DB) state will remain as it is, showing that the device is in progress. Once the system comes back online, the device will still be in SPRT_SG_POLICY_DEVICE, and this will count towards the concurrency limit for that policy, even though the device is no longer being processed. This is what we call a stuck device, and this is one of the scenarios that runAgedResultsCheck is designed to handle.

Question: Is it advisable in some circumstances to delete queued (event based) devices out of the SPRT_SG_POLICY_DEVICE_HOLD table?

Answer: If you are removing entries from SPRT_NC_INFORM_EVENT, then it makes sense to also remove QUEUED devices from SPRT_SG_POLICY_DEVICE_HOLD.

Question: The main question is what steps should they take, after a system down, to lose as less as possible devices that were in the middle of the provisioning process.

To get less as possible phone calls from customers.

Answer: The issue we run into when a system goes down is what we call stuck devices. This is the case where the DB shows them as being in progress, but they are not actually being processed by the system. The timer job called runAgedResultsCheck is designed to handle these types of devices, although depending on how many stuck devices you have could determine how long it takes that job to clean up the stuck devices.

This is what runAgedResultsCheck does

Clean open action history records. These are records in SPRT_SG_POLICY_ACTION_HISTORY that have an ACTION_EXEC_START_TIME older than the Policy Action Timeout preference, and ACTION_EXEC_END_TIME is NULL. For these records we set the ACTION_EXEC_END_TIME to the current time, and set the RESULT_CODE to -3456 (timeout), MESSAGE to The policy action has timed out, and WORKFLOW to STOP_FAIL.
- If this particular policy action is NOT flagged to be retried on a policy action timeout, we also close the “open” device history records, which I describe in more detail in the next bullet.
Clean open device history records. These are records in SPRT_SG_POLICY_DEVICE_HISTORY that have an EXECUTION_START_TIME older than the Policy Action Timeout, and EXECUTION_END_TIME is NULL, and there are no open action history records that are associated with this device history record. For these records we set the EXECUTION_END_TIME to the current time, the STATUS to FAILED, and the MESSAGE to The policy action has timed out.
Clean open device records. These are records in SPRT_SG_POLICY_DEVICE that have an EXECUTION_START_TIME older than the Policy Action Timeout, and that have no associated open records in SPRT_SG_POLICY_DEVICE_HISTORY. We simply delete these records from the table. We also increment the FAILED_COUNT in SPRT_SG_POLICY_EXEC_HISTORY by the number of records that are deleted for each policy.
Clean open SPRT_NC_CPE records. This is essentially making sure that SPRT_NC_CPE_WF_STATE.NC_CPE_WORKFLOW_STATE is set to NULL and SPRT_NC_CPE_CONNECTION.NC_CPE_CONNECTION is set to 0 for any devices where SPRT_NC_CPE.NC_CPE_LAST_CONTACT_TIME is older than the Policy Action Timeout.

So in essence runAgedResultsCheck performs some rather complicated checks to ensure that the correct data in the DB gets updated.

Letting runAgedResultsCheck do its thing should result in no data loss. You just need to wait until after the Policy Action Timeout to elapse before you will see it clean up stuff.

Choose files or drag and drop files

Tags:

Was this article helpful?

Yes

Priyanka Bhotika
Posted

Comments

Please sign in to comment

What Happens with Remaining Devices After a ServiceGateway Related Processes Restart?

Overview

Reference Information

This is what runAgedResultsCheck does

Priyanka Bhotika

Comments