Watchdog design for CF app

Tue, December 21, 2004, 04:02 PM under MobileAndEmbedded
I encourage you to read my previous entry on watchdogs, even if you are familiar with the topic. It will ensure we are on the same page on principles and terminology.

CF's non-deterministic nature in combination with some areas that can indeed take 100s of milliseconds to process (e.g. form loading) dictate the design choices. Cutting to the chase, I'd say your only viable option is to have another blind process as the dog and then choose a high interval for kicking it from the CF app (i.e. >9 seconds). Further, I suggest the dog is an eVC app with as little baggage as possible. For the kick, you can choose between a variety of IPC mechanisms, and I suggest using a named event. The CF app signals the event and the eVC app waits on it, with a timeout twice as large as the time interval.

The CF app will signal the event exactly like I have described here (i.e. cmdWrite_Click method). Obviously you create the named event and store the handle. On a timer's tick (System.Windows.Forms.Timer; *not* a Threading.Timer) signal the event. Also, signal it before any method that will take some time to complete. You can basically kick the dog from a number of places, but the important one is the kick from the UI thread on a timer whose interval is half the interval defined in the dog process.

So the C app acting as the dog has code similar to this pseudo code:
// in a thread

while(1)
{
if (WaitForSingleObject(hDogEvent, 20000) != WAIT_OBJECT_0)
{
//RESET UNIT
}
ResetEvent(hDogEvent);
}

We implemented the above on our platform with a few additions.
1. The dog itself kicks the hardware watchdog on our unit. So should anything go wrong with that process (effectively should something go wrong in the OS), our unit resets.
2. The dog is also the launcher of the CF app on unit startup. So the dog starts the CF app and keeps the handle to the process. So, on a separate thread, it waits on the process handle; if that returns (interpreted as the app has exited), it also resets the unit. [Note there is no mechanism for the user to terminate the CF app but, in the case where the process is nuked for whatever reason, the dog does not have to wait for the interval and instead resets the unit immediately.]
3. Before resetting the unit, it stores in the registry the reset reason with a timestamp (the app also does that in cases where it legitimatly resets the unit e.g. language change) - the diagnostic value of this is obvious.
4. If I told you, I'd have to kill you :-)

And as an aside, my CF app displays on a diagnostics page the amount of time it has been running along with the reason it was last reset. We have a unit in the office that has been going for >100 days (although I am sure there are others on the field that have been going for longer ;-)