It’s the 4th of July. Exactly sixteen years ago today the Mars Pathfinder landed to a media fanfare and began to transmit data back to Earth. Days later and the flow of information and images was interrupted by a series of total systems resets. How this problem was a) diagnosed and b) resolved still makes for a fascinating tale for software engineers.[1]
Diagnosing the issue
The Pathfinder's applications were scheduled by the VxWorks RTOS. Since VxWorks provides pre-emptive priority scheduling of threads, tasks were executed as threads with priorities determined by their relative urgency.
The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus synchronized with mutual exclusion locks (mutexes). Other higher priority threads took precedence when necessary, including a very high priority bus management task, which also accessed the bus with mutexes. Unfortunately in this case, a long-running communications task, having higher priority than the meteorological task, but lower than the bus management task, prevented it from running.
Soon, a watchdog timer noticed that the bus management task had not been executed for some time, concluded that something had gone wrong, and ordered a total system reset. (Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch and returned to focusing on the mission-critical landing software.)
Finding a solution
Engineers worked frantically on a lab replica to diagnose and fix the problem, eventually spotting a priority inversion. A priority inversion occurs when a high priority task is indirectly pre-empted by a medium priority task "inverting" the relative priorities of the two tasks (see Figure 1). This is a clear violation of the priority model which says high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks which will quickly complete their use of a resource shared by the high and low priority tasks.
To fix the problem, they turned on a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the priority inversion would have been prevented.
Under priority inheritance the priority of the task that holds the semaphore inherits the priority of a higher priority task when the higher priority task requests the semaphore. In Figure 1, task “low” would inherit the priority of task “high” when that task requested the semaphore. This allows “low” to pre-empt “medium”.
The initialization parameter for the mutex which caused the problem (and those for two others which could have caused the same problem) was stored in a global variable, whose address was in symbol tables also included in the launch software. Because VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed during system debugging, it was possible to upload a short C program to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. This put an end to the system resets.
What did engineers learn?
- only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified – a black box diagnosis without traces would have been impossible;
- the presence of "debugging" facilities in the system was extremely important – the problem could not have been corrected without the ability to modify the system;
- spending extra time to ensure priority inheritance correctness at the testing stage, even at some additional performance cost, would have been invaluable.
The origins of the solution
When the keynote speaker referred to a paper which first identified the priority inversion problem and proposed the solution, something extraordinary happened - amazingly, the authors were all in the room and received a rapturous reception. The original paper was:
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.
[1] This summary is based on a note written by Mike Jones in December 1997 following an IEEE Real-Time Systems Symposium keynote address by David Wilner, Chief Technical Officer of Wind River Systems.