ASSURE: Automatic Software Self-healing Using REscue points

Software failures in server applications are a significant problem for preserving system availability. ASSURE is a system that introduces rescue points to recover software from unknown faults, while maintaining both system integrity and availability, by mimicking system behavior under known error conditions. Rescue points are locations in existing application code for handling a given set of programmer-anticipated failures, which are automatically repurposed and tested for safely enabling fault recovery from a larger class of (unanticipated) faults. When a fault occurs at an arbitrary location in the program, ASSURE restores execution to an appropriate rescue point and induces the program to recover execution by virtualizing the program's existing error-handling facilities. Rescue points are identified using fuzzing, implemented using a fast coordinated checkpoint-restart mechanism that handles multi-process and multi-threaded applications, and, after testing, are injected into production code using binary patching. We have implemented an ASSURE Linux prototype that operates without application source code and without base operating system kernel changes.

Rescue Point
Example of rescue point

 
Related Papers
ASSURE: Automatic Software Self-healing Using REscue points (more...)
Stelios Sidiroglou, Oren Laadan, Nico Viennot, Carlos-Rene Perez, Angelos D. Keromytis, and Jason Nieh
In ASPLOS 2009, Washington, DC, March 2009