On a cluster such as the SCC, jobs that take hours or days to run to completion on occasions could suffer from system abort in mid-stream due to a variety of reasons: power failure, walltime limit, scheduled shutdown, to name a few. Rerunning jobs often leads to inevitable delay and waste of system resources. Checkpoint Restarting has long been a common technique to which researchers turn to tackle this issue. Checkpoint Restarting essentially means saving data to disk periodically so that, if need be, you can restart the job from the point at which your data was last saved. Checkpoint Restarting can either be dealt with through a batch scheduler (if supported) or you can do it the old-fashioned way by writing to disk manually yourself. On the SCC, the OGS batch scheduler checkpointing feature is not supported. A rather simple MATLAB checkpointing tool has been developed by RCS to provide a user-friendly, systematic way to implement checkpointing for your code. Included here is a MATLAB example to demonstrate the usage of this RCS checkpointing tool to facilitate rerunning of your job — should a system abort occurs.
Applicable Platforms
Linux, MS Windows, Mac.
Scopes of the Checkpointing Tools
Two types of applications come to mind:
- To restart a job terminated by a variety of reasons: system crash, system kill, job time exceeds requested wall clock limit, etc. The current version uses loop iteration count as the criterion for checkpointing. This may be extended to accommodate other criteria, such as elapsed time.
- For jobs that take longer than the SCC batch maximum time limit, divide the job into smaller jobs so that each fits within the system’s maximum wall clock limit. Checkpointing will bridge the jobs. The current version of the package supports this type of applications.
Design Goals of Checkpointing Tools
The goals are modest: to make available a simple and user-friendly checkpointing package for a limited class of applications that have relatively straightforward code implementations. It takes 20 to 30 minutes to understand what it does and determine whether it is suitable for your application. In brief, spell out what you want to save for restarting and then save them, once in a while.
Example
- The checkpointing tools (
chkin.m, chkpt.m, README
), as well as a usage example (test_checkpoint.m, test_restart.m
) are available for download - It computes the sum of a simple arithmetic series,
A = 1 + 2 + 3 + . . . + N
, in a loop - Elements of the arithmetic series are mapped to the iteration index, one-on-one. For each iteration, A represents the partial sum at that instance. For N = 50,
A = N*(N+1)/2 = 1275
- Each time an iteration is performed, the code confirms with “Completed iteration X”
- The checkpointing frequency is defined (see below) as every 7 iterations
- A one-second pause (in each iteration) is added to slow down computation to give you ample time to issue Ctrl-c to simulate system kill. Make sure at least one checkpoint event (7 iterations) has happened before issuing the kill action
- A mat-file
mydata.mat
is generated. If the job is killed at Iteration 17, data is saved, first at Iteration 7, then overwritten at Iteration 14. The solutions for Iterations 15, 16, and 17 are not saved. The companion code,test_restart.m
, will loadmydata.mat
, generated bytest_checkpoint
, and then start executing from Iteration 15 and complete at Iteration 50 (Niter)
How to run test_checkpoint
and test_restart
The demonstration example can run either in the interactive or batch mode
-
Run interactively
>> test_checkpoint Completed iteration 1 Completed iteration 2 . . . . . . Completed iteration 6 Checkpointing frequency is every 7 iterations. Data updated at iteration 7 Completed iteration 7 . . . . . . Completed iteration 13 Checkpointing frequency is every 7 iterations. Data updated at iteration 14 Completed iteration 14 Completed iteration 15 Completed iteration 16 Completed iteration 17 ****** Simulate a system kill right here with Ctrl-c Operation terminated by user during test_checkpoint (line 65)
Then proceed with running the restarting job
>> test_restart Completed iteration 18 . . . . . . Completed iteration 48 Completed iteration 49 Completed iteration 50 iter = 50; A = 1275
-
Run in batch mode
scc1% qsub -b y 'matlab -singleCompThread -r "test_checkpoint, exit"' scc1% qstat -u kadin 2168148 . . . kadin r 12/12/2013 09:11:42 budge@scc-ha1.scc.bu.edu 1
You should wait until the time difference between your current clock time and the job’s start run time (shown above) is at least 7 seconds (for one frequency cycle) before launching
qdel
to simulate a system kill. This is because each frequency save cycle takes 7 seconds (1 second/iteration due to pause(1)).scc1% qdel 2168148
Then, run the restarting job
scc1% qsub -b y 'matlab -singleCompThread -r "test_restart, exit"'
Checkpointing tools
-
Use
chkin.m
to designate variables for checkpointingfunction s = chkin(s, field) % Purpose: % Adds or creates one or more empty fields specified by cell array "field" % Example 1: >> chkin(s, {'a'}); % check in 1 item % Example 2: >> chkin(s, {'a', 'A', 'myArray'}); % check in multiple items % % Date: December 7, 2013 % Kadin Tseng, RCS, kadin@bu.edu for k=1:length(field) s.(field{k}) = []; end end % end of function
-
Use
chkpt.m
to perform checkpointing% function chkpt(matfile, s) % Purpose: % This script m-file is part of the checkpoint restart package for % users to periodically save data to a file during a batch job run. % In the event that the batch job got terminated, a user can rerun the job % from the point of the last save instead of starting from the beginning. % Input: % matfile -- data storage file name (e.g., mydata.mat) % s is a struct array that stores the various data that the application % code needs to save for restarting. % Note: % For the objective of restarting, data saved overwrites previous saves % to save disk storage as well as for convenience in restarting. % % December 7, 2013 % Kadin Tseng, RCS, kadin@bu.edu % for k=1:nNames s.(chkNames{k}) = eval(chkNames{k}); % update all variables first end save(matfile, '-struct', 's'); % Only fields of s are saved, not s
-
Checkpointing Example,
test_checkpoint.m
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Usage: % >> test_checkpoint % % December 8, 2013 % Kadin Tseng, RCS, Boston University (kadin@bu.edu) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% a = 123; % a constant n = 456; % another constant Niter = 50; % total number of iterations frequency = 7; % how often to perform checkpointing (intermediate save) %========================================================================= % Before the start of the iteration loop, "check-in" each variable % that should be checkpointed in the event of restarting the job matfile = 'mydata.mat'; % mandatory; name of checkpoint mat-file s = struct(); % mandatory; create struct for checkpointing s = chkin(s,{'iter'}); % mandatory; iter is iteration loop index s = chkin(s,{'frequency'}); % mandatory; frequency is checkpointing period % i.e., how often to perform a save s = chkin(s,{'Niter'}); % mandatory; total number of iterations s = chkin(s,{'a' 'n' 'A'}); % OK to check in multiple items % continue until all variables are checked in. Note that you are only % checking in the variables, they don't need to have been already defined chkNames = fieldnames(s); % the full list of variables to checkpoint nNames = length(chkNames); % number of variables in list %======================================================================== A = 0; % initialize arithmetic sum A for iter=1:Niter A = A + iter; % computes the running sum pause(1); % slow down to let you kill job with ^c % to simulate system kill %======================================================================== % Checkpoints periodically (determined by the constant frequency) if mod(iter, frequency) == 0 chkpt % performs checkpointing (save) every *frequency* iterations fprintf(1, ['Checkpointing frequency is every %2d iterations.' ... 'Data updated at iteration %3dn'], ... frequency, iter); % Confirm after each checkpointing event end %======================================================================== fprintf(1, 'Completed iteration %dn', iter); end % ending iteration loop Sas = Niter*(Niter+1)/2; % correct sum of arithmetic series fprintf(1,'iter = %d; A = %d; Correct answer = %d', iter, A, Sas);
-
Restarting with
test_restart.m
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Usage: % >> test_restart % % December 8, 2013 % Kadin Tseng, RCS, Boston University (kadin@bu.edu) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Any information defined in original code that has not %%% been checkpointed should be placed here. %=============================================================== matfile = 'mydata'; % should match that in test_checkpoint.m load(matfile); % retrieve data from matfile iter1 = iter+1; % iter is the last time test_checkpoint issued % a save; we start computing on the next step %=============================================================== %%%%%A = 0; % should not initialize in restart for iter=iter1:Niter % iteration loop starts from when it was last saved A = A + iter; % computes the sum up to iter fprintf(1, 'Completed iteration %dn', iter); end % ending iteration loop Sas = Niter*(Niter+1)/2; % correct sum of arithmetic series fprintf(1,'iter = %d; A = %d; Correct answer = %d', iter, A, Sas);