If yes, it just creates a session and returns it to the training code that proceeds normally. If one of the tasks crashes and restarts, managed_session() checks if the Model is initialized. The non-chief tasks depend on the chief task for initializing the model. In the other tasks sv.managed_session() waits for the Model to have been initialized before returning a session to the training code. In the chief task, the Supervisor works exactly as in the first example above. With sv.managed_session(server.target) as sess: # Get a Session in a TensorFlow server on the cluster. Sv = Supervisor(logdir='/shared_directory/.', is_chief=is_chief) # Create a Supervisor that uses log directory on a shared file system. Server = tf.distribute.Server(server_def) This could be based on server_def.task_index, The only change you have to do to the single program code is to indicate if the program is running as the chief. The other tasks depend on the chief for these services. One of the tasks must be identified as the chief: the task that handles initialization, checkpoints, summaries, and recovery. To train with replicas you deploy the same program in a Cluster. This is why the training loop has to check for sv.should_stop().Įxceptions that indicate that the training inputs have been exhausted, tf.errors.OutOfRangeError, also cause sv.should_stop() to return True but are not re-raised from the with block: they indicate a normal termination. In that case the training loop should also stop. After an exception is raised, should_stop() returns True. The supervisor is notified of any exception raised by one of the services. If the program crashes and is restarted, the managed session automatically reinitialize variables from the most recent checkpoint. In addition, a few services have been started to checkpoint the model and add summaries to the event log. Within the with sv.managed_session() block all variables in the graph have been initialized. With sv.managed_session(FLAGS.master) as sess: # Get a TensorFlow session managed by the supervisor. # Create a Supervisor that will checkpoint the model in '/tmp/mydir'. Use for a single program with tf.Graph().as_default(): The Supervisor is a small wrapper around a Coordinator, a Saver, and a SessionManager that takes care of common needs of TensorFlow training programs. Please use tf.compat.v1.train.MonitoredTrainingSession instead. Summary_writer=USE_DEFAULT, init_fn=None, local_init_run_options=None Save_model_secs=600, recovery_wait_secs=30, stop_grace_secs=120,Ĭheckpoint_basename='model.ckpt', session_manager=None, Saver=USE_DEFAULT, global_step=USE_DEFAULT, save_summaries_secs=120, Local_init_op=USE_DEFAULT, logdir=None, summary_op=USE_DEFAULT, Is_chief=True, init_op=USE_DEFAULT, init_feed_dict=None, Graph=None, ready_op=USE_DEFAULT, ready_for_local_init_op=USE_DEFAULT, A training helper that checkpoints models and computes summaries.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |