Skip to article frontmatterSkip to article content

Logging with Tensorboard

City University of Hong Kong
from util import *

Logging the results

To call additional functions during training, we can add the functions to the callbacks parameter of the model fit method. For instance:

from tqdm.keras import TqdmCallback

if input("Train? [Y/n]").lower() != "n":
    model.fit(
        ds_b["train"],
        epochs=6,
        validation_data=ds_b["test"],
        verbose=0,
        callbacks=[TqdmCallback(verbose=2)],
    )

The above code uses TqdmCallback()` to return a callback function that displays a graphical progress bar:

  • Setting verbose=0 for the method fit disables the default text-based progress bar.
  • Setting verbose=2 for the class TqdmCallback show and keep the progress bars for training each batch. Try changing verbose to other values to see different effects.

An important use of callback functions is to save the models and results during training for further analysis. We define the following function train_model for this purpose:

  • Take a look at the docstring to learn its basic usage, and then
  • learn the implementations in the source code.
import datetime
import os

import pytz


def train_model(
    model,
    fit_params={},
    log_root=".",
    save_log_params=None,
    save_model_params=None
):
    """Train and test the model, and return the log directory path name.

    Parameters
    ----------
    log_root: str
        Root directory for creating log directory

    fit_params: dict
        Dictionary of parameters to pass to model.fit.
    save_log_params: dict
        Dictionary of parameters to pass to
        tf.keras.callbacks.TensorBoard to save the results for TensorBoard.
        The default value None means no logging of the results.
    save_model_params: dict
        Dictionary of parameters to pass to
        tf.keras.callbacks.ModelCheckpoint to save the model to checkpoint
        files.
        The default value None means no saving of the models.

    Returns
    -------
    str: log directory path that points to a subfolder of log_root named
        using the current time.
    """
    # use a subfolder named by the current time to distinguish repeated runs
    log_dir = os.path.join(
        log_root,
        datetime.datetime.now(tz=pytz.timezone("Asia/Hong_Kong")).strftime(
            "%Y%m%d-%H%M%S"
        ),
    )

    callbacks = fit_params.pop("callbacks", []).copy()

    if save_log_params is not None:
        # add callback to save the training log for further analysis by tensorboard
        callbacks.append(tf.keras.callbacks.TensorBoard(log_dir, **save_log_params))

    if save_model_params is not None:
        # save the model as checkpoint files after each training epoch
        callbacks.append(
            tf.keras.callbacks.ModelCheckpoint(
                os.path.join(log_dir, "{epoch}.ckpt"), **save_model_params
            )
        )

    # training + testing (validation)
    model.fit(
        ds_b["train"], validation_data=ds_b["test"], callbacks=callbacks, **fit_params
    )

    return log_dir

For example:

fit_params = {"epochs": 6, "callbacks": [TqdmCallback()], "verbose": 0}
log_root = os.path.join(user_home, "log")  # log folder
save_log_params = {"update_freq": 100, "histogram_freq": 1}
save_model_params = {"save_weights_only": True, "verbose": 1}

if input("Train? [Y/n]").lower() != "n":
    model = compile_model(create_simple_model())
    log_dir = train_model(
        model,
        fit_params=fit_params,
        log_root=log_root,
        save_log_params=save_log_params,
        save_model_params=save_model_params
    )

By providing the save_model_params to the callback tf.keras.callbacks.ModelCheckpoint, the model is saved at the end of each epoch to log_dir.

!ls {log_dir}

Saving the model is useful because it often takes a long time to train a neural network. To reload the model from the latest checkpoint and continue to train it:

if input("Continue to train? [Y/n]").lower() != "n":
    # load the weights of the previously trained model
    restored_model = compile_model(create_simple_model())
    restored_model.load_weights(tf.train.latest_checkpoint(log_dir))
    # continue to train
    train_model(restored_model, log_root=log_root, save_log_params=save_log_params)

By providing tf.keras.callbacks.TensorBoard as a callback function to the fit method earlier, the training logs can be analyzed using TensorBoard.

if input('Execute? [Y/n]').lower() != 'n':
    %load_ext tensorboard
    %tensorboard --logdir {log_root}
import tensorboard as tb

tb.notebook.list()

The SCALARS tab shows the curves of training and validation losses/accuracies after different batches/epoches. The curves often have jitters as the gradient descent is stochastic (random). To see the typical performance, a smoothing factor θ[0,1]\theta\in [0,1] can be applied on the left panel. The smoothed curve lˉ(t)\bar{l}(t) of the original curve l(t)l(t) is defined as

lˉ(t)=θlˉ(t1)+(1θ)l(t)\begin{align} \bar{l}(t) = \theta \bar{l}(t-1) + (1-\theta) l(t) \end{align}

which is called the moving average. Try changing the smoothing factor on the left panel to see the effect.

Solution to Exercise 1

This leads to a large bias when using the empirical loss or performance to estimate the actual loss or performance. The performance is likely overly pessimistic since there is a large weight on the loss or performance of the previous neural network trained with fewer epochs.

We can also visualize the input images in TensorBoard:

  • Run the following cell to write the images to the log directory.
  • Click the refresh button on the top of the previous TensorBoard panel.
  • Click the IMAGE tab to show the images.
if input("Execute? [Y/n]").lower() != "n":
    file_writer = tf.summary.create_file_writer(log_dir)

    with file_writer.as_default():
        # Don't forget to reshape.
        images = np.reshape(
            [image for (image, label) in ds["train"].take(25)], (-1, 28, 28, 1)
        )
        tf.summary.image("25 training data examples", images, max_outputs=25, step=0)

In addition to presenting the results, TensorBoard is useful for debugging deep learning. In particular, learn

TensorBoard can also show simultaneously the logs of different runs stored in different subfolders of the log directory:

You can select different runs on the left panel to compare their performance.

Note that loading the log to TensorBoard may consume a lot of memory. You can list the TensorBoard notebook instances and kill those you do not need anymore by running !kill {pid}.

while (pid := input('pid to kill? (press enter to exit)')):
    !kill {pid}

Enhancements

def create_dropout_model():
    tf.keras.backend.clear_session()
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Input(shape=(28, 28, 1)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(128, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(0.2),  # dropout
            tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
        ],
        name="Dropout",
    )
    return model


model = compile_model(create_dropout_model())
print(model.summary())

if input("Train? [Y/n]").lower() != "n":
    ### BEGIN SOLUTION
    fit_params = {"epochs": 6, "callbacks": [TqdmCallback()], "verbose": 0}
    save_log_params = {}
    save_model_params = None
    log_dir = train_model(
        model,
        fit_params=fit_params,
        log_root=log_root,
        save_log_params=save_log_params,
        save_model_params=save_model_params
    )
    ### END SOLUTION
def create_cnn_model():
    tf.keras.backend.clear_session()
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Input(shape=(28, 28, 1)),
            tf.keras.layers.Conv2D(32, 3, activation="relu"),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64, activation="relu"),
            tf.keras.layers.Dense(10, activation="softmax"),
        ],
        name="CNN",
    )
    return model


model = compile_model(create_cnn_model())
print(model.summary())

if input("Train? [Y/n]").lower() != "n":
    ### BEGIN SOLUTION
    fit_params = {"epochs": 6, "callbacks": [TqdmCallback()], "verbose": 0}
    save_log_params = {}
    save_model_params
    log_dir = train_model(
        model,
        fit_params=fit_params,
        log_root=log_root,
        save_log_params=save_log_params,
        save_model_params=save_model_params
    )
    ### END SOLUTION

Cleanup

If you run out of storage, you should remove some of the log files:

if input('Remove all logs? [Y/n]').lower() != 'n':
    !rm -rf {log_root}