4 Lines to Using TPUs on Google's Colab

Google Colab is a massive contribution to the democratization of machine learning. Not only are GPUs available for free (1x K80 at the time of writing), you can also use Google's TPUs (Tensor Processing Units) for free. While there are some limitations, pretty big and non-trivial models can be trained so long as you have access to the internet and a relatively modern browser. What's more, it should not take more than a few minute of your time to try it out.

Select TPU runtime

In Colab menu, "Runtime -> Change runtime type". In the window that appears, under Hardware Accelerator, select TPU.

4 lines to TPU

from tensorflow.contrib.tpu.python.tpu import keras_support

tpu_grpc_url = "grpc://"+os.environ["COLAB_TPU_ADDR"]
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu_grpc_url)
strategy = keras_support.TPUDistributionStrategy(tpu_cluster_resolver)
model = tf.contrib.tpu.keras_to_tpu_model(model, strategy=strategy)

That's it. (That's actually just a single line if you don't care about long lines)

If you already have a working Keras model, this is all you need to get it running in colab. Train it as usual with model.fit_generator(...)


Extra Stuff/Notes

For completeness...

Getting your code and data onto colab

This is probably the hardest thing to do. Colab runtimes are given a 50GB temporary storage (approximately 30GB usable). If your code is on github or somewhere publicly accessible, command line tools are available from within the notebook.

The can be downloaded easily like this

!git clone <your-code.git>
!wget http://your.data.server/dataset.tar.gz

Or you can click on the '>' on the left to open a side panel where you can upload files.

Notes and common problems

  • Copying back to CPU takes a while. Reducing the number of checkpoints will speed up the training significantly.
  • Use of learning rate scheduler is required, even if it's just a constant.
  • The initial compilation of the TPU model might take quite a while, especially for very large models.
  • Error messages might be a little cryptic. I would definitely get a model running properly locally before running it on a TPU.

Runnable Notebook

https://colab.research.google.com/drive/12aezd43epJ-lmQdpvowIoqXfoXvFiQMX