What is the difference between `tf.data` and `tf.keras.preprocessing.image_dataset_from_directory`?

Q: What is the difference between `tf.data` and `tf.keras.preprocessing.image_dataset_from_directory`?

Learn the answer to "What is the difference between `tf.data` and `tf.keras.preprocessing.image_dataset_from_directory`?" with detailed explanations, code examples, and best practices on DeployU.

The Scenario

You are an ML engineer at a self-driving car company. You are training an image classification model on a large dataset of images. The training is very slow, and you have noticed that the GPU is often idle, waiting for data to be loaded from the CPU.

You are currently using the tf.keras.preprocessing.image_dataset_from_directory utility to load the data. While this was easy to set up, you suspect that it is not performant enough for your use case.

The Challenge

Explain why the image_dataset_from_directory utility might not be performant enough for this task. How would you use the tf.data API to build a high-performance input pipeline that can keep the GPU saturated with data?

Wrong Approach

A junior engineer might not recognize that the data loading pipeline is the bottleneck. They might try to solve the problem by using a larger GPU or by reducing the complexity of the model, which would not address the root cause of the problem.

Addresses symptoms, not root cause

Right Approach

A senior engineer would immediately suspect that the data loading pipeline is the bottleneck. They would be able to explain how to use the `tf.data` API to build a high-performance input pipeline with prefetching and caching. They would also be able to explain how to use the TensorFlow Profiler to diagnose performance issues.

Step 1: Diagnose the Bottleneck

The first step is to confirm that the data loading pipeline is the bottleneck. We can use the TensorFlow Profiler to do this. The profiler will show us a timeline of the operations that are being executed on the CPU and the GPU. If we see that the GPU is often idle while the CPU is busy loading data, then we know that the data loading pipeline is the bottleneck.

Step 2: Why `image_dataset_from_directory` is Not Enough

The image_dataset_from_directory utility is a convenient way to create a tf.data.Dataset from a directory of images, but it is not always the most performant.

Feature	`image_dataset_from_directory`	`tf.data` API
Performance	Can be slow, especially for large datasets.	Very fast, can be highly optimized with prefetching, caching, and parallel processing.
Flexibility	Limited, only works for a specific directory structure.	Very flexible, can be used to load data from a variety of sources and formats.
Control	Provides a high-level abstraction with limited control.	Provides full control over the data loading and processing pipeline.

Step 3: Build a High-Performance `tf.data` Pipeline

Here’s how we can use the tf.data API to build a high-performance input pipeline:

1. Create a dataset of file paths:

import tensorflow as tf

list_ds = tf.data.Dataset.list_files(str('path/to/my/dataset/*/*'))

2. Load and pre-process the data:

We can use the map method to load and pre-process the images in parallel.

def parse_image(filename):
    # ... load and decode the image ...
    return image, label

dataset = list_ds.map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)

3. Cache the data:

If the dataset is small enough to fit in memory, we can use the cache method to cache it. This will save us from having to reload the data from disk on each epoch.

dataset = dataset.cache()

4. Shuffle and batch the data:

dataset = dataset.shuffle(buffer_size=1000).batch(batch_size=32)

5. Prefetch the data:

The prefetch method allows the CPU to pre-process the data for the next batch while the GPU is busy with the current batch. This can significantly improve performance.

dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

By using these techniques, we can build a high-performance tf.data pipeline that can keep the GPU saturated with data and significantly reduce the training time.

Systematic, production-ready debugging

Practice Question

You are training a model on a very large dataset that does not fit in memory. Which of the following `tf.data` methods would you NOT use?

Questions