Putting AI to the Test: Generative Adversarial Networks vs. Diffusion Models

Blog Main

Here at Aurora Solar, we take roof modeling very seriously. In fact, we even have a dedicated team of highly trained designers who support our customers with on-demand roof models via our Expert Design Services offering.

As we’ve grown over the years, we’ve been able to gather insights from over half a million roof images modeled by our Expert Design team, including segmentation labels for each roof face, tree, and obstruction (things like vent pipes, skylights, and so on).

For each roof, we have labels that look like the following:

Using this comprehensive database of highly accurate training data, our team has developed and deployed our own artificial intelligence (AI) and computer vision models. 

Today, our customers leverage these powerful AI capabilities using our Aurora AI and Lead Capture AI products. For any site entered into the system, our roof modeling pipeline takes the roof images and LIDAR data as an input to automatically output 3D roof models for our customers.

Just for fun — and, to help improve our understanding of generative models more generally — we can also investigate the inverse problem: Can we take the labels as inputs and generate a realistic looking roof image? 

More generally, this problem is known as image generation. Historically, this problem has been tackled by generative adversarial networks, also known as GANs. Recently, a new type of network, called diffusion models, has emerged as an option for image generation tasks — and these models come close to achieving the performance of GANs. 

In this blog post, we’ll compare using a diffusion model against a GAN for the task of conditional roof image generation.

Problem statement

The task we’re investigating is conditional generation: Given roof face, obstruction, and tree labels, generate a realistic image of the roof.

Given roof faces, obstructions, and trees, the model should generate a realistic image of the roof.

How a GAN is trained

A generative adversarial network works by training a pair of networks. One is the generator, and the other is the discriminator. The generator is the network that generates the fake images. The discriminator, on the other hand, is a classification network that looks at an image and predicts whether it is real or fake. Both networks are trained simultaneously, with the discriminator trying to classify real and fake images correctly, and the generator trying to fool the discriminator into classifying its fake images as real.

The generator generates fake images, while the discriminator tries to distinguish real images from fake ones.

As the discriminator improves at telling apart real and fake images, so do the images the generator produces. (NOTE: The number of training epochs is the number of times a model works through the entire dataset. Each epoch, the model sees every example in the dataset once.)

Images generated by generator over time.

How a diffusion model is trained

Conversely, to train a diffusion model, we take an image and gradually add noise to it over many steps until it is indistinguishable from an image of pure noise:

Gradually adding noise to an image until it is just pure noise. (From https://developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-to-gans-part-1/)

Then, the network is trained to predict a slightly denoised image given the noised image at each step.

Diffusion Model Cat is unimpressed.

By repeatedly applying the diffusion network to its own predictions, we can go from an image of pure noise to a realistic noiseless image.

Results

Here is a comparison of outputs from each method:

As you can see, the diffusion model’s outputs are better and more realistic. The GAN outputs have visible artifacts, and neighboring buildings are not modeled as realistically as well.

Frechet inception distance

The Frechet inception distance (FID) is a common metric used to evaluate the realism of generated images. The lower the score, the better and more realistic the images are. After about seven days of training*, the GAN achieved an FID of 40.2, while the diffusion model achieved 31.3, meaning that its images match the distribution of real images better. This corroborates what we see in the images above.

(* Each model was trained for seven days on an NVIDIA A100 GPU.)

Runtime

One drawback of diffusion models is that they take much longer to generate images than GANs. For instance, it takes our GAN about two minutes to generate 4,000 images, while it takes the diffusion model two days to do the same. This is equivalent to approximately 0.03 seconds and 40 seconds respectively per image, or a speed difference of more than 1,000x. Having said that, one is able to trade runtime for image quality with the diffusion model by using fewer diffusion steps. For example, if we want to generate images as quickly as one second per image with the diffusion model, then the FID increases from 31.3 to 55.9.

How the models respond to input changes

One other thing we were curious about was how the models would respond to changes in the input. We moved an obstruction across the image, and this is how each model’s output changed.

Here’s the obstruction input:

Here’s how the GAN output changed:

And here’s how the diffusion model output changed:

We can see that the GAN’s outputs are a lot more unstable than those of the diffusion model, which makes for a more interesting GIF, but suggests that the generation is a lot less controllable. The diffusion model’s outputs hardly change except for the moving obstruction itself, whereas the entire GAN image changes as the obstruction moves.

Why it matters

At this point, you might be saying, “This is really cool, but what does it have to do with me?”

Well, as a little background, we have a dataset of hundreds of thousands of roofs that we trained these generative models on. We’ve seen the different results of different generative models (both GANs and diffusion models) above. Our roof-modeling pipeline, which takes roof images and LIDAR data as input and automatically outputs 3D roof models, is trained on the same dataset.

Secondly, diffusion models were invented for the purpose of, and are currently primarily being used for, generating realistic images. But in the future they could be used to improve the results of our roof-modeling pipeline, as well. This is work that our team is looking into.

Conclusion

As computer vision literature and our own findings suggest, diffusion models can indeed produce realistic images that match or even outperform GAN-generated images. However, diffusion models take much longer to generate images than GANs.

Over the next few years, we will hopefully see developments that speed up the generation of images with diffusion models.

Acknowledgements

Many thanks to Maxwell Siegelman for implementing and training the diffusion model. Thanks as well to Sherry Huang for her feedback on this blog post.

References and further reading

Featured image courtesy of Project Sunroof.

Ready to learn more?