Explaining StyleGAN

The explosion in technological evolution when it comes to GANs in the past few years has been exponential. From a black and white 64×64 resolution render to a now near picture-perfect 1024×1024 resolution picture in 7 short years.

4.5 years of GAN progress on face generation. 2014[7], 2015[10], 2016[11], 2017[12], 2018[13].

It all starts with the dataset. The model is very particular when it comes to the parameters of the images. First, the resolution must be by a power of 2. This could be 128×128, 512×512, 1024×1024 etc. Let’s say you want to generate faces using SytleGAN. The Neural Network must have many images to reference upon while training, as it has to evaluate not only your facial features but the texture and more that goes into a realistic photo.

When it comes to training a model, a popular comparison is the game of Cops & Robbers. The robber, who for sake of comparison is an identity counterfeiter, tries to create fake identities, while the cop tries to catch fake IDs. The generator network (robber) generates an image, and the discriminator network (cop) looks at the image and determines if it is real. If either network is correct, the loser will slightly tweak itself to become better. However, a tweak isn’t always successful, which is why you may see the occasional drop in quality. This process is repeated thousands of times until the user stops the neural network. After the training process is complete, we can now ask the robber to generate images for whatever purpose we choose to use it for.

With the original GAN network, this was the extent of what it did. Two simple neural networks (decimator and cop) combined to make a complex network. However, human nature cannot simply be satisfied. This is why in 2015 the individual neural networks became more complex Convolutional Neural Networks. A series of other versions of the original GAN (CoGAN, ArtGAN, and DiscoGANS)

There was one major problem with the newer GANs, however. The goal of the neural network Cops & Robbers was for the generated images to trick the discriminator, not necessarily to make a high-quality image. This means that if the generator created a high-quality image, the discriminator would easily just say “It’s too high quality!” and throw it out. This meant the robber got a little tricky, as robbers do. It would generate images that look to the human eye noticeably off, which the discriminator would qualify as real. Obviously, this hurts the final output of the model, even though it technically reached its goal.

This issue was addressed in a major way in 2018 with NVIDIAs StyleGAN. This used a dynamic approach to generating images. First, it would use the original GANs simple neural networks on very low resolutions like 4×4 resolution. This is used for about 100 rounds, in which the discriminator has trouble comparing the real and fake due to its poor quality. This process gradually increases the complexity of the models and resolution, eventually ending in the full resolution. Not only that, but we can now add custom characteristics, such as blonde hair to increase our input into what the network generates.

That was 2018, now it’s 2021 and things have still improved. We are still using the 2018 StyleGAN as a base, which speaks to its quality. With StyleGAN2-ADA, the discriminator becoming, well, more discriminating. This increases the quality of the final generation. Additionally, the speed and efficiency of training have become much more accessible, allowing individuals to train a model themselves.

So what are we waiting for, lets try it for ourselves.

Powered by BetterDocs