neuron.ai

Synthetic Data II

In the first part of our series on synthetic data, we introduced the basic concepts of the topic and its applications. We looked at the main problems of creating synthetic data and how to overcome them. Moreover, we provided some references to useful GitHub repositories. In this second part, we will entirely focus on a method used to create synthetic data, namely Generative Adversarial Networks (GANs).

Generative Adversarial Networks (GANs)

Synthetic data has become increasingly popular in the field of Artificial Intelligence. Since diverse data is significant for improving the performance of a model and to reduce the risk of overfitting, generating synthetic data is a common practice. A way to use synthetic data is to augment real-world datasets, which can be done by using generative models (e.g., Generative Adversarial Networks (GANs)). GANs generate new samples similar to the real data to increase the size and diversity of the dataset to improve the performance of the model. 

Generative Adversarial Networks (GANs) are deep learning architectures to generate new unseen data similar to input data. They consist of two main components: a generator network and a discriminator network. While the generator network takes in a random input and produces new data, the discriminator network takes in both the generated and real data to distinguish between them. The two networks are trained together in an adversarial process; the generator is trying to create data that can fool the discriminator while the discriminator is trying to identify the generated data correctly. GANs have been successfully used in various tasks such as image generation, video generation, text generation, and audio synthesis.

GANs were first introduced in a paper in 2014 by Ian Goodfellow et al. [1]. Since then, they have been widely used in various fields of computer vision, natural language processing, time series prediction, and so on. They are particularly powerful in generating high-quality images and videos. In Figure 1 a visualization of samples from the model in [1] is given where the rightmost column shows the nearest training example of the neighboring sample to demonstrate that the model has not memorized the training set and that samples are fair random draws, not cherry-picked.

Figure 1: MNIST visualization of samples from the model describe in [1].

GANs for Synthetic Data Generation

There are some popular GAN architechtures, and Deep Convolutional Generative Adversarial Networks (DCGANs) are one for image generation. They use deep convolutional neural networks for both the generator and discriminator to generate high-resolution images. Also, there are several variants of GANs, such as Wasserstein GANs (WGANs) and Boundary Equilibrium Generative Adversarial Networks (BEGAN). These models generate more realistic data due to a more stable training process. 

Another aspect of GANs is the use of an adversarial loss function that measures the difference between the generated data and real data to train both the generator and the discriminator. The generator tries to create data as similar as possible to the real data to minimize this loss while the discriminator maximizes this loss by trying to identify the generated data correctly. In this way, the competition between them is created. Moreover, additional loss functions can be used in GANs to improve the quality of the generated data. For instance, pixel-wise loss functions (such as mean squared error) can be used for image generation.

Another case of GANs is using a noise vector as input to the generator to allow it to produce unseen data. This noise vector is usually a random vector drawn from a simple distribution, such as a normal distribution or uniform distribution.

Research Overview

Several papers discuss the use of GANs for synthetic data generation, including the original GAN architecture, unsupervised representation learning, stability and convergence improvements, and advanced techniques such as progressive growing and self-attention mechanisms. Here, we are focusing on papers with code or real-world applications. 

In [2], the authors present a conditional GAN architecture for image-to-image translation tasks and evaluate the Pix2Pix model on several image-to-image translation tasks, including edges-to-shoes, edges-to-handbags, and cityscape-to-facade translation. They show that the model outperforms several state-of-the-art image-to-image translation models in terms of both accuracy and efficiency. They additionally show that the model can handle complex image-to-image translation tasks, such as creating photo-realistic images from sketches.

The paper [3] introduces the CycleGAN architecture for unpaired image-to-image translation, unlike traditional GANs, with a design to work with unpaired image samples. This design allows the model to perform image-to-image translation without the need for large amounts of paired training data. The CycleGAN architecture consists of two generator networks and two discriminator networks. Each generator network takes an image from one domain and transforms it into an image in the other domain. Meanwhile, the two discriminator networks evaluate the generated images and determine whether they are real or fake. Figure 5 gives some examples of the results of the proposed model. The training objective of CycleGAN also includes a cycle-consistency loss, which ensures that the transformed image from one domain can be transformed back into the original image in the other domain. This cycle-consistency loss helps to maintain the structural and semantic information of the original image and to improve the quality of the generated one. The model is evaluated on several image-to-image translation tasks, including horse-to-zebra, summer-to-winter, and apple-to-orange translations. It has been shown that the model outperforms several image-to-image translation models in terms of both accuracy and robustness, and it can handle a wide range of image-to-image translation tasks, including those that involve complex and diverse image structures and styles.

Fig 2: Collection style transfer: they transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and Ukiyo-e. [3]

In [4], an application of GANs for image inpainting is presented, where missing or corrupted parts of an image are filled in. A type of convolution called partial convolution is introduced to handle irregularly shaped holes in images. They also propose a feature map normalization technique that helps the network to balance the feature activations between the hole and surrounding regions. The proposed model is evaluated on image inpainting tasks, including rectangle and irregular hole inpainting, and it outperforms other methods. It can handle more complex image inpainting tasks, such as filling in multiple holes of different shapes and sizes.

For video generation, [5] introduces a novel GAN-based deep learning approach for video-to-video synthesis. The model called Video-to-Video Synthesis Network (vid2vid-Net) can learn the mapping between an input and an output video in an unsupervised manner. It also includes a spatial-temporal generator that can produce high-resolution video frames and a discriminator that can judge if the generated video is realistic. The model can preserve the content and motion information in the generated video by employing two discriminators, where one is for spatial information and the other is for temporal information. Additionally, a novel optical flow prediction module is introduced to handle large motions and occlusions in the input and output videos. The process is evaluated on several video-to-video synthesis tasks, including generating high-resolution videos, changing the weather conditions in a video, and changing the scene content in a video.

 

In [6], the authors represent GAN-based Video-to-Video Synthesis (V2V-Synth) with a two-stage training process. The first stage trains a scene dynamics network to capture the complex scene dynamics while the second stage trains a video generator network to synthesize realistic video sequences based on the learned scene dynamics. The proposed adversarial loss function is also designed to penalize the video generator network for synthesizing unrealistic scene dynamics.

In [7], a GAN-based video-generating approach named Stochastic Video Generation (SVG) is used for learning a statistical representation of the target video distribution, and it is obtained by training a separate network on a large dataset of real videos. The study demonstrates that the learned prior can be transferred to new tasks, allowing the SVG network to generate realistic video sequences for a new task with little fine-tuning. 

As for text generation, the study [8] presents a generative model for sequence data such as text, music, and speech. The model called SeqGAN is a sequence generative adversarial network that uses reinforcement learning to adjust the generator’s parameters based on the reward signal provided by the discriminator. 

Concerning synthetic audio data generation, [9] is a research paper for GAN models on text-to-speech synthesis. The ClariNet model is an end-to-end system that generates high-quality speech waveforms from text input in real-time. The model is based on a parallel WaveGAN architecture, which allows the generation of speech waveforms with high audio quality and low computational cost. The model is trained on a large corpus of speech data, and it can be easily fine-tuned on new speaker data for personalized speech synthesis. 

 

Conclusion

To sum up, although GANs are known to be challenging to train and find the right architecture and hyperparameters, they have been considered an efficient method for synthetic data generation since they have been successfully used in various tasks.

Ayşenur Gilik

I am a researcher at Pro2Future and working on sustainability with explainable AI. I am currently pursuing a Ph.D. at the Institute of Pervasive Computing at JKU in Linz, Austria. I finished my M.Sc. in Electronics Engineering at Kadir Has University in İstanbul, Türkiye, where I worked as a teaching assistant in Electrical-Electronics Engineering Department for four years. I have primarily worked on machine learning and Artificial Intelligence and their applications on different engineering problems. My professional interests are machine learning, artificial intelligence, computer vision, sustainability, transparent and trustworthy systems, and teaching; my personal interests are literature, cinema, writing, and photography.

References

[1] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020).
Generative adversarial networks. 
Communications of the ACM63(11), 139-144. 

[2] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017).
Image-to-image translation with conditional adversarial networks.
In 
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134). 

[3] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017).
Unpaired image-to-image translation using cycle-consistent adversarial networks.
In 
Proceedings of the IEEE international conference on computer vision (pp. 2223-2232).

[4] Liu, G., Reda, F. A., Shih, K. J., Wang, T. C., Tao, A., & Catanzaro, B. (2018).
Image inpainting for irregular holes using partial convolutions.
In 
Proceedings of the European conference on computer vision (ECCV) (pp. 85-100).

[5] Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018).
Video-to-video synthesis. 

arXiv preprint arXiv:1808.06601
.

[6] Vondrick, C., Pirsiavash H., and Torralba, A. (2016).
Generating videos with scene dynamics.
in Advances in Neural Information Processing Systems 29

[7] Denton, E., Fergus, R. (2018).
Stochastic Video Generation with a Learned Prior. 

arXiv preprint arXiv:1802.07687.

[8] Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017).
SeqGAN: Sequence generative adversarial nets with policy gradient.
In 
Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).

[9] Ping, W., Peng, K., & Chen, J. (2018).
Clarinet: Parallel wave generation in end-to-end text-to-speech.
arXiv preprint arXiv:1807.07281.