Evolving Architectures for Colorization (Colorvision – Issue 2 (Jan/Feb 2025))

Charles UMESI


Introduction

Over the last decade, CNNs (convoluted neural networks) have been the preferred choice of neural architecture for machine learning-related computer vision projects, and they have also been used for image colorization (such as by Zhang et al. (2016)). Other architectural groups have emerged, which have subsequently been used for colorization, such as cINNs (conditional invertible neural networks) by Ardizzone et al. (2019). And more recently, GANs (generative adversarial networks), which have shown promise in colorization, producing significantly better results than CNNs (Dalal et al., 2021)—currently, TriColVid uses a GAN by Elshazly et al. (2023). Another group of architectures being explored in colorization are transformers (Kumar et al., 2021). This group of neural networks have all but replaced RNNs (recurrent neural networks) in predictive text and natural language processing as a whole. Transformers are also used in text-to-image and text-to-video applications (such as those from OpenAI). Given their use in image and video applications and the dominance of transformers in AI as a whole, it is perhaps unsurprising that this architecture has been tried in image colorization (Kumar et al., 2021). Another architecture to emerge is a combination of a GAN and transformer by Shafiq and Lee (2024). The design of neural architectures for colorization is far from settled.

Main

Given the challenges of accurate colorization, the neural networks being developed for the process have become progressively more complex and extensive. GANs have a more extensive architecture than CNNs, and transformers (generally) have a more extensive architecture than GANs. Unlike a CNN, which consists of just one neural network, a GAN and a transformer (in most cases) each consist of two neural networks. In the case of a GAN, the two networks are termed the generator and discriminator, and in the case of the transformer (generally speaking), the two networks are described as the encoder and decoder. Not surprisingly, the combined GAN-transformer has an even more complex and extensive architecture. Diagrams of these architectures as well as details of how these networks process pixel information from an image can be read elsewhere (see citations in previous paragraph).

There are various reasons developers cite for proposing the use of a specific architecture for colorization, but invariably, colorization that is not fully satisfactory is a common theme. The hybrid GAN-transformer by Shafiq and Lee (2024) was intended to address color bleeding. The self-attention capabilities of transformers are noteworthy, but the vision transformer model developed by Kumar et al. (2021) is limited in its ability to detect edges. Authors for both models point to various improvements in colorization from their architectures. The vision transformer by Kumar et al. (2021) has a better FID score than GANs, which in turn have better FID scores than a CNN (Ardizzone et al., 2019). Shafiq and Lee (2024) on the other hand, used PSNR scores and their hybrid model performed better than the vision transformer by Kumar et al. (2021). But given that PSNR and FID measure different things, it is difficult to say conclusively whether the hybrid model is a true improvement over the vision transformer. Whilst there is a clear scientific need to be able to assess the quality of colorization objectively using various metrics, the perception of that quality will undoubtedly be subjective. The author of this article found noticeable visual discrepancies between ground truths and generated colors for both the hybrid (Shafiq and Lee, 2024) and vision transformer (Kumar et al., 2021) and that gap between the subjective and objective will need to be bridged.

The cINN by Ardizzone et al. (2019) is intriguing as this type of network is a combination of a plain INN and a VGG-type CNN. Plain INNs are unsuitable for colorization as their architecture restrict the use of pooling and batch normalization layers, which are important in neural networks involved in colorization. The architecture by Ardizzone et al. (2019) overcomes this problem by incorporating the VGG-CNN, which acts as a conditioning network. Their model has a slightly better FID score than a GAN the team compared with although that score is not as good as the vision transformer by Kumar et al. (2021). One observation of the paper by Ardizzone et al. (2019) is that almost no ground truths were provided and so it is difficult to comment on the quality of perceived improvement.

There is evidently more work to be done on neural network development for colorization. The increasing extension of their architectures allude to the complexity of image processing, which the visual cortex of the brain does effortlessly. Some of these aspects are likely to be factored in as newer architectures emerge. It is, however, uncertain which of these architectures or indeed transformers (which replaced RNNs) will eventually replace CNNs and GANs (despite their weaknesses) as the architecture of choice for colorization, assuming that happens.

Conclusion

The development of colorization is occurring on multiple fronts; one of them being neural architecture of which several types have emerged in the last few years. In general, these architectures are increasingly more complex and extensive than previous colorization models. One group in particular, transformers, which have not only replaced RNNs, but are today, the dominant architecture in so many areas of AI (thanks in part to models developed by OpenAI) are now being tried in colorization (an area that is currently dominated by CNNs and GANs). Objective metrics show that in certain parameters, these newer models, including vision transformers, perform better than their predecessors. However, there is a perceived discrepancy between ground truths and colorizations by these models, something future architectures will also have to address. And on that basis, it remains to be seen which of these newer models, or transformers for that matter, displace CNNs and GANs in colorization tasks in the near future.

References

Ardizzone et al. (2019) “Guided Image Generation with Conditional Invertible Neural Networks”, arXiv:1907.02392v3 [cs.CV], DOI: https://doi.org/10.48550/arXiv.1907.02392.

Dalal et al. (2021) “Image Colorization Progress: A Review of Deep Learning Techniques for Automation of Colorization”, IJATCSE, 10: 2908–15, DOI: https://doi.org/10.30534/ijatcse/2021/401042021.

Elshazly et al. (2023) “Image Colorization Using GANs”. Available from: https://www.kaggle.com/code/ziyadelshazly/image-colorization-using-gans/notebook [Accessed June 24, 2024].

Kumar et al. (2021) “Colorization Transformer”, arXiv:2102.04432v2 [cs.CV], DOI: https://doi.org/10.48550/arXiv.2102.04432.

Shafiq and Lee (2024) “Transforming Color: Novel Image Colorization Method”, arXiv:2410.04799v1 [cs.CV], DOI: https://doi.org/10.3390/electronics13132511.

Zhang et al. (2016) “Colorful Image Colorization”, arXiv:1603.08511 [cs.CV], DOI: https://doi.org/10.48550/arXiv.1603.08511.