Transforming the Landscape of Computer Vision: The Rise of Transformer Models

Rebeca Sarai G. G.
10 min readJan 3, 2023

Computer vision transformer models are a relatively new approach to a variety of tasks within computer vision, such as image classification, object detection, and segmentation. These models have gained popularity in recent years due to their ability to handle large amounts of data and model long-range dependencies, which has led some to question whether they are superior to traditional convolutional neural networks (CNNs). In this article, we will explore some of the key differences between these two approaches and consider whether computer vision transformer models are indeed better than CNNs.

Computer vision

Computer vision is a field of artificial intelligence that focuses on enabling computers to understand and interpret visual data from the world around them, such as images and videos. Transformer models have recently emerged as a powerful tool for a variety of tasks within computer vision, including image classification, object detection, and segmentation.

Image processing

Image processing is the field of artificial intelligence that focuses on enabling computers to understand and interpret visual data from the world around them, such as images and videos. It involves a range of techniques and algorithms that are used to extract useful information from images and video, such as identifying objects, recognizing patterns, and detecting features.

Image processing has a wide range of applications, including computer vision, medical imaging, remote sensing, and video analysis. It can be used to perform tasks such as image enhancement, image restoration, image compression, and image recognition.

There are many different approaches to image processing, including machine learning, computer vision, and signal processing. Machine learning approaches typically involve training a model on a large dataset of images in order to recognize patterns and make predictions. Computer vision approaches typically involve designing algorithms that can understand and interpret visual data in a way that is similar to how the human visual system works. Signal processing approaches involve manipulating and analyzing the raw data in an image in order to extract useful information.

Overall, image processing is an important field of artificial intelligence that has the potential to revolutionize a wide range of industries and applications.

Image processing and its relation with computer vision

Image processing algorithms are used to extract information from images, restore and compress image and video data, and build new experiences in virtual and augmented reality. Computer vision uses image processing to recognize and categorize image data.

Image processing is a key component of computer vision, as it involves a range of techniques and algorithms that are used to extract useful information from images and video.

Computer vision is a broad field that encompasses a wide range of tasks, including image classification, object detection, image segmentation, and video analysis. Image processing is an important part of these tasks, as it involves analyzing and manipulating the raw data in images and videos in order to extract useful information.

There are many different approaches to image processing, including machine learning, computer vision, and signal processing. These approaches are used to design algorithms and systems that can understand and interpret visual data in a way that is similar to how the human visual system works.

Overall, image processing is a crucial part of the field of computer vision, as it provides the tools and techniques needed to extract useful information from visual data. This enables computers to understand and interpret the world around them in a way that is similar to how humans do.

Learn image processing with Python

If you would like to learn more about image processing to process, transform, and manipulate images at your will, even when they come in thousands, check out my course in Datacamp.

Anonymize faces on images using Python in this course on Datacamp, starting for free.

You will also learn to restore damaged images, perform noise reduction, smart-resize images, count the number of dots on a dice, apply facial detection, and much more, using scikit-image. After completing this course, you will be able to apply your knowledge to different domains such as machine learning and artificial intelligence, machine and robotic vision, space and medical image analysis, retailing, and many more. Take the step and dive into the wonderful world that is computer vision!

Transformer models

A transformer model is a type of deep learning architecture that was originally introduced in the paper “Attention is All You Need” (Vaswani et al., 2017). It is primarily used for natural language processing tasks, such as language translation, but has since been adapted for use in computer vision and other domains.

One of the key innovations of the transformer model is its use of self-attention mechanisms, which allow it to model long-range dependencies in sequential data. In the context of computer vision, this means that the model can consider the relationship between different pixels in an image or different frames in a video when making predictions.

One of the main benefits of using transformer models for computer vision tasks is their ability to handle large amounts of data effectively. Traditional convolutional neural networks (CNNs), which have been the dominant approach for many computer vision tasks, can struggle with very large input sizes, such as high-resolution images or long video sequences. Transformer models, on the other hand, can process input of any length, making them well-suited for tasks that require handling large amounts of data.

There have been several notable successes in using transformer models for computer vision tasks. In 2020, a team from Google Research introduced a transformer-based model called ViT (Visual Transformer) that achieved state-of-the-art results on several image classification benchmarks. ViT works by treating an image as a sequence of patches, which are then processed by the transformer model. This allows it to effectively capture both local and global features in the image.

Another notable example of a transformer model for computer vision is DETR (DEtection TRansformer), which was introduced by Facebook AI in 2020. DETR is a fully end-to-end object detection model that uses a transformer architecture to process the input image and generate object predictions. It has been shown to outperform many other state-of-the-art object detection models on a number of benchmarks.

Vision Transformer model overview. It splits an image into fixed-size patches, linearly embeds each of them, adds position embeddings, and feeds the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017). Source: AN IMAGE IS WORTH 16X16 WORDS

Here are a few examples of use cases for transformer models in AI:

  • Language translation: Transformer models were originally developed for the task of machine translation, and have achieved state-of-the-art results on many benchmarks. They are particularly effective at handling long and complex sentences, as they can model long-range dependencies between words and phrases.
  • Text classification: Transformer models can be used to classify text documents or other types of written content into different categories, such as spam emails or news articles. They are particularly effective at handling large amounts of text data, as they can process input of any length.
  • Image classification: Transformer models have been adapted for use in computer vision tasks, such as image classification, where they have been shown to be effective at capturing both local and global features in the input data. They can be used to classify images into different categories, such as animals, objects, or landscapes.
  • Object detection: Transformer models can be used for object detection in images and videos, where they can effectively model the relationships between different elements in the input data. This allows them to identify and classify objects within the input, as well as locate them within the scene.
  • Speech recognition: Transformer models have also been applied to speech recognition tasks, where they can effectively model the long-range dependencies between different phonemes and words in spoken language. This allows them to transcribe spoken language into written text with high accuracy.
  • Reinforcement learning: Transformer models have been used in reinforcement learning tasks, where they can effectively learn to make decisions and take actions in an environment in order to maximize a reward. This has applications in areas such as autonomous robotics and game playing.

Overall, transformer models have demonstrated strong performance on a wide range of tasks in AI and are likely to continue to be an important tool in the field in the future.

A few real-life examples of how transformer models are being used today:

  • Google Translate: One of the most well-known examples of transformer models in action is Google Translate, which uses a transformer-based model to translate text and speech between languages. This allows users to communicate with people in different countries and languages, opening up new opportunities for international business and cultural exchange.
  • Facebook’s News Feed: Transformer models are also used by Facebook to personalize and rank the stories that appear in users’ news feeds. The model considers a variety of factors, such as the user’s past interactions and preferences, as well as the content of the story itself, to determine which stories are most relevant to each user.
  • Self-driving cars: Transformer models are being explored for use in autonomous vehicles, where they can be used to process sensor data and make driving decisions in real time. This has the potential to significantly improve the safety and efficiency of transportation systems.
  • Healthcare: Transformer models are being applied to healthcare tasks, such as diagnosing diseases and predicting patient outcomes. For example, a transformer-based model developed by researchers at MIT was able to accurately diagnose skin cancer from images with a high degree of accuracy.
  • Disaster response: Transformer models have also been used in disaster response scenarios, where they can be used to process and analyze large amounts of data from sensors and other sources to identify patterns and trends that can help predict and mitigate the impact of disasters.

These are just a few examples of the many real-life uses of transformer models in AI. It is likely that we will see even more applications of these powerful models in the future.

Transformers that join vision with language

In 2022, a new vision transformer “LiT: Zero-Shot Transfer with Locked-image Tuning” was released to accomplish to match text to images. This simple yet effective setup provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, significantly closing the gap between the two styles of learning.

Diagram explaining how ultimodal contrastive learning trains models to produce similar representations for closely matched images and texts.
Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts. Source: Google blog.

Contrastive learning

Contrastive learning on image-text data refers to a type of machine learning approach that aims to learn the relationship between images and the corresponding text descriptions of those images. This is typically done by training a model to predict whether a given text description is related to a given image or not.

The resulting matrix after running LiT on these five images (below) and the five input texts (Top left). This has been generated using code from this colab.

To do this, a dataset of images and their corresponding text descriptions is typically used. The model is then trained to maximize the similarity between the image and its corresponding text description, while simultaneously minimizing the similarity between the image and unrelated text descriptions. This process helps the model to learn the relationship between the visual content of an image and the words used to describe it.

Contrastive learning on image-text data has a wide range of applications, including image classification, object detection, and image generation. It can also be used to improve the performance of natural language processing tasks, such as language translation, by providing additional context and information about the visual content of the input.

Overall, contrastive learning on image-text data is a powerful approach for learning the relationship between images and text, and has the potential to improve the performance of a variety of AI tasks.

Contrastive learning models learn representations from “positive” and “negative” examples, such that representations for “positive” examples are similar to each other but different from “negative” examples.

Multimodal contrastive learning applies this to pairs of images and associated texts. An image encoder computes representations from images, and a text encoder does the same for texts. Each image representation is encouraged to be close to the representation of its associated text (“positive”), but distinct from the representation of other texts (“negatives”) in the data, and vice versa. This has typically been done with randomly initialized models (“from scratch”), meaning the encoders have to simultaneously learn representations and how to match them.

LiT tuning bridges this gap: we contrastively train a text model to compute representations well aligned with the powerful ones available from a pre-trained image encoder. Importantly, for this to work well, the image encoder should be “locked“, that is: it should not be updated during training. This may be unintuitive since one usually expects the additional information from further training to increase performance, but we find that locking the image encoder consistently leads to better results.

This matrix is the result of some experiments performed by me, testing the ability to describe people’s facial attributes. The model can weigh on the importance of the tokens. Also, be affected when the input text is long enough to cover multiple “correct” attributes but one of two words is incorrect. The code can be found in this colab.

Conclusion

In summary, computer vision transformers are better than convolutional neural networks for the following reasons:

  • Can handle input of any length, making them well-suited for tasks that require processing large amounts of data, such as high-resolution images or long video sequences.
  • Can model long-range dependencies in the input data through the use of self-attention mechanisms, which is useful for tasks that require understanding the context or relationships between different elements in the input, such as object detection or segmentation.
  • Are not limited by the size or shape of the input data, as they can process inputs of any size or shape.
  • Do not require pre-processing or feature engineering, as they can learn directly from the raw input data.
  • Can effectively capture both local and global features in the input data, making them well-suited for tasks that require a combination of both.

It is important to note that transformer models are not always superior to CNNs for computer vision tasks. CNNs are often faster and more efficient to train and deploy than transformer models, as they have fewer parameters and require fewer computational resources. Additionally, CNNs are often better suited for tasks that require detecting local patterns or features in the input data, such as edge detection or texture classification.

To conclude, transformer models have emerged as a promising approach for a variety of tasks within computer vision, including image classification, object detection, and segmentation. Their ability to handle large amounts of data and model long-range dependencies makes them well-suited for many applications in this field. While there is still much research to be done to fully understand the capabilities and limitations of transformer models for computer vision, it is clear that they have the potential to significantly advance the state of the art in this field.

--

--

Rebeca Sarai G. G.

Computer scientist 👩🏻‍💻 Tech and innovation enthusiast 🇻🇪🇪🇸. You can learn Image processing with me: https://tinyurl.com/Image-Processing-Python