State of Computer Vision 2023: From Vision Transformers to Neural Radiance Fields

Jul 06, 2023

Large language model development (LLM) development is still happening at a rapid pace. At the same time, leaving AI regulation debates aside, LLM news seem to be arriving at a just slightly slower rate than usual.

This is a good opportunity to give the spotlight to computer vision once in a while, discussing the current state of research and development in this field. And this theme also goes nicely with a recap of CVPR 2023 in Vancouver, which was a wonderful conference at probably the nicest conference venue I have attended so far.

The jam-packed (or rather gem-packed) CVPR magazine listing the 2359 papers accepted at CVPR this year.

PS: If the newsletter appears truncated or clipped, that's because some email providers may truncate longer email messages. In this case, you can access the full article at https://magazine.sebastianraschka.com.

Articles & Trends

This year, CVPR 2023 has accepted an impressive total of 2359 papers, resulting in a vast array of posters for participants to explore. As I navigated through the sea of posters and engaged in conversations with attendees and researchers, I found that most research focused on one of the following 4 themes:

Vision Transformers
Generative AI for Vision: Diffusion Models and GANs
NeRF: Neural Radiance Fields
Object Detection and Segmentation

Below, I will be giving a brief introduction to these four subfields. Subsequently, I will also highlight an intriguing paper selected from the conference proceedings, which relates to each field.

(Interested readers can find the full list of accepted papers here.)

1) Vision Transformers

Following in the footsteps of successful language transformers and large language models, vision transformers (ViTs) initially appeared in 2020 via An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

The main concept behind ViTs is similar to that of language transformers: It uses the same self-attention mechanism in its multihead attention blocks. However, instead of tokenizing words, ViTs are tokenizing images.

As discussed in a previous Ahead of AI article, Understanding Encoder And Decoder LLMs, the original transformer for language modeling (via Attention Is All You Need) was an encoder-decoder architecture. And modern language transformers can be grouped into encoder-only architectures like BERT (often used for classification) and decoder-only architectures like GPT (mostly used for text generation).

The original ViT model resembles encoder-like architectures similar to BERT, encoding embedding image patches. We can then attach a classification head for an image classification task.

Main idea behind vision transformers (ViTs) for classification

Note that ViTs typically have much more parameters than convolutional neural networks (CNNs), and as a rule of thumb, they require more training data to achieve good modeling performance. This is why we usually adopt pretrained ViTs instead of training them from scratch. (Unlike language transformers, ViTs are typically pretrained in supervised, rather than unsupervised or self-supervised, fashion.)

For example, as I recently showed in my CVPR talk on Accelerating PyTorch Model Training Using Mixed-Precision and Fully Sharded Data Parallelism, we can achieve much better classification performance if we start with pretrained transformers rather than training them on the target dataset from scratch.

Training a ViT from scratch versus finetuning a pretrained ViT

ViTs like DeiT and Swin are extremely popular architectures due to their state-of-the-art performance on computer vision tasks.

Still, one major criticism of ViTs is that they are relatively resource-intensive and less efficient than CNNs. This may be among the major contributing factors why ViTs have not seen widespread adoption in practical applications despite being among the hottest research topics in computer vision.

Efficient ViT: Memory Efficient Vision Transformer with Cascaded Group Attention

As mentioned above, ViTs are relatively resource-intensive, which holds them back from more widespread adoption in practice. In the CVPR paper Efficient ViT: Memory Efficient Vision Transformer with Cascaded Group Attention, researchers introduce a new, efficient architecture to address this limitation, making ViTs more suitable for real-time applications.

Annotated figure from https://arxiv.org/abs/2305.07027

The main innovations in this paper include using

a single memory-bound multihead self-attention (MHSA) block between fully connected layers (FC layers);
cascaded group attention.

Let's start with point 1: the MHSA sandwiched between FC layers. According to other studies, memory inefficiencies are mainly caused by MHSA rather than FC layers (Ivanov et al. 2020 and Kao et al. 2021). To address this, the researchers use additional FC layers to allow more communication between feature channels. But they reduce the number of attention layers 1 in contrast to, e.g., the popular Swin Transformer. In addition, they also shrink the channel dimensions for the Q (query) and K (key) matrices in MHSA.

Key architecture changes in EfficientViT (annoated figure from https://arxiv.org/abs/2305.07027)

The cascaded group attention here is inspired by group convolutions, which were used in AlexNet, for example.

Group convolutions, also known as channel-wise or depthwise convolutions, are a variation of the standard convolution operation. In contrast to regular convolutions, group convolution divides the input channels into several groups. Each group performs its convolution operations independently. For instance, if an input has 64 channels, and the grouping parameter is set to 2, the input would be split into two groups of 32 channels each, and these groups would then be convolved independently. This approach not only reduces computational cost but can also increase model diversity by enforcing a kind of regularization, leading to potentially improved modeling performance in some tasks.

Impressively, EfficientViT does not only run up to 6x faster than a MobileViT, but it also runs 2.3 on an iPhone 11 at similar accuracy.

As a bottom line, due to their good predictive performance, ViTs are here to stay (at least as a popular research target). Now, with efficiency improvements, I think we will also see a more widespread adoption in production in the upcoming years.

2) Generative AI for Vision: Diffusion Models

Thanks to open-source models like Stable Diffusion, which reimplemented the model proposed in the High-Resolution Image Synthesis with Latent Diffusion Models paper, most are probably familiar with diffusion models by now. If not, here's a short rundown.

Diffusion models are fundamentally generative models. During training, random noise is added to the input data over a series of steps to perturb it. Then, in a reverse process, the model learns how to invert the noise and denoise the output to recover the original data. During inference, a diffusion model can then be used to generate new data from noise inputs.

(Of course, there are many more generative AI models besides diffusion models, as I discuss in Chapter 9 of Machine Learning Q and AI.)

Most diffusion models use CNNs and employ a U-Net based on a CNN. U-Net is known for its distinctive "U" shape, originally used for image segmentation tasks. It consists of an encoder part to capture context and a decoder that allows precise localization.

Diffusion models traditionally repurposed U-Net to model the conditional distribution at each diffusion step, providing the mapping from a noise distribution to the target data distribution. Or, in other words, a U-Net is be used to predict the noise to be added or subtracted at each step of the diffusion process. U-Nets are particularly attractive here because they can combine local and global information.

Latent diffusion model (known as Stable Diffusion) via https://arxiv.org/abs/2112.10752

One interesting question is whether it's beneficial to swap the CNN backbone with a ViT, which is something researchers explored in the paper below.

All are Worth Words: A ViT Backbone for Diffusion Models

In this All are Worth Words: A ViT Backbone for Diffusion Models paper, researchers try to swap the convolutional U-Net backbone in a diffusion model with a ViT, which they refer to as U-ViT. Note that this is not the first attempt next to previous methods like GenViT and VQ-Diffusion, for example. However, the proposed U-ViT seems to be the best attempt so far.

The main contributions are additional "long" skip connections and an additional convolutional block before the output. Note that the "long" skip connections are in addition to the regular skip connections in the transformer blocks.

Similar to regular ViTs, explained in the previous section, the inputs are the "patchified" images, plus an additional time token (for the diffusion step) and condition token (for class-conditional generation).

The new U-ViT architecture introduced in https://arxiv.org/abs/2209.12152

The figure below shows an ablation study to justify these various design choices (nice!).

The researchers evaluated the new architecture on all three main tasks:

Unconditional image generation;
Class-conditional image generation;
Text-to-image generation.

On unconditional image generation, the new U-ViT diffusion model is competitive but worse than other diffusion models. On conditional image generation tasks, the new U-ViT diffusion model rivals the best GAN and outperforms other diffusion models. And on Text-to-image generation, it outperforms other models (GANs and diffusion models) trained on the same dataset.

U-ViT conditional and text-to-image generation via https://arxiv.org/abs/2209.12152

It's refreshing to see that the paper was accepted even though the model did not outperform all other models on several tasks. It's a new model with new architectural ideas, and it will be interesting to see where ViTs take diffusion models in the upcoming years.

(As a very small caveat, the text-to-image training and evaluation datasets were relatively small compared to some other models. And it's unclear how the computational performance of this architecture compares to previous CNN-based diffusion models. However, I assume the main motivation behind smaller datasets is that it's an academic project (versus a major project at a big tech company) so it might be resource related. Even with using the smaller, more established benchmark MS-COCO dataset, there is nothing wrong.

3) Neural Radiance Fields (NeRF)

In short, Neural Radiance Fields (NeRF) is a relatively new method (initially proposed in 2020) for synthesizing novel views of complex 3D scenes from a set of 2D images: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. This is accomplished by modeling the scenes as volumetric fields of neural network-generated colors and densities. The NeRF model, trained on a small set of images taken from different viewpoints, learns to output the color and opacity of a point in 3D space given its 3D coordinates and viewing direction.

Illustration of new 3D renderings via NeRFs, learned from 2D input images. (Figure from https://arxiv.org/abs/2003.08934)

Since there's a lot of technical jargon in Neural Radiance Fields, let's take a brief step back and derive where these terms come from.

First, we need to define neural fields, a fancy term for describing a neural network that serves as a trainable (or parameterizable) function that generates a "field" of output values across the input space, for example, a 3D scene.

The idea is to get the network to overfit to a specific 3D scene, which can then be used to generate novel views of the scene with high accuracy. This is somewhat similar to the concept of spline interpolation in numerical analysis, where a curve is "overfit" to a set of data points to generate a smooth and precise representation of the underlying function.

The algorithm is summarized in the figure below, taken from the excellent Neural Fields in Visual Computing and Beyond paper by Xie et al.

A typical feed-forward neural field algorithm illustrated in https://arxiv.org/abs/2111.11426

In the case of NeRF, the network learns to output color and opacity values for a given point in space, conditioned on the viewing direction, which allows for creating a realistic 3D scene representation. Since NeRF predicts the color and intensity of light at every point along a specific viewing direction in 3D space, it essentially models a field of radiance values, hence the term Radiance Field. By the way, note that we technically refer to a NeRF representation as 5D because it considers three spatial dimensions (x, y, z coordinates in 3D space) and two additional dimensions representing the viewing direction defined by two angles, theta and phi.

NeRFs are interesting because they have many potential applications, basically everything related to high-quality 3D reconstructions. This includes 3D scanning, virtual and augmented reality, and 3D modeling and motion capturing in movies and video games.

ABLE-NeRF: Attention-Based Rendering with Learnable Embeddings for Neural Radiance Field

As described above, the basic idea behind NeRF is to model a 3D scene as a continuous volume of radiance values. Instead of storing explicit 3D objects or voxel grids, the 3D scene is stored as a function (a neural network) that maps 3D coordinates to colors (RGB values) and densities.

This network is trained using a set of 2D images of the scene taken from different viewpoints. And then, when rendering a scene, NeRF takes as input a 3D coordinate and a viewing direction (a camera ray or beam), and it outputs the RGB color value and the volume density at that location.

This way, NeRFs can generate novel views of objects with photo-realistic qualities. However, one of the shortcomings of NeRFs is that glossy objects often look blurry, and the colors of translucent objects are often murky.

In the paper ABLE-NeRF: Attention-Based Rendering with Learnable Embeddings for Neural Radiance Field, researchers address these shortcomings and improve the visual quality of translucent and glossy surfaces by introducing a self-attention-based framework and learnable embeddings.

Improved realism via ABLE-NeRF (pictures from https://arxiv.org/abs/2303.13817)

One additional detail worth mentioning about regular NeRFs is that they use a volumetric rendering equation derived from the physics of light transport. The basic idea in volumetric rendering is to integrate the contributions of all points along a ray of light as it travels through a 3D scene. The rendering process involves sampling points along each ray, querying the neural network for the radiance and density values at each point, and then integrating these to compute the final color of the pixel.

The proposed ABLE-NeRF diverges from such physics-based volumetric rendering and uses an attention-based network instead, which is responsible for determining a ray's color. Further, they added masked attention so that specific points can't attend occluded points to incorporate the physical restrictions of the real world. Lastly, they also add another transformer module for learnable embeddings to capture the view-dependent appearance caused by indirect illumination. (Ablations studies show that the learnable embeddings are important for enhancing visual quality.) The big picture of the proposed method is summarized in the annotated figure below.

Annotated figure of the ABLE-NeRF method from https://arxiv.org/abs/2303.13817

Based on the renderings comparing ABLE-NeRF to a reference NeRF (shown above/earlier), the results look truly impressive. Of course, the results are not perfect, as one can see when comparing the renderings to the ground truth, but ABLE-NeRF is roughly 90% there. It's pretty impressive how far we have come regarding 3D scene reconstruction from images.

4) Object Detection and Segmentation

Object detection and segmentation are classic computer vision tasks, so it probably doesn't require a lengthy introduction. However, to briefly highlight the difference between these two tasks: object detection about predicting bounding boxes and the associated labels; segmentation classifies each pixel to distinguish between foreground and background objects.

Object detection (top) and segmentation (bottom). Figures from YOLO paper (https://arxiv.org/abs/1506.02640) and Mask R-CNN paper (https://arxiv.org/abs/1703.06870v3)

Furthermore, we can distinguish between three segmentation categories, which I outlined below.

1. Semantic segmentation. This technique labels each pixel in an image with a class of objects (e.g., car, dog, or house). However, it doesn't distinguish between different instances of an object. For example, if there are three cars in an image, all of them would be labeled as "car" without distinguishing between car 1, car 2, and car 3.

2. Instance segmentation. This technique takes semantic segmentation a step further by differentiating between individual instances of an object. So, in the same scenario of an image with three cars, instance segmentation would separately identify each one (e.g., car 1, car 2, and car 3). So, this technique not only classifies each pixel but also identifies distinct object instances, giving it a distinct object ID.

3. Panoptic segmentation. This technique combines both semantic segmentation and instance segmentation. In panoptic segmentation, each pixel is assigned a semantic label as well as an instance ID. To differentiate panoptic segmentation a bit better from instance segmentation, the latter focuses on identifying each instance of each recognized object in the scene. I.e., instance segmentation is concerned primarily with "things" - identifiable objects like cars, people, or animals. On the other hand, panoptic segmentation aims to provide a comprehensive understanding of the scene by labeling every pixel with either a class label (for "stuff", uncountable regions like sky, grass, etc.) or an instance label (for "things", countable objects like cars, people, etc.).

Figure from https://arxiv.org/abs/1801.00868

Some popular algorithms for object detection include R-CNN and its variants (Fast R-CNN, Faster R-CNN), YOLO (You Only Look Once), and SSD (Single Shot MultiBox Detector).

Models for segmentation include U-Net (discussed in the diffusion model section earlier), Mask R-CNN (a variant of Faster R-CNN with segmentation capabilities), and DeepLab, among others.

Both object detection and segmentation are important for applications like self-driving cars, medical imaging, and, of course, video surveillance. Note that object detection and segmentation are also often used together, where object detection can provide a coarse localization of objects, which segmentation algorithms can then be finetuned for more precise boundary and shape analysis.

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

The Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation paper is an extension of the DINO method (DETR with Improved deNoising anchOr boxes).

And DETR (DEtection TRansformer) is an end-to-end object detection model introduced by Facebook AI in the paper End-to-End Object Detection with Transformers, which (surprise, surprise) uses a transformer architecture. Unlike traditional object detection models, DETR treats object detection as a direct set prediction problem, eliminating the need for handcrafted anchor boxes and non-maximum suppression procedures and providing a simpler and more flexible approach to object detection.

While CNN-based methods nowadays unify object detection (a region-specific task) and segmentation (a pixel-level task) to improve the overall performance on both tasks (basically a form of multitask learning), this is not the case for transformer-based object detection and segmentation systems.

However, this is what the proposed Mask DINO system aims to address. By extending DINO, Mask DINO outperforms all existing object detection and segmentation systems. The idea behind Mask DINO is summarized in the annotated figure below.

Annotated figure from the Mask DINO paper, https://arxiv.org/abs/2206.02777

This magazine is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues.

Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (from Scratch)

Your support means a great deal! Thank you!

Adil Sheraz

Jul 11, 2023

Nicely written, learned a lot. Thanks for sharing

Expand full comment

Byndnglsh

Jul 9, 2023

Looking forward to the next "Research Highlights in Three Sentences or Less."

As I have little interest in computer vision, I didn't get anything out of this issue. Frankly, I have only a passing interest in generative art -- broadly defined -- as well. Multimodal is great, but multimodal doesn't necessarily have anything to do with generative art, CV, ...

6 more comments...