'Dive into Deep Learning/D2L Computer Vision'에 해당되는 글 13건

2023.08.21 D2L - 14.12. Neural Style Transfer 1
2023.08.21 D2L - 14.11. Fully Convolutional Networks
2023.08.21 D2L - 14.10. Transposed Convolution 1
2023.08.20 D2L - 14.9. Semantic Segmentation and the Dataset
2023.08.20 D2L - 14.8. Region-based CNNs (R-CNNs)
2023.08.19 D2L - 14.7. Single Shot Multibox Detection
2023.08.19 D2L - 14.6. The Object Detection Dataset
2023.08.19 D2L - 14.5. Multiscale Object Detection
2023.08.19 D2L - 14.4. Anchor Boxes
2023.08.19 D2L - 14.3. Object Detection and Bounding Boxes

Dive into Deep Learning/D2L Computer Vision

D2L - 14.12. Neural Style Transfer

2023. 8. 21. 04:49 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/neural-style.html

14.12. Neural Style Transfer — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.12. Neural Style Transfer

If you are a photography enthusiast, you may be familiar with the filter. It can change the color style of photos so that landscape photos become sharper or portrait photos have whitened skins. However, one filter usually only changes one aspect of the photo. To apply an ideal style to a photo, you probably need to try many different filter combinations. This process is as complex as tuning the hyperparameters of a model.

사진 애호가라면 필터에 익숙할 것입니다. 사진의 색상 스타일을 변경하여 풍경 사진을 더 선명하게 하거나 인물 사진의 피부를 하얗게 만들 수 있습니다. 그러나 하나의 필터는 일반적으로 사진의 한 측면만 변경합니다. 사진에 이상적인 스타일을 적용하려면 아마도 다양한 필터 조합을 시도해야 할 것입니다. 이 프로세스는 모델의 하이퍼파라미터 조정만큼 복잡합니다.

In this section, we will leverage layerwise representations of a CNN to automatically apply the style of one image to another image, i.e., style transfer (Gatys et al., 2016). This task needs two input images: one is the content image and the other is the style image. We will use neural networks to modify the content image to make it close to the style image in style. For example, the content image in Fig. 14.12.1 is a landscape photo taken by us in Mount Rainier National Park in the suburbs of Seattle, while the style image is an oil painting with the theme of autumn oak trees. In the output synthesized image, the oil brush strokes of the style image are applied, leading to more vivid colors, while preserving the main shape of the objects in the content image.

이 섹션에서는 CNN의 레이어별 표현을 활용하여 한 이미지의 스타일을 다른 이미지에 자동으로 적용합니다(예: 스타일 전송)(Gatys et al., 2016). 이 작업에는 두 개의 입력 이미지가 필요합니다. 하나는 콘텐츠 이미지이고 다른 하나는 스타일 이미지입니다. 신경망을 사용하여 콘텐츠 이미지를 스타일에서 스타일 이미지에 가깝게 수정합니다. 예를 들어 그림 14.12.1의 콘텐츠 이미지는 시애틀 교외의 레이니어산 국립공원에서 저희가 촬영한 풍경 사진이고, 스타일 이미지는 가을 참나무를 주제로 한 유화입니다. 출력된 합성 이미지에서는 스타일 이미지의 오일 브러시 스트로크가 적용되어 콘텐츠 이미지에서 오브젝트의 주요 형태를 유지하면서 보다 선명한 색상을 연출합니다.

Fig. 14.12.1  Given content and style images, style transfer outputs a synthesized image.

14.12.1. Method

Fig. 14.12.2 illustrates the CNN-based style transfer method with a simplified example. First, we initialize the synthesized image, for example, into the content image. This synthesized image is the only variable that needs to be updated during the style transfer process, i.e., the model parameters to be updated during training. Then we choose a pretrained CNN to extract image features and freeze its model parameters during training. This deep CNN uses multiple layers to extract hierarchical features for images. We can choose the output of some of these layers as content features or style features. Take Fig. 14.12.2 as an example. The pretrained neural network here has 3 convolutional layers, where the second layer outputs the content features, and the first and third layers output the style features.

그림 14.12.2는 CNN 기반의 style transfer 방법을 단순화한 예와 함께 보여준다. 먼저, 예를 들어 합성 이미지를 콘텐츠 이미지로 초기화합니다. 이 합성 이미지는 스타일 전송 프로세스 중에 업데이트해야 하는 유일한 변수, 즉 훈련 중에 업데이트할 모델 매개변수입니다. 그런 다음 사전 훈련된 CNN을 선택하여 훈련 중에 이미지 특징을 추출하고 모델 매개변수를 동결합니다. 이 심층 CNN은 여러 계층을 사용하여 이미지의 계층적 특징을 추출합니다. 이러한 레이어 중 일부의 출력을 콘텐츠 기능 또는 스타일 기능으로 선택할 수 있습니다. 그림 14.12.2를 예로 들어 보겠습니다. 여기서 사전 훈련된 신경망에는 3개의 컨볼루션 레이어가 있습니다. 여기서 두 번째 레이어는 콘텐츠 기능을 출력하고 첫 번째 및 세 번째 레이어는 스타일 기능을 출력합니다.

Fig. 14.12.2  CNN-based style transfer process. Solid lines show the direction of forward propagation and dotted lines show backward propagation.

Next, we calculate the loss function of style transfer through forward propagation (direction of solid arrows), and update the model parameters (the synthesized image for output) through backpropagation (direction of dashed arrows). The loss function commonly used in style transfer consists of three parts: (i) content loss makes the synthesized image and the content image close in content features; (ii) style loss makes the synthesized image and style image close in style features; and (iii) total variation loss helps to reduce the noise in the synthesized image. Finally, when the model training is over, we output the model parameters of the style transfer to generate the final synthesized image.

다음으로 순방향 전파(실선 화살표 방향)를 통해 스타일 전송의 손실 함수를 계산하고 역전파(파선 화살표 방향)를 통해 모델 매개변수(출력용 합성 이미지)를 업데이트합니다. 스타일 전송에서 일반적으로 사용되는 손실 함수는 세 부분으로 구성됩니다. (i) 콘텐츠 손실은 합성 이미지와 콘텐츠 이미지를 콘텐츠 기능에 가깝게 만듭니다. (ii) 스타일 손실은 합성 이미지와 스타일 이미지를 스타일 특징에 가깝게 만듭니다. (iii) 전체 변형 손실은 합성 이미지의 노이즈를 줄이는 데 도움이 됩니다. 마지막으로 모델 학습이 끝나면 style transfer의 모델 파라미터를 출력하여 최종 합성 이미지를 생성합니다.

In the following, we will explain the technical details of style transfer via a concrete experiment.

다음에서는 구체적인 실험을 통해 스타일 이전의 기술적 세부 사항을 설명합니다.

14.12.2. Reading the Content and Style Images

First, we read the content and style images. From their printed coordinate axes, we can tell that these images have different sizes.

먼저 콘텐츠와 스타일 이미지를 읽습니다. 인쇄된 좌표축에서 이러한 이미지의 크기가 다르다는 것을 알 수 있습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from d2l import torch as d2l

d2l.set_figsize()
content_img = d2l.Image.open('../img/rainier.jpg')
d2l.plt.imshow(content_img);

위 코드는 딥러닝 모델을 사용하여 이미지를 분석하고 시각화하는 예제입니다.

%matplotlib inline은 Jupyter Notebook에서 Matplotlib 그래프를 인라인으로 표시하는 명령입니다. 그래프를 보다 쉽게 시각화하기 위해 사용됩니다.
import torch는 파이토치 라이브러리를 임포트합니다.
import torchvision은 torchvision 라이브러리를 임포트합니다. torchvision은 컴퓨터 비전 관련 작업을 위한 데이터셋, 모델 등을 제공합니다.
from torch import nn은 파이토치의 nn 모듈을 가져옵니다. 이 모듈은 신경망 구성 요소를 정의하는 데 사용됩니다.
from d2l import torch as d2l은 D2L 라이브러리에서 torch 모듈을 가져와 d2l이란 이름으로 사용한다는 것을 의미합니다.
d2l.set_figsize()는 Matplotlib 그래프의 크기를 설정하는 함수입니다.
content_img = d2l.Image.open('../img/rainier.jpg')는 'rainier.jpg' 이미지 파일을 열어 content_img에 저장합니다. d2l.Image.open은 D2L 라이브러리의 이미지 처리 함수로, 이미지 파일을 열고 처리하는 기능을 제공합니다.
d2l.plt.imshow(content_img)는 content_img를 Matplotlib을 사용하여 이미지로 표시합니다. 이를 통해 'rainier.jpg' 이미지가 출력됩니다.

즉, 이 코드는 'rainier.jpg' 이미지를 열어서 Matplotlib을 사용하여 화면에 표시하는 예제입니다.

style_img = d2l.Image.open('../img/autumn-oak.jpg')
d2l.plt.imshow(style_img);

위 코드는 스타일 이미지를 열고 화면에 표시하는 예제입니다.

style_img = d2l.Image.open('../img/autumn-oak.jpg')는 'autumn-oak.jpg' 이미지 파일을 열어 style_img에 저장합니다. d2l.Image.open은 D2L 라이브러리의 이미지 처리 함수로, 이미지 파일을 열고 처리하는 기능을 제공합니다.
d2l.plt.imshow(style_img)는 style_img를 Matplotlib을 사용하여 이미지로 표시합니다. 이를 통해 'autumn-oak.jpg' 이미지가 출력됩니다.

즉, 이 코드는 'autumn-oak.jpg' 이미지를 열어서 Matplotlib을 사용하여 화면에 표시하는 예제입니다.

14.12.3. Preprocessing and Postprocessing

Below, we define two functions for preprocessing and postprocessing images. The preprocess function standardizes each of the three RGB channels of the input image and transforms the results into the CNN input format. The postprocess function restores the pixel values in the output image to their original values before standardization. Since the image printing function requires that each pixel has a floating point value from 0 to 1, we replace any value smaller than 0 or greater than 1 with 0 or 1, respectively.

아래에서 이미지 전처리 및 후처리를 위한 두 가지 함수를 정의합니다. 전처리 함수는 입력 이미지의 3개 RGB 채널 각각을 표준화하고 결과를 CNN 입력 형식으로 변환합니다. 후처리 기능은 출력 이미지의 픽셀 값을 표준화 이전의 원래 값으로 복원합니다. 이미지 인쇄 기능은 각 픽셀이 0에서 1까지의 부동 소수점 값을 가져야 하므로 0보다 작은 값 또는 1보다 큰 값을 각각 0 또는 1로 바꿉니다.

rgb_mean = torch.tensor([0.485, 0.456, 0.406])
rgb_std = torch.tensor([0.229, 0.224, 0.225])

def preprocess(img, image_shape):
    transforms = torchvision.transforms.Compose([
        torchvision.transforms.Resize(image_shape),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=rgb_mean, std=rgb_std)])
    return transforms(img).unsqueeze(0)

def postprocess(img):
    img = img[0].to(rgb_std.device)
    img = torch.clamp(img.permute(1, 2, 0) * rgb_std + rgb_mean, 0, 1)
    return torchvision.transforms.ToPILImage()(img.permute(2, 0, 1))

위 코드는 이미지 전처리와 후처리를 수행하는 함수들을 정의하는 예제입니다.

rgb_mean과 rgb_std는 이미지 전처리 및 후처리에 사용될 평균과 표준편차를 정의한 텐서입니다. 이 값들은 일반적으로 이미지를 정규화(normalize)하는 데 사용됩니다.
preprocess(img, image_shape) 함수는 이미지를 주어진 image_shape로 리사이징하고, ToTensor 변환과 Normalize 변환을 순차적으로 적용하여 이미지를 전처리합니다. 마지막에 .unsqueeze(0)를 사용하여 배치 차원을 추가합니다. 이 함수는 이미지를 모델의 입력으로 사용하기 위해 변환하는데 사용됩니다.
postprocess(img) 함수는 모델의 출력 이미지를 원본 이미지 형식으로 복원하는데 사용됩니다. 먼저, 이미지 텐서에서 배치 차원을 제거한 다음, 평균과 표준편차를 사용하여 역정규화(normalize)를 수행합니다. 마지막으로 ToPILImage 변환을 사용하여 텐서를 PIL 이미지로 변환합니다. 이 함수는 모델의 출력을 실제 이미지로 변환하여 시각화하는데 사용됩니다.

요약하면, 이 코드는 이미지를 모델 입력으로 전처리하거나, 모델의 출력을 실제 이미지로 후처리하는 함수들을 정의하는 예제입니다.

14.12.4. Extracting Features

We use the VGG-19 model pretrained on the ImageNet dataset to extract image features (Gatys et al., 2016).

우리는 ImageNet 데이터 세트에서 사전 훈련된 VGG-19 모델을 사용하여 이미지 특징을 추출합니다(Gatys et al., 2016).

pretrained_net = torchvision.models.vgg19(pretrained=True)

위 코드는 torchvision 라이브러리를 사용하여 미리 학습된 VGG-19 모델을 가져오는 예제입니다.

torchvision.models.vgg19(pretrained=True)은 VGG-19 모델을 가져오는 함수입니다. 이 함수는 미리 학습된 가중치를 가진 VGG-19 모델을 반환합니다. pretrained=True로 설정하면 사전 학습된 가중치가 자동으로 로드됩니다.

VGG-19은 컨볼루션 신경망 아키텍처로, 주로 이미지 분류 작업에 사용됩니다. 사전 학습된 모델을 사용하면 이미지 특징 추출 및 전이 학습에 유용한 특성을 제공할 수 있습니다.

Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /home/ci/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100%|██████████| 548M/548M [00:02<00:00, 213MB/s]

In order to extract the content features and style features of the image, we can select the output of certain layers in the VGG network. Generally speaking, the closer to the input layer, the easier to extract details of the image, and vice versa, the easier to extract the global information of the image. In order to avoid excessively retaining the details of the content image in the synthesized image, we choose a VGG layer that is closer to the output as the content layer to output the content features of the image. We also select the output of different VGG layers for extracting local and global style features. These layers are also called style layers. As mentioned in Section 8.2, the VGG network uses 5 convolutional blocks. In the experiment, we choose the last convolutional layer of the fourth convolutional block as the content layer, and the first convolutional layer of each convolutional block as the style layer. The indices of these layers can be obtained by printing the pretrained_net instance.

이미지의 콘텐츠 특징과 스타일 특징을 추출하기 위해 VGG 네트워크에서 특정 레이어의 출력을 선택할 수 있습니다. 일반적으로 입력 레이어에 가까울수록 이미지의 세부 정보를 추출하기 쉽고 그 반대의 경우도 이미지의 전체 정보를 추출하기 쉽습니다. 합성된 이미지에서 콘텐츠 이미지의 세부 사항이 과도하게 유지되는 것을 피하기 위해 출력에 가까운 VGG 레이어를 콘텐츠 레이어로 선택하여 이미지의 콘텐츠 특징을 출력합니다. 또한 로컬 및 글로벌 스타일 기능을 추출하기 위해 다양한 VGG 레이어의 출력을 선택합니다. 이러한 레이어를 스타일 레이어라고도 합니다. 섹션 8.2에서 언급했듯이 VGG 네트워크는 5개의 컨볼루션 블록을 사용합니다. 실험에서는 네 번째 컨볼루션 블록의 마지막 컨볼루션 레이어를 콘텐츠 레이어로 선택하고 각 컨볼루션 블록의 첫 번째 컨볼루션 레이어를 스타일 레이어로 선택합니다. 이러한 레이어의 인덱스는 pretrained_net 인스턴스를 인쇄하여 얻을 수 있습니다.

style_layers, content_layers = [0, 5, 10, 19, 28], [25]

위 코드는 스타일 전이 네트워크를 구성하는 데 사용되는 레이어의 인덱스를 지정하는 예제입니다.

style_layers: 스타일 이미지의 스타일 특징을 포착하는 데 사용되는 레이어의 인덱스 리스트입니다. VGG-19 모델에서는 이들 인덱스에 해당하는 레이어는 스타일 정보를 추출하는데 적절합니다. 스타일 이미지의 다양한 특징을 반영하기 위해 여러 개의 레이어를 선택합니다.
content_layers: 내용 이미지의 내용 특징을 포착하는 데 사용되는 레이어의 인덱스 리스트입니다. VGG-19 모델에서는 이들 인덱스에 해당하는 레이어가 내용 정보를 추출하는데 적절합니다. 일반적으로 스타일 이미지와는 달리 내용 이미지의 특징은 더 깊은 레이어에서 추출됩니다.

이렇게 선택한 스타일 레이어와 내용 레이어의 인덱스는 스타일 전이 네트워크에서 각각의 이미지로부터 추출한 스타일과 내용의 특징을 활용하여 새로운 이미지를 생성하는데 사용됩니다.

When extracting features using VGG layers, we only need to use all those from the input layer to the content layer or style layer that is closest to the output layer. Let’s construct a new network instance net, which only retains all the VGG layers to be used for feature extraction.

VGG 레이어를 사용하여 피처를 추출할 때 입력 레이어에서 콘텐츠 레이어 또는 출력 레이어에 가장 가까운 스타일 레이어까지 모두 사용하면 됩니다. 기능 추출에 사용할 모든 VGG 레이어만 유지하는 새로운 네트워크 인스턴스 네트워크를 구성해 보겠습니다.

net = nn.Sequential(*[pretrained_net.features[i] for i in
                      range(max(content_layers + style_layers) + 1)])

위 코드는 스타일 전이 네트워크를 구성하기 위해 미리 학습된 VGG-19 모델에서 특정 레이어들을 추출하여 Sequential 모듈로 묶는 과정을 수행합니다.

pretrained_net.features: VGG-19 모델의 features 부분은 컨볼루션 레이어들을 포함하는 Sequential 모듈입니다. 이 부분이 입력 이미지를 컨볼루션 레이어들을 통해 특성으로 변환하는 부분입니다.
content_layers와 style_layers: 위에서 설명한 것처럼 내용과 스타일을 추출할 레이어들의 인덱스 리스트입니다.
range(max(content_layers + style_layers) + 1): 선택한 내용 레이어와 스타일 레이어의 최대 인덱스까지의 범위를 생성합니다. 이 범위는 VGG-19 모델의 features 모듈에서 추출할 레이어들의 인덱스 범위를 의미합니다.

nn.Sequential을 사용하여 pretrained_net의 features 모듈에서 선택한 인덱스 범위에 해당하는 레이어들을 추출하여 순차적으로 연결한 net을 생성합니다. 이렇게 생성된 net은 스타일 전이 네트워크에서 입력 이미지를 특성으로 변환하는 역할을 수행합니다.

Given the input X, if we simply invoke the forward propagation net(X), we can only get the output of the last layer. Since we also need the outputs of intermediate layers, we need to perform layer-by-layer computation and keep the content and style layer outputs.

입력 X가 주어지면 순방향 전파 net(X)를 호출하기만 하면 마지막 레이어의 출력만 얻을 수 있습니다. 중간 레이어의 출력도 필요하므로 레이어별 계산을 수행하고 콘텐츠 및 스타일 레이어 출력을 유지해야 합니다.

def extract_features(X, content_layers, style_layers):
    contents = []
    styles = []
    for i in range(len(net)):
        X = net[i](X)
        if i in style_layers:
            styles.append(X)
        if i in content_layers:
            contents.append(X)
    return contents, styles

위 코드는 입력 이미지를 스타일 전이 네트워크를 통해 특성으로 추출하는 함수를 정의하는 부분입니다.

X: 입력 이미지를 나타내는 텐서입니다.
content_layers: 내용을 추출할 레이어들의 인덱스 리스트입니다.
style_layers: 스타일을 추출할 레이어들의 인덱스 리스트입니다.

함수는 입력 이미지 X를 net에 통과시켜 가면 각 레이어에서의 출력을 계산합니다. 만약 현재 레이어의 인덱스가 style_layers에 속한다면, 해당 레이어의 출력을 스타일을 추출하는 데 사용될 스타일 리스트인 styles에 추가합니다. 마찬가지로 현재 레이어의 인덱스가 content_layers에 속한다면, 해당 레이어의 출력을 내용을 추출하는 데 사용될 내용 리스트인 contents에 추가합니다.

따라서 함수의 반환값은 내용과 스타일을 추출한 특성을 나타내는 리스트인 contents와 styles입니다.

Two functions are defined below: the get_contents function extracts content features from the content image, and the get_styles function extracts style features from the style image. Since there is no need to update the model parameters of the pretrained VGG during training, we can extract the content and the style features even before the training starts. Since the synthesized image is a set of model parameters to be updated for style transfer, we can only extract the content and style features of the synthesized image by calling the extract_features function during training.

아래에 두 가지 함수가 정의되어 있습니다. get_contents 함수는 콘텐츠 이미지에서 콘텐츠 특징을 추출하고 get_styles 함수는 스타일 이미지에서 스타일 특징을 추출합니다. 훈련 중에 사전 훈련된 VGG의 모델 매개변수를 업데이트할 필요가 없기 때문에 훈련이 시작되기 전에 콘텐츠와 스타일 특징을 추출할 수 있습니다. 합성 이미지는 스타일 전달을 위해 업데이트할 모델 매개변수의 집합이므로 훈련 중에 extract_features 함수를 호출해야만 합성 이미지의 콘텐츠 및 스타일 특징을 추출할 수 있습니다.

def get_contents(image_shape, device):
    content_X = preprocess(content_img, image_shape).to(device)
    contents_Y, _ = extract_features(content_X, content_layers, style_layers)
    return content_X, contents_Y

def get_styles(image_shape, device):
    style_X = preprocess(style_img, image_shape).to(device)
    _, styles_Y = extract_features(style_X, content_layers, style_layers)
    return style_X, styles_Y

위 코드는 입력 이미지에서 내용과 스타일을 추출하는 함수들을 정의하는 부분입니다.

get_contents(image_shape, device): 이 함수는 내용 이미지로부터 내용을 추출하는 역할을 합니다.
- image_shape: 이미지의 크기를 지정합니다.
- device: 사용할 디바이스를 지정합니다.
함수는 content_img를 preprocess 함수를 사용하여 전처리하고 content_layers와 style_layers를 사용하여 extract_features 함수를 통해 내용 특성을 추출합니다. 반환값으로는 전처리된 내용 이미지 content_X와 추출된 내용 특성 리스트 contents_Y를 반환합니다.
get_styles(image_shape, device): 이 함수는 스타일 이미지로부터 스타일을 추출하는 역할을 합니다.
- image_shape: 이미지의 크기를 지정합니다.
- device: 사용할 디바이스를 지정합니다.
함수는 style_img를 preprocess 함수를 사용하여 전처리하고 content_layers와 style_layers를 사용하여 extract_features 함수를 통해 스타일 특성을 추출합니다. 반환값으로는 전처리된 스타일 이미지 style_X와 추출된 스타일 특성 리스트 styles_Y를 반환합니다.

14.12.5. Defining the Loss Function

Now we will describe the loss function for style transfer. The loss function consists of the content loss, style loss, and total variation loss.

이제 스타일 전송을 위한 손실 함수에 대해 설명하겠습니다. 손실 함수는 콘텐츠 손실, 스타일 손실 및 총 변형 손실로 구성됩니다.

14.12.5.1. Content Loss

Similar to the loss function in linear regression, the content loss measures the difference in content features between the synthesized image and the content image via the squared loss function. The two inputs of the squared loss function are both outputs of the content layer computed by the extract_features function.

선형 회귀의 손실 함수와 유사하게 콘텐츠 손실은 제곱 손실 함수를 통해 합성 이미지와 콘텐츠 이미지 간의 콘텐츠 특징 차이를 측정합니다. 제곱 손실 함수의 두 입력은 모두 extract_features 함수로 계산된 콘텐츠 계층의 출력입니다.

def content_loss(Y_hat, Y):
    # We detach the target content from the tree used to dynamically compute
    # the gradient: this is a stated value, not a variable. Otherwise the loss
    # will throw an error.
    return torch.square(Y_hat - Y.detach()).mean()

위 코드는 내용 손실 함수를 정의하는 부분입니다.

content_loss(Y_hat, Y) 함수는 추출된 내용 특성 Y_hat와 목표 내용 특성 Y 사이의 손실을 계산합니다. Y를 detach() 함수를 사용하여 연결된 계산 그래프에서 분리하여 계산하므로, 이것은 변수가 아닌 값으로 취급되어 경사도 계산 시 에러를 방지할 수 있습니다. 함수 내에서는 torch.square() 함수를 사용하여 제곱을 계산하고, 그 후에 평균을 계산하여 내용 손실을 반환합니다.

14.12.5.2. Style Loss

Style loss, similar to content loss, also uses the squared loss function to measure the difference in style between the synthesized image and the style image. To express the style output of any style layer, we first use the extract_features function to compute the style layer output. Suppose that the output has 1 example, c channels, height ℎ, and width w, we can transform this output into matrix X with c rows and ℎw columns. This matrix can be thought of as the concatenation of c vectors x1,…,xo, each of which has a length of ℎw. Here, vector xi represents the style feature of channel i.

스타일 손실은 콘텐츠 손실과 마찬가지로 제곱 손실 함수를 사용하여 합성 이미지와 스타일 이미지 간의 스타일 차이를 측정합니다. 스타일 레이어의 스타일 출력을 표현하기 위해 먼저 extract_features 함수를 사용하여 스타일 레이어 출력을 계산합니다. 출력에 1개의 채널, 높이 ℎ, 너비 w가 있다고 가정하면 이 출력을 c 행과 ℎw 열이 있는 행렬 X로 변환할 수 있습니다. 이 행렬은 각각 길이가 ℎw인 c 벡터 x1,…,xo의 연결로 생각할 수 있습니다. 여기서 벡터 xi는 채널 i의 스타일 특징을 나타냅니다.

In the Gram matrix of these vectors XX**⊤∈ℝ**c×c, element xij in row i and column j is the dot product of vectors xi and xj. It represents the correlation of the style features of channels i and j. We use this Gram matrix to represent the style output of any style layer. Note that when the value of ℎw is larger, it likely leads to larger values in the Gram matrix. Note also that the height and width of the Gram matrix are both the number of channels c. To allow style loss not to be affected by these values, the gram function below divides the Gram matrix by the number of its elements, i.e., cℎw.

이러한 벡터 XX**⊤∈ℝ**c×c의 그람 행렬에서 i행과 j열의 요소 xij는 벡터 xi와 xj의 내적입니다. 채널 i와 j의 스타일 특징의 상관관계를 나타냅니다. 이 그램 매트릭스를 사용하여 모든 스타일 레이어의 스타일 출력을 나타냅니다. ℎw의 값이 클수록 그람 행렬에서 더 큰 값이 될 가능성이 높습니다. 또한 그램 행렬의 높이와 너비는 모두 채널 c의 수입니다. 스타일 손실이 이러한 값의 영향을 받지 않도록 하기 위해 아래의 gram 함수는 Gram 행렬을 해당 요소의 수, 즉 cℎw로 나눕니다.

def gram(X):
    num_channels, n = X.shape[1], X.numel() // X.shape[1]
    X = X.reshape((num_channels, n))
    return torch.matmul(X, X.T) / (num_channels * n)

위 코드는 그람 행렬 함수를 정의하는 부분입니다.

gram(X) 함수는 입력으로 들어온 특성 맵 X의 그람 행렬을 계산합니다. 그람 행렬은 특성 맵의 채널 간 상관 관계를 캡처하는데 사용되며, 스타일 손실을 계산하는 데 활용됩니다.

함수 내에서는 먼저 X의 크기를 조정하여 num_channels을 채널 수로, n을 각 채널에서의 픽셀 수로 변환합니다. 그런 다음 torch.matmul() 함수를 사용하여 X와 X.T (전치된 X)의 행렬 곱을 계산하고, 이를 (num_channels * n)로 나누어 그람 행렬을 구합니다. 이렇게 구한 그람 행렬은 스타일 손실 계산에 사용됩니다.

Obviously, the two Gram matrix inputs of the squared loss function for style loss are based on the style layer outputs for the synthesized image and the style image. It is assumed here that the Gram matrix gram_Y based on the style image has been precomputed.

분명히 스타일 손실에 대한 제곱 손실 함수의 두 그램 매트릭스 입력은 합성 이미지와 스타일 이미지에 대한 스타일 레이어 출력을 기반으로 합니다. 여기서는 스타일 이미지를 기반으로 하는 그램 행렬 gram_Y가 미리 계산되었다고 가정합니다.

def style_loss(Y_hat, gram_Y):
    return torch.square(gram(Y_hat) - gram_Y.detach()).mean()

위 코드는 스타일 손실 함수를 정의하는 부분입니다.

style_loss(Y_hat, gram_Y) 함수는 스타일 이미지의 그람 행렬 gram_Y와 생성된 이미지의 특성 맵 Y_hat 간의 스타일 손실을 계산합니다. 스타일 손실은 생성된 이미지의 특성 맵이 스타일 이미지의 스타일을 얼마나 잘 따라가는지를 나타내는 지표입니다.

함수 내에서는 Y_hat의 그람 행렬을 계산하기 위해 gram(Y_hat) 함수를 호출하고, 이를 gram_Y와 비교하여 차이의 제곱을 계산한 후 평균을 취합니다. 이로써 스타일 손실을 구하게 됩니다. gram_Y를 detach()하여 그람 행렬의 계산 그래프를 분리하여 오류가 발생하지 않도록 합니다.

Gram Matrix 란?

A Gram Matrix, also known as a Gramian Matrix or a Gramian, is a mathematical construct used in linear algebra and signal processing. It is derived from the inner products of a set of vectors and provides insights into the relationships between these vectors. The Gram Matrix is particularly useful in various applications, including image processing, machine learning, and the analysis of functions.

Gram 행렬은 선형 대수 및 신호 처리에서 사용되는 수학적인 개념으로, 벡터들의 내적(inner product)에서 유도되며 이러한 벡터들 간의 관계에 대한 통찰력을 제공합니다. Gram 행렬은 이미지 처리, 기계 학습, 함수 분석과 같은 다양한 응용 분야에서 특히 유용하게 활용됩니다.

Given a set of vectors in an -dimensional space, the Gram Matrix is defined as follows:

이 -차원 공간에서 주어진 벡터들의 집합이라고 할 때, Gram 행렬 은 다음과 같이 정의됩니다:

x_1^T x_1 & x_1^T x_2 & \ldots & x_1^T x_n \\

x_2^T x_1 & x_2^T x_2 & \ldots & x_2^T x_n \\

\vdots & \vdots & \ddots & \vdots \\

x_n^T x_1 & x_n^T x_2 & \ldots & x_n^T x_n \\

\end{bmatrix}\]

Here, \(x_i^T\) represents the transpose of vector \(x_i\), and \(x_i^T x_j\) represents the inner product (dot product) between vectors \(x_i\) and \(x_j\).

여기서 \(x_i^T\)는 벡터 \(x_i\)의 전치(transpose)를 나타내며, \(x_i^T x_j\)는 벡터 \(x_i\)와 \(x_j\)의 내적(점곱)을 나타냅니다.

The Gram Matrix is symmetric and positive semi-definite, meaning its eigenvalues are non-negative. It captures the relationships between the vectors in terms of their similarities or correlations. In image processing, for example, the Gram Matrix of a set of image feature vectors can be used to analyze texture or style information, which is the basis for techniques like neural style transfer in deep learning.

Gram 행렬은 대칭적이고 양의 준정부호(positive semi-definite)를 가집니다. 즉, 고유값은 비음수입니다. 이는 벡터들 간의 유사성 또는 상관 관계를 포착합니다. 예를 들어 이미지 처리에서, 이미지 특징 벡터들의 집합의 Gram 행렬은 텍스처나 스타일 정보를 분석하는 데 사용될 수 있으며, 이는 딥 러닝에서의 신경망 스타일 전이(neural style transfer)와 같은 기법의 기초입니다.

In machine learning, the Gram Matrix is used in kernel methods for support vector machines (SVMs) and other algorithms that require pairwise similarities between data points. It allows these algorithms to operate in a higher-dimensional space without explicitly computing the transformed feature vectors.

기계 학습에서는 Gram 행렬은 서포트 벡터 머신(SVM)의 커널 방법 및 데이터 포인트 간의 쌍별 유사성을 필요로하는 다른 알고리즘에서 사용됩니다. 이는 이러한 알고리즘들이 변환된 특징 벡터를 명시적으로 계산하지 않고도 더 높은 차원 공간에서 작동할 수 있게 합니다.

Overall, the Gram Matrix is a versatile mathematical tool with applications ranging from signal processing to machine learning, offering insights into the relationships and similarities between vectors in various contexts.

종합적으로, Gram 행렬은 다양한 의미 맥락에서 벡터 간의 관계와 유사성에 대한 통찰력을 제공하는 다재다능한 수학적 도구입니다. 이는 신호 처리부터 기계 학습까지 다양한 분야에서 응용 가능한 중요한 개념입니다.

14.12.5.3. Total Variation Loss

Sometimes, the learned synthesized image has a lot of high-frequency noise, i.e., particularly bright or dark pixels. One common noise reduction method is total variation denoising. Denote by xi,j the pixel value at coordinate (i,j). Reducing total variation loss makes values of neighboring pixels on the synthesized image closer.

때로는 학습된 합성 이미지에 많은 고주파 노이즈, 즉 특히 밝거나 어두운 픽셀이 있습니다. 일반적인 노이즈 감소 방법 중 하나는 전체 변형 노이즈 제거입니다. 좌표(i,j)의 픽셀 값을 xi,j로 표시합니다. 총 변형 손실을 줄이면 합성 이미지의 인접 픽셀 값이 더 가까워집니다.

def tv_loss(Y_hat):
    return 0.5 * (torch.abs(Y_hat[:, :, 1:, :] - Y_hat[:, :, :-1, :]).mean() +
                  torch.abs(Y_hat[:, :, :, 1:] - Y_hat[:, :, :, :-1]).mean())

위 코드는 전체 변위 손실 함수를 정의하는 부분입니다.

tv_loss(Y_hat) 함수는 생성된 이미지의 특성 맵 Y_hat에 대한 전체 변위 손실을 계산합니다. 전체 변위 손실은 생성된 이미지의 픽셀 간의 변화 정도를 나타내며, 이미지의 부드러움과 노이즈 제거에 기여합니다.

함수 내에서는 Y_hat의 특성 맵에 대해 수직 방향 및 수평 방향으로 픽셀 간의 차이를 계산하고, 절댓값을 취하여 평균을 계산합니다. 이렇게 계산된 두 차이의 평균을 더한 후 0.5를 곱하여 최종적인 전체 변위 손실을 계산합니다.

14.12.5.4. Loss Function

The loss function of style transfer is the weighted sum of content loss, style loss, and total variation loss. By adjusting these weight hyperparameters, we can balance among content retention, style transfer, and noise reduction on the synthesized image.

스타일 전송의 손실 함수는 콘텐츠 손실, 스타일 손실 및 총 변형 손실의 가중 합입니다. 이러한 가중치 하이퍼파라미터를 조정하여 합성 이미지의 콘텐츠 유지, 스타일 전송 및 노이즈 감소 간의 균형을 맞출 수 있습니다.

content_weight, style_weight, tv_weight = 1, 1e4, 10

def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):
    # Calculate the content, style, and total variance losses respectively
    contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
        contents_Y_hat, contents_Y)]
    styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
        styles_Y_hat, styles_Y_gram)]
    tv_l = tv_loss(X) * tv_weight
    # Add up all the losses
    l = sum(styles_l + contents_l + [tv_l])
    return contents_l, styles_l, tv_l, l

위 코드는 이미지 생성을 위한 전체 손실을 계산하는 함수를 정의하는 부분입니다.

compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram) 함수는 생성된 이미지 X와 이에 대한 콘텐츠 및 스타일 특성 맵의 예측 contents_Y_hat 및 styles_Y_hat, 그리고 정답 콘텐츠 및 스타일 그람 매트릭스 contents_Y와 styles_Y_gram을 사용하여 전체 손실을 계산합니다.

함수 내에서는 콘텐츠 손실, 스타일 손실, 그리고 전체 변위 손실을 계산합니다. 각각의 손실에는 content_weight, style_weight, tv_weight 가중치가 곱해진 후 합산되어 최종 손실 l이 계산됩니다.

이렇게 계산된 각 손실들을 반환하며, 생성된 이미지의 품질을 조정하는 데 사용됩니다.

14.12.6. Initializing the Synthesized Image

In style transfer, the synthesized image is the only variable that needs to be updated during training. Thus, we can define a simple model, SynthesizedImage, and treat the synthesized image as the model parameters. In this model, forward propagation just returns the model parameters.

스타일 전송에서 합성 이미지는 학습 중에 업데이트해야 하는 유일한 변수입니다. 따라서 간단한 모델인 SynthesizedImage를 정의하고 합성된 이미지를 모델 매개변수로 처리할 수 있습니다. 이 모델에서 정방향 전파는 모델 매개변수만 반환합니다.

class SynthesizedImage(nn.Module):
    def __init__(self, img_shape, **kwargs):
        super(SynthesizedImage, self).__init__(**kwargs)
        self.weight = nn.Parameter(torch.rand(*img_shape))

    def forward(self):
        return self.weight

위 코드는 신경망 모델을 정의하는 클래스인 SynthesizedImage를 나타냅니다.

SynthesizedImage 클래스는 nn.Module을 상속하여 정의되며, 생성된 이미지를 나타내는 모델입니다. 생성된 이미지는 nn.Parameter를 통해 모델의 학습 가능한 매개변수로서 정의됩니다. 생성된 이미지의 크기는 img_shape로 주어지며, 이는 self.weight 매개변수의 형태가 됩니다.

forward 메서드는 해당 모델의 순전파(forward) 연산을 수행하는 부분으로, 단순히 self.weight 값을 반환합니다. 이렇게 정의된 클래스는 이미지 생성 과정에서 모델을 사용하여 이미지를 조작하고 업데이트하는 데 활용될 수 있습니다.

Next, we define the get_inits function. This function creates a synthesized image model instance and initializes it to the image X. Gram matrices for the style image at various style layers, styles_Y_gram, are computed prior to training.

다음으로 get_inits 함수를 정의합니다. 이 함수는 합성된 이미지 모델 인스턴스를 생성하고 이미지 X로 초기화합니다. 다양한 스타일 레이어에서 스타일 이미지에 대한 그램 매트릭스(style_Y_gram)는 훈련 전에 계산됩니다.

def get_inits(X, device, lr, styles_Y):
    gen_img = SynthesizedImage(X.shape).to(device)
    gen_img.weight.data.copy_(X.data)
    trainer = torch.optim.Adam(gen_img.parameters(), lr=lr)
    styles_Y_gram = [gram(Y) for Y in styles_Y]
    return gen_img(), styles_Y_gram, trainer

위 코드는 생성된 이미지의 초기화 및 학습에 필요한 초기화 요소를 얻는 함수 get_inits를 나타냅니다.

이 함수는 다음과 같은 역할을 수행합니다:

X: 원본 이미지
device: 사용할 디바이스 (CPU 또는 GPU)
lr: 학습률
styles_Y: 스타일 이미지의 Gram 행렬들

위 매개변수들을 이용하여 다음 작업을 수행합니다:

SynthesizedImage 모델의 인스턴스 gen_img를 생성하고, 해당 모델의 가중치를 원본 이미지 X의 데이터로 초기화합니다.
styles_Y의 각 스타일 이미지에 대한 Gram 행렬을 계산하여 styles_Y_gram에 저장합니다.
gen_img, styles_Y_gram, 그리고 Adam 옵티마이저를 생성하여 반환합니다.

이렇게 생성된 이미지와 초기화에 필요한 정보는 실제 이미지를 생성하고 업데이트하는 학습 과정에서 활용될 것입니다.

14.12.7. Training

When training the model for style transfer, we continuously extract content features and style features of the synthesized image, and calculate the loss function. Below defines the training loop.

스타일 전달을 위해 모델을 훈련할 때 합성 이미지의 콘텐츠 특징과 스타일 특징을 지속적으로 추출하고 손실 함수를 계산합니다. 아래는 훈련 루프를 정의합니다.

def train(X, contents_Y, styles_Y, device, lr, num_epochs, lr_decay_epoch):
    X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y)
    scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_decay_epoch, 0.8)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[10, num_epochs],
                            legend=['content', 'style', 'TV'],
                            ncols=2, figsize=(7, 2.5))
    for epoch in range(num_epochs):
        trainer.zero_grad()
        contents_Y_hat, styles_Y_hat = extract_features(
            X, content_layers, style_layers)
        contents_l, styles_l, tv_l, l = compute_loss(
            X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)
        l.backward()
        trainer.step()
        scheduler.step()
        if (epoch + 1) % 10 == 0:
            animator.axes[1].imshow(postprocess(X))
            animator.add(epoch + 1, [float(sum(contents_l)),
                                     float(sum(styles_l)), float(tv_l)])
    return X

위 코드는 스타일 전이 네트워크를 학습하는 함수 train을 나타냅니다.

이 함수는 다음과 같은 역할을 수행합니다:

X: 원본 이미지
contents_Y: 내용 이미지의 특징
styles_Y: 스타일 이미지의 특징
device: 사용할 디바이스 (CPU 또는 GPU)
lr: 학습률
num_epochs: 에포크 수
lr_decay_epoch: 학습률 감소를 적용할 에포크 간격

위 매개변수들을 이용하여 다음 작업을 수행합니다:

생성된 이미지 X, 스타일 이미지의 Gram 행렬들 styles_Y_gram, 그리고 Adam 옵티마이저를 초기화하고 얻습니다.
StepLR 스케줄러를 생성하여 학습률을 조정하는데 사용합니다.
애니메이션을 위한 Animator를 초기화합니다.
주어진 에포크 수만큼 학습을 반복하면서 다음을 수행합니다:
- 옵티마이저의 그래디언트를 초기화합니다.
- 생성된 이미지 X를 스타일 전이 네트워크에 통과시켜 내용과 스타일 특징을 추출합니다.
- 내용 손실, 스타일 손실, 총 변동 손실을 계산합니다.
- 총 손실에 대한 역전파를 수행하고 옵티마이저를 업데이트합니다.
- 스케줄러를 업데이트하여 학습률을 조정합니다.
- 일정 에포크마다 생성된 이미지를 시각화하고, 내용 손실, 스타일 손실, 변동 손실을 기록합니다.
최종적으로 생성된 이미지 X를 반환합니다.

Now we start to train the model. We rescale the height and width of the content and style images to 300 by 450 pixels. We use the content image to initialize the synthesized image.

이제 모델 학습을 시작합니다. 콘텐츠 및 스타일 이미지의 높이와 너비를 300 x 450픽셀로 재조정합니다. 콘텐츠 이미지를 사용하여 합성 이미지를 초기화합니다.

device, image_shape = d2l.try_gpu(), (300, 450)  # PIL Image (h, w)
net = net.to(device)
content_X, contents_Y = get_contents(image_shape, device)
_, styles_Y = get_styles(image_shape, device)
output = train(content_X, contents_Y, styles_Y, device, 0.3, 500, 50)

위 코드는 주어진 내용 이미지와 스타일 이미지를 사용하여 스타일 전이 네트워크를 학습하고, 생성된 이미지를 반환하는 작업을 수행합니다.

device와 image_shape를 설정합니다. device는 학습에 사용할 디바이스(CPU 또는 GPU)를 결정하고, image_shape는 생성될 이미지의 크기를 결정합니다.
net을 지정된 디바이스로 옮깁니다. 이는 네트워크를 선택한 디바이스에서 실행하도록 하는 역할을 합니다.
get_contents 함수를 사용하여 내용 이미지의 특징을 추출합니다. 추출한 내용 이미지와 특징을 content_X와 contents_Y로 저장합니다.
get_styles 함수를 사용하여 스타일 이미지의 특징을 추출합니다. 추출한 스타일 이미지와 특징을 styles_X와 styles_Y로 저장합니다.
train 함수를 호출하여 스타일 전이 네트워크를 학습합니다. content_X, contents_Y, styles_Y, device, 학습률(0.3), 에포크 수(500), 학습률 감소 에포크(50)를 매개변수로 전달합니다. 학습된 결과로 생성된 이미지를 output에 저장합니다.

즉, 위 코드는 주어진 내용 이미지와 스타일 이미지를 이용하여 스타일 전이 네트워크를 학습하고, 학습된 결과로 생성된 이미지를 반환합니다.

We can see that the synthesized image retains the scenery and objects of the content image, and transfers the color of the style image at the same time. For example, the synthesized image has blocks of color like those in the style image. Some of these blocks even have the subtle texture of brush strokes.

합성된 이미지는 콘텐츠 이미지의 풍경과 사물을 그대로 유지함과 동시에 스타일 이미지의 색상을 전달함을 알 수 있다. 예를 들어 합성 이미지에는 스타일 이미지와 같은 색상 블록이 있습니다. 이러한 블록 중 일부는 브러시 스트로크의 미묘한 질감을 가지고 있습니다.

14.12.8. Summary

The loss function commonly used in style transfer consists of three parts: (i) content loss makes the synthesized image and the content image close in content features; (ii) style loss makes the synthesized image and style image close in style features; and (iii) total variation loss helps to reduce the noise in the synthesized image.
스타일 전송에서 일반적으로 사용되는 손실 함수는 세 부분으로 구성됩니다. (i) 콘텐츠 손실은 합성 이미지와 콘텐츠 이미지를 콘텐츠 기능에 가깝게 만듭니다. (ii) 스타일 손실은 합성 이미지와 스타일 이미지를 스타일 특징에 가깝게 만듭니다. (iii) 전체 변형 손실은 합성 이미지의 노이즈를 줄이는 데 도움이 됩니다.
We can use a pretrained CNN to extract image features and minimize the loss function to continuously update the synthesized image as model parameters during training.
사전 훈련된 CNN을 사용하여 이미지 특징을 추출하고 손실 함수를 최소화하여 훈련 중에 합성된 이미지를 모델 매개변수로 지속적으로 업데이트할 수 있습니다.
We use Gram matrices to represent the style outputs from the style layers.
스타일 레이어의 스타일 출력을 나타내기 위해 그램 매트릭스를 사용합니다.

14.12.9. Exercises

How does the output change when you select different content and style layers?
Adjust the weight hyperparameters in the loss function. Does the output retain more content or have less noise?
Use different content and style images. Can you create more interesting synthesized images?
Can we apply style transfer for text? Hint: you may refer to the survey paper by Hu et al. (2022).

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.11. Fully Convolutional Networks

2023. 8. 21. 04:10 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/fcn.html

14.11. Fully Convolutional Networks — Dive into Deep Learning 1.0.3 documentation

d2l.ai

As discussed in Section 14.9, semantic segmentation classifies images in pixel level. A fully convolutional network (FCN) uses a convolutional neural network to transform image pixels to pixel classes (Long et al., 2015). Unlike the CNNs that we encountered earlier for image classification or object detection, a fully convolutional network transforms the height and width of intermediate feature maps back to those of the input image: this is achieved by the transposed convolutional layer introduced in Section 14.10. As a result, the classification output and the input image have a one-to-one correspondence in pixel level: the channel dimension at any output pixel holds the classification results for the input pixel at the same spatial position.

섹션 14.9에서 논의한 바와 같이 의미론적 분할 semantic segmentation은 픽셀 수준에서 이미지를 분류합니다. 완전 컨볼루션 네트워크 fully convolutional network (FCN)는 컨볼루션 신경망을 사용하여 이미지 픽셀을 픽셀 클래스로 변환합니다(Long et al., 2015). 이전에 이미지 분류 또는 객체 감지를 위해 접한 CNN과 달리 완전 컨볼루션 네트워크는 중간 피처 맵의 높이와 너비를 입력 이미지의 높이와 너비로 다시 변환합니다. 결과적으로 분류 출력과 입력 이미지는 픽셀 수준에서 일대일 대응을 갖습니다. 모든 출력 픽셀의 채널 차원은 동일한 공간 위치에서 입력 픽셀에 대한 분류 결과를 유지합니다.

Fully Convolutional Networks 란?

"Fully Convolutional Networks"는 완전 합성곱 신경망이라고도 불리며, 주로 이미지 분할과 같은 컴퓨터 비전 작업에 사용되는 딥러닝 아키텍처를 의미합니다.

기존의 전통적인 합성곱 신경망(Convolutional Neural Networks, CNN)은 주로 분류 작업을 위해 설계되었으며, 입력 이미지를 각기 다른 레이어로 전달하여 점진적으로 특징을 추출한 후, 마지막에 소프트맥스 레이어로 클래스 확률을 출력하는 구조를 가지고 있습니다.

반면에 Fully Convolutional Networks (FCN)는 입력 이미지를 픽셀 단위로 분류하는 이미지 분할 작업에 특화된 구조입니다. FCN은 일반적인 CNN의 특징 추출 부분을 유지하면서, 마지막 레이어를 완전 합성곱 레이어로 변경하여 원본 이미지의 공간적인 정보를 보존할 수 있도록 합니다. 이를 통해 각 픽셀에 대한 클래스 레이블을 예측하게 됩니다.

FCN은 픽셀 단위 분류 작업에서 효과적이며, 이미지 분할, 객체 탐지, 시맨틱 세그멘테이션 등의 컴퓨터 비전 작업에서 널리 사용되는 아키텍처입니다. FCN은 입력 이미지의 크기에 따라 유연하게 대응할 수 있으며, 특정 객체나 물체의 경계를 정확하게 분할할 수 있는 장점을 가지고 있습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

이 코드는 주피터 노트북(Jupyter Notebook)에서 실행되는 경우에 사용되는 매직 명령 %matplotlib inline을 시작으로, 필요한 모듈과 라이브러리를 가져오는 부분입니다.

%matplotlib inline: 주피터 노트북 환경에서 Matplotlib 그래프를 인라인으로 표시하도록 설정하는 매직 명령입니다.
import torch: 파이토치 라이브러리를 가져옵니다.
import torchvision: 파이토치의 컴퓨터 비전 관련 기능을 확장한 torchvision 라이브러리를 가져옵니다.
from torch import nn: 파이토치의 nn (neural network) 모듈에서 neural network 관련 기능들을 가져옵니다.
from torch.nn import functional as F: 파이토치의 nn 모듈에서 함수형 인터페이스를 가져와 F로 별칭을 지정합니다. 이것은 주로 활성화 함수나 손실 함수와 같은 함수적 기능을 가리키기 위해 사용됩니다.
from d2l import torch as d2l: "Dive into Deep Learning" 책의 도구 함수를 사용하기 위해 d2l 라이브러리를 가져옵니다. torch 별칭을 d2l로 변경하여 사용합니다.

이 코드 조각은 파이토치와 관련된 라이브러리와 함수를 가져옴으로써 딥러닝 모델을 구축하고 훈련하기 위한 기본적인 환경을 설정합니다.

14.11.1. The Model

Here we describe the basic design of the fully convolutional network model. As shown in Fig. 14.11.1, this model first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution introduced in Section 14.10. As a result, the model output has the same height and width as the input image, where the output channel contains the predicted classes for the input pixel at the same spatial position.

여기서 우리는 완전 컨벌루션 네트워크 모델의 기본 설계를 설명합니다. 그림 14.11.1과 같이 이 모델은 먼저 CNN을 사용하여 이미지 특징을 추출한 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환하고 마지막으로 특징 맵의 높이와 너비를 변환합니다. 섹션 14.10에서 소개한 전치 컨벌루션을 통해 입력 이미지의 이미지로 변환합니다. 결과적으로 모델 출력은 입력 이미지와 동일한 높이와 너비를 가지며 출력 채널에는 동일한 공간 위치에서 입력 픽셀에 대해 예측된 클래스가 포함됩니다.

Fig. 14.11.1  Fully convolutional network.

Below, we use a ResNet-18 model pretrained on the ImageNet dataset to extract image features and denote the model instance as pretrained_net. The last few layers of this model include a global average pooling layer and a fully connected layer: they are not needed in the fully convolutional network.

아래에서는 ImageNet 데이터 세트에서 사전 훈련된 ResNet-18 모델을 사용하여 이미지 기능을 추출하고 모델 인스턴스를 pretrained_net으로 표시합니다. 이 모델의 마지막 몇 계층에는 전역 평균 풀링 계층과 완전 연결 계층이 포함됩니다. 완전 컨벌루션 네트워크에서는 필요하지 않습니다.

pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]

이 코드는 사전 훈련된 ResNet-18 모델을 불러오고, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

pretrained_net: 사전 훈련된 ResNet-18 모델을 불러옵니다.
torchvision.models.resnet18(pretrained=True): 사전 훈련된 ResNet-18 모델을 불러옵니다. pretrained=True는 사전 훈련된 가중치를 사용하겠다는 의미입니다.
list(pretrained_net.children()): 불러온 ResNet-18 모델의 모든 하위 모듈을 리스트로 반환합니다. 이는 네트워크의 각 레이어를 순차적으로 표현한 것입니다.
[-3:]: 리스트의 마지막 세 개의 원소를 선택합니다. 이는 불러온 ResNet-18 모델의 마지막 세 개의 레이어를 의미합니다.

따라서, 위 코드는 사전 훈련된 ResNet-18 모델을 불러온 뒤, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/ci/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 56.3MB/s]

[Sequential(
   (0): BasicBlock(
     (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (downsample): Sequential(
       (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     )
   )
   (1): BasicBlock(
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 ),
 AdaptiveAvgPool2d(output_size=(1, 1)),
 Linear(in_features=512, out_features=1000, bias=True)]

위 출력은 ResNet-18 모델의 구조를 설명합니다. 아래에서부터 위로 각각의 레이어와 그 내용을 설명하겠습니다.

Linear 레이어:
- 입력 특성 개수: 512
- 출력 특성 개수: 1000
- 활성화 함수: 없음
- 주어진 입력 특성에 대한 선형 변환을 수행하는 fully connected 레이어입니다. 이것은 마지막 분류 레이어로, 모델의 출력이 클래스 수에 대한 확률 분포를 나타냅니다.
AdaptiveAvgPool2d 레이어:
- 출력 크기: (1, 1)
- 입력으로 들어오는 특성 맵을 주어진 출력 크기에 맞게 자동으로 평균 풀링하는 레이어입니다. 입력 이미지의 크기에 관계 없이 일정한 크기의 출력을 생성합니다.
Sequential 레이어 (2개의 BasicBlock 레이어 포함):
- 첫 번째 BasicBlock:
  - conv1: 입력 특성 개수 256, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn1: 배치 정규화 레이어
  - relu: ReLU 활성화 함수
  - conv2: 입력 특성 개수 512, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn2: 배치 정규화 레이어
  - downsample: 다운샘플링을 위한 Sequential 레이어
- 두 번째 BasicBlock:
  - 위와 같은 구조를 가진 BasicBlock 레이어
BasicBlock은 ResNet의 기본 구성 요소입니다. 입력 특성을 컨볼루션과 배치 정규화를 거쳐 변환한 뒤, 활성화 함수 ReLU를 적용하고 다시 컨볼루션과 배치 정규화를 거칩니다. 이러한 구성으로 인해 ResNet은 깊게 쌓여도 그레이디언트 소실 문제를 해결하며 성능을 높일 수 있습니다.

따라서 위의 출력은 ResNet-18 모델의 구조를 나타내며, 각 레이어의 구성 요소를 설명합니다.

Next, we create the fully convolutional network instance net. It copies all the pretrained layers in the ResNet-18 except for the final global average pooling layer and the fully connected layer that are closest to the output.

다음으로 완전 컨볼루션 네트워크 인스턴스 네트워크를 만듭니다. 최종 전역 평균 풀링 계층과 출력에 가장 가까운 완전 연결 계층을 제외하고 ResNet-18의 모든 사전 훈련된 계층을 복사합니다.

net = nn.Sequential(*list(pretrained_net.children())[:-2])

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 새로운 신경망 모델로 가져오는 과정을 나타냅니다.

pretrained_net은 미리 학습된 ResNet-18 모델을 나타냅니다.
pretrained_net.children()는 모델의 모든 하위 레이어를 가져오는 메소드입니다.
list(pretrained_net.children())[:-2]는 마지막 두 레이어를 제외한 모든 레이어를 리스트로 가져옵니다.
nn.Sequential(*...)는 주어진 레이어들을 순차적으로 연결한 새로운 신경망 모델을 생성합니다.

따라서 net은 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 가지는 새로운 신경망 모델입니다. 이렇게 하면 기존의 미리 학습된 모델에서 일부 레이어를 재활용하고, 이에 새로운 레이어를 추가하여 원하는 작업에 맞는 모델을 빠르게 구축할 수 있습니다.

Given an input with height and width of 320 and 480 respectively, the forward propagation of net reduces the input height and width to 1/32 of the original, namely 10 and 15.

높이와 너비가 각각 320과 480인 입력이 주어지면 net의 정방향 전파는 입력 높이와 너비를 원본의 1/32, 즉 10과 15로 줄입니다.

X = torch.rand(size=(1, 3, 320, 480))
net(X).shape

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 나머지 레이어로 입력 데이터를 전달하여 출력의 형태를 확인하는 과정을 나타냅니다.

torch.rand(size=(1, 3, 320, 480))은 크기가 (1, 3, 320, 480)인 랜덤한 값을 가지는 입력 데이터를 생성합니다. 이는 배치 크기가 1이고 채널 수가 3이며, 높이가 320이고 너비가 480인 이미지 데이터를 나타냅니다.
net(X)는 위에서 생성한 입력 데이터 X를 미리 학습된 ResNet-18 모델인 net에 통과시켜 출력을 계산하는 과정입니다.
.shape는 텐서의 크기(shape)를 반환하는 속성입니다. 따라서 net(X).shape는 네트워크의 출력 형태를 나타내게 됩니다.

이 코드를 실행하면 입력 데이터 X를 네트워크에 전달한 결과의 출력 형태를 확인할 수 있습니다.

torch.Size([1, 512, 10, 15])

Next, we use a 1×1 convolutional layer to transform the number of output channels into the number of classes (21) of the Pascal VOC2012 dataset. Finally, we need to increase the height and width of the feature maps by 32 times to change them back to the height and width of the input image. Recall how to calculate the output shape of a convolutional layer in Section 7.3. Since (320−64+16×2+32)/32=10 and (480−64+16×2+32)/32=15, we construct a transposed convolutional layer with stride of 32, setting the height and width of the kernel to 64, the padding to 16. In general, we can see that for stride s, padding s/2 (assuming s/2 is an integer), and the height and width of the kernel 2s, the transposed convolution will increase the height and width of the input by s times.

다음으로 1×1 컨벌루션 레이어를 사용하여 출력 채널 수를 Pascal VOC2012 데이터 세트의 클래스 수(21)로 변환합니다. 마지막으로 피처 맵의 높이와 너비를 32배 늘려 입력 이미지의 높이와 너비로 다시 변경해야 합니다. 섹션 7.3에서 컨볼루션 레이어의 출력 형태를 계산하는 방법을 기억하십시오. (320-64+16×2+32)/32=10이고 (480-64+16×2+32)/32=15이므로 보폭이 32인 전치 컨볼루션 레이어를 구성하고 커널은 64로, 패딩은 16으로 높이와 너비를 설정합니다. 일반적으로 보폭 s, 패딩 s/2(s/2가 정수라고 가정) 및 커널 2s의 높이와 너비에 대해 전치 컨볼루션이 입력의 높이와 너비를 s배로 증가시키는 것을 볼 수 있습니다. .

num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
                                    kernel_size=64, padding=16, stride=32))

위 코드는 기존의 신경망 모델 net에 두 개의 추가 레이어를 추가하는 과정을 나타냅니다.

num_classes = 21: 클래스의 개수를 나타내는 변수로, 이 경우에는 21개의 클래스를 분류하는 문제를 가정하고 있습니다.
nn.Conv2d(512, num_classes, kernel_size=1): final_conv라는 이름의 1x1 컨볼루션 레이어를 생성하고, 입력 채널 수가 512이며 출력 채널 수가 num_classes인 컨볼루션 레이어를 추가합니다. 이 레이어는 마지막 컨볼루션 레이어를 대체하는 역할을 합니다.
nn.ConvTranspose2d(num_classes, num_classes, kernel_size=64, padding=16, stride=32): transpose_conv라는 이름의 전치 컨볼루션 레이어를 생성하고, 입력 채널 수와 출력 채널 수가 모두 num_classes인 전치 컨볼루션 레이어를 추가합니다. 이 레이어는 특성 맵의 크기를 늘려주는 데 사용되며, 이 경우에는 크기를 32배 확대하고 패딩과 스트라이드를 조절하여 원하는 출력 크기를 얻습니다.

따라서 위 코드는 미리 학습된 신경망 net의 마지막 레이어를 대체하고, 그 뒤에 전치 컨볼루션 레이어를 추가하여 원하는 출력을 생성할 수 있도록 수정하는 과정을 나타냅니다. 이는 주로 세그멘테이션 등의 문제에서 사용되며, 네트워크의 출력 크기를 조정하고 클래스별 특성 맵을 얻을 때 유용합니다.

14.11.2. Initializing Transposed Convolutional Layers

We already know that transposed convolutional layers can increase the height and width of feature maps. In image processing, we may need to scale up an image, i.e., upsampling. Bilinear interpolation is one of the commonly used upsampling techniques. It is also often used for initializing transposed convolutional layers.

우리는 transposed convolutional layer가 feature map의 height와 width를 증가시킬 수 있다는 것을 이미 알고 있습니다. 이미지 처리에서 이미지를 확장해야 할 수도 있습니다. 즉, 업샘플링이 필요할 수 있습니다. 쌍선형 보간 Bilinear interpolation 은 일반적으로 사용되는 업샘플링 기술 중 하나입니다. 또한 전치된 컨볼루션 레이어를 초기화하는 데 자주 사용됩니다.

To explain bilinear interpolation, say that given an input image we want to calculate each pixel of the upsampled output image. In order to calculate the pixel of the output image at coordinate (x,y), first map (x,y) to coordinate (x′,y′) on the input image, for example, according to the ratio of the input size to the output size. Note that the mapped x′ and y′ are real numbers. Then, find the four pixels closest to coordinate (x′,y′) on the input image. Finally, the pixel of the output image at coordinate (x,y) is calculated based on these four closest pixels on the input image and their relative distance from (x′,y′).

쌍선형 보간법 bilinear interpolation을 설명하기 위해 입력 이미지가 주어졌을 때 업샘플링된 출력 이미지의 각 픽셀을 계산하려고 한다고 가정해 보겠습니다. 좌표 (x,y)에서 출력 이미지의 픽셀을 계산하기 위해 먼저 입력 크기의 비율에 따라 (x,y)를 입력 이미지의 좌표 (x′,y′)에 매핑합니다. 출력 크기에. 매핑된 x′ 및 y′는 실수입니다. 그런 다음 입력 이미지에서 좌표(x′,y′)에 가장 가까운 4개의 픽셀을 찾습니다. 마지막으로 좌표 (x,y)에 있는 출력 이미지의 픽셀은 입력 이미지에서 가장 가까운 4개의 픽셀과 (x',y')로부터의 상대적인 거리를 기반으로 계산됩니다.

bilinear interpolation이란?

'Bilinear Interpolation'은 이미지나 그래픽의 크기나 해상도를 변경하는 과정에서 사용되는 보간 기법 중 하나입니다. 보간(interpolation)은 주어진 데이터 포인트 사이에 새로운 값을 추정하는 기술을 의미하며, 이를 통해 부드럽게 변환된 이미지나 그래픽을 생성할 수 있습니다.

이 중 'Bilinear Interpolation'은 2차원 평면에서 사용되며, 두 개의 인접한 데이터 포인트 사이에서 새로운 값을 추정합니다. 간단히 말해, 2차원 좌표 상에서 (x, y) 위치에 존재하는 값은 주변 4개의 데이터 포인트를 사용하여 계산됩니다. 이렇게 인접한 4개의 데이터 포인트를 가중치를 적용하여 선형 보간한 결과를 사용하여 새로운 값을 추정합니다.

이 기법은 이미지나 그래픽의 크기 변경, 회전, 이동 등에 사용되며, 특히 이미지 리사이징이나 세그멘테이션과 같은 작업에서 주로 활용됩니다. 'Bilinear'이라는 이름은 가중치를 계산하는 데에 선형 보간을 사용한다는 특성을 나타내며, 두 개의 축에 대한 보간이 동시에 이루어지므로 'Bilinear'이라는 이름이 붙여졌습니다.

Upsampling of bilinear interpolation can be implemented by the transposed convolutional layer with the kernel constructed by the following bilinear_kernel function. Due to space limitations, we only provide the implementation of the bilinear_kernel function below without discussions on its algorithm design.

쌍선형 보간법 bilinear interpolation 의 업샘플링은 다음 bilinear_kernel 함수로 구성된 커널을 사용하여 전치된 컨벌루션 계층 transposed convolutional layer에 의해 구현될 수 있습니다. 공간 제한으로 인해 알고리즘 설계에 대한 논의 없이 bilinear_kernel 함수의 구현만 제공합니다.

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (torch.arange(kernel_size).reshape(-1, 1),
          torch.arange(kernel_size).reshape(1, -1))
    filt = (1 - torch.abs(og[0] - center) / factor) * \
           (1 - torch.abs(og[1] - center) / factor)
    weight = torch.zeros((in_channels, out_channels,
                          kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return weight

위 코드는 주어진 입력 채널 수와 출력 채널 수, 그리고 커널 크기에 따라 이차원의 'Bilinear Interpolation' 커널을 생성하는 함수를 정의하고 있습니다. 이 커널은 주로 합성곱 신경망(Convolutional Neural Network, CNN)에서 전치 합성곱 연산에 활용되며, 주로 세그멘테이션(segmentation)과 같은 작업에서 이미지를 업샘플링하는 데 사용됩니다.

여기서 in_channels은 입력 채널 수, out_channels는 출력 채널 수, kernel_size는 커널의 크기(가로와 세로)입니다. 이 커널은 입력 이미지의 각 채널별로 다른 가중치를 가지며, 이를 통해 입력 이미지의 특성을 보존하면서 업샘플링됩니다.

커널을 생성하는 과정은 다음과 같습니다:

factor 변수는 커널 크기의 중앙을 계산합니다.
center 변수는 중앙 위치를 결정합니다.
og는 커널 크기에 따른 좌표 그리드를 생성합니다.
filt는 각 좌표 위치에 대한 커널 가중치를 계산합니다. 이것은 'Bilinear Interpolation'에 따라 계산된 값으로, 가운데에 가까운 값일수록 높은 가중치를 가지며, 커널 크기를 기준으로 대칭적인 값을 가집니다.
weight 텐서는 이렇게 계산된 filt 값을 이용하여 입력 채널 수와 출력 채널 수에 따른 4차원 텐서를 생성합니다. 이 텐서의 각 채널에 커널 가중치 값을 할당합니다.

이러한 커널은 전치 합성곱 연산에서 업샘플링을 수행할 때 사용되며, 주로 모델의 특성 맵 크기를 늘리는 데 활용됩니다.

Let’s experiment with upsampling of bilinear interpolation that is implemented by a transposed convolutional layer. We construct a transposed convolutional layer that doubles the height and weight, and initialize its kernel with the bilinear_kernel function.

전치된 컨볼루션 레이어에 의해 구현되는 쌍선형 보간의 업샘플링을 실험해 봅시다. 높이와 무게를 두 배로 늘리는 전치 컨볼루션 레이어를 구성하고 bilinear_kernel 함수로 커널을 초기화합니다.

conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
                                bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));

위 코드는 입력 채널 수와 출력 채널 수가 모두 3인 커스텀 'Bilinear Interpolation' 커널을 생성하여 이를 사용하여 전치 합성곱 연산을 수행하는 부분입니다.

해석을 해보겠습니다:

conv_trans는 전치 합성곱 레이어를 생성하는데 사용됩니다. nn.ConvTranspose2d는 전치 합성곱을 수행하는 클래스입니다. 여기서 입력 채널 수와 출력 채널 수 모두 3이며, 커널 크기는 4x4이고, 패딩은 1, 스트라이드는 2로 설정됩니다.
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4))는 위에서 정의한 bilinear_kernel 함수를 사용하여 커스텀 'Bilinear Interpolation' 커널을 생성하고, 이를 conv_trans 레이어의 가중치 텐서로 복사하는 역할을 합니다. 이를 통해 'Bilinear Interpolation'의 특성을 가지는 커널이 전치 합성곱 레이어에 적용됩니다.

따라서 위 코드는 입력 이미지의 특성 맵을 업샘플링하여 크기를 확장하는 역할을 수행합니다. 이를 통해 세그멘테이션과 같은 작업에서 높은 해상도의 예측을 얻을 수 있습니다.

Read the image X and assign the upsampling output to Y. In order to print the image, we need to adjust the position of the channel dimension.

이미지 X를 읽고 업샘플링 출력을 Y에 할당합니다. 이미지를 인쇄하려면 채널 차원의 위치를 조정해야 합니다.

img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()

위 코드는 주어진 이미지에 대해 Bilinear Interpolation을 사용하여 업샘플링을 수행하는 과정을 보여주는 부분입니다.

해석을 해보겠습니다:

img는 d2l.Image.open('../img/catdog.jpg')를 통해 로드된 이미지입니다. 이 이미지를 텐서 형식으로 변환하기 위해 torchvision.transforms.ToTensor()를 사용하여 텐서로 변환합니다.
X는 img를 모델에 입력으로 사용하기 위해 차원을 확장한 것입니다. 일반적으로 모델에 입력할 때 배치 차원을 추가하기 위해 사용됩니다.
conv_trans(X)는 이전에 정의한 conv_trans 레이어를 사용하여 입력 이미지 X에 대한 업샘플링을 수행한 결과를 얻습니다. 이렇게 얻은 결과인 Y는 업샘플링된 이미지입니다.
out_img는 Y 텐서에서 0번째 배치 차원을 선택하여 이미지를 추출합니다. 그리고 이를 .permute(1, 2, 0)를 통해 채널 순서를 변경하여 이미지의 형식을 HWC (높이, 너비, 채널)로 변경합니다. .detach()는 연산을 추적하지 않도록 하기 위해 사용되며, 역전파에 영향을 주지 않습니다.

결과적으로 out_img는 업샘플링된 이미지를 나타내며, 해당 이미지를 시각화하거나 분석하는 데 사용할 수 있습니다.

As we can see, the transposed convolutional layer increases both the height and width of the image by a factor of two. Except for the different scales in coordinates, the image scaled up by bilinear interpolation and the original image printed in Section 14.3 look the same.

보시다시피 transposed convolutional layer는 이미지의 높이와 너비를 2배로 늘립니다. 좌표의 스케일이 다른 것을 제외하면 쌍선형 보간에 의해 확대된 이미지와 14.3절에서 인쇄된 원본 이미지는 동일하게 보입니다.

d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);

위 코드는 입력 이미지와 Bilinear Interpolation을 통해 얻은 업샘플링된 출력 이미지를 시각화하는 부분입니다.

해석을 해보겠습니다:

d2l.set_figsize()는 시각화할 그림의 크기를 설정하는 함수입니다. 보통 d2l 라이브러리에서 제공하는 시각화 함수를 사용할 때 함께 사용됩니다.
print('input image shape:', img.permute(1, 2, 0).shape)는 입력 이미지의 형태를 출력합니다. 이미지의 채널 순서를 변경하여 (높이, 너비, 채널) 형식으로 변경하고, .shape를 사용하여 해당 형태를 출력합니다.
d2l.plt.imshow(img.permute(1, 2, 0))는 d2l 라이브러리의 시각화 함수를 사용하여 입력 이미지를 표시합니다. img.permute(1, 2, 0)를 사용하여 채널 순서를 변경하고 이미지를 시각화합니다.
print('output image shape:', out_img.shape)는 업샘플링된 출력 이미지의 형태를 출력합니다.
d2l.plt.imshow(out_img)는 업샘플링된 출력 이미지를 시각화합니다.

결과적으로 코드는 입력 이미지와 업샘플링된 출력 이미지를 시각화하여 보여주는 역할을 합니다.

input image shape: torch.Size([561, 728, 3])
output image shape: torch.Size([1122, 1456, 3])

In a fully convolutional network, we initialize the transposed convolutional layer with upsampling of bilinear interpolation. For the 1×1 convolutional layer, we use Xavier initialization.

완전 컨벌루션 네트워크에서 쌍선형 보간법의 업샘플링으로 전치된 컨벌루션 레이어를 초기화합니다. 1×1 컨볼루션 레이어의 경우 Xavier 초기화를 사용합니다.

W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);

위 코드는 미리 정의된 네트워크 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 가중치로 설정하는 부분입니다.

해석을 해보겠습니다:

W = bilinear_kernel(num_classes, num_classes, 64)은 bilinear_kernel 함수를 사용하여 Bilinear Interpolation에 사용되는 가중치 행렬 W를 생성합니다. 이 함수는 입력 채널 수와 출력 채널 수가 같으며, 커널 크기는 64x64입니다. 이렇게 생성된 W는 Bilinear Interpolation을 수행하는 필터 역할을 합니다.
net.transpose_conv.weight.data.copy_(W)은 미리 정의된 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 W로 설정합니다. net.transpose_conv.weight.data는 레이어의 가중치 데이터를 의미하며, .copy_(W)를 사용하여 W의 값을 이 가중치에 복사합니다.

이렇게 설정된 가중치로 인해 ConvTranspose2d 레이어는 입력 이미지를 업샘플링하고 Bilinear Interpolation을 수행하여 출력 이미지를 생성하게 됩니다.

14.11.3. Reading the Dataset

We read the semantic segmentation dataset as introduced in Section 14.9. The output image shape of random cropping is specified as 320×480: both the height and width are divisible by 32.

섹션 14.9에서 소개한 시맨틱 분할 데이터 세트를 읽습니다. 임의 자르기의 출력 이미지 모양은 320×480으로 지정됩니다. 높이와 너비는 모두 32로 나눌 수 있습니다.

batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

위 코드는 Pascal VOC 데이터셋을 사용하여 학습과 테스트 데이터로 나누어 불러오는 부분입니다.

해석을 해보겠습니다:

batch_size는 미니배치의 크기를 나타내며, 한 번의 iteration에서 처리되는 데이터 샘플의 수입니다.
crop_size는 이미지를 잘라내는 크기입니다. VOC 데이터셋의 이미지를 모두 동일한 크기로 자르기 위해 사용됩니다.
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)는 d2l.load_data_voc 함수를 호출하여 Pascal VOC 데이터셋을 불러오고 학습용 데이터와 테스트용 데이터로 나눈 미니배치 반복자 train_iter와 test_iter를 생성합니다. 이 반복자들은 학습 및 테스트에 사용됩니다.

즉, 위 코드는 Pascal VOC 데이터셋을 불러와서 학습용 데이터와 테스트용 데이터로 나누고, 미니배치 반복자를 생성하는 과정을 나타냅니다.

read 1114 examples
read 1078 examples

14.11.4. Training

Now we can train our constructed fully convolutional network. The loss function and accuracy calculation here are not essentially different from those in image classification of earlier chapters. Because we use the output channel of the transposed convolutional layer to predict the class for each pixel, the channel dimension is specified in the loss calculation. In addition, the accuracy is calculated based on correctness of the predicted class for all the pixels.

이제 우리는 구축한 완전 컨벌루션 네트워크를 훈련할 수 있습니다. 여기서 손실 함수와 정확도 계산은 이전 장의 이미지 분류와 본질적으로 다르지 않습니다. 각 픽셀의 클래스를 예측하기 위해 전치된 컨벌루션 레이어의 출력 채널을 사용하기 때문에 손실 계산에서 채널 차원이 지정됩니다. 또한 정확도는 모든 픽셀에 대해 예측된 클래스의 정확도를 기반으로 계산됩니다.

def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

위 코드는 주어진 모델을 학습시키는 과정을 나타내고 있습니다.

def loss(inputs, targets) 함수는 입력 데이터와 실제 타겟 데이터를 받아와서 Cross Entropy 손실을 계산하는 함수입니다. 여기서 reduction='none'으로 설정하여 각 데이터 포인트의 손실 값을 계산하고, 이를 평균하여 최종 손실을 구합니다.
num_epochs는 에포크 수를 나타냅니다. 에포크는 학습 데이터를 몇 번 반복해서 사용하여 모델을 업데이트하는 단위입니다.
lr은 학습률(learning rate)로, 모델의 가중치를 업데이트할 때 얼마나 큰 스텝으로 업데이트할지를 결정합니다.
wd는 가중치 감쇠(weight decay)로, 가중치의 크기를 제어하여 오버피팅을 방지하고 모델의 일반화 성능을 높이는데 사용됩니다.
devices는 사용 가능한 GPU 장치를 찾는 함수인 d2l.try_all_gpus()를 호출하여 GPU 장치 목록을 가져옵니다.
trainer는 옵티마이저로, torch.optim.SGD를 사용하여 네트워크의 파라미터를 업데이트합니다.
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)는 모델을 학습시키는 함수를 호출합니다. 이 함수는 네트워크 모델, 학습용 데이터 반복자, 테스트용 데이터 반복자, 손실 함수, 옵티마이저, 에포크 수, GPU 장치 정보를 인자로 받아서 모델을 학습시키는 프로세스를 실행합니다.

따라서 위 코드는 주어진 모델을 주어진 데이터로 학습시키는 과정을 수행하는 코드입니다.

loss 0.449, train acc 0.861, test acc 0.852
226.7 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]

14.11.5. Prediction

When predicting, we need to standardize the input image in each channel and transform the image into the four-dimensional input format required by the CNN.

예측할 때 각 채널의 입력 이미지를 표준화하고 이미지를 CNN에서 요구하는 4차원 입력 형식으로 변환해야 합니다.

def predict(img):
    X = test_iter.dataset.normalize_image(img).unsqueeze(0)
    pred = net(X.to(devices[0])).argmax(dim=1)
    return pred.reshape(pred.shape[1], pred.shape[2])

위 코드는 주어진 이미지에 대한 Semantic Segmentation 예측을 수행하는 함수를 정의한 것입니다.

def predict(img) 함수는 이미지를 입력으로 받아와서 Semantic Segmentation을 예측합니다.
X = test_iter.dataset.normalize_image(img).unsqueeze(0)는 입력 이미지를 네트워크의 입력 형식에 맞게 전처리하는 과정입니다. 입력 이미지를 모델의 입력에 맞게 정규화(Normalization)하고, 배치 차원을 추가하여 모델에 한 번에 처리할 수 있도록 준비합니다.
pred = net(X.to(devices[0])).argmax(dim=1)는 전처리된 입력 이미지를 네트워크에 전달하고, 네트워크의 출력을 가져와서 가장 높은 값의 클래스 인덱스를 선택합니다. 이를 통해 각 픽셀에 대한 예측 클래스 인덱스를 얻습니다.
return pred.reshape(pred.shape[1], pred.shape[2])는 예측된 클래스 인덱스를 원본 이미지 크기에 맞게 재조정하여 반환합니다. 반환된 클래스 인덱스 맵은 이미지의 각 픽셀에 대한 클래스 정보를 포함하게 됩니다.

즉, predict 함수는 주어진 이미지에 대해 Semantic Segmentation 모델로 예측을 수행하고, 예측된 클래스 인덱스 맵을 반환하는 역할을 합니다.

To visualize the predicted class of each pixel, we map the predicted class back to its label color in the dataset.

각 픽셀의 예측된 클래스를 시각화하기 위해 예측된 클래스를 데이터 세트의 레이블 색상에 다시 매핑합니다.

def label2image(pred):
    colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
    X = pred.long()
    return colormap[X, :]

위 코드는 예측된 클래스 인덱스 맵을 컬러 이미지로 변환하는 함수를 정의한 것입니다.

def label2image(pred) 함수는 예측된 클래스 인덱스 맵을 입력으로 받아서 컬러 이미지로 변환합니다.
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])는 VOC 데이터셋에 사용되는 클래스의 컬러 맵을 Torch Tensor로 변환하여 준비합니다. 각 클래스마다 대응하는 RGB 컬러 값을 가지고 있습니다.
X = pred.long()는 예측된 클래스 인덱스 맵을 정수 형태로 변환합니다. 일반적으로 클래스 인덱스는 정수 값으로 표현됩니다.
return colormap[X, :]는 클래스 인덱스 맵에 대응하는 컬러 값을 colormap에서 찾아서 반환합니다. 따라서 각 픽셀에 대한 클래스 인덱스에 해당하는 컬러로 구성된 컬러 이미지가 반환됩니다.

이 함수를 사용하면 Semantic Segmentation 모델이 예측한 클래스 인덱스 맵을 실제 컬러 이미지로 변환하여 시각화할 수 있습니다.

Images in the test dataset vary in size and shape. Since the model uses a transposed convolutional layer with stride of 32, when the height or width of an input image is indivisible by 32, the output height or width of the transposed convolutional layer will deviate from the shape of the input image. In order to address this issue, we can crop multiple rectangular areas with height and width that are integer multiples of 32 in the image, and perform forward propagation on the pixels in these areas separately. Note that the union of these rectangular areas needs to completely cover the input image. When a pixel is covered by multiple rectangular areas, the average of the transposed convolution outputs in separate areas for this same pixel can be input to the softmax operation to predict the class.

테스트 데이터 세트의 이미지는 크기와 모양이 다양합니다. 이 모델은 보폭이 32인 전치 컨볼루션 레이어를 사용하기 때문에 입력 이미지의 높이 또는 너비가 32로 나눌 수 없는 경우 전치된 컨볼루션 레이어의 출력 높이 또는 너비가 입력 이미지의 모양에서 벗어날 것입니다. 이 문제를 해결하기 위해 이미지에서 높이와 너비가 32의 정수배인 여러 직사각형 영역을 자르고 이 영역의 픽셀에 대해 개별적으로 순방향 전파를 수행할 수 있습니다. 이러한 직사각형 영역의 합집합은 입력 이미지를 완전히 덮어야 합니다. 픽셀이 여러 직사각형 영역으로 덮여 있는 경우 동일한 픽셀에 대해 별도의 영역에서 전치된 컨볼루션 출력의 평균을 소프트맥스 연산에 입력하여 클래스를 예측할 수 있습니다.

For simplicity, we only read a few larger test images, and crop a 320×480 area for prediction starting from the upper-left corner of an image. For these test images, we print their cropped areas, prediction results, and ground-truth row by row.

간단하게 하기 위해 몇 개의 더 큰 테스트 이미지만 읽고 이미지의 왼쪽 상단 모서리에서 시작하여 예측을 위해 320×480 영역을 자릅니다. 이러한 테스트 이미지의 경우 잘린 영역, 예측 결과 및 실측 정보를 행별로 인쇄합니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 320, 480)
    X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X.permute(1,2,0), pred.cpu(),
             torchvision.transforms.functional.crop(
                 test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

위 코드는 VOC 데이터셋의 테스트 이미지와 해당 이미지에 대한 예측 결과, 그리고 실제 레이블 이미지를 시각화하는 예제입니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')는 VOC 데이터셋을 다운로드하고 압축을 해제하여 voc_dir에 저장합니다.
test_images, test_labels = d2l.read_voc_images(voc_dir, False)는 VOC 데이터셋의 테스트 이미지와 레이블 이미지를 읽어옵니다.
n, imgs = 4, []는 시각화할 이미지 개수 n과 결과 이미지를 저장할 리스트 imgs를 초기화합니다.
for i in range(n):는 n 개의 이미지에 대해 반복합니다.
crop_rect = (0, 0, 320, 480)는 이미지를 잘라낼 영역을 정의합니다.
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)는 테스트 이미지 i를 잘라낸 후 X에 저장합니다.
pred = label2image(predict(X))는 잘라낸 이미지 X에 대한 예측을 수행한 뒤, 예측 결과를 컬러 이미지로 변환하여 pred에 저장합니다.
imgs += [X.permute(1,2,0), pred.cpu(), ...]는 잘라낸 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 리스트에 추가합니다. .permute(1, 2, 0)는 이미지의 차원 순서를 변경하여 HWC 형식으로 변환합니다.
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2)는 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 한 줄에 3개씩 표시하고, 이를 n 행으로 나열하여 시각화합니다. scale=2는 이미지 크기를 확대하여 표시합니다.

이 코드를 통해 VOC 데이터셋의 테스트 이미지에 대한 Semantic Segmentation 모델의 예측 결과와 실제 레이블 이미지를 시각화할 수 있습니다.

14.11.6. Summary

The fully convolutional network first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution.

완전 컨벌루션 네트워크는 먼저 CNN을 사용하여 이미지 특징을 추출합니다. 그런 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환합니다. 마지막으로 transposed convolution을 통해 feature map의 높이와 너비를 입력 이미지의 높이와 너비로 변환합니다.
In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize the transposed convolutional layer.

완전 컨벌루션 네트워크에서는 쌍선형 보간의 업샘플링을 사용하여 전치된 컨볼루션 레이어를 초기화할 수 있습니다.

14.11.7. Exercises

If we use Xavier initialization for the transposed convolutional layer in the experiment, how does the result change?
Can you further improve the accuracy of the model by tuning the hyperparameters?
Predict the classes of all pixels in test images.
The original fully convolutional network paper also uses outputs of some intermediate CNN layers (Long et al., 2015). Try to implement this idea.

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.10. Transposed Convolution

2023. 8. 21. 01:26 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/transposed-conv.html

14.10. Transposed Convolution — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.10. Transposed Convolution

The CNN layers we have seen so far, such as convolutional layers (Section 7.2) and pooling layers (Section 7.5), typically reduce (downsample) the spatial dimensions (height and width) of the input, or keep them unchanged. In semantic segmentation that classifies at pixel-level, it will be convenient if the spatial dimensions of the input and output are the same. For example, the channel dimension at one output pixel can hold the classification results for the input pixel at the same spatial position.

우리가 지금까지 본 컨볼루션 레이어(섹션 7.2) 및 풀링 레이어(섹션 7.5)와 같은 CNN 레이어는 일반적으로 입력의 공간 차원 spatial dimensions (높이 및 너비)을 줄이거나 변경하지 않고 유지합니다. 픽셀 단위로 분류하는 시맨틱 분할semantic segmentation에서는 입력과 출력의 공간적 차원이 같으면 편리할 것이다. 예를 들어, 한 출력 픽셀의 채널 차원은 동일한 공간 위치에서 입력 픽셀에 대한 분류 결과를 보유할 수 있습니다.

To achieve this, especially after the spatial dimensions are reduced by CNN layers, we can use another type of CNN layers that can increase (upsample) the spatial dimensions of intermediate feature maps. In this section, we will introduce transposed convolution, which is also called fractionally-strided convolution (Dumoulin and Visin, 2016), for reversing downsampling operations by the convolution.

이를 달성하기 위해 특히 공간 차원이 CNN 레이어에 의해 축소된 후 중간 피쳐 맵의 공간 차원을 증가(업샘플링)할 수 있는 다른 유형의 CNN 레이어를 사용할 수 있습니다. 이 섹션에서는 convolution에 의한 downsampling 연산을 반전시키기 위해 fractionally-strided convolution(Dumoulin and Visin, 2016)이라고도 하는 transposed convolution을 소개합니다.

import torch
from torch import nn
from d2l import torch as d2l

Transposed Convolution이란?

Transposed convolution, also known as fractionally strided convolution or deconvolution, is an operation in convolutional neural networks (CNNs) that performs an upscaling operation on an input feature map. Unlike regular convolution, which reduces the spatial resolution of the feature map, transposed convolution increases the spatial resolution of the feature map.

전치 합성곱(Transposed Convolution), 또는 분수 합성곱 또는 역 합성곱(deconvolution)으로도 알려져 있으며, 컨볼루션 신경망(CNN)에서 입력 특성 맵에 대한 업스케일링 작업을 수행하는 연산입니다. 일반적인 컨볼루션과는 달리, 전치 합성곱은 특성 맵의 공간 해상도를 증가시킵니다.

The operation involves using learnable parameters (also known as convolutional kernels) to map each input pixel to multiple output pixels. This process effectively enlarges the feature map by inserting zeros or duplicating values in between the original input values. Transposed convolution is commonly used for tasks like image segmentation, image generation, and increasing the spatial resolution of feature maps.

이 연산은 학습 가능한 매개변수(컨볼루션 커널이라고도 함)를 사용하여 각 입력 픽셀을 여러 출력 픽셀로 매핑합니다. 이 과정은 원래 입력 값 사이에 0을 삽입하거나 값을 복제하여 특성 맵을 확장합니다. 전치 합성곱은 이미지 세그멘테이션, 이미지 생성, 특성 맵의 공간 해상도 증가와 같은 작업에 널리 사용됩니다.

The main steps of the transposed convolution operation are as follows:

전치 합성곱 연산의 주요 단계는 다음과 같습니다:

Input: The input feature map has a certain height and width, and it is commonly represented as a 3D tensor with dimensions (channels, height, width).

입력: 입력 특성 맵은 특정한 높이와 너비를 갖고 있으며, 보통 (채널, 높이, 너비) 차원의 3D 텐서로 나타납니다.
Convolution: A set of learnable convolutional kernels (filters) is applied to the input feature map. Each kernel produces a response map by convolving with the input.

컨볼루션: 학습 가능한 컨볼루션 커널(필터) 집합이 입력 특성 맵에 적용됩니다. 각 커널은 입력과 컨볼루션하여 응답 맵을 생성합니다.
Upsampling: The response maps are then upsampled to increase the spatial resolution. This is achieved by inserting zeros or duplicating values in between the original input values, effectively increasing the size of the feature map.

업스케일링: 응답 맵은 업스케일링되어 공간 해상도를 증가시킵니다. 이를 위해 원래 입력 값 사이에 0을 삽입하거나 값을 복제하여 특성 맵의 크기를 키웁니다.
Output: The final output is an upsampled feature map with higher spatial resolution, which can be used for subsequent tasks like segmentation or generating higher-resolution images.

출력: 최종 출력은 공간 해상도가 높아진 업스케일링된 특성 맵으로, 세그멘테이션이나 더 높은 해상도의 이미지 생성과 같은 후속 작업에 사용할 수 있습니다.

Transposed convolution is commonly used in various architectures, such as in generative adversarial networks (GANs) for image generation and in decoder portions of autoencoders for feature reconstruction. It provides a way to recover spatial information that might have been lost during earlier convolutional operations, enabling the network to generate fine-grained details and high-resolution representations.

전치 합성곱은 이미지 생성을 위한 생성적 적대 신경망(GAN)이나 특성 재구성을 위한 오토인코더의 디코더 부분과 같은 다양한 아키텍처에서 자주 사용됩니다. 이를 통해 네트워크는 이전 컨볼루션 연산 중에 손실될 수 있는 공간 정보를 복구하여 미세한 세부 정보와 고해상도 표현을 생성할 수 있는 방법을 제공 합니다.

GAN이란 ?

GAN stands for "Generative Adversarial Network." It is a class of machine learning models that consists of two neural networks, known as the generator and the discriminator, that are trained simultaneously through a competitive process.

The key idea behind GANs is to have two networks compete against each other in a game-like scenario:

GAN은 "생성적 적대 신경망"을 의미합니다. 이는 생성자(generator)와 판별자(discriminator) 두 개의 신경망으로 구성되어 있으며, 이들이 경쟁적인 과정을 통해 동시에 훈련되는 머신 러닝 모델의 한 유형입니다.

GAN의 핵심 아이디어는 두 개의 신경망을 게임과 유사한 상황에서 서로 경쟁하게 하는 것입니다:

Generator: The generator network takes random noise as input and generates data, such as images, audio, or text. It learns to create data that is similar to real data from a given dataset.

생성자: 생성자 신경망은 무작위한 잡음을 입력으로 받아 이미지, 음성 또는 텍스트와 같은 데이터를 생성합니다. 이는 주어진 데이터셋의 실제 데이터와 유사한 데이터를 생성하는 방법을 학습합니다.
Discriminator: The discriminator network, also called the critic, takes both real data from the training dataset and generated data from the generator as input. It learns to distinguish between real and generated data.

판별자: 판별자 신경망, 또는 비평자(critic)라고도 하는 이 네트워크는 훈련 데이터셋의 실제 데이터와 생성자로부터 생성된 데이터를 모두 입력으로 받습니다. 판별자는 실제와 생성된 데이터를 구분하는 방법을 학습합니다.

During training, the generator aims to produce data that is so realistic that it can fool the discriminator into believing that it is real data. On the other hand, the discriminator tries to correctly classify whether the input data is real or generated. This process creates a feedback loop where the generator improves its ability to create realistic data, while the discriminator becomes better at distinguishing real from generated data.

훈련 중에 생성자는 판별자를 속여 실제 데이터처럼 보이는 데이터를 생성하는 것을 목표로 합니다. 반면, 판별자는 입력 데이터가 실제 데이터인지 생성된 데이터인지 올바르게 분류하려고 합니다. 이 과정은 생성자가 실제 데이터와 거의 구별되지 않는 실제 같은 데이터를 생성하도록 개선되는 반면, 판별자는 실제와 생성된 데이터를 구별하는 능력을 향상시킵니다.

The training of a GAN involves an adversarial process, where the generator and discriminator are iteratively updated to improve their performance. As training progresses, the generator becomes more adept at creating data that is increasingly difficult for the discriminator to differentiate from real data. The ultimate goal is for the generator to produce data that is indistinguishable from real data, effectively learning the underlying distribution of the training dataset.

GAN의 훈련은 생성자와 판별자를 반복적으로 업데이트하여 성능을 개선하는 적대적인 과정을 포함합니다. 훈련이 진행됨에 따라 생성자는 판별자가 실제 데이터와 구별하기 어려운 데이터를 더 잘 생성하게 되며, 최종 목표는 생성자가 실제 데이터와 구별할 수 없을 정도로 실제와 똑같은 데이터를 생성하도록 하는 것입니다. 이로써 생성자는 훈련 데이터셋의 기반이 되는 분포를 학습합니다.

GANs have demonstrated remarkable success in various applications, including image generation, style transfer, data augmentation, super-resolution, and more. They have also led to the development of more advanced GAN variants, such as Conditional GANs, CycleGANs, and Progressive GANs, each with their own specific applications and improvements

GAN은 이미지 생성, 스타일 변환, 데이터 증강, 초해상도 등 다양한 응용 분야에서 놀라운 성과를 보여주었습니다. 더 나아가 조건부 GAN, CycleGAN, Progressive GAN 등과 같은 고급 GAN 변형들이 개발되어 각각 특정 응용 및 개선에 활용되고 있습니다.

14.10.1. Basic Operation

Ignoring channels for now, let’s begin with the basic transposed convolution operation with stride of 1 and no padding. Suppose that we are given a nℎ×nw input tensor and a kℎ×kw kernel. Sliding the kernel window with stride of 1 for nw times in each row and nℎ times in each column yields a total of nℎnw intermediate results. Each intermediate result is a (nℎ+kℎ−1)×(nw+kw−1) tensor that are initialized as zeros. To compute each intermediate tensor, each element in the input tensor is multiplied by the kernel so that the resulting kℎ×kw tensor replaces a portion in each intermediate tensor. Note that the position of the replaced portion in each intermediate tensor corresponds to the position of the element in the input tensor used for the computation. In the end, all the intermediate results are summed over to produce the output.

지금은 채널을 무시하고 보폭이 1이고 패딩이 없는 기본 전치 컨볼루션 작업 transposed convolution operation 부터 시작하겠습니다. nℎ×nw 입력 텐서와 kℎ×kw 커널이 주어졌다고 가정합니다. 커널 창을 각 행에서 nw번, 각 열에서 nℎ번 stride 1로 슬라이딩하면 총 nℎnw 중간 결과가 생성됩니다. 각 중간 결과는 0으로 초기화되는 (nℎ+kℎ−1)×(nw+kw−1) 텐서입니다. 각 중간 텐서를 계산하기 위해 입력 텐서의 각 요소에 커널을 곱하여 결과 kℎ×kw 텐서가 각 중간 텐서의 일부를 대체합니다. 각 중간 텐서에서 대체된 부분의 위치는 계산에 사용된 입력 텐서의 요소 위치에 해당합니다. 결국 모든 중간 결과가 합산되어 출력이 생성됩니다.

As an example, Fig. 14.10.1 illustrates how transposed convolution with a 2×2 kernel is computed for a 2×2 input tensor.

예를 들어, 그림 14.10.1은 2×2 입력 텐서에 대해 2×2 커널을 사용한 전치 컨볼루션이 계산되는 방법을 보여줍니다.

Fig. 14.10.1  Transposed convolution with a 2×2 kernel. The shaded portions are a portion of an intermediate tensor as well as the input and kernel tensor elements used for the computation.

We can implement this basic transposed convolution operation trans_conv for a input matrix X and a kernel matrix K.

입력 행렬 X와 커널 행렬 K에 대해 이 기본 전치 합성곱 연산transposed convolution operation trans_conv를 구현할 수 있습니다.

def trans_conv(X, K):
    h, w = K.shape
    Y = torch.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

이 코드는 2D 행렬 X와 커널 행렬 K 간의 2D 전치 합성곱(Transposed Convolution) 연산을 수행하는 함수를 정의한 내용입니다.

def trans_conv(X, K):: trans_conv 함수를 정의합니다. 이 함수는 2D 전치 합성곱 연산을 수행합니다.
- h, w = K.shape: 커널 K의 높이와 너비를 가져옵니다.
- Y = torch.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1)): 출력 결과를 저장할 2D 행렬 Y를 초기화합니다. 크기는 입력 X의 높이와 너비에 커널 K의 크기를 더한 값으로 설정됩니다.
- for i in range(X.shape[0]):: 입력 X의 행 인덱스에 대해 반복합니다.
  - for j in range(X.shape[1]):: 입력 X의 열 인덱스에 대해 반복합니다.
    - Y[i: i + h, j: j + w] += X[i, j] * K: 입력 X의 (i, j) 위치에서 커널 K를 가중치로 사용하여 출력 Y의 해당 영역에 누적합니다.
- return Y: 전치 합성곱 연산 결과인 출력 행렬 Y를 반환합니다.

이 함수는 입력과 커널로 주어진 데이터에 대해 2D 전치 합성곱 연산을 수행하여 출력 결과를 생성합니다.

In contrast to the regular convolution (in Section 7.2) that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input. We can construct the input tensor X and the kernel tensor K from Fig. 14.10.1 to validate the output of the above implementation of the basic two-dimensional transposed convolution operation.

커널을 통해 입력 요소를 줄이는 일반 컨볼루션(섹션 7.2)과 달리 전치 컨볼루션은 커널을 통해 입력 요소를 브로드캐스트하여 입력보다 큰 출력을 생성합니다. 그림 14.10.1에서 입력 텐서 X와 커널 텐서 K를 구성하여 기본 2차원 전치 컨볼루션 연산의 위 구현 결과를 검증할 수 있습니다.

X = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
trans_conv(X, K)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 2D 전치 합성곱 연산을 수행하는 과정을 나타냅니다.

X = torch.tensor([[0.0, 1.0], [2.0, 3.0]]): 2x2 크기의 입력 행렬 X를 생성합니다.
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]]): 2x2 크기의 커널 K를 생성합니다.
trans_conv(X, K): 앞에서 정의한 trans_conv 함수에 입력 행렬 X와 커널 K를 전달하여 전치 합성곱 연산을 수행합니다.

이 코드는 입력 행렬과 커널을 사용하여 전치 합성곱 연산을 실행한 결과를 반환합니다. 결과적으로 2D 전치 합성곱 연산의 결과를 확인할 수 있습니다.

tensor([[ 0.,  0.,  1.],
        [ 0.,  4.,  6.],
        [ 4., 12.,  9.]])

Alternatively, when the input X and kernel K are both four-dimensional tensors, we can use high-level APIs to obtain the same results.

또는 입력 X와 커널 K가 모두 4차원 텐서인 경우 고급 API를 사용하여 동일한 결과를 얻을 수 있습니다.

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2): 입력 행렬 X와 커널 K를 4차원 텐서로 변환합니다. 이를 위해 reshape 함수를 사용하여 각각 (1, 1, 2, 2) 크기의 4차원 텐서로 변환합니다. 이는 PyTorch 합성곱 연산의 입력 형식에 맞추기 위함입니다.
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 편향을 사용하지 않도록 bias=False로 설정합니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X를 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[ 0.,  0.,  1.],
          [ 0.,  4.,  6.],
          [ 4., 12.,  9.]]]], grad_fn=<ConvolutionBackward0>)

14.10.2. Padding, Strides, and Multiple Channels

Different from in the regular convolution where padding is applied to input, it is applied to output in the transposed convolution. For example, when specifying the padding number on either side of the height and width as 1, the first and last rows and columns will be removed from the transposed convolution output.

패딩이 입력에 적용되는 일반 컨볼루션과 달리 전치 컨볼루션에서는 출력에 패딩이 적용됩니다. 예를 들어 높이와 너비 양쪽의 패딩 수를 1로 지정하면 Transposed Convolution 출력에서 첫 번째와 마지막 행과 열이 제거됩니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 패딩이 포함된 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 패딩이 포함된 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 패딩을 1로 설정하고, 편향을 사용하지 않도록 bias=False로 설정합니다. 이렇게 함으로써 입력 이미지에 패딩이 적용된 합성곱 전치 연산을 수행하게 됩니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 패딩이 포함된 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X에 패딩이 적용된 상태에서 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 패딩이 포함된 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[4.]]]], grad_fn=<ConvolutionBackward0>)

In the transposed convolution, strides are specified for intermediate results (thus output), not for input. Using the same input and kernel tensors from Fig. 14.10.1, changing the stride from 1 to 2 increases both the height and weight of intermediate tensors, hence the output tensor in Fig. 14.10.2.

Transposed Convolution에서 보폭은 입력이 아닌 중간 결과(따라서 출력)에 지정됩니다. 그림 14.10.1의 동일한 입력 및 커널 텐서를 사용하여 stride를 1에서 2로 변경하면 중간 텐서의 높이와 무게가 모두 증가하므로 그림 14.10.2의 출력 텐서가 증가합니다.

Fig. 14.10.2  Transposed convolution with a 2×2 kernel with stride of 2. The shaded portions are a portion of an intermediate tensor as well as the input and kernel tensor elements used for the computation.

The following code snippet can validate the transposed convolution output for stride of 2 in Fig. 14.10.2.

다음 코드 스니펫은 그림 14.10.2에서 보폭이 2인 전치 컨볼루션 출력을 검증할 수 있습니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 스트라이드(stride)가 적용된 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 스트라이드가 적용된 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 스트라이드를 2로 설정하고, 편향을 사용하지 않도록 bias=False로 설정합니다. 이렇게 함으로써 입력 이미지에 스트라이드가 적용된 합성곱 전치 연산을 수행하게 됩니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 스트라이드가 적용된 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X에 스트라이드가 적용된 상태에서 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 스트라이드가 적용된 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[0., 0., 0., 1.],
          [0., 0., 2., 3.],
          [0., 2., 0., 3.],
          [4., 6., 6., 9.]]]], grad_fn=<ConvolutionBackward0>)

For multiple input and output channels, the transposed convolution works in the same way as the regular convolution. Suppose that the input has ci channels, and that the transposed convolution assigns a kℎ×kw kernel tensor to each input channel. When multiple output channels are specified, we will have a ci×kℎ×kw kernel for each output channel.

다중 입력 및 출력 채널의 경우 전치 컨볼루션은 일반 컨볼루션과 동일한 방식으로 작동합니다. 입력에 ci 채널이 있고 전치 컨벌루션이 각 입력 채널에 kℎ×kw 커널 텐서를 할당한다고 가정합니다. 여러 출력 채널이 지정된 경우 각 출력 채널에 대해 ci×kℎ×kw 커널이 있습니다.

As in all, if we feed X into a convolutional layer f to output Y=f(X) and create a transposed convolutional layer g with the same hyperparameters as f except for the number of output channels being the number of channels in X, then g(Y) will have the same shape as X. This can be illustrated in the following example.

대체로 X를 컨볼루션 레이어 f에 공급하여 Y=f(X)를 출력하고 출력 채널 수가 X의 채널 수인 것을 제외하고 f와 동일한 하이퍼파라미터로 전치된 컨볼루션 레이어 g를 생성하면 g(Y)는 X와 동일한 모양을 갖습니다. 이는 다음 예에서 설명할 수 있습니다.

X = torch.rand(size=(1, 10, 16, 16))
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape

이 코드는 주어진 입력 텐서 X에 대해 합성곱(Convolution)과 합성곱 전치(Convolutional Transpose) 연산을 차례로 적용하고, 최종 결과의 형태를 확인하는 과정을 나타냅니다.

X = torch.rand(size=(1, 10, 16, 16)): 크기가 (1, 10, 16, 16)인 랜덤한 값을 가지는 4차원 텐서 X를 생성합니다. 이는 1개의 채널, 10개의 필터, 16x16 크기의 이미지를 의미합니다.
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3): 입력 채널이 10이고 출력 채널이 20인 합성곱 연산(nn.Conv2d)을 정의합니다. 커널 크기를 5로 설정하고, 패딩을 2로 설정하여 출력 크기를 입력과 같게 만들어주고, 스트라이드를 3으로 설정하여 스트라이드가 적용된 합성곱을 정의합니다.
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3): 입력 채널이 20이고 출력 채널이 10인 합성곱 전치 연산(nn.ConvTranspose2d)을 정의합니다. 커널 크기를 5로 설정하고, 패딩을 2로 설정하여 출력 크기를 입력과 같게 만들어주고, 스트라이드를 3으로 설정하여 스트라이드가 적용된 합성곱 전치를 정의합니다.
tconv(conv(X)).shape == X.shape: 먼저 입력 X에 합성곱 conv를 적용한 결과에 합성곱 전치 tconv를 적용한 결과의 형태가 입력 X의 형태와 같은지 비교합니다. 이 비교 결과는 True 또는 False가 됩니다. 이 비교는 합성곱과 합성곱 전치 연산을 차례로 적용했을 때 입력과 동일한 형태의 결과가 나오는지 확인하는 것입니다.

이 코드는 PyTorch의 합성곱과 합성곱 전치 연산을 적용하고 그 결과의 형태를 확인하는 과정을 나타내며, 해당 비교를 통해 합성곱과 합성곱 전치 연산의 효과를 확인할 수 있습니다.

True

14.10.3. Connection to Matrix Transposition

The transposed convolution is named after the matrix transposition. To explain, let’s first see how to implement convolutions using matrix multiplications. In the example below, we define a 3×3 input X and a 2×2 convolution kernel K, and then use the corr2d function to compute the convolution output Y.

전치 컨볼루션은 행렬 전치의 이름을 따서 명명되었습니다. 설명을 위해 먼저 행렬 곱셈을 사용하여 컨볼루션을 구현하는 방법을 살펴보겠습니다. 아래 예에서는 3×3 입력 X와 2×2 컨볼루션 커널 K를 정의한 다음 corr2d 함수를 사용하여 컨볼루션 출력 Y를 계산합니다.

X = torch.arange(9.0).reshape(3, 3)
K = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
Y = d2l.corr2d(X, K)
Y

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 2D 상관(Correlation) 연산을 수행하는 과정을 나타냅니다.

X = torch.arange(9.0).reshape(3, 3): 0부터 8까지의 값을 가지는 1차원 텐서를 3x3 크기의 2차원 행렬 X로 재구성합니다. 이는 입력 행렬입니다.
K = torch.tensor([[1.0, 2.0], [3.0, 4.0]]): 주어진 값을 가지는 2x2 크기의 커널 행렬 K를 생성합니다. 이는 2D 상관 연산에 사용될 커널입니다.
Y = d2l.corr2d(X, K): 2D 상관 연산 함수 d2l.corr2d를 사용하여 입력 행렬 X와 커널 K에 대해 상관 연산을 수행한 결과인 출력 행렬 Y를 계산합니다.

이 코드는 주어진 입력 행렬과 커널에 대해 2D 상관 연산을 수행하고 그 결과인 출력 행렬 Y를 반환하는 과정을 나타냅니다.

tensor([[27., 37.],
        [57., 67.]])

Next, we rewrite the convolution kernel K as a sparse weight matrix W containing a lot of zeros. The shape of the weight matrix is (4, 9), where the non-zero elements come from the convolution kernel K.

다음으로 컨볼루션 커널 K를 많은 0을 포함하는 희소 가중치 행렬 W로 다시 작성합니다. 가중치 행렬의 모양은 (4, 9)이며, 여기서 0이 아닌 요소는 컨볼루션 커널 K에서 나옵니다.

def kernel2matrix(K):
    k, W = torch.zeros(5), torch.zeros((4, 9))
    k[:2], k[3:5] = K[0, :], K[1, :]
    W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k
    return W

W = kernel2matrix(K)
W

이 코드는 주어진 2x2 크기의 커널 K를 행렬 형태로 변환하는 과정을 나타냅니다.

def kernel2matrix(K): 커널 K를 입력으로 받아서 행렬 형태로 변환하는 함수 kernel2matrix를 정의합니다.
k, W = torch.zeros(5), torch.zeros((4, 9)): 크기가 5인 1차원 텐서 k와 크기가 (4, 9)인 2차원 텐서 W를 생성합니다. k는 행렬 W를 구성하기 위한 임시 벡터입니다.
k[:2], k[3:5] = K[0, :], K[1, :]: K의 첫 번째 행과 두 번째 행의 값을 k의 첫 번째 원소와 세 번째, 네 번째 원소에 대입합니다. 이를 통해 k 벡터에 커널 K의 값들이 할당됩니다.
W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k: 행렬 W의 각 행에 k의 값들을 할당하여 행렬 형태로 변환합니다. 각 행은 서로 다른 위치에서 시작하는 값을 가지도록 할당됩니다.
return W: 변환된 행렬 W를 반환합니다.
W = kernel2matrix(K): 주어진 커널 K에 대해 kernel2matrix 함수를 호출하여 커널을 행렬 형태로 변환한 결과인 행렬 W를 얻습니다.

이 코드는 주어진 2x2 크기의 커널을 행렬 형태로 변환하여 행렬 W로 반환하는 과정을 나타냅니다.

tensor([[1., 2., 0., 3., 4., 0., 0., 0., 0.],
        [0., 1., 2., 0., 3., 4., 0., 0., 0.],
        [0., 0., 0., 1., 2., 0., 3., 4., 0.],
        [0., 0., 0., 0., 1., 2., 0., 3., 4.]])

Concatenate the input X row by row to get a vector of length 9. Then the matrix multiplication of W and the vectorized X gives a vector of length 4. After reshaping it, we can obtain the same result Y from the original convolution operation above: we just implemented convolutions using matrix multiplications.

입력 X를 행별로 연결하여 길이가 9인 벡터를 얻습니다. 그런 다음 W와 벡터화된 X의 행렬 곱셈은 길이가 4인 벡터를 제공합니다. 이를 재구성한 후 위의 원래 컨볼루션 연산에서 동일한 결과 Y를 얻을 수 있습니다. 방금 행렬 곱셈을 사용하여 컨볼루션을 구현했습니다.

Y == torch.matmul(W, X.reshape(-1)).reshape(2, 2)

이 코드는 주어진 입력 행렬 X와 변환된 행렬 W를 사용하여 행렬-벡터 곱셈을 수행한 결과인 출력 행렬 Y가 정확히 일치하는지를 확인하는 조건을 나타냅니다.

Y: 위 코드에서 계산한 행렬 Y입니다.
torch.matmul(W, X.reshape(-1)): 행렬 W와 입력 행렬 X를 벡터로 펼친 뒤, 행렬-벡터 곱셈을 수행한 결과입니다.
.reshape(2, 2): 이전 계산 결과를 2x2 크기의 행렬로 재구성합니다.
==: 두 행렬이 원소별로 일치하는지를 비교하는 연산자입니다.

따라서, 위 코드는 변환된 행렬 W와 입력 행렬 X를 사용하여 계산한 출력 행렬 Y가 정확히 일치하는지 여부를 확인하고 그 결과를 반환하는 조건을 나타냅니다.

tensor([[True, True],
        [True, True]])

Likewise, we can implement transposed convolutions using matrix multiplications. In the following example, we take the 2×2 output Y from the above regular convolution as input to the transposed convolution. To implement this operation by multiplying matrices, we only need to transpose the weight matrix W with the new shape (9,4).

마찬가지로 행렬 곱셈을 사용하여 전치 컨볼루션을 구현할 수 있습니다. 다음 예제에서는 위의 일반 컨볼루션에서 2×2 출력 Y를 전치 컨볼루션의 입력으로 사용합니다. 행렬을 곱하여 이 연산을 구현하려면 가중치 행렬 W를 새 모양(9,4)으로 바꾸면 됩니다.

Z = trans_conv(Y, K)
Z == torch.matmul(W.T, Y.reshape(-1)).reshape(3, 3)

이 코드는 주어진 출력 행렬 Y와 변환된 행렬 W를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과인 행렬 Z가 정확히 일치하는지를 확인하는 조건을 나타냅니다.

Z: trans_conv 함수를 사용하여 계산한 행렬 Z입니다.
trans_conv(Y, K): 출력 행렬 Y와 주어진 커널 K를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과입니다.
torch.matmul(W.T, Y.reshape(-1)): 행렬 W의 전치 행렬과 Y를 벡터로 펼친 뒤, 역변환 행렬-벡터 곱셈을 수행한 결과입니다.
.reshape(3, 3): 이전 계산 결과를 3x3 크기의 행렬로 재구성합니다.
==: 두 행렬이 원소별로 일치하는지를 비교하는 연산자입니다.

따라서, 위 코드는 주어진 출력 행렬 Y와 변환된 행렬 W를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과인 행렬 Z가 정확히 일치하는지 여부를 확인하고 그 결과를 반환하는 조건을 나타냅니다.

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Consider implementing the convolution by multiplying matrices. Given an input vector x and a weight matrix W, the forward propagation function of the convolution can be implemented by multiplying its input with the weight matrix and outputting a vector y=Wx. Since backpropagation follows the chain rule and ∇xy=W⊤, the backpropagation function of the convolution can be implemented by multiplying its input with the transposed weight matrix W⊤. Therefore, the transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer: its forward propagation and backpropagation functions multiply their input vector with W⊤ and W, respectively.

행렬을 곱하여 컨벌루션을 구현하는 것을 고려하십시오. 입력 벡터 x와 가중치 행렬 W가 주어지면 컨볼루션의 순방향 전파 함수는 입력에 가중치 행렬을 곱하고 벡터 y=Wx를 출력하여 구현할 수 있습니다. 역전파는 체인 규칙과 ∇xy=W⊤를 따르기 때문에 컨볼루션의 역전파 함수는 전치된 가중치 행렬 W⊤를 입력에 곱하여 구현할 수 있습니다. 따라서 전치된 컨볼루션 레이어는 컨볼루션 레이어의 순방향 전파 함수와 역전파 함수를 교환할 수 있습니다. 순방향 전파 및 역전파 함수는 각각 입력 벡터에 W⊤ 및 W를 곱합니다.

14.10.4. Summary

In contrast to the regular convolution that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input.

커널을 통해 입력 요소를 줄이는 일반 컨볼루션과 달리 전치 컨볼루션은 커널을 통해 입력 요소를 브로드캐스팅하여 입력보다 큰 출력을 생성합니다.
If we feed X into a convolutional layer f to output Y=f(X) and create a transposed convolutional layer g with the same hyperparameters as f except for the number of output channels being the number of channels in X, then g(Y) will have the same shape as X.

X를 컨볼루션 레이어 f에 공급하여 Y=f(X)를 출력하고 출력 채널 수가 X의 채널 수인 것을 제외하고 f와 동일한 하이퍼파라미터로 전치된 컨볼루션 레이어 g를 생성하면 g(Y) X와 같은 모양을 갖게 됩니다.
We can implement convolutions using matrix multiplications. The transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer.

행렬 곱셈을 사용하여 컨볼루션을 구현할 수 있습니다. Transposed convolutional layer는 convolutional layer의 forward propagation function과 backpropagation function을 교환할 수 있습니다.

14.10.5. Exercises

In Section 14.10.3, the convolution input X and the transposed convolution output Z have the same shape. Do they have the same value? Why?
Is it efficient to use matrix multiplications to implement convolutions? Why?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.9. Semantic Segmentation and the Dataset

2023. 8. 20. 22:40 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/semantic-segmentation-and-dataset.html

14.9. Semantic Segmentation and the Dataset — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.9. Semantic Segmentation and the Dataset

When discussing object detection tasks in Section 14.3–Section 14.8, rectangular bounding boxes are used to label and predict objects in images. This section will discuss the problem of semantic segmentation, which focuses on how to divide an image into regions belonging to different semantic classes. Different from object detection, semantic segmentation recognizes and understands what are in images in pixel level: its labeling and prediction of semantic regions are in pixel level. Fig. 14.9.1 shows the labels of the dog, cat, and background of the image in semantic segmentation. Compared with in object detection, the pixel-level borders labeled in semantic segmentation are obviously more fine-grained.

섹션 14.3–섹션 14.8에서 개체 감지 작업을 논의할 때 사각형 경계 상자는 이미지의 개체에 레이블을 지정하고 예측하는 데 사용됩니다. 이 섹션에서는 이미지를 다른 의미semantic 클래스에 속하는 영역으로 나누는 방법에 중점을 둔 의미 분할 semantic segmentation 문제에 대해 설명합니다. 객체 감지와 달리 semantic segmentation은 픽셀 수준에서 이미지에 있는 것을 인식하고 이해합니다. 의미 영역 semantic regions 의 라벨링 및 예측은 픽셀 수준에서 이루어집니다. 그림 14.9.1은 시맨틱 세그멘테이션에서 이미지의 개, 고양이, 배경의 라벨을 보여준다. 개체 감지와 비교할 때 시맨틱 분할semantic segmentation에서 레이블이 지정된 픽셀 수준 경계는 분명히 더 세분화됩니다.

Fig. 14.9.1  Labels of the dog, cat, and background of the image in semantic segmentation.

14.9.1. Image Segmentation and Instance Segmentation

There are also two important tasks in the field of computer vision that are similar to semantic segmentation, namely image segmentation and instance segmentation. We will briefly distinguish them from semantic segmentation as follows.

컴퓨터 비전 분야에는 시맨틱 분할semantic segmentation과 유사한 두 가지 중요한 작업, 즉 이미지 분할image segmentation과 인스턴스 분할instance segmentation이 있습니다. 다음과 같이 시맨틱 분할과 간략하게 구분합니다.

Image segmentation divides an image into several constituent regions. The methods for this type of problem usually make use of the correlation between pixels in the image. It does not need label information about image pixels during training, and it cannot guarantee that the segmented regions will have the semantics that we hope to obtain during prediction. Taking the image in Fig. 14.9.1 as input, image segmentation may divide the dog into two regions: one covers the mouth and eyes which are mainly black, and the other covers the rest of the body which is mainly yellow.

이미지 분할은 이미지를 여러 구성 영역으로 나눕니다. 이러한 유형의 문제에 대한 방법은 일반적으로 이미지의 픽셀 간의 상관 관계를 사용합니다. 훈련하는 동안 이미지 픽셀에 대한 레이블 정보가 필요하지 않으며 분할된 영역이 예측 중에 얻고자 하는 의미 체계를 가질 것이라고 보장할 수 없습니다. 그림 14.9.1의 이미지를 입력으로 사용하면 이미지 분할은 개를 두 영역으로 나눌 수 있습니다. 하나는 주로 검은색인 입과 눈을 덮고 다른 하나는 주로 노란색인 몸의 나머지 부분을 덮습니다.
Instance segmentation is also called simultaneous detection and segmentation. It studies how to recognize the pixel-level regions of each object instance in an image. Different from semantic segmentation, instance segmentation needs to distinguish not only semantics, but also different object instances. For example, if there are two dogs in the image, instance segmentation needs to distinguish which of the two dogs a pixel belongs to.

인스턴스 분할은 동시 감지 및 분할이라고도 합니다. 이미지에서 각 개체 인스턴스의 픽셀 수준 영역을 인식하는 방법을 연구합니다. 시맨틱 분할과 달리 인스턴스 분할은 시맨틱뿐만 아니라 다른 객체 인스턴스도 구분해야 합니다. 예를 들어, 이미지에 두 마리의 개가 있는 경우 인스턴스 분할은 픽셀이 속한 두 개 중 어느 것을 구별해야 합니다.

14.9.2. The Pascal VOC2012 Semantic Segmentation Dataset

One of the most important semantic segmentation dataset is Pascal VOC2012. In the following, we will take a look at this dataset.

가장 중요한 시맨틱 분할 데이터 세트는 Pascal VOC2012입니다. 다음에서는 이 데이터 세트를 살펴보겠습니다.

%matplotlib inline
import os
import torch
import torchvision
from d2l import torch as d2l

The tar file of the dataset is about 2 GB, so it may take a while to download the file. The extracted dataset is located at ../data/VOCdevkit/VOC2012.

데이터셋의 tar 파일은 약 2GB이므로 파일을 다운로드하는데 다소 시간이 걸릴 수 있습니다. 추출된 데이터 세트는 ../data/VOCdevkit/VOC2012에 있습니다.

#@save
d2l.DATA_HUB['voc2012'] = (d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar',
                           '4e443f8a2eca6b1dac8a6c57641b67dd40621a49')

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
d2l.DATA_HUB['voc2012']: d2l.DATA_HUB 딕셔너리에 'voc2012'라는 키를 가진 항목을 추가합니다. 이 항목은 데이터 다운로드를 위한 URL 및 체크섬 정보를 포함하는 튜플입니다.
d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar': d2l.DATA_URL과 문자열을 합쳐서 VOC2012 데이터의 URL을 생성합니다.
'4e443f8a2eca6b1dac8a6c57641b67dd40621a49': 다운로드한 파일의 체크섬(해시 값)입니다.
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012'): 'voc2012' 데이터를 다운로드하고 압축을 해제하여 voc_dir 변수에 저장합니다. download_extract 함수는 주어진 키에 해당하는 데이터를 다운로드하고 압축을 해제합니다. 여기서 'voc2012' 데이터는 VOC2012 데이터셋을 의미하며, VOCdevkit/VOC2012 경로에 압축을 해제합니다.

이 코드는 'voc2012' 데이터를 다운로드하고 압축을 해제하여 데이터가 저장될 디렉토리 경로인 voc_dir을 설정하는 내용을 나타냅니다. VOC2012 데이터셋은 객체 검출과 이미지 분류 등 다양한 컴퓨터 비전 작업에 사용되는 데이터셋입니다.

Downloading ../data/VOCtrainval_11-May-2012.tar from http://d2l-data.s3-accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar...

After entering the path ../data/VOCdevkit/VOC2012, we can see the different components of the dataset. The ImageSets/Segmentation path contains text files that specify training and test samples, while the JPEGImages and SegmentationClass paths store the input image and label for each example, respectively. The label here is also in the image format, with the same size as its labeled input image. Besides, pixels with the same color in any label image belong to the same semantic class. The following defines the read_voc_images function to read all the input images and labels into the memory.

../data/VOCdevkit/VOC2012 경로를 입력하면 데이터 세트의 다양한 구성 요소를 볼 수 있습니다. ImageSets/Segmentation 경로에는 교육 및 테스트 샘플을 지정하는 텍스트 파일이 포함되어 있으며 JPEGImages 및 SegmentationClass 경로는 각각 각 예제에 대한 입력 이미지와 레이블을 저장합니다. 여기의 레이블도 레이블이 지정된 입력 이미지와 크기가 같은 이미지 형식입니다. 게다가 어떤 레이블 이미지에서 같은 색상을 가진 픽셀은 같은 의미 클래스에 속합니다. 다음은 모든 입력 이미지와 레이블을 메모리로 읽어들이는 read_voc_images 함수를 정의합니다.

#@save
def read_voc_images(voc_dir, is_train=True):
    """Read all VOC feature and label images."""
    txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
                             'train.txt' if is_train else 'val.txt')
    mode = torchvision.io.image.ImageReadMode.RGB
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    features, labels = [], []
    for i, fname in enumerate(images):
        features.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'JPEGImages', f'{fname}.jpg')))
        labels.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode))
    return features, labels

train_features, train_labels = read_voc_images(voc_dir, True)

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
def read_voc_images(voc_dir, is_train=True):: read_voc_images라는 함수를 정의합니다. 이 함수는 VOC 데이터셋의 특징 이미지와 라벨 이미지를 읽어오는 역할을 합니다.
- voc_dir: VOC 데이터셋이 저장된 디렉토리 경로입니다.
- is_train=True: 훈련 데이터인지 여부를 나타내는 변수입니다.
- txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation', 'train.txt' if is_train else 'val.txt'): 훈련 데이터 또는 검증 데이터 이미지 파일 이름이 나열된 텍스트 파일의 경로를 생성합니다.
- mode = torchvision.io.image.ImageReadMode.RGB: 읽어온 이미지를 RGB 형식으로 읽어오도록 설정합니다.
- with open(txt_fname, 'r') as f:: 텍스트 파일을 열고 f라는 이름으로 사용합니다.
  - images = f.read().split(): 파일에서 읽은 이미지 파일 이름들을 공백으로 분리하여 리스트로 저장합니다.
- features, labels = [], []: 특징 이미지와 라벨 이미지를 저장할 빈 리스트 features와 labels를 생성합니다.
- for i, fname in enumerate(images):: 이미지 파일 이름들을 반복하면서 처리합니다.
  - features.append(torchvision.io.read_image(os.path.join(voc_dir, 'JPEGImages', f'{fname}.jpg'))): 이미지 파일을 읽어와 특징 이미지 리스트에 추가합니다.
  - labels.append(torchvision.io.read_image(os.path.join(voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode)): 라벨 이미지 파일을 읽어와 라벨 이미지 리스트에 추가합니다.
- return features, labels: 읽어온 특징 이미지와 라벨 이미지 리스트를 반환합니다.
train_features, train_labels = read_voc_images(voc_dir, True): VOC 데이터셋에서 훈련 데이터의 특징 이미지와 라벨 이미지를 읽어와 각각 train_features와 train_labels 변수에 할당합니다.

이 코드는 VOC 데이터셋에서 훈련 데이터의 특징 이미지와 라벨 이미지를 읽어오는 함수 read_voc_images를 정의하고 이를 호출하여 train_features와 train_labels를 생성하는 내용을 나타냅니다.

We draw the first five input images and their labels. In the label images, white and black represent borders and background, respectively, while the other colors correspond to different classes.

처음 5개의 입력 이미지와 레이블을 그립니다. 레이블 이미지에서 흰색과 검은색은 각각 테두리와 배경을 나타내고 나머지 색상은 다른 클래스에 해당합니다.

n = 5
imgs = train_features[:n] + train_labels[:n]
imgs = [img.permute(1,2,0) for img in imgs]
d2l.show_images(imgs, 2, n);

n = 5: 변수 n에 5를 할당합니다. 이는 표시할 이미지의 개수를 의미합니다.
imgs = train_features[:n] + train_labels[:n]: train_features 리스트에서 처음부터 n개의 이미지와 train_labels 리스트에서 처음부터 n개의 이미지를 선택하여 리스트 imgs에 결합합니다. 이렇게 하면 특징 이미지와 해당하는 라벨 이미지가 번갈아가며 포함된 리스트가 생성됩니다.
imgs = [img.permute(1,2,0) for img in imgs]: imgs 리스트 내의 이미지들의 차원을 변경합니다. 각 이미지의 차원을 (채널, 높이, 너비)에서 (높이, 너비, 채널)로 변경하면 이미지가 보다 일반적인 형태로 표현됩니다.
d2l.show_images(imgs, 2, n);: imgs 리스트에 포함된 이미지들을 2개의 열로 나누어서 n개씩 그립니다. d2l.show_images 함수는 이미지를 그리는 함수로서 여러 이미지를 그룹으로 나누어서 표시합니다.

이 코드는 훈련 데이터셋에서 특징 이미지와 라벨 이미지를 일정 개수만큼 선택하고, 차원을 변경한 후에 이를 그룹으로 나누어서 표시하는 내용을 나타냅니다.

Next, we enumerate the RGB color values and class names for all the labels in this dataset.

다음으로 이 데이터 세트의 모든 레이블에 대한 RGB 색상 값과 클래스 이름을 열거합니다.

#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]

#@save
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
VOC_COLORMAP: VOC 데이터셋에서 클래스 인덱스에 따라 색상을 부여하는 컬러맵을 정의한 변수입니다. 각 클래스에 대해 RGB 값으로 색상을 지정하여 클래스를 시각화할 때 사용됩니다.
VOC_CLASSES: VOC 데이터셋에서 정의된 클래스 이름들을 나열한 변수입니다. 이 변수에는 배경과 다양한 객체 클래스들의 이름이 포함되어 있습니다.

이 코드는 VOC 데이터셋에서 클래스들의 컬러맵과 클래스 이름들을 정의하는 내용을 나타냅니다. VOC 데이터셋은 객체 검출, 분할 등의 작업에 사용되는 데이터셋으로, 이러한 정보들은 시각화 및 레이블 관련 작업에서 사용됩니다.

With the two constants defined above, we can conveniently find the class index for each pixel in a label. We define the voc_colormap2label function to build the mapping from the above RGB color values to class indices, and the voc_label_indices function to map any RGB values to their class indices in this Pascal VOC2012 dataset.

위에서 정의한 두 상수를 사용하면 레이블의 각 픽셀에 대한 클래스 인덱스를 편리하게 찾을 수 있습니다. 우리는 voc_colormap2label 함수를 정의하여 위의 RGB 색상 값에서 클래스 인덱스로의 매핑을 구축하고 voc_label_indices 함수를 정의하여 RGB 값을 이 Pascal VOC2012 데이터 세트의 클래스 인덱스로 매핑합니다.

#@save
def voc_colormap2label():
    """Build the mapping from RGB to class indices for VOC labels."""
    colormap2label = torch.zeros(256 ** 3, dtype=torch.long)
    for i, colormap in enumerate(VOC_COLORMAP):
        colormap2label[
            (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i
    return colormap2label

#@save
def voc_label_indices(colormap, colormap2label):
    """Map any RGB values in VOC labels to their class indices."""
    colormap = colormap.permute(1, 2, 0).numpy().astype('int32')
    idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
           + colormap[:, :, 2])
    return colormap2label[idx]

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
voc_colormap2label(): VOC 데이터셋의 클래스 컬러맵을 이용하여 RGB 값에서 클래스 인덱스로 매핑하는 함수입니다. VOC 데이터셋의 클래스 컬러맵을 이용하여 RGB 값의 조합을 클래스 인덱스로 변환하는 매핑을 생성하고 반환합니다.
- colormap2label = torch.zeros(256 ** 3, dtype=torch.long): 256^3 크기의 모든 조합에 대한 클래스 인덱스를 담는 텐서를 생성합니다.
- for i, colormap in enumerate(VOC_COLORMAP):: VOC 컬러맵의 각 색상에 대해 반복합니다.
  - (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]: RGB 값을 클래스 인덱스로 변환하여 텐서의 해당 위치에 클래스 인덱스를 저장합니다.
voc_label_indices(colormap, colormap2label): 주어진 VOC 라벨 컬러맵을 이용하여 RGB 값을 클래스 인덱스로 변환하는 함수입니다.
- colormap.permute(1, 2, 0).numpy().astype('int32'): 라벨 컬러맵을 높이-너비-채널 형태에서 너비-채널-높이 형태로 변경하고, 넘파이 배열로 변환하며 데이터 타입을 'int32'로 변환합니다.
- ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256 + colormap[:, :, 2]): RGB 값을 클래스 인덱스로 변환하는 계산을 수행하여 클래스 인덱스 텐서 idx를 생성합니다.
- return colormap2label[idx]: 생성한 클래스 인덱스 텐서 idx를 이용하여 RGB 값의 클래스 인덱스로 변환한 결과를 반환합니다.

이 코드는 VOC 데이터셋의 라벨 컬러맵과 RGB 값들을 클래스 인덱스로 변환하기 위한 함수들을 정의하는 내용을 나타냅니다. 이러한 함수들은 데이터셋 시각화 및 레이블 처리 작업에서 사용됩니다.

For example, in the first example image, the class index for the front part of the airplane is 1, while the background index is 0.

예를 들어, 첫 번째 예제 이미지에서 비행기 앞부분의 클래스 인덱스는 1이고 배경 인덱스는 0입니다.

y = voc_label_indices(train_labels[0], voc_colormap2label())
y[105:115, 130:140], VOC_CLASSES[1]

y = voc_label_indices(train_labels[0], voc_colormap2label()): train_labels 리스트에서 첫 번째 라벨 이미지를 이용하여 voc_colormap2label() 함수를 통해 라벨 컬러맵을 클래스 인덱스로 변환한 결과를 변수 y에 할당합니다.
y[105:115, 130:140], VOC_CLASSES[1]: y 변수의 특정 범위를 선택하여 그 값을 출력하고, VOC_CLASSES 리스트의 두 번째 클래스 이름을 출력합니다.

이 코드는 VOC 데이터셋의 첫 번째 라벨 이미지에 대해 라벨 컬러맵을 클래스 인덱스로 변환한 결과를 변수 y에 저장하고, 이 변수에서 특정 영역의 값을 선택하여 출력하는 내용을 나타냅니다. 또한, VOC_CLASSES 리스트에서 두 번째 클래스의 이름을 출력합니다. 이를 통해 라벨 이미지의 레이블 정보를 확인할 수 있습니다.

(tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]),
 'aeroplane')

14.9.2.1. Data Preprocessing

In previous experiments such as in Section 8.1–Section 8.4, images are rescaled to fit the model’s required input shape. However, in semantic segmentation, doing so requires rescaling the predicted pixel classes back to the original shape of the input image. Such rescaling may be inaccurate, especially for segmented regions with different classes. To avoid this issue, we crop the image to a fixed shape instead of rescaling. Specifically, using random cropping from image augmentation, we crop the same area of the input image and the label.

섹션 8.1–섹션 8.4와 같은 이전 실험에서 이미지는 모델의 필수 입력 모양에 맞게 크기가 조정됩니다. 그러나 의미론적 분할에서 그렇게 하려면 예측된 픽셀 클래스를 입력 이미지의 원래 모양으로 다시 스케일링해야 합니다. 이러한 크기 조정은 특히 클래스가 다른 분할 영역의 경우 부정확할 수 있습니다. 이 문제를 방지하기 위해 크기를 다시 조정하는 대신 이미지를 고정된 모양으로 자릅니다. 특히, 이미지 확대에서 임의 자르기를 사용하여 입력 이미지와 레이블의 동일한 영역을 자릅니다.

#@save
def voc_rand_crop(feature, label, height, width):
    """Randomly crop both feature and label images."""
    rect = torchvision.transforms.RandomCrop.get_params(
        feature, (height, width))
    feature = torchvision.transforms.functional.crop(feature, *rect)
    label = torchvision.transforms.functional.crop(label, *rect)
    return feature, label

imgs = []
for _ in range(n):
    imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)

imgs = [img.permute(1, 2, 0) for img in imgs]
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
voc_rand_crop(feature, label, height, width): 특징 이미지와 라벨 이미지를 주어진 높이와 너비에 맞춰 무작위로 잘라내는 함수입니다.
- rect = torchvision.transforms.RandomCrop.get_params(feature, (height, width)): 무작위로 자르기 위한 좌표 정보를 얻습니다. 이 때, get_params 함수는 무작위 자르기를 위해 필요한 좌표 정보를 반환합니다.
- feature = torchvision.transforms.functional.crop(feature, *rect): 주어진 좌표 정보를 이용하여 특징 이미지를 무작위로 잘라냅니다.
- label = torchvision.transforms.functional.crop(label, *rect): 주어진 좌표 정보를 이용하여 라벨 이미지도 무작위로 잘라냅니다.
imgs = []: 빈 리스트 imgs를 생성합니다.
for _ in range(n):: n번 반복하는 루프를 실행합니다.
- imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300): voc_rand_crop 함수를 사용하여 특징 이미지와 라벨 이미지를 200x300 크기로 무작위로 자르고, imgs 리스트에 추가합니다.
imgs = [img.permute(1, 2, 0) for img in imgs]: imgs 리스트 내의 이미지들의 차원을 변경합니다. 각 이미지의 차원을 (채널, 높이, 너비)에서 (높이, 너비, 채널)로 변경하면 이미지가 보다 일반적인 형태로 표현됩니다.
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);: imgs 리스트에 포함된 이미지들을 두 개의 열로 나누어서 그룹으로 나눈 뒤에 표시합니다. 이 때, 첫 번째 그룹은 imgs의 짝수 인덱스 이미지들로, 두 번째 그룹은 홀수 인덱스 이미지들로 구성됩니다.

이 코드는 VOC 데이터셋의 첫 번째 특징 이미지와 라벨 이미지를 무작위로 자르고, 이를 시각화하여 표시하는 내용을 나타냅니다. 이를 통해 데이터 증강과 이미지 처리 과정을 이해할 수 있습니다.

14.9.2.2. Custom Semantic Segmentation Dataset Class

We define a custom semantic segmentation dataset class VOCSegDataset by inheriting the Dataset class provided by high-level APIs. By implementing the __getitem__ function, we can arbitrarily access the input image indexed as idx in the dataset and the class index of each pixel in this image. Since some images in the dataset have a smaller size than the output size of random cropping, these examples are filtered out by a custom filter function. In addition, we also define the normalize_image function to standardize the values of the three RGB channels of input images.

상위 수준 API에서 제공하는 Dataset 클래스를 상속하여 사용자 지정 시맨틱 세분화 데이터 세트 클래스 VOCSegDataset을 정의합니다. __getitem__ 함수를 구현하면 데이터셋에서 idx로 인덱싱된 입력 이미지와 이 이미지의 각 픽셀의 클래스 인덱스에 임의로 액세스할 수 있습니다. 데이터 세트의 일부 이미지는 임의 자르기의 출력 크기보다 크기가 작기 때문에 이러한 예는 사용자 정의 필터 기능으로 필터링됩니다. 또한 normalize_image 함수를 정의하여 입력 이미지의 세 RGB 채널 값을 표준화합니다.

#@save
class VOCSegDataset(torch.utils.data.Dataset):
    """A customized dataset to load the VOC dataset."""

    def __init__(self, is_train, crop_size, voc_dir):
        self.transform = torchvision.transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.crop_size = crop_size
        features, labels = read_voc_images(voc_dir, is_train=is_train)
        self.features = [self.normalize_image(feature)
                         for feature in self.filter(features)]
        self.labels = self.filter(labels)
        self.colormap2label = voc_colormap2label()
        print('read ' + str(len(self.features)) + ' examples')

    def normalize_image(self, img):
        return self.transform(img.float() / 255)

    def filter(self, imgs):
        return [img for img in imgs if (
            img.shape[1] >= self.crop_size[0] and
            img.shape[2] >= self.crop_size[1])]

    def __getitem__(self, idx):
        feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
                                       *self.crop_size)
        return (feature, voc_label_indices(label, self.colormap2label))

    def __len__(self):
        return len(self.features)

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
class VOCSegDataset(torch.utils.data.Dataset):: torch.utils.data.Dataset 클래스를 상속하여 커스텀 데이터셋 클래스 VOCSegDataset을 정의합니다. 이 클래스는 VOC 데이터셋을 불러오고 처리하기 위한 기능을 제공합니다.
- self.transform: VOC 데이터셋의 이미지를 정규화하는 변환을 수행하는 객체를 생성합니다.
- self.crop_size: 임의로 자를 크기를 저장하는 변수입니다.
- features, labels = read_voc_images(voc_dir, is_train=is_train): read_voc_images 함수를 이용하여 특징 이미지와 라벨 이미지를 불러옵니다.
- self.features = [self.normalize_image(feature) for feature in self.filter(features)]: 특징 이미지들을 필터링하고 정규화합니다.
- self.labels = self.filter(labels): 라벨 이미지들도 필터링합니다.
- self.colormap2label = voc_colormap2label(): 라벨 컬러맵을 클래스 인덱스로 매핑하는 매핑을 생성합니다.
- def normalize_image(self, img):: 이미지를 정규화하는 함수입니다.
- def filter(self, imgs):: 이미지들을 필터링하는 함수입니다.
- def __getitem__(self, idx):: 데이터셋에서 특정 인덱스의 아이템을 가져오는 메서드입니다.
- def __len__(self):: 데이터셋의 길이를 반환하는 메서드입니다.

이 코드는 VOC 데이터셋을 커스텀 데이터셋 클래스로 정의하는 내용을 나타냅니다. 이 클래스는 데이터셋을 불러오고 처리하는 기능을 제공하여 모델 학습 등에 활용될 수 있습니다.

14.9.2.3. Reading the Dataset

We use the custom VOCSegDataset class to create instances of the training set and test set, respectively. Suppose that we specify that the output shape of randomly cropped images is 320×480. Below we can view the number of examples that are retained in the training set and test set.

맞춤형 VOCSegDataset 클래스를 사용하여 훈련 세트와 테스트 세트의 인스턴스를 각각 생성합니다. 임의로 자른 이미지의 출력 모양이 320×480이라고 지정한다고 가정합니다. 아래에서 훈련 세트와 테스트 세트에 보관된 예의 수를 볼 수 있습니다.

crop_size = (320, 480)
voc_train = VOCSegDataset(True, crop_size, voc_dir)
voc_test = VOCSegDataset(False, crop_size, voc_dir)

crop_size = (320, 480): 임의로 자를 크기를 (320, 480)로 지정합니다.
voc_train = VOCSegDataset(True, crop_size, voc_dir): VOCSegDataset 클래스를 이용하여 VOC 데이터셋의 학습용 데이터셋 voc_train을 생성합니다. True는 학습 데이터셋을 의미하며, crop_size와 voc_dir은 해당 클래스의 생성자에 필요한 인자입니다.
voc_test = VOCSegDataset(False, crop_size, voc_dir): VOCSegDataset 클래스를 이용하여 VOC 데이터셋의 테스트용 데이터셋 voc_test를 생성합니다. False는 테스트 데이터셋을 의미하며, crop_size와 voc_dir은 해당 클래스의 생성자에 필요한 인자입니다.

이 코드는 학습용과 테스트용 VOC 데이터셋을 생성하는 내용을 나타냅니다. 이를 통해 데이터셋을 생성하여 모델 학습 및 평가에 활용할 수 있습니다.

read 1114 examples
read 1078 examples

Setting the batch size to 64, we define the data iterator for the training set. Let’s print the shape of the first minibatch. Different from in image classification or object detection, labels here are three-dimensional tensors.

배치 크기를 64로 설정하고 훈련 세트에 대한 데이터 반복자를 정의합니다. 첫 번째 미니 배치의 모양을 인쇄해 봅시다. 이미지 분류 또는 객체 감지와 달리 여기서 레이블은 3차원 텐서입니다.

batch_size = 64
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True,
                                    drop_last=True,
                                    num_workers=d2l.get_dataloader_workers())
for X, Y in train_iter:
    print(X.shape)
    print(Y.shape)
    break

batch_size = 64: 배치 크기를 64로 지정합니다.
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True, drop_last=True, num_workers=d2l.get_dataloader_workers()): torch.utils.data.DataLoader 클래스를 사용하여 학습용 데이터셋 voc_train을 불러오는 데이터 로더 train_iter를 생성합니다. 이 데이터 로더는 배치 크기를 batch_size로 설정하고, 데이터를 섞어서 가져오며(shuffle=True), 마지막 배치를 무시합니다(drop_last=True), 병렬로 데이터를 로딩할 때 사용할 워커의 수를 d2l.get_dataloader_workers()로 설정합니다.
for X, Y in train_iter:: train_iter에서 배치 단위로 데이터를 가져옵니다.
- print(X.shape): 현재 배치의 특징 이미지 X의 모양을 출력합니다.
- print(Y.shape): 현재 배치의 라벨 이미지 Y의 모양을 출력합니다.
break: 첫 번째 배치만 확인하고 반복을 종료합니다.

이 코드는 데이터 로더를 사용하여 학습용 데이터셋을 배치 단위로 불러와서 모양을 확인하는 내용을 나타냅니다. 이를 통해 데이터 로딩 및 배치 처리 과정을 이해할 수 있습니다.

torch.Size([64, 3, 320, 480])
torch.Size([64, 320, 480])

14.9.2.4. Putting It All Together

Finally, we define the following load_data_voc function to download and read the Pascal VOC2012 semantic segmentation dataset. It returns data iterators for both the training and test datasets.

마지막으로 Pascal VOC2012 시맨틱 분할 데이터 세트를 다운로드하고 읽기 위해 다음 load_data_voc 함수를 정의합니다. 교육 및 테스트 데이터 세트 모두에 대한 데이터 반복자를 반환합니다.

#@save
def load_data_voc(batch_size, crop_size):
    """Load the VOC semantic segmentation dataset."""
    voc_dir = d2l.download_extract('voc2012', os.path.join(
        'VOCdevkit', 'VOC2012'))
    num_workers = d2l.get_dataloader_workers()
    train_iter = torch.utils.data.DataLoader(
        VOCSegDataset(True, crop_size, voc_dir), batch_size,
        shuffle=True, drop_last=True, num_workers=num_workers)
    test_iter = torch.utils.data.DataLoader(
        VOCSegDataset(False, crop_size, voc_dir), batch_size,
        drop_last=True, num_workers=num_workers)
    return train_iter, test_iter

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
def load_data_voc(batch_size, crop_size):: load_data_voc 함수를 정의합니다. 이 함수는 VOC 시맨틱 세그멘테이션 데이터셋을 불러오는 역할을 합니다.
- voc_dir = d2l.download_extract('voc2012', os.path.join('VOCdevkit', 'VOC2012')): VOC 데이터셋을 다운로드하고 압축을 해제한 경로를 voc_dir에 저장합니다.
- num_workers = d2l.get_dataloader_workers(): 데이터 로더가 사용할 워커의 수를 가져옵니다.
- train_iter = torch.utils.data.DataLoader(VOCSegDataset(True, crop_size, voc_dir), batch_size, shuffle=True, drop_last=True, num_workers=num_workers): 학습 데이터셋을 위한 데이터 로더 train_iter를 생성합니다. True는 학습 데이터셋을 의미하며, VOCSegDataset 클래스를 사용하여 데이터셋을 불러옵니다.
- test_iter = torch.utils.data.DataLoader(VOCSegDataset(False, crop_size, voc_dir), batch_size, drop_last=True, num_workers=num_workers): 테스트 데이터셋을 위한 데이터 로더 test_iter를 생성합니다. False는 테스트 데이터셋을 의미하며, VOCSegDataset 클래스를 사용하여 데이터셋을 불러옵니다.
- return train_iter, test_iter: 학습 데이터셋과 테스트 데이터셋의 데이터 로더를 반환합니다.

이 코드는 VOC 시맨틱 세그멘테이션 데이터셋을 불러오기 위한 함수를 정의한 내용을 나타냅니다. 이 함수를 통해 모델 학습 및 평가에 활용할 수 있는 데이터 로더를 얻을 수 있습니다.

14.9.3. Summary

Semantic segmentation recognizes and understands what are in an image in pixel level by dividing the image into regions belonging to different semantic classes.

시맨틱 분할은 이미지를 서로 다른 시맨틱 클래스에 속하는 영역으로 나누어 이미지에 있는 것을 픽셀 수준으로 인식하고 이해합니다.
One of the most important semantic segmentation dataset is Pascal VOC2012.

가장 중요한 시맨틱 분할 데이터 세트 중 하나는 Pascal VOC2012입니다.
In semantic segmentation, since the input image and label correspond one-to-one on the pixel, the input image is randomly cropped to a fixed shape rather than rescaled.

시맨틱 분할에서 입력 이미지와 레이블은 픽셀에서 일대일로 대응하므로 입력 이미지는 크기를 다시 조정하지 않고 고정된 모양으로 무작위로 자릅니다.

14.9.4. Exercises

How can semantic segmentation be applied in autonomous vehicles and medical image diagnostics? Can you think of other applications?
Recall the descriptions of data augmentation in Section 14.1. Which of the image augmentation methods used in image classification would be infeasible to be applied in semantic segmentation?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.8. Region-based CNNs (R-CNNs)

2023. 8. 20. 21:58 | Posted by 솔웅

import torch
import torchvision

X = torch.arange(16.).reshape(1, 1, 4, 4)
X

https://d2l.ai/chapter_computer-vision/rcnn.html

14.8. Region-based CNNs (R-CNNs) — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.8. Region-based CNNs (R-CNNs)

Besides single shot multibox detection described in Section 14.7, region-based CNNs or regions with CNN features (R-CNNs) are also among many pioneering approaches of applying deep learning to object detection (Girshick et al., 2014). In this section, we will introduce the R-CNN and its series of improvements: the fast R-CNN (Girshick, 2015), the faster R-CNN (Ren et al., 2015), and the mask R-CNN (He et al., 2017). Due to limited space, we will only focus on the design of these models.

섹션 14.7에서 설명한 단일 샷 멀티박스 감지 외에도 지역 기반 CNN 또는 CNN 기능이 있는 지역(R-CNN)도 객체 감지에 딥 러닝을 적용하는 많은 선구적인 접근 방식 중 하나입니다(Girshick et al., 2014). 이 섹션에서는 R-CNN과 일련의 개선 사항인 fast R-CNN(Girshick, 2015), faster R-CNN(Ren et al., 2015) 및 mask R-CNN(He 외, 2017). 제한된 공간으로 인해 이러한 모델의 디자인에만 집중할 것입니다.

14.8.1. R-CNNs

The R-CNN first extracts many (e.g., 2000) region proposals from the input image (e.g., anchor boxes can also be considered as region proposals), labeling their classes and bounding boxes (e.g., offsets). (Girshick et al., 2014)

R-CNN은 먼저 입력 이미지에서 많은(예: 2000개) region proposals을 추출하고(예: 앵커 상자도 region proposals으로 간주할 수 있음) 해당 클래스 및 경계 상자(예: 오프셋)에 레이블을 지정합니다.

Then a CNN is used to perform forward propagation on each region proposal to extract its features. Next, features of each region proposal are used for predicting the class and bounding box of this region proposal.

그런 다음 CNN을 사용하여 각 region proposal 에 대한 순방향 전파를 수행하여 해당 features을 추출합니다. 다음으로, 각 region proposal 의 features은 이 region proposal 의 클래스와 경계 상자를 예측하는 데 사용됩니다.

Fig. 14.8.1 shows the R-CNN model. More concretely, the R-CNN consists of the following four steps:

그림 14.8.1은 R-CNN 모델을 보여준다. 보다 구체적으로 R-CNN은 다음 네 단계로 구성됩니다.

Perform selective search to extract multiple high-quality region proposals on the input image (Uijlings et al., 2013). These proposed regions are usually selected at multiple scales with different shapes and sizes. Each region proposal will be labeled with a class and a ground-truth bounding box.

입력 이미지에서 여러 고품질 region proposals을 추출하기 위해 선택적 검색을 수행합니다(Uijlings et al., 2013). 이러한 제안된 영역은 일반적으로 모양과 크기가 다른 여러 척도에서 선택됩니다. 각 region proposals은 클래스와 ground-truth 경계 상자로 레이블이 지정됩니다.
Choose a pretrained CNN and truncate it before the output layer. Resize each region proposal to the input size required by the network, and output the extracted features for the region proposal through forward propagation.

사전 훈련된 CNN을 선택하고 출력 계층 앞에서 자릅니다. 각 region proposal의 크기를 네트워크에서 요구하는 입력 크기로 조정하고 순방향 전파를 통해 추출된 region proposal의 -features- 특징을 출력합니다.
Take the extracted features and labeled class of each region proposal as an example. Train multiple support vector machines to classify objects, where each support vector machine individually determines whether the example contains a specific class.

각 지역 제안 region proposal 의 추출된 기능 features 과 레이블이 지정된 클래스를 예로 들어 보겠습니다. 여러 서포트 벡터 머신을 훈련시켜 개체를 분류합니다. 여기서 각 서포트 벡터 머신은 예제에 특정 클래스가 포함되어 있는지 여부를 개별적으로 결정합니다.
Take the extracted features and labeled bounding box of each region proposal as an example. Train a linear regression model to predict the ground-truth bounding box.

각 지역 제안 region proposal의 추출된 특징과 레이블이 지정된 경계 상자를 예로 들어 보겠습니다. ground-truth 경계 상자를 예측하도록 선형 회귀 모델을 훈련합니다.

Although the R-CNN model uses pretrained CNNs to effectively extract image features, it is slow. Imagine that we select thousands of region proposals from a single input image: this requires thousands of CNN forward propagations to perform object detection. This massive computing load makes it infeasible to widely use R-CNNs in real-world applications.

R-CNN 모델은 사전 훈련된 CNN을 사용하여 이미지 특징을 효과적으로 추출하지만 속도가 느립니다. 단일 입력 이미지에서 수천 개의 지역 제안 region proposals을 선택한다고 상상해보십시오. 개체 감지를 수행하려면 수천 개의 CNN 정방향 전파가 필요합니다. 이 막대한 컴퓨팅 부하로 인해 실제 응용 프로그램에서 R-CNN을 널리 사용하는 것이 불가능합니다.

Region Proposal 이란?

A "Region Proposal" refers to a process in object detection tasks where potential regions or bounding boxes containing objects of interest are proposed within an image. In object detection, the goal is to identify and localize objects within an image, often by drawing bounding boxes around them. However, directly applying object detection algorithms to all possible image regions can be computationally expensive and inefficient.

"Region Proposal(영역 제안)"은 객체 검출 작업에서 이미지 내에 관심 대상 객체가 포함될 가능성이 있는 영역 또는 경계 상자를 제안하는 과정을 의미합니다. 객체 검출에서의 목표는 이미지 내에서 객체를 식별하고 위치를 파악하는 것인데, 모든 가능한 이미지 영역에 대해 직접 객체 검출 알고리즘을 적용하는 것은 계산적으로 부담스럽고 비효율적일 수 있습니다.

To address this, region proposal methods are used to narrow down the search space for object detection. These methods aim to generate a relatively small set of potential regions or bounding boxes that are likely to contain objects. These proposals serve as the input to the subsequent object detection algorithm, significantly reducing the computational burden.

이러한 문제를 해결하기 위해 영역 제안 방법을 사용하여 객체 검출을 위한 검색 공간을 좁힙니다. 이러한 방법은 객체가 포함될 가능성이 높은 상대적으로 작은 영역이나 경계 상자 세트를 생성하는 것을 목표로 합니다. 이러한 제안은 후속 객체 검출 알고리즘의 입력으로 사용되며, 계산 부담을 크게 줄여줍니다.

There are various techniques for generating region proposals. Some common methods include:

영역 제안을 생성하는 다양한 기법이 있습니다. 일반적인 방법은 다음과 같습니다:

Sliding Window Approach: A small window of fixed size is moved across the image at different scales and positions. Bounding boxes are generated for each window location. This approach can be computationally intensive due to the need to slide the window over the entire image.

슬라이딩 윈도우 접근법: 고정 크기의 작은 창을 이미지 상에서 다양한 스케일과 위치로 이동시키면서 각 창 위치에 대한 경계 상자를 생성합니다. 이 접근법은 전체 이미지에 대해 창을 이동시켜야 하기 때문에 계산적으로 비싼 경우가 있을 수 있습니다.
Selective Search: This method generates region proposals by grouping pixels based on their similarity in color, texture, and intensity. The algorithm combines regions of varying sizes and shapes to create a diverse set of proposals.

Selective Search(선택적 검색): 이 방법은 색상, 질감, 강도의 유사성을 기반으로 픽셀을 그룹화하여 영역 제안을 생성합니다. 알고리즘은 다양한 크기와 모양의 영역을 결합하여 다양한 제안 세트를 만듭니다.
EdgeBoxes: EdgeBoxes uses the edges in an image to generate object proposals. It identifies object-like regions by examining the number of bounding box edges that overlap with the object's edge.

EdgeBoxes: EdgeBoxes는 이미지 내의 가장자리를 활용하여 객체 제안을 생성합니다. 객체의 가장자리와 겹치는 경계 상자 가장자리의 수를 조사하여 객체와 유사한 영역을 식별합니다.
Region Proposal Networks (RPNs): RPNs are part of modern object detection frameworks like Faster R-CNN. They use convolutional neural networks to generate region proposals directly from the input image. RPNs learn to propose regions that are likely to contain objects.

Region Proposal Networks (RPNs): RPN은 Faster R-CNN과 같은 현대적인 객체 검출 프레임워크의 일부입니다. RPN은 컨볼루션 신경망을 사용하여 입력 이미지로부터 직접적으로 영역 제안을 생성합니다. RPN은 객체가 포함될 가능성이 높은 영역을 제안하도록 학습됩니다.

After generating region proposals, these proposals are typically ranked or scored based on their likelihood of containing objects. High-scoring proposals are then passed to the subsequent object classification and localization steps of the object detection pipeline.

영역 제안을 생성한 후, 이러한 제안은 일반적으로 객체 포함 가능성에 따라 순위 또는 점수를 매깁니다. 높은 점수를 받은 제안은 후속 객체 분류 및 위치 파악 단계로 전달됩니다.

In summary, "Region Proposal" refers to the process of generating potential bounding boxes or regions in an image that are likely to contain objects. This process helps reduce the search space for object detection tasks and improves the efficiency of object detection algorithms.

요약하면, "Region Proposal(영역 제안)"은 이미지 내에 있는 객체가 포함될 가능성이 높은 경계 상자나 영역을 생성하는 과정을 의미합니다. 이 과정을 통해 객체 검출 작업의 검색 공간이 축소되며, 객체 검출 알고리즘의 효율성이 향상됩니다.

14.8.2. Fast R-CNN

The main performance bottleneck of an R-CNN lies in the independent CNN forward propagation for each region proposal, without sharing computation. Since these regions usually have overlaps, independent feature extractions lead to much repeated computation. One of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward propagation is only performed on the entire image (Girshick, 2015).

R-CNN의 주요 성능 병목 현상은 계산을 공유하지 않고 각 region proposal에 대한 독립적인 CNN 순방향 전파에 있습니다. 이러한 영역은 일반적으로 겹치기 때문에 독립적인 특징 추출은 많은 반복 계산으로 이어집니다. R-CNN에서 빠른 R-CNN의 주요 개선 사항 중 하나는 CNN 순방향 전파가 전체 이미지에서만 수행된다는 것입니다(Girshick, 2015).

Fig. 14.8.2 describes the fast R-CNN model. Its major computations are as follows:

그림 14.8.2는 빠른 R-CNN 모델을 설명합니다. 주요 계산은 다음과 같습니다.

Compared with the R-CNN, in the fast R-CNN the input of the CNN for feature extraction is the entire image, rather than individual region proposals. Moreover, this CNN is trainable. Given an input image, let the shape of the CNN output be 1×c×ℎ1×w1.

R-CNN과 비교할 때 빠른 R-CNN에서 특징 feature 추출을 위한 CNN의 입력은 개별 영역 제안region proposals 이 아닌 전체 이미지입니다. 게다가, 이 CNN은 학습이 가능합니다. 입력 이미지가 주어지면 CNN 출력의 모양을 1×c×ℎ1×w1이라고 합니다.
Suppose that selective search generates n region proposals. These region proposals (of different shapes) mark regions of interest (of different shapes) on the CNN output. Then these regions of interest further extract features of the same shape (say height ℎ2 and width w2 are specified) in order to be easily concatenated. To achieve this, the fast R-CNN introduces the region of interest (RoI) pooling layer: the CNN output and region proposals are input into this layer, outputting concatenated features of shape n×c×ℎ2×w2 that are further extracted for all the region proposals.

선택적 검색이 n개의 지역 제안 region proposals을 생성한다고 가정합니다. 이러한 영역 제안 region proposals(다른 모양)은 CNN 출력에서 관심 영역 regions of interest (다른 모양)을 표시합니다. 그런 다음 이러한 관심 영역은 쉽게 연결하기 위해 동일한 모양(높이 ℎ2 및 너비 w2가 지정됨)의 특징을 추가로 추출합니다. 이를 달성하기 위해 빠른 R-CNN은 관심 영역(regions of interest -RoI) 풀링 레이어를 도입합니다. CNN 출력 및 영역 제안 region proposa 이 이 레이어에 입력됩니다. 모든 영역 제안 region proposa에 대해 추가로 추출되는 shape n×c×ℎ2×w2의 연결된 특징 features 을 출력합니다.
Using a fully connected layer, transform the concatenated features into an output of shape n×d, where d depends on the model design.

완전 연결 레이어를 사용하여 연결된 피처를 모양 shape n×d의 출력으로 변환합니다. 여기서 d는 모델 디자인에 따라 다릅니다.
Predict the class and bounding box for each of the n region proposals. More concretely, in class and bounding box prediction, transform the fully connected layer output into an output of shape n×q (q is the number of classes) and an output of shape n×4, respectively. The class prediction uses softmax regression.

n개의 영역 제안 region proposals 각각에 대한 클래스 및 경계 상자를 예측합니다. 보다 구체적으로 클래스 및 경계 상자 예측에서 완전 연결 계층 출력을 각각 모양 n×q(q는 클래스 수) 및 모양 n×4의 출력으로 변환합니다. 클래스 예측은 softmax 회귀를 사용합니다.

The region of interest pooling layer proposed in the fast R-CNN is different from the pooling layer introduced in Section 7.5. In the pooling layer, we indirectly control the output shape by specifying sizes of the pooling window, padding, and stride. In contrast, we can directly specify the output shape in the region of interest pooling layer.

fast R-CNN에서 제안하는 region of interest pooling layer은 섹션 7.5에서 소개한 풀링 계층과 다릅니다. 풀링 레이어에서 풀링 창, 패딩 및 보폭의 크기를 지정하여 출력 모양을 간접적으로 제어합니다. 반대로 관심 영역 풀링 레이어에서 출력 모양을 직접 지정할 수 있습니다.

Region of Interest Pooling Layer (RoI Layer) 란?

The Region of Interest (RoI) pooling layer is a component commonly used in convolutional neural networks (CNNs) for object detection and instance segmentation tasks. Its primary purpose is to extract fixed-size feature representations from variable-sized regions of an input feature map. The RoI pooling layer plays a critical role in handling objects of different sizes and positions within an image and is crucial for enabling CNNs to perform accurate object localization and recognition.

관심 영역 풀링 레이어(Region of Interest pooling layer)는 객체 탐지 및 인스턴스 분할 작업에 널리 사용되는 컨볼루션 신경망(CNN) 구성 요소입니다. 주요 목적은 입력 특성 맵의 가변 크기 영역에서 고정 크기의 특성 표현을 추출하는 것입니다. 관심 영역 풀링 레이어는 이미지 내 다양한 크기와 위치의 객체를 처리하는 중요한 역할을 하며, 정확한 객체 지역화와 인식을 위한 CNN의 핵심 역할을 합니다.

In object detection tasks, a CNN generates a set of candidate bounding box predictions for potential objects within an image. These bounding boxes can vary in size and aspect ratio based on the object's location and scale. The RoI pooling layer is applied to these predicted bounding boxes to obtain a consistent spatial resolution of feature representations, which can then be fed into subsequent layers for classification and regression.

객체 탐지 작업에서 CNN은 이미지 내 잠재적인 객체에 대한 후보 바운딩 박스 예측 집합을 생성합니다. 이러한 바운딩 박스는 객체의 위치와 크기에 따라 크기와 종횡비가 다를 수 있습니다. 관심 영역 풀링 레이어는 이러한 예측된 바운딩 박스에 적용되어 고정된 공간 해상도(예: 7x7)의 특성 표현을 얻습니다. 이로써 후속 레이어로 분류 및 회귀 작업에 입력될 수 있습니다.

The RoI pooling layer operates as follows:

관심 영역 풀링 레이어는 다음과 같이 작동합니다:

Input: The input consists of a feature map generated by the preceding layers of the CNN and a set of predicted bounding boxes (RoIs).

입력: 입력은 CNN의 이전 레이어에서 생성된 특성 맵과 예측된 바운딩 박스(RoIs) 집합으로 구성됩니다.
Bounding Box Adaptation: Each RoI is adapted and aligned to a fixed spatial resolution (e.g., 7x7) using RoI pooling. This involves dividing the RoI into a grid of equal cells and selecting the maximum value from each cell. This pooling operation ensures that each RoI is transformed into a consistent feature map with the same spatial dimensions.

바운딩 박스 적: 각 RoI는 관심 영역 풀링을 사용하여 고정된 공간 해상도(예: 7x7)에 맞게 조정됩니다. 이는 RoI를 동일한 크기의 셀 그리드로 나누고 각 셀에서 최대 값을 선택하는 것을 포함합니다. 이 풀링 연산을 통해 각 RoI가 동일한 공간 차원을 가진 일관된 특성 맵으로 변환됩니다.
Feature Extraction: The RoI pooling operation generates fixed-size feature maps for each RoI. These feature maps can be directly fed into fully connected layers for classification or regression tasks.

특성 추출: 관심 영역 풀링 작업은 각 RoI에 대해 고정된 크기의 특성 맵을 생성합니다. 이러한 특성 맵은 분류나 회귀 작업을 위해 직접 완전 연결 레이어에 입력될 수 있습니다.

By using the RoI pooling layer, CNNs can effectively handle objects of different sizes, enabling the network to focus on relevant object details and reducing the impact of variations in object scale and position. This layer is a key component in modern object detection architectures like Faster R-CNN and Mask R-CNN.

관심 영역 풀링 레이어를 사용함으로써 CNN은 다양한 크기의 객체를 효과적(효율)으로 처리할 수 있으며, 네트워크는 관련성 있는 객체 세부 정보에 집중하고 객체 크기와 위치의 변동을 줄일 수 있습니다. 이 레이어는 Faster R-CNN 및 Mask R-CNN과 같은 현대적인 객체 탐지 아키텍처에서 핵심 구성 요소입니다.

Overall, the RoI pooling layer plays a pivotal role in enabling CNNs to process variable-sized regions of an input feature map, thereby facilitating accurate object detection, localization, and segmentation in complex scenes.

전반적으로, 관심 영역 풀링 레이어는 입력 특성 맵의 가변 크기 영역을 처리하는 데 중요한 역할을하여 복잡한 장면에서 정확한 객체 탐지, 지역화 및 분할을 용이하게 하는 데 중요한 역할을 합니다.

For example, let’s specify the output height and width for each region as ℎ2 and w2, respectively. For any region of interest window of shape ℎ×w, this window is divided into a ℎ2×w2 grid of subwindows, where the shape of each subwindow is approximately (ℎ/ℎ2)×(w/w2). In practice, the height and width of any subwindow shall be rounded up, and the largest element shall be used as output of the subwindow. Therefore, the region of interest pooling layer can extract features of the same shape even when regions of interest have different shapes.

예를 들어 각 영역의 출력 높이와 너비를 각각 ℎ2와 w2로 지정해 보겠습니다. ℎ×w 모양의 관심 영역 창에 대해 이 창은 ℎ2×w2 그리드의 하위 창으로 나뉘며 각 하위 창의 모양은 대략 (ℎ/ℎ2)×(w/w2)입니다. 실제로 모든 하위 창의 높이와 너비는 반올림되고 가장 큰 요소가 하위 창의 출력으로 사용됩니다. 따라서 관심 영역 풀링 계층은 관심 영역의 모양이 다른 경우에도 동일한 모양의 특징을 추출할 수 있습니다.

As an illustrative example, in Fig. 14.8.3, the upper-left 3×3 region of interest is selected on a 4×4 input. For this region of interest, we use a 2×2 region of interest pooling layer to obtain a 2×2 output. Note that each of the four divided subwindows contains elements 0, 1, 4, and 5 (5 is the maximum); 2 and 6 (6 is the maximum); 8 and 9 (9 is the maximum); and 10.

예를 들어, 그림 14.8.3에서 왼쪽 상단의 3×3 관심 영역이 4×4 입력에서 선택됩니다. 이 관심 영역에 대해 2×2 관심 영역 풀링 계층을 사용하여 2×2 출력을 얻습니다. 4개의 분할된 하위 창 각각에는 요소 0, 1, 4 및 5(5가 최대값)가 포함되어 있습니다. 2 및 6(6이 최대값); 8 및 9(9가 최대); 및 10.

Fig. 14.8.3  A 2×2 region of interest pooling layer.

Below we demonstrate the computation of the region of interest pooling layer. Suppose that the height and width of the CNN-extracted features X are both 4, and there is only a single channel.

아래에서는 관심 영역 풀링 계층의 계산을 보여줍니다. CNN에서 추출한 특징 X의 높이와 너비가 모두 4이고 채널이 하나만 있다고 가정합니다.

import torch
import torchvision

X = torch.arange(16.).reshape(1, 1, 4, 4)
X

import torch: 파이토치 라이브러리를 가져옵니다.
import torchvision: 파이토치 비전 라이브러리를 가져옵니다. 이미지 관련 작업을 위한 함수와 모델을 포함하고 있습니다.
X = torch.arange(16.).reshape(1, 1, 4, 4): 0부터 15까지의 숫자로 이루어진 배열을 생성합니다. .reshape(1, 1, 4, 4)를 사용하여 배열을 4x4 크기의 4차원 텐서로 변환합니다. 이렇게 변환하면 (batch_size, channels, height, width) 형태로 텐서를 표현하게 됩니다. 여기서 batch_size와 channels는 1로 설정되었습니다.
X: 생성된 텐서 X를 출력합니다.

이 코드는 4x4 크기의 1채널 이미지를 표현하는 4차원 텐서 X를 생성하고 출력하는 내용을 나타냅니다.

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

Let’s further suppose that the height and width of the input image are both 40 pixels and that selective search generates two region proposals on this image. Each region proposal is expressed as five elements: its object class followed by the (x,y)-coordinates of its upper-left and lower-right corners.

또한 입력 이미지의 높이와 너비가 모두 40픽셀이고 선택적 검색이 이 이미지에 대해 두 개의 영역 제안region proposals을 생성한다고 가정해 보겠습니다. 각 영역 제안은 5개의 요소로 표현됩니다. 개체 클래스와 왼쪽 상단 및 오른쪽 하단 모서리의 (x,y) 좌표가 뒤따릅니다.

rois = torch.Tensor([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])

rois: rois라는 변수에 텐서 값을 할당합니다.
torch.Tensor: 파이토치에서 텐서를 생성하는 클래스입니다. 행렬이나 다차원 배열과 유사한 데이터 구조를 나타냅니다.
[[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]]: 2개의 리스트를 가진 리스트로 이루어진 텐서를 생성합니다. 각 리스트는 5개의 숫자로 이루어져 있으며, 각각의 리스트는 ROI(Region of Interest) 정보를 나타냅니다. 각 ROI 정보는 [batch index, class index, x1, y1, x2, y2] 형태로 표현됩니다. 여기서 batch index는 배치 내에서 몇 번째 이미지인지, class index는 객체 클래스를 의미하며, (x1, y1)은 좌상단 좌표, (x2, y2)는 우하단 좌표를 나타냅니다.
==> 요소가 5개인데 설명은 6개로 돼 있음. 교재 내용을 보면 object class,x1,y1,x2,y2 가 맞는 것 같

이 코드는 두 개의 ROI 정보를 담은 텐서 rois를 생성하는 내용을 나타냅니다.

Because the height and width of X are 1/10 of the height and width of the input image, the coordinates of the two region proposals are multiplied by 0.1 according to the specified spatial_scale argument. Then the two regions of interest are marked on X as X[:, :, 0:3, 0:3] and X[:, :, 1:4, 0:4], respectively. Finally in the 2×2 region of interest pooling, each region of interest is divided into a grid of sub-windows to further extract features of the same shape 2×2.

X의 높이와 너비는 입력 이미지의 높이와 너비의 1/10이기 때문에 두 영역 제안의 좌표는 지정된 spatial_scale 인수에 따라 0.1이 곱해집니다. 그런 다음 두 개의 관심 영역이 X에 각각 X[:, :, 0:3, 0:3] 및 X[:, :, 1:4, 0:4]로 표시됩니다. 마지막으로 2×2 관심 영역 풀링에서는 각 관심 영역을 하위 창의 그리드로 나누어 동일한 모양의 2×2 특징을 더 추출합니다.

torchvision.ops.roi_pool(X, rois, output_size=(2, 2), spatial_scale=0.1)

torchvision.ops.roi_pool: torchvision 패키지 내의 ops 모듈에서 roi_pool 함수를 호출합니다. 이 함수는 RoI (Region of Interest) Pooling 연산을 수행합니다.
X: 입력 데이터로서, RoI Pooling 연산을 수행할 대상 텐서입니다.
rois: RoI 정보를 담은 텐서입니다. 각 RoI 정보는 [batch index, class index, x1, y1, x2, y2] 형태로 구성되어 있습니다.
output_size=(2, 2): RoI Pooling 연산의 출력 크기를 지정합니다. 여기서는 2x2 크기의 출력을 생성하도록 지정되었습니다.
spatial_scale=0.1: RoI Pooling 연산에서 입력과 출력의 크기 비율을 나타내는 값입니다. 이 값은 입력 RoI의 좌표를 출력 RoI 좌표로 변환하는 데 사용됩니다.

이 코드는 입력 데이터 X와 RoI 정보를 사용하여 RoI Pooling 연산을 수행하는 내용을 나타냅니다. RoI Pooling은 주어진 RoI 영역을 정해진 크기로 출력하는 연산으로, 객체 검출과 같은 작업에서 주로 사용됩니다.

tensor([[[[ 5.,  6.],
          [ 9., 10.]]],


        [[[ 9., 11.],
          [13., 15.]]]])

14.8.3. Faster R-CNN

To be more accurate in object detection, the fast R-CNN model usually has to generate a lot of region proposals in selective search. To reduce region proposals without loss of accuracy, the faster R-CNN proposes to replace selective search with a region proposal network (Ren et al., 2015).

개체 감지에서 보다 정확하려면 fast R-CNN 모델은 일반적으로 선택적 검색에서 많은 영역 제안을 생성해야 합니다. 정확도 손실 없이 지역 제안 region proposals을 줄이기 위해 Faster R-CNN은 선택적 검색을 지역 제안 region proposals 네트워크로 대체할 것을 제안합니다(Ren et al., 2015).

Fig. 14.8.4 shows the faster R-CNN model. Compared with the fast R-CNN, the faster R-CNN only changes the region proposal method from selective search to a region proposal network. The rest of the model remain unchanged. The region proposal network works in the following steps:

그림 14.8.4는 더 빠른 R-CNN 모델을 보여줍니다. 빠른 R-CNN과 비교할 때 더 빠른 R-CNN은 지역 제안 region proposal 방법만 선택적 검색에서 지역 제안 region proposal 네트워크로 변경합니다. 나머지 모델은 변경되지 않습니다. 지역 제안 네트워크는 다음 단계로 작동합니다.

Use a 3×3 convolutional layer with padding of 1 to transform the CNN output to a new output with c channels. In this way, each unit along the spatial dimensions of the CNN-extracted feature maps gets a new feature vector of length c.

패딩이 1인 3×3 컨벌루션 레이어를 사용하여 CNN 출력을 c 채널이 있는 새 출력으로 변환합니다. 이러한 방식으로 CNN 추출 기능 맵의 공간 차원을 따라 각 단위는 길이가 c인 새로운 기능 벡터를 얻습니다.
Centered on each pixel of the feature maps, generate multiple anchor boxes of different scales and aspect ratios and label them.

기능 맵의 각 픽셀을 중심으로 다양한 크기와 가로 세로 비율의 여러 앵커 상자를 생성하고 레이블을 지정합니다.
Using the length-c feature vector at the center of each anchor box, predict the binary class (background or objects) and bounding box for this anchor box.

각 앵커 상자의 중심에 있는 길이-c 특징 벡터를 사용하여 이 앵커 상자의 이진 클래스(배경 또는 객체) 및 경계 상자를 예측합니다.
Consider those predicted bounding boxes whose predicted classes are objects. Remove overlapped results using non-maximum suppression. The remaining predicted bounding boxes for objects are the region proposals required by the region of interest pooling layer.

예측 클래스가 객체인 예측된 경계 상자를 고려하십시오. 최대가 아닌 억제를 사용하여 중첩된 결과를 제거합니다. 개체에 대한 나머지 예측 경계 상자는 관심 영역 풀링 레이어에 필요한 영역 제안입니다.

It is worth noting that, as part of the faster R-CNN model, the region proposal network is jointly trained with the rest of the model. In other words, the objective function of the faster R-CNN includes not only the class and bounding box prediction in object detection, but also the binary class and bounding box prediction of anchor boxes in the region proposal network. As a result of the end-to-end training, the region proposal network learns how to generate high-quality region proposals, so as to stay accurate in object detection with a reduced number of region proposals that are learned from data.

더 빠른 R-CNN 모델의 일부로 영역 제안 region proposal 네트워크가 모델의 나머지 부분과 공동으로 훈련된다는 점은 주목할 가치가 있습니다. 즉, 더 빠른 R-CNN의 목적 함수는 객체 감지에서 클래스 및 경계 상자 예측뿐만 아니라 영역 제안 region proposal 네트워크에서 앵커 상자의 이진 클래스 및 경계 상자 예측을 포함합니다. 종단 간 교육의 결과, 지역 제안 region proposal 네트워크는 고품질 지역 제안 region proposal을 생성하는 방법을 학습하여 데이터에서 학습된 지역 제안 region proposal의 수를 줄임과 동시 객체 감지에서 정확성을 유지합니다.

14.8.4. Mask R-CNN

In the training dataset, if pixel-level positions of object are also labeled on images, the mask R-CNN can effectively leverage such detailed labels to further improve the accuracy of object detection (He et al., 2017).

학습 데이터 세트에서 개체의 픽셀 수준 위치도 이미지에 레이블이 지정된 경우 마스크 R-CNN은 이러한 세부 레이블을 효과적으로 활용하여 개체 감지의 정확도를 더욱 향상시킬 수 있습니다(He et al., 2017).

As shown in Fig. 14.8.5, the mask R-CNN is modified based on the faster R-CNN. Specifically, the mask R-CNN replaces the region of interest pooling layer with the region of interest (RoI) alignment layer. This region of interest alignment layer uses bilinear interpolation to preserve the spatial information on the feature maps, which is more suitable for pixel-level prediction. The output of this layer contains feature maps of the same shape for all the regions of interest. They are used to predict not only the class and bounding box for each region of interest, but also the pixel-level position of the object through an additional fully convolutional network. More details on using a fully convolutional network to predict pixel-level semantics of an image will be provided in subsequent sections of this chapter.

그림 14.8.5와 같이 더 빠른 R-CNN을 기반으로 마스크 R-CNN이 수정되었습니다. 특히, 마스크 R-CNN은 관심 영역 풀링 레이어를 관심 영역(RoI) 정렬 레이어로 대체합니다. 이 관심 영역 정렬 레이어는 쌍선형 보간을 사용하여 픽셀 수준 예측에 더 적합한 기능 맵의 공간 정보를 보존합니다. 이 레이어의 출력에는 모든 관심 영역에 대해 동일한 모양의 기능 맵이 포함됩니다. 각 관심 영역에 대한 클래스 및 경계 상자뿐만 아니라 추가적인 완전 컨벌루션 네트워크를 통해 개체의 픽셀 수준 위치를 예측하는 데 사용됩니다. 이미지의 픽셀 수준 의미론을 예측하기 위해 완전 컨벌루션 네트워크를 사용하는 방법에 대한 자세한 내용은 이 장의 다음 섹션에서 제공됩니다.

14.8.5. Summary

The R-CNN extracts many region proposals from the input image, uses a CNN to perform forward propagation on each region proposal to extract its features, then uses these features to predict the class and bounding box of this region proposal.

R-CNN은 입력 이미지에서 많은 영역 제안region proposals을 추출하고, CNN을 사용하여 각 영역 제안에 대해 정방향 전파를 수행하여 해당 기능을 추출한 다음 이러한 기능을 사용하여 이 영역 제안의 클래스 및 경계 상자를 예측합니다.
One of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward propagation is only performed on the entire image. It also introduces the region of interest pooling layer, so that features of the same shape can be further extracted for regions of interest that have different shapes.

R-CNN에서 빠른 R-CNN의 주요 개선 사항 중 하나는 CNN 순방향 전파가 전체 이미지에서만 수행된다는 것입니다. 또한 관심 영역 풀링 레이어를 도입하여 모양이 다른 관심 영역에 대해 동일한 모양의 특징을 추가로 추출할 수 있습니다.
The faster R-CNN replaces the selective search used in the fast R-CNN with a jointly trained region proposal network, so that the former can stay accurate in object detection with a reduced number of region proposals.

더 빠른 R-CNN은 빠른 R-CNN에서 사용되는 선택적 검색을 공동으로 훈련된 영역 제안 네트워크로 대체하므로 전자는 감소된 수의 영역 제안으로 객체 감지에서 정확성을 유지할 수 있습니다.
Based on the faster R-CNN, the mask R-CNN additionally introduces a fully convolutional network, so as to leverage pixel-level labels to further improve the accuracy of object detection.

더 빠른 R-CNN을 기반으로 하는 마스크 R-CNN은 픽셀 수준 레이블을 활용하여 개체 감지의 정확도를 더욱 향상시키기 위해 완전히 컨벌루션 네트워크를 추가로 도입합니다.

14.8.6. Exercises

Can we frame object detection as a single regression problem, such as predicting bounding boxes and class probabilities? You may refer to the design of the YOLO model (Redmon et al., 2016).
Compare single shot multibox detection with the methods introduced in this section. What are their major differences? You may refer to Figure 2 of Zhao et al. (2019).

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.7. Single Shot Multibox Detection

2023. 8. 19. 21:26 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/ssd.html

14.7. Single Shot Multibox Detection — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.7. Single Shot Multibox Detection

In Section 14.3–Section 14.6, we introduced bounding boxes, anchor boxes, multiscale object detection, and the dataset for object detection. Now we are ready to use such background knowledge to design an object detection model: single shot multibox detection (SSD) (Liu et al., 2016). This model is simple, fast, and widely used. Although this is just one of vast amounts of object detection models, some of the design principles and implementation details in this section are also applicable to other models.

섹션 14.3–섹션 14.6에서 경계 상자, 앵커 상자, 멀티스케일 객체 감지 및 객체 감지를 위한 데이터 세트를 소개했습니다. 이제 이러한 배경 지식을 사용하여 객체 감지 모델인 단일 샷 멀티박스 감지(SSD)를 설계할 준비가 되었습니다(Liu et al., 2016). 이 모델은 간단하고 빠르며 널리 사용됩니다. 이것은 방대한 양의 객체 감지 모델 중 하나일 뿐이지만 이 섹션의 일부 설계 원칙 및 구현 세부 정보는 다른 모델에도 적용할 수 있습니다.

Single Shot Multibox Detection (SSD) 란?

Single Shot Multibox Detection (SSD) is a popular object detection algorithm in computer vision. It is designed to efficiently detect and localize objects in images. The "Single Shot" in its name indicates that it performs both object localization and classification in a single forward pass through a neural network, making it a faster and more efficient approach compared to some other object detection methods that involve multiple stages.

'Single Shot Multibox Detection' (SSD)은 컴퓨터 비전에서 널리 사용되는 객체 검출 알고리즘입니다. 이 알고리즘은 이미지 내의 객체를 효율적으로 검출하고 위치를 지정하는 데 사용됩니다. 그 이름에 있는 "Single Shot"은 이 알고리즘이 객체의 위치 지정 및 분류를 하나의 순방향 전파로 처리한다는 것을 의미하며, 다른 객체 검출 방법과 비교하여 더 빠르고 효율적인 방법을 나타냅니다.

Here's a brief overview of how SSD works:

다음은 SSD의 작동 방식에 대한 간략한 설명입니다:

Anchor Boxes: SSD uses a predefined set of anchor boxes of different sizes and aspect ratios that serve as reference bounding boxes. These anchor boxes are centered at various positions across the image.

앵커 박스: SSD는 다양한 크기와 종횡비를 가진 미리 정의된 앵커 박스 세트를 사용합니다. 이 앵커 박스는 이미지 내 다양한 위치에 중심이 맞춰져 있습니다.
Convolutional Backbone: The input image is passed through a convolutional neural network (CNN) backbone. This backbone extracts features from the image, capturing different levels of information.

합성곱 기반: 입력 이미지는 합성곱 신경망(CNN) 백본을 통과합니다. 이 백본은 이미지에서 특징을 추출하여 다양한 정보 수준을 포착합니다.
Multi-scale Feature Maps: SSD uses feature maps from different layers of the CNN to detect objects at various scales. These feature maps capture objects of different sizes due to the different receptive fields of the convolutional layers.

다중 스케일 특징 맵: SSD는 CNN의 다른 레이어에서 특징 맵을 사용하여 다양한 크기의 객체를 검출합니다. 이러한 특징 맵은 합성곱 레이어의 다른 수용 영역으로 인해 다양한 크기의 객체를 포착합니다.
Localization and Classification: For each position in the feature maps, SSD predicts the offsets to adjust the anchor boxes, aiming to match the true object bounding boxes. Additionally, SSD predicts the probability scores for different object classes.

위치 지정 및 분류: 특징 맵 내 각 위치에 대해 SSD는 앵커 박스를 조정하기 위한 오프셋을 예측하여 실제 객체의 경계 상자에 맞추려고 합니다. 또한 SSD는 다른 객체 클래스에 대한 확률 점수를 예측합니다.
Non-Maximum Suppression (NMS): To remove duplicate detections and improve precision, SSD uses non-maximum suppression. This step selects the most confident detection for each object and discards overlapping detections.

비최대 억제 (NMS): 중복 검출을 제거하고 정확도를 향상시키기 위해 SSD는 비최대 억제를 사용합니다. 이 단계에서는 각 객체에 대해 가장 자신있는 검출을 선택하고 겹치는 검출을 제거합니다.

The main advantages of SSD include its real-time performance, as it directly predicts object classes and bounding box coordinates in a single pass, without relying on region proposals. This makes it suitable for applications where speed is crucial, such as real-time object detection in videos or robotics.

SSD의 주요 장점은 객체의 클래스와 경계 상자 좌표를 단일 전파로 직접 예측하기 때문에 실시간 성능을 발휘한다는 점입니다. 이로써 영상이나 로보틱스와 같이 속도가 중요한 응용 분야에 적합합니다.

However, one limitation of SSD is that it might struggle with detecting small objects in high-resolution images, as the predefined anchor boxes might not match well with those objects. Despite this limitation, SSD remains a widely used and effective approach for object detection tasks.

하지만 SSD의 단점 중 하나는 고해상도 이미지에서 작은 객체를 검출하는 데 어려움을 겪을 수 있다는 점입니다. 미리 정의된 앵커 박스가 이러한 객체와 잘 일치하지 않을 수 있기 때문입니다. 이런 한계에도 불구하고 SSD는 여전히 객체 검출 작업에 널리 사용되는 효과적인 방법입니다.

14.7.1. Model

Fig. 14.7.1 provides an overview of the design of single-shot multibox detection. This model mainly consists of a base network followed by several multiscale feature map blocks. The base network is for extracting features from the input image, so it can use a deep CNN. For example, the original single-shot multibox detection paper adopts a VGG network truncated before the classification layer (Liu et al., 2016), while ResNet has also been commonly used. Through our design we can make the base network output larger feature maps so as to generate more anchor boxes for detecting smaller objects. Subsequently, each multiscale feature map block reduces (e.g., by half) the height and width of the feature maps from the previous block, and enables each unit of the feature maps to increase its receptive field on the input image.

그림 14.7.1은 단일 샷 멀티박스 감지 설계의 개요를 제공합니다. 이 모델은 주로 기본 네트워크와 여러 다중 스케일 기능 맵 블록으로 구성됩니다. 기본 네트워크는 입력 이미지에서 특징을 추출하기 위한 것이므로 심층 CNN을 사용할 수 있습니다. 예를 들어 원래의 단일 샷 멀티박스 감지 용지는 분류 계층 앞에서 잘린 VGG 네트워크를 채택하고(Liu et al., 2016), ResNet도 일반적으로 사용되었습니다. 우리의 설계를 통해 우리는 더 작은 물체를 감지하기 위해 더 많은 앵커 박스를 생성하기 위해 기본 네트워크 출력을 더 큰 기능 맵으로 만들 수 있습니다. 그 후, 각 멀티스케일 특징 맵 블록은 이전 블록에서 특징 맵의 높이와 너비를 줄이고(예: 절반) 특징 맵의 각 단위가 입력 이미지에서 수용 필드를 증가시킬 수 있도록 합니다.

Recall the design of multiscale object detection through layerwise representations of images by deep neural networks in Section 14.5. Since multiscale feature maps closer to the top of Fig. 14.7.1 are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects.

섹션 14.5에서 심층 신경망에 의한 이미지의 계층적 표현을 통한 다중 스케일 객체 감지 설계를 상기하십시오. 그림 14.7.1의 상단에 가까운 멀티스케일 특징 맵은 더 작지만 수용 필드가 더 크기 때문에 더 적지만 더 큰 물체를 감지하는 데 적합합니다.

In a nutshell, via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes); thus, this is a multiscale object detection model.

간단히 말해서, 기본 네트워크와 여러 멀티스케일 기능 맵 블록을 통해 단일 샷 멀티박스 감지는 다양한 크기의 다양한 앵커 박스를 생성하고 이러한 앵커 박스의 클래스와 오프셋을 예측하여 다양한 크기의 객체를 감지합니다(따라서 경계 상자); 따라서 이것은 다중 스케일 객체 감지 모델입니다.

Fig. 14.7.1  As a multiscale object detection model, single-shot multibox detection mainly consists of a base network followed by several multiscale feature map blocks.

In the following, we will describe the implementation details of different blocks in Fig. 14.7.1. To begin with, we discuss how to implement the class and bounding box prediction.

다음에서는 그림 14.7.1의 여러 블록에 대한 구현 세부 사항을 설명합니다. 먼저 클래스 및 경계 상자 예측을 구현하는 방법에 대해 설명합니다.

14.7.1.1. Class Prediction Layer

Let the number of object classes be q. Then anchor boxes have q+1 classes, where class 0 is background. At some scale, suppose that the height and width of feature maps are ℎ and w, respectively. When α anchor boxes are generated with each spatial position of these feature maps as their center, a total of ℎwα anchor boxes need to be classified. This often makes classification with fully connected layers infeasible due to likely heavy parametrization costs. Recall how we used channels of convolutional layers to predict classes in Section 8.3. Single-shot multibox detection uses the same technique to reduce model complexity.

객체 클래스의 수를 q라 하자. 그런 다음 앵커 상자에는 q+1 클래스가 있으며 여기서 클래스 0은 배경입니다. 어떤 축척에서 특징 맵의 높이와 너비가 각각 ℎ와 w라고 가정합니다. 이러한 특징 맵의 각 공간 위치를 중심으로 α 앵커 박스를 생성하면 총 ℎwα 앵커 박스를 분류해야 한다. 이로 인해 매개변수화 비용이 높을 가능성이 높기 때문에 완전히 연결된 계층으로 분류하는 것이 불가능한 경우가 많습니다. 섹션 8.3에서 클래스를 예측하기 위해 컨벌루션 레이어의 채널을 어떻게 사용했는지 기억하십시오. 단일 샷 멀티박스 감지는 동일한 기술을 사용하여 모델 복잡성을 줄입니다.

Specifically, the class prediction layer uses a convolutional layer without altering width or height of feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs at the same spatial dimensions (width and height) of feature maps. More concretely, channels of the output feature maps at any spatial position (x, y) represent class predictions for all the anchor boxes centered on (x, y) of the input feature maps. To produce valid predictions, there must be α(q+1) output channels, where for the same spatial position the output channel with index i(q+1)+j represents the prediction of the class j (0≤j≤q) for the anchor box i (0≤i<a).

특히 클래스 예측 레이어는 피처 맵의 너비나 높이를 변경하지 않고 컨볼루션 레이어를 사용합니다. 이러한 방식으로 기능 맵의 동일한 공간 차원(너비 및 높이)에서 출력과 입력 간에 일대일 대응이 있을 수 있습니다. 보다 구체적으로, 임의의 공간 위치(x, y)에 있는 출력 기능 맵의 채널은 입력 기능 맵의 (x, y) 중심에 있는 모든 앵커 박스에 대한 클래스 예측을 나타냅니다. 유효한 예측을 생성하려면 α(q+1) 출력 채널이 있어야 합니다. 여기서 동일한 공간 위치에 대해 인덱스 i(q+1)+j가 있는 출력 채널은 클래스 j(0≤j≤q)의 예측을 나타냅니다. 앵커 박스 i의 경우(0≤i<a).

Below we define such a class prediction layer, specifying α and q via arguments num_anchors and num_classes, respectively. This layer uses a 3×3 convolutional layer with a padding of 1. The width and height of the input and output of this convolutional layer remain unchanged.

아래에서 우리는 각각 num_anchors 및 num_classes 인수를 통해 α 및 q를 지정하여 이러한 클래스 예측 계층을 정의합니다. 이 레이어는 패딩이 1인 3×3 컨볼루션 레이어를 사용합니다. 이 컨볼루션 레이어의 입력 및 출력의 너비와 높이는 변경되지 않습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


def cls_predictor(num_inputs, num_anchors, num_classes):
    return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
                     kernel_size=3, padding=1)

%matplotlib inline: 이 줄은 Jupyter Notebook 등에서 맷플롯립 그래프를 인라인으로 표시하도록 설정하는 매직 커맨드입니다. 그래프를 노트북 셀 안에서 바로 볼 수 있게 합니다.
import torch: 파이토치 라이브러리를 임포트합니다. 딥러닝 모델을 구성하고 학습시키는 데 사용됩니다.
import torchvision: 파이토치 비전(Vision) 관련 라이브러리를 임포트합니다. 이미지 데이터를 다루고 변환하는 등의 기능을 제공합니다.
from torch import nn: 파이토치의 nn 모듈로부터 필요한 클래스와 함수를 가져옵니다. 신경망 모델을 구성하는 데 사용됩니다.
from torch.nn import functional as F: 파이토치의 functional 모듈로부터 함수를 가져와 F로 별칭을 붙입니다. 신경망 구성 중 함수들을 사용하기 위해 가져옵니다.
from d2l import torch as d2l: 'Dive into Deep Learning' 도서의 코드 라이브러리(d2l)로부터 파이토치 관련 모듈을 가져옵니다.
def cls_predictor(num_inputs, num_anchors, num_classes): 이 함수는 분류(classification) 작업을 위한 예측기 모듈을 생성합니다. num_inputs는 입력 특성 맵의 채널 수를 나타내고, num_anchors는 각 위치에서 예측할 앵커(anchor)의 수를 나타냅니다. num_classes는 분류하려는 클래스의 수입니다.
return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1), kernel_size=3, padding=1): 이 함수는 2D 합성곱(Convolution) 레이어를 생성하여 예측기 모듈을 반환합니다. 합성곱 레이어는 입력 특성 맵과 필터를 사용하여 특성을 추출하는 데 사용됩니다. num_inputs는 입력 채널 수, num_anchors * (num_classes + 1)는 출력 채널 수를 나타냅니다. 여기서 (num_classes + 1)은 각 클래스와 배경(class 0)을 분류하기 위한 채널 수입니다. kernel_size=3은 필터의 크기를 3x3으로 설정하고, padding=1은 입력 주변에 0 패딩을 추가하여 입력과 출력의 크기를 동일하게 유지합니다.

이 코드는 파이토치를 사용하여 클래스 예측을 위한 합성곱 레이어를 생성하는 함수를 정의하는 것입니다. 이러한 레이어는 주로 객체 검출과 분류 작업에 사용됩니다.

14.7.1.2. Bounding Box Prediction Layer

The design of the bounding box prediction layer is similar to that of the class prediction layer. The only difference lies in the number of outputs for each anchor box: here we need to predict four offsets rather than q+1 classes.

경계 상자 예측 계층의 설계는 클래스 예측 계층의 설계와 유사합니다. 유일한 차이점은 각 앵커 박스의 출력 수에 있습니다. 여기서 우리는 q+1 클래스가 아닌 4개의 오프셋을 예측해야 합니다.

def bbox_predictor(num_inputs, num_anchors):
    return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)

def bbox_predictor(num_inputs, num_anchors):: 이 함수는 바운딩 박스(prediction box) 예측기 모듈을 생성하는 역할을 합니다. num_inputs는 입력 특성 맵의 채널 수를 나타내며, num_anchors는 각 위치에서 예측할 앵커(anchor)의 수를 나타냅니다.
return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1): 이 함수는 2D 합성곱(Convolution) 레이어를 생성하여 바운딩 박스 예측기 모듈을 반환합니다. 합성곱 레이어는 입력 특성 맵과 필터를 사용하여 특성을 추출하는 데 사용됩니다. num_inputs는 입력 채널 수, num_anchors * 4는 출력 채널 수를 나타냅니다. 여기서 num_anchors * 4는 각 앵커에 대한 바운딩 박스의 좌표를 예측하기 위한 채널 수입니다. 바운딩 박스의 좌표는 일반적으로 (x_min, y_min, x_max, y_max) 형식으로 표현되며, 이 경우 각 앵커마다 4개의 좌표를 예측해야 하므로 num_anchors * 4 채널이 필요합니다. kernel_size=3은 필터의 크기를 3x3으로 설정하고, padding=1은 입력 주변에 0 패딩을 추가하여 입력과 출력의 크기를 동일하게 유지합니다.

이 코드는 파이토치를 사용하여 바운딩 박스 예측을 위한 합성곱 레이어를 생성하는 함수를 정의하는 것입니다. 이러한 레이어는 객체 검출 작업에서 앵커와 관련된 바운딩 박스의 좌표를 예측하기 위해 사용됩니다.

14.7.1.3. Concatenating Predictions for Multiple Scales

As we mentioned, single-shot multibox detection uses multiscale feature maps to generate anchor boxes and predict their classes and offsets. At different scales, the shapes of feature maps or the numbers of anchor boxes centered on the same unit may vary. Therefore, shapes of the prediction outputs at different scales may vary.

앞에서 언급했듯이 단일 샷 멀티박스 감지는 멀티스케일 기능 맵을 사용하여 앵커 박스를 생성하고 해당 클래스와 오프셋을 예측합니다. 서로 다른 축척에서 피쳐 맵의 모양이나 같은 단위를 중심으로 하는 앵커 박스의 수가 다를 수 있습니다. 따라서 서로 다른 스케일에서 예측 출력의 모양이 다를 수 있습니다.

In the following example, we construct feature maps at two different scales, Y1 and Y2, for the same minibatch, where the height and width of Y2 are half of those of Y1. Let’s take class prediction as an example. Suppose that 5 and 3 anchor boxes are generated for every unit in Y1 and Y2, respectively. Suppose further that the number of object classes is 10. For feature maps Y1 and Y2 the numbers of channels in the class prediction outputs are 5×(10+1)=55 and 3×(10+1)=33, respectively, where either output shape is (batch size, number of channels, height, width).

다음 예제에서는 동일한 미니배치에 대해 Y1과 Y2의 두 가지 다른 축척으로 피처 맵을 구성합니다. 여기서 Y2의 높이와 너비는 Y1의 절반입니다. 클래스 예측을 예로 들어 보겠습니다. Y1과 Y2의 모든 유닛에 대해 각각 5개와 3개의 앵커 박스가 생성된다고 가정합니다. 개체 클래스의 수가 10이라고 가정합니다. 특징 맵 Y1 및 Y2의 경우 클래스 예측 출력의 채널 수는 각각 5×(10+1)=55 및 3×(10+1)=33입니다. 여기서 출력 형태는 (배치 크기, 채널 수, 높이, 너비)입니다.

def forward(x, block):
    return block(x)

Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
Y1.shape, Y2.shape

def forward(x, block):: 이 함수는 주어진 입력 x와 블록(block)을 사용하여 순전파(forward pass)를 수행하는 역할을 합니다. 입력 x를 block에 적용하고 결과를 반환합니다.
return block(x): 이 줄은 주어진 블록 block을 입력 x에 적용하여 결과를 반환합니다. 이는 신경망의 순전파 단계에서 레이어를 통과시키는 역할을 합니다.
Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10)): torch.zeros((2, 8, 20, 20))는 크기가 2x8x20x20인 입력 데이터를 생성합니다. 이 입력 데이터를 cls_predictor(8, 5, 10) 함수에 적용하여 예측기를 생성합니다. 그리고 forward 함수를 이용하여 입력 데이터를 예측기에 통과시켜 Y1 결과를 얻습니다.
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10)): torch.zeros((2, 16, 10, 10))는 크기가 2x16x10x10인 입력 데이터를 생성합니다. 이 입력 데이터를 cls_predictor(16, 3, 10) 함수에 적용하여 예측기를 생성합니다. 그리고 forward 함수를 이용하여 입력 데이터를 예측기에 통과시켜 Y2 결과를 얻습니다.
Y1.shape, Y2.shape: 이 줄은 Y1과 Y2의 형태(shape)를 확인합니다. .shape는 배열의 크기를 나타내는 속성(attribute)입니다. 이 코드에서는 Y1과 Y2의 출력 크기를 확인하여 결과를 반환합니다.

이 코드는 주어진 입력 데이터를 예측기에 통과시켜 각각 Y1과 Y2 출력을 얻고, 이들 출력의 크기를 확인하는 역할을 합니다. 이는 딥러닝 모델의 순전파 작업을 보여주는 간단한 예시입니다.

(torch.Size([2, 55, 20, 20]), torch.Size([2, 33, 10, 10]))

As we can see, except for the batch size dimension, the other three dimensions all have different sizes. To concatenate these two prediction outputs for more efficient computation, we will transform these tensors into a more consistent format.

보시다시피 배치 크기 차원을 제외하고 다른 세 차원은 모두 크기가 다릅니다. 보다 효율적인 계산을 위해 이 두 예측 출력을 연결하기 위해 이 텐서를 보다 일관된 형식으로 변환합니다.

Note that the channel dimension holds the predictions for anchor boxes with the same center. We first move this dimension to the innermost. Since the batch size remains the same for different scales, we can transform the prediction output into a two-dimensional tensor with shape (batch size, height × width × number of channels). Then we can concatenate such outputs at different scales along dimension 1.

채널 차원은 중심이 같은 앵커 박스에 대한 예측을 보유합니다. 먼저 이 차원을 가장 안쪽으로 이동합니다. 다른 스케일에 대해 배치 크기가 동일하게 유지되므로 예측 출력을 모양(배치 크기, 높이 × 너비 × 채널 수)이 있는 2차원 텐서로 변환할 수 있습니다. 그런 다음 차원 1을 따라 다른 스케일에서 이러한 출력을 연결할 수 있습니다.

def flatten_pred(pred):
    return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)

def concat_preds(preds):
    return torch.cat([flatten_pred(p) for p in preds], dim=1)

def flatten_pred(pred):: 이 함수는 예측 결과 텐서(pred)를 2D로 평탄화(flatten)하는 역할을 합니다.
- pred.permute(0, 2, 3, 1): pred의 차원을 변경하여 이미지 각 위치의 예측 값을 마지막 차원으로 가져옵니다. 즉, 이미지 각 위치에 대한 예측 값들이 모두 한 차원으로 모이게 됩니다.
- torch.flatten(...): 위에서 변경한 예측 텐서를 2D로 평탄화합니다. start_dim=1은 평탄화 작업을 시작할 차원을 나타냅니다. 여기서는 첫 번째 차원(배치 차원)을 제외하고 평탄화를 수행합니다.
def concat_preds(preds):: 이 함수는 여러 예측 결과 텐서들(preds)을 연결(concatenate)하여 하나의 큰 2D 텐서로 만드는 역할을 합니다.
- torch.cat(...): 여러 개의 텐서를 지정된 차원(dim)을 기준으로 연결합니다. 여기서는 flatten_pred(p) for p in preds를 통해 preds 안의 각 예측 텐서를 2D로 평탄화하고, 이들을 차원 dim=1을 기준으로 연결하여 하나의 큰 텐서로 만듭니다.

이러한 코드는 주어진 예측 텐서들을 평탄화하거나 연결하여 전처리하는 함수들을 정의합니다. 주로 신경망 출력을 처리하고 객체 검출 작업을 수행할 때 사용될 수 있습니다.

In this way, even though Y1 and Y2 have different sizes in channels, heights, and widths, we can still concatenate these two prediction outputs at two different scales for the same minibatch.

이러한 방식으로 Y1과 Y2가 채널, 높이 및 너비에서 다른 크기를 가지더라도 동일한 미니배치에 대해 두 가지 다른 스케일에서 이 두 예측 출력을 연결할 수 있습니다.

concat_preds([Y1, Y2]).shape

concat_preds([Y1, Y2]): 이 부분은 concat_preds 함수를 사용하여 Y1과 Y2 두 개의 출력을 연결(concatenate)한 결과를 생성합니다. 이 함수는 주어진 출력들을 하나의 큰 2D 텐서로 만들어줍니다.
.shape: 이 코드는 연결한 결과 텐서의 크기(shape)를 확인하는 역할을 합니다. .shape는 텐서의 차원 크기를 나타내는 속성(attribute)입니다.

결과적으로, 이 코드는 Y1과 Y2 두 개의 출력을 연결한 텐서의 크기(shape)를 확인합니다. 이는 객체 검출 모델에서 다양한 예측 결과를 하나의 큰 텐서로 합치는 작업을 수행할 때 사용될 수 있습니다.

14.7.1.4. Downsampling Block

In order to detect objects at multiple scales, we define the following downsampling block down_sample_blk that halves the height and width of input feature maps. In fact, this block applies the design of VGG blocks in Section 8.2.1. More concretely, each downsampling block consists of two 3×3 convolutional layers with padding of 1 followed by a 2×2 max-pooling layer with stride of 2. As we know, 3×3 convolutional layers with padding of 1 do not change the shape of feature maps. However, the subsequent 2×2 max-pooling reduces the height and width of input feature maps by half. For both input and output feature maps of this downsampling block, because 1×2+(3−1)+(3−1)=6, each unit in the output has a 6×6 receptive field on the input. Therefore, the downsampling block enlarges the receptive field of each unit in its output feature maps.

여러 스케일에서 객체를 감지하기 위해 입력 기능 맵의 높이와 너비를 절반으로 줄이는 다음과 같은 다운샘플링 블록 down_sample_blk를 정의합니다. 실제로 이 블록은 섹션 8.2.1의 VGG 블록 설계를 적용합니다. 보다 구체적으로 각 다운샘플링 블록은 패딩이 1인 두 개의 3×3 컨볼루션 레이어와 보폭이 2인 2×2 최대 풀링 레이어로 구성됩니다. 아시다시피 패딩이 1인 3×3 컨볼루션 레이어는 피쳐 맵의 모양. 그러나 후속 2×2 최대 풀링은 입력 기능 맵의 높이와 너비를 절반으로 줄입니다. 이 다운샘플링 블록의 입력 및 출력 기능 맵 모두에 대해 1×2+(3−1)+(3−1)=6이므로 출력의 각 단위는 입력에 6×6 수용 필드를 가집니다. 따라서 다운샘플링 블록은 출력 기능 맵에서 각 유닛의 수용 필드를 확대합니다.

def down_sample_blk(in_channels, out_channels):
    blk = []
    for _ in range(2):
        blk.append(nn.Conv2d(in_channels, out_channels,
                             kernel_size=3, padding=1))
        blk.append(nn.BatchNorm2d(out_channels))
        blk.append(nn.ReLU())
        in_channels = out_channels
    blk.append(nn.MaxPool2d(2))
    return nn.Sequential(*blk)

def down_sample_blk(in_channels, out_channels):: 이 함수는 다운샘플링(down-sampling) 블록을 생성하는 역할을 합니다. 다운샘플링 블록은 입력의 공간 해상도를 줄이는 데 사용되며, 주로 컨볼루션 신경망(Convolutional Neural Network, CNN) 아키텍처에서 활용됩니다. in_channels는 입력 채널 수를, out_channels는 출력 채널 수를 나타냅니다.
blk = []: 빈 리스트 blk를 생성합니다. 이 리스트는 블록 내의 연속적인 레이어를 저장할 역할을 합니다.
for _ in range(2):: 2번 반복하는 루프를 생성합니다. 루프 안에서는 두 개의 컨볼루션 레이어를 생성하고, 이후 레이어들을 순차적으로 추가하게 됩니다.
- blk.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)): 컨볼루션 레이어를 생성하여 blk 리스트에 추가합니다. 이 레이어는 입력 채널 수 in_channels에서 출력 채널 수 out_channels로 변환합니다. kernel_size=3은 커널(필터) 크기를 3x3으로 설정하고, padding=1은 입력 주변에 0 패딩을 추가하여 입력과 출력의 크기를 동일하게 유지합니다.
- blk.append(nn.BatchNorm2d(out_channels)): 배치 정규화(Batch Normalization) 레이어를 추가합니다. 이는 신경망 내부에서 입력 데이터의 평균과 분산을 조정하여 학습 안정성을 향상시키는 역할을 합니다.
- blk.append(nn.ReLU()): ReLU(Rectified Linear Unit) 활성화 함수 레이어를 추가합니다. ReLU는 비선형성을 도입하여 네트워크가 복잡한 함수를 학습할 수 있게 합니다.
- in_channels = out_channels: 다음 레이어에서의 입력 채널 수를 현재 출력 채널 수로 업데이트합니다.
blk.append(nn.MaxPool2d(2)): 최대 풀링(Max Pooling) 레이어를 추가합니다. 최대 풀링은 입력의 공간 해상도를 줄이는 데 사용되며, 특성 맵에서 가장 큰 값을 선택하여 다운샘플링합니다. 2는 풀링 영역의 크기를 2x2로 설정하는 것을 의미합니다.
return nn.Sequential(*blk): 생성한 레이어들을 순차적으로 실행하는 nn.Sequential 컨테이너를 생성하고 반환합니다. *blk는 리스트 내의 레이어들을 펼쳐서 순차적으로 컨테이너에 추가하라는 의미입니다.

이 함수는 입력 채널 수와 출력 채널 수를 입력으로 받아 다운샘플링 블록을 생성하는 역할을 합니다. 이 블록은 CNN에서 특성 추출을 위해 사용되며, 컨볼루션, 배치 정규화, 활성화 함수, 풀링 등의 레이어로 구성됩니다.

In the following example, our constructed downsampling block changes the number of input channels and halves the height and width of the input feature maps.

다음 예에서 구성된 다운샘플링 블록은 입력 채널 수를 변경하고 입력 기능 맵의 높이와 너비를 절반으로 줄입니다.

forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape

forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)): 이 부분은 forward 함수를 사용하여 입력 데이터와 다운샘플링 블록(down_sample_blk(3, 10))을 통과시킨 결과를 생성합니다. torch.zeros((2, 3, 20, 20))는 크기가 2x3x20x20인 입력 데이터를 생성합니다. down_sample_blk(3, 10)는 입력 채널 수가 3이고 출력 채널 수가 10인 다운샘플링 블록을 생성합니다. 이 입력 데이터를 다운샘플링 블록에 통과시켜 결과를 얻습니다.
.shape: 이 코드는 결과 텐서의 크기(shape)를 확인하는 역할을 합니다. .shape는 텐서의 차원 크기를 나타내는 속성(attribute)입니다.

결과적으로, 이 코드는 입력 데이터를 다운샘플링 블록을 통과시켜 얻은 결과 텐서의 크기(shape)를 확인합니다. 다운샘플링은 풀링과 컨볼루션 레이어 등을 통해 입력 데이터의 공간 해상도를 줄이는 과정을 말하며, 결과적으로 텐서의 크기가 변할 수 있습니다.

14.7.1.5. Base Network Block

The base network block is used to extract features from input images. For simplicity, we construct a small base network consisting of three downsampling blocks that double the number of channels at each block. Given a 256×256 input image, this base network block outputs 32×32 feature maps (256/2**3=32).

기본 네트워크 블록은 입력 이미지에서 특징을 추출하는 데 사용됩니다. 단순화를 위해 각 블록에서 채널 수를 두 배로 늘리는 3개의 다운샘플링 블록으로 구성된 작은 기본 네트워크를 구성합니다. 256×256 입력 이미지가 주어지면 이 기본 네트워크 블록은 32×32 기능 맵(256/2**3=32)을 출력합니다.

def base_net():
    blk = []
    num_filters = [3, 16, 32, 64]
    for i in range(len(num_filters) - 1):
        blk.append(down_sample_blk(num_filters[i], num_filters[i+1]))
    return nn.Sequential(*blk)

forward(torch.zeros((2, 3, 256, 256)), base_net()).shape

def base_net():: 이 함수는 기본 신경망(Base Network)을 생성하는 역할을 합니다. 기본 신경망은 객체 검출 네트워크의 초기 부분으로 사용됩니다.
blk = []: 빈 리스트 blk를 생성합니다. 이 리스트는 신경망 내의 연속된 레이어를 저장할 역할을 합니다.
num_filters = [3, 16, 32, 64]: num_filters는 입력 채널 수와 각 블록에서 사용할 출력 채널 수를 나타내는 리스트입니다. 여기서 [3, 16, 32, 64]는 입력 채널 수가 3이고, 첫 번째 블록의 출력 채널 수가 16, 두 번째 블록의 출력 채널 수가 32, 세 번째 블록의 출력 채널 수가 64임을 나타냅니다.
for i in range(len(num_filters) - 1):: num_filters 리스트의 길이보다 하나 적은 횟수만큼 반복하는 루프를 생성합니다. 이 루프는 각 블록에 대한 다운샘플링 블록을 생성하여 blk 리스트에 추가합니다.
- blk.append(down_sample_blk(num_filters[i], num_filters[i+1])): down_sample_blk 함수를 사용하여 다운샘플링 블록을 생성하고 blk 리스트에 추가합니다. 현재 인덱스 i에 해당하는 출력 채널 수와 그 다음 인덱스 i+1에 해당하는 출력 채널 수를 이용하여 다운샘플링 블록을 생성합니다.
return nn.Sequential(*blk): 생성한 블록들을 순차적으로 실행하는 nn.Sequential 컨테이너를 생성하고 반환합니다. *blk는 리스트 내의 블록들을 펼쳐서 순차적으로 컨테이너에 추가하라는 의미입니다.
forward(torch.zeros((2, 3, 256, 256)), base_net()).shape: 이 코드는 입력 데이터를 기본 신경망에 통과시킨 결과 텐서의 크기(shape)를 확인합니다. torch.zeros((2, 3, 256, 256))는 크기가 2x3x256x256인 입력 데이터를 생성합니다. 이 입력 데이터를 기본 신경망에 통과시켜 결과 텐서를 얻습니다. 결과적으로, 이 코드는 입력 데이터를 기본 신경망에 넣고 얻은 출력 텐서의 크기를 확인합니다.

이 코드는 객체 검출 네트워크의 기본 부분인 기본 신경망을 정의하고, 입력 데이터를 이 신경망에 통과시킨 결과의 크기를 확인하는 역할을 합니다. 기본 신경망은 다양한 다운샘플링 블록을 사용하여 입력 데이터의 공간 해상도를 줄이는 역할을 수행합니다.

14.7.1.6. The Complete Model

The complete single shot multibox detection model consists of five blocks. The feature maps produced by each block are used for both (i) generating anchor boxes and (ii) predicting classes and offsets of these anchor boxes. Among these five blocks, the first one is the base network block, the second to the fourth are downsampling blocks, and the last block uses global max-pooling to reduce both the height and width to 1. Technically, the second to the fifth blocks are all those multiscale feature map blocks in Fig. 14.7.1.

완전한 단일 샷 멀티박스 감지 모델은 5개의 블록으로 구성됩니다. 각 블록에서 생성된 기능 맵은 (i) 앵커 상자 생성 및 (ii) 이러한 앵커 상자의 클래스 및 오프셋 예측에 모두 사용됩니다. 이 5개의 블록 중 첫 번째는 기본 네트워크 블록이고 두 번째에서 네 번째는 다운샘플링 블록이며 마지막 블록은 글로벌 최대 풀링을 사용하여 높이와 너비를 모두 1로 줄입니다. 기술적으로 두 번째에서 다섯 번째 블록은 그림 14.7.1의 모든 멀티스케일 특징 맵 블록입니다.

def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 1:
        blk = down_sample_blk(64, 128)
    elif i == 4:
        blk = nn.AdaptiveMaxPool2d((1,1))
    else:
        blk = down_sample_blk(128, 128)
    return blk

def get_blk(i):: 이 함수는 블록(block)을 반환하는 역할을 합니다. 블록은 객체 검출 네트워크의 다양한 구성 요소 중 하나입니다. 함수는 입력으로 정수 i를 받습니다.
if i == 0:: 만약 i가 0이라면 아래 코드를 실행합니다. 이 부분은 i가 0일 때의 블록을 정의하는 부분입니다.
- blk = base_net(): base_net() 함수를 호출하여 기본 신경망 블록을 생성하고 blk 변수에 할당합니다.
elif i == 1:: 만약 i가 1이라면 아래 코드를 실행합니다. 이 부분은 i가 1일 때의 블록을 정의하는 부분입니다.
- blk = down_sample_blk(64, 128): down_sample_blk 함수를 호출하여 입력 채널 수가 64이고 출력 채널 수가 128인 다운샘플링 블록을 생성하고 blk 변수에 할당합니다.
elif i == 4:: 만약 i가 4라면 아래 코드를 실행합니다. 이 부분은 i가 4일 때의 블록을 정의하는 부분입니다.
- blk = nn.AdaptiveMaxPool2d((1,1)): nn.AdaptiveMaxPool2d((1,1))를 사용하여 2D 적응형 최대 풀링 블록을 생성하고 blk 변수에 할당합니다. 이 블록은 입력 데이터의 크기에 상관없이 고정된 크기의 풀링을 수행합니다.
else:: 만약 i가 위의 조건들에 해당하지 않는 경우, 즉 0, 1, 4가 아닌 경우 아래 코드를 실행합니다. 이 부분은 위의 조건들에 해당하지 않는 경우의 블록을 정의하는 부분입니다.
- blk = down_sample_blk(128, 128): down_sample_blk 함수를 호출하여 입력 채널 수와 출력 채널 수가 모두 128인 다운샘플링 블록을 생성하고 blk 변수에 할당합니다.
return blk: 생성한 블록을 반환합니다.

이 함수는 입력 정수 i에 따라 다양한 유형의 블록을 생성하여 반환하는 역할을 합니다. 이러한 블록들은 객체 검출 네트워크의 다양한 구성을 나타내며, 입력 i에 따라 다운샘플링 블록, 적응형 풀링 블록 등을 생성할 수 있습니다.

Now we define the forward propagation for each block. Different from in image classification tasks, outputs here include (i) CNN feature maps Y, (ii) anchor boxes generated using Y at the current scale, and (iii) classes and offsets predicted (based on Y) for these anchor boxes.

이제 각 블록에 대한 순방향 전파를 정의합니다. 이미지 분류 작업과 달리 여기서 출력에는 (i) CNN 기능 맵 Y, (ii) 현재 스케일에서 Y를 사용하여 생성된 앵커 상자, (iii) 이러한 앵커 상자에 대해 예측된 클래스 및 오프셋(Y 기반)이 포함됩니다.

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):: 이 함수는 입력 데이터 X와 여러 가지 요소들을 활용하여 객체 검출 네트워크의 순전파 과정을 수행하는 역할을 합니다. 함수는 다양한 인자들을 입력으로 받습니다.
Y = blk(X): 입력 데이터 X를 블록 blk에 통과시켜 특성 맵 Y를 생성합니다. blk는 객체 검출 네트워크에서 사용되는 블록을 나타내며, 입력 데이터를 변환하고 특성을 추출하는 역할을 합니다.
anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio): d2l.multibox_prior 함수를 사용하여 특성 맵 Y에 대한 앵커(anchor) 박스들을 생성합니다. 앵커는 객체 검출에서 다양한 크기와 종횡비를 가지는 사각형 박스를 의미하며, 객체의 위치와 크기를 추론하는 데 사용됩니다. sizes는 앵커 박스의 크기 리스트를, ratios는 앵커 박스의 종횡비 리스트를 나타냅니다.
cls_preds = cls_predictor(Y): 특성 맵 Y를 입력으로 받아 클래스 예측기(cls_predictor)를 통과시켜 객체 클래스에 대한 예측값 cls_preds를 생성합니다. 이 예측값은 각 앵커 박스에 대한 객체 클래스 확률을 의미합니다.
bbox_preds = bbox_predictor(Y): 특성 맵 Y를 입력으로 받아 바운딩 박스 예측기(bbox_predictor)를 통과시켜 바운딩 박스 예측값 bbox_preds를 생성합니다. 이 예측값은 각 앵커 박스에 대한 바운딩 박스의 좌표를 의미합니다.
return (Y, anchors, cls_preds, bbox_preds): 이 코드는 생성한 여러 결과를 튜플로 묶어서 반환합니다. 반환되는 결과에는 특성 맵 Y, 앵커 박스들, 클래스 예측값, 바운딩 박스 예측값이 포함됩니다. 이러한 결과들은 객체 검출 네트워크의 순전파 결과로 활용됩니다.

이 함수는 객체 검출 네트워크에서 입력 데이터를 블록에 통과시켜 특성을 추출하고, 앵커 박스, 클래스 예측값, 바운딩 박스 예측값을 생성하는 과정을 수행하는 역할을 합니다. 이러한 정보들은 객체 검출 작업에서 중요한 역할을 하며, 네트워크가 위치와 클래스 정보를 예측하는 데 사용됩니다.

Recall that in Fig. 14.7.1 a multiscale feature map block that is closer to the top is for detecting larger objects; thus, it needs to generate larger anchor boxes. In the above forward propagation, at each multiscale feature map block we pass in a list of two scale values via the sizes argument of the invoked multibox_prior function (described in Section 14.4). In the following, the interval between 0.2 and 1.05 is split evenly into five sections to determine the smaller scale values at the five blocks: 0.2, 0.37, 0.54, 0.71, and 0.88. Then their larger scale values are given by √0.2×0.37=0.272, √0.37×0.54=0.447, and so on.

그림 14.7.1에서 상단에 가까운 멀티스케일 특징 맵 블록은 더 큰 물체를 감지하기 위한 것임을 상기하십시오. 따라서 더 큰 앵커 박스를 생성해야 합니다. 위의 순방향 전파에서 각 멀티스케일 기능 맵 블록에서 호출된 multibox_prior 함수의 크기 인수를 통해 두 개의 스케일 값 목록을 전달합니다(섹션 14.4에서 설명). 다음에서는 0.2와 1.05 사이의 간격을 5개 섹션으로 균일하게 분할하여 5개 블록(0.2, 0.37, 0.54, 0.71 및 0.88)에서 더 작은 척도 값을 결정합니다. 그런 다음 더 큰 척도 값은 √0.2×0.37=0.272, √0.37×0.54=0.447 등으로 지정됩니다.

sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
         [0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1

sizes: 이 부분은 앵커 박스(anchor box)의 크기를 나타내는 리스트입니다. 앵커 박스는 객체 검출에서 다양한 크기와 종횡비를 가진 사각형 박스를 의미합니다. 각 리스트 요소는 [w, h] 형태로, [0.2, 0.272], [0.37, 0.447] 등과 같이 사각형의 너비와 높이를 나타냅니다.
ratios: 이 부분은 앵커 박스의 종횡비(ratio)를 나타내는 리스트입니다. 종횡비는 박스의 너비와 높이 비율을 의미합니다. [1, 2, 0.5]와 같이 리스트 안에 종횡비를 나타내는 숫자들을 나열합니다. 이때 * 5는 리스트를 5번 복사하여 크기를 늘린 것을 의미합니다.
num_anchors: 이 부분은 총 앵커 박스의 개수를 계산하는 변수입니다. 앵커 박스의 개수는 크기와 종횡비에 따라 달라지며, 이 변수는 이를 계산하는데 사용됩니다. len(sizes[0])는 첫 번째 크기 리스트에 있는 요소의 개수를 나타내며, len(ratios[0])는 첫 번째 종횡비 리스트에 있는 요소의 개수를 나타냅니다. 마지막으로 - 1을 하여 중복으로 세어진 1개의 종횡비를 제외합니다.

이 코드는 객체 검출 작업에서 사용되는 앵커 박스의 크기와 종횡비를 정의하고, 총 앵커 박스의 개수를 계산하는 역할을 합니다. 이러한 앵커 박스는 네트워크가 객체의 위치와 크기를 예측하는 데 사용되는 중요한 개념입니다.

Now we can define the complete model TinySSD as follows.

이제 완전한 모델 TinySSD를 다음과 같이 정의할 수 있습니다.

class TinySSD(nn.Module):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        idx_to_in_channels = [64, 128, 128, 128, 128]
        for i in range(5):
            # Equivalent to the assignment statement `self.blk_i = get_blk(i)`
            setattr(self, f'blk_{i}', get_blk(i))
            setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
                                                    num_anchors, num_classes))
            setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
                                                      num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        for i in range(5):
            # Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
        anchors = torch.cat(anchors, dim=1)
        cls_preds = concat_preds(cls_preds)
        cls_preds = cls_preds.reshape(
            cls_preds.shape[0], -1, self.num_classes + 1)
        bbox_preds = concat_preds(bbox_preds)
        return anchors, cls_preds, bbox_preds

class TinySSD(nn.Module):: 이 부분은 nn.Module을 상속하는 TinySSD 클래스를 정의하는 부분입니다. TinySSD는 객체 검출을 위한 작은 SSD(Single Shot MultiBox Detector) 모델을 정의합니다.
def __init__(self, num_classes, **kwargs):: 클래스의 초기화 메서드입니다. num_classes는 검출하려는 클래스의 개수를 의미합니다.
- self.num_classes = num_classes: 클래스 변수 num_classes를 설정하여 클래스의 개수를 저장합니다.
- idx_to_in_channels = [64, 128, 128, 128, 128]: 입력 채널 수에 대한 리스트입니다. 각 블록에 대한 입력 채널 수를 나타냅니다.
- 루프를 통해 각 블록에 대한 동작을 설정합니다.
  - setattr(self, f'blk_{i}', get_blk(i)): get_blk 함수를 사용하여 i번째 블록을 생성하고 해당 블록을 self에 속성으로 추가합니다.
  - setattr(self, f'cls_{i}', cls_predictor(...)): cls_predictor 함수를 사용하여 클래스 예측기를 생성하고 해당 예측기를 self에 속성으로 추가합니다.
  - setattr(self, f'bbox_{i}', bbox_predictor(...)): bbox_predictor 함수를 사용하여 바운딩 박스 예측기를 생성하고 해당 예측기를 self에 속성으로 추가합니다.
def forward(self, X):: 순전파 메서드입니다. 네트워크의 입력을 받아서 순전파 연산을 수행합니다.
- 빈 리스트들을 생성하여 앵커, 클래스 예측값, 바운딩 박스 예측값을 저장할 공간을 할당합니다.
- 루프를 통해 각 블록에 대한 동작을 수행합니다.
  - X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(...): blk_forward 함수를 사용하여 블록 i에 입력 X를 통과시키고 앵커, 클래스 예측값, 바운딩 박스 예측값을 얻습니다.
- anchors = torch.cat(anchors, dim=1): 앵커들을 텐서로 변환하고 차원 1을 기준으로 연결합니다.
- cls_preds = concat_preds(cls_preds): 클래스 예측값들을 연결(concatenate)하여 하나의 큰 텐서로 만듭니다.
- cls_preds = cls_preds.reshape(...): 클래스 예측값의 차원을 변환하여 클래스 개수에 맞게 재구성합니다.
- bbox_preds = concat_preds(bbox_preds): 바운딩 박스 예측값들을 연결하여 하나의 큰 텐서로 만듭니다.
- 최종적으로, 앵커, 클래스 예측값, 바운딩 박스 예측값을 반환합니다.

이 클래스는 작은 SSD 객체 검출 모델을 정의하고, 입력 데이터를 순전파하여 앵커, 클래스 예측값, 바운딩 박스 예측값을 생성하는 역할을 합니다. 클래스 예측과 바운딩 박스 예측을 통해 객체 검출을 수행하는 네트워크 구조를 나타냅니다.

We create a model instance and use it to perform forward propagation on a minibatch of 256×256 images X.

모델 인스턴스를 생성하고 이를 사용하여 256×256 이미지 X의 미니배치에서 정방향 전파를 수행합니다.

As shown earlier in this section, the first block outputs 32×32 feature maps. Recall that the second to fourth downsampling blocks halve the height and width and the fifth block uses global pooling. Since 4 anchor boxes are generated for each unit along spatial dimensions of feature maps, at all the five scales a total of (322+162+82+42+1)×4=5444 anchor boxes are generated for each image.

이 섹션의 앞부분에 표시된 것처럼 첫 번째 블록은 32×32 기능 맵을 출력합니다. 두 번째에서 네 번째 다운샘플링 블록은 높이와 너비를 절반으로 줄이고 다섯 번째 블록은 전역 풀링을 사용한다는 점을 기억하세요. 특징 맵의 공간 차원을 따라 각 단위에 대해 4개의 앵커 상자가 생성되므로 모든 5개 축척에서 각 이미지에 대해 총 (322+162+82+42+1)×4=5444개의 앵커 상자가 생성됩니다.

net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)

net = TinySSD(num_classes=1): TinySSD 클래스를 사용하여 객체 검출을 위한 작은 SSD 모델 net을 생성합니다. num_classes=1은 검출하려는 클래스의 개수를 1로 설정한다는 의미입니다.
X = torch.zeros((32, 3, 256, 256)): 크기가 32x3x256x256인 입력 데이터 X를 생성합니다. 이 입력 데이터는 배치 크기가 32이고 채널 수가 3인 이미지로 구성되며, 이미지의 가로 및 세로 크기는 각각 256입니다.
anchors, cls_preds, bbox_preds = net(X): 생성한 모델 net에 입력 데이터 X를 통과시켜 앵커, 클래스 예측값, 바운딩 박스 예측값을 얻습니다. anchors, cls_preds, bbox_preds 변수에 각각의 결과가 할당됩니다.
print('output anchors:', anchors.shape): anchors의 크기를 출력합니다. 이 부분은 네트워크의 순전파 결과로 생성된 앵커의 크기를 확인하는 역할을 합니다.
print('output class preds:', cls_preds.shape): cls_preds의 크기를 출력합니다. 이 부분은 네트워크의 순전파 결과로 생성된 클래스 예측값의 크기를 확인하는 역할을 합니다.
print('output bbox preds:', bbox_preds.shape): bbox_preds의 크기를 출력합니다. 이 부분은 네트워크의 순전파 결과로 생성된 바운딩 박스 예측값의 크기를 확인하는 역할을 합니다.

이 코드는 작은 SSD 모델에 입력 데이터를 통과시켜 앵커, 클래스 예측값, 바운딩 박스 예측값을 생성하고, 각 결과의 크기를 확인하는 역할을 수행합니다.

14.7.2. Training

Now we will explain how to train the single shot multibox detection model for object detection.

이제 객체 감지를 위해 단일 샷 멀티박스 감지 모델을 훈련하는 방법을 설명합니다.

14.7.2.1. Reading the Dataset and Initializing the Model

To begin with, let’s read the banana detection dataset described in Section 14.6.

먼저 섹션 14.6에 설명된 바나나 감지 데이터 세트를 읽어 보겠습니다.

batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)

batch_size = 32: 배치 크기를 32로 설정합니다. 배치 크기는 한 번의 학습 단계에서 처리되는 데이터의 개수를 나타냅니다. 여기서는 32개의 데이터가 하나의 배치로 묶여 처리될 것입니다.
train_iter, _ = d2l.load_data_bananas(batch_size): d2l.load_data_bananas(batch_size) 함수를 호출하여 바나나 객체 검출 데이터셋을 로드합니다. 이 함수는 학습 데이터를 미니배치로 나누어 제공하는 데이터 로더를 생성합니다. 여기서 반환되는 train_iter는 학습 데이터셋의 미니배치를 순회하는 반복자(iterator)입니다. 코드에서 _는 두 번째 반환값을 무시하는 변수입니다. 이렇게 생성된 데이터 로더를 사용하여 모델을 학습하게 될 것입니다.

이 코드는 배치 크기를 설정하고 바나나 객체 검출 데이터셋을 미니배치로 나누어 학습용 데이터 로더를 생성하는 역할을 합니다.

There is only one class in the banana detection dataset. After defining the model, we need to initialize its parameters and define the optimization algorithm.

바나나 감지 데이터 세트에는 하나의 클래스만 있습니다. 모델을 정의한 후 매개변수를 초기화하고 최적화 알고리즘을 정의해야 합니다.

device, net = d2l.try_gpu(), TinySSD(num_classes=1)
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

device, net = d2l.try_gpu(), TinySSD(num_classes=1): d2l.try_gpu() 함수를 사용하여 GPU 디바이스를 시도하고, 그 결과를 device 변수에 저장합니다. 그 다음, TinySSD 클래스를 사용하여 객체 검출을 위한 작은 SSD 모델 net을 생성합니다. num_classes=1은 검출하려는 클래스의 개수를 1로 설정한다는 의미입니다.
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4): torch.optim.SGD를 사용하여 확률적 경사 하강법(Stochastic Gradient Descent) 옵티마이저 trainer를 생성합니다. net.parameters()는 모델의 학습 가능한 매개변수들을 가져옵니다. lr=0.2는 학습률(learning rate)을 0.2로 설정한다는 의미이며, weight_decay=5e-4는 L2 정규화를 위한 가중치 감쇠(weight decay)를 설정합니다. 이 옵티마이저는 모델을 학습할 때 매개변수 업데이트를 수행하는데 사용됩니다.

이 코드는 GPU 디바이스를 시도하고 작은 SSD 모델을 생성한 후, 확률적 경사 하강법 옵티마이저를 설정하는 역할을 합니다. 이러한 설정을 통해 모델을 GPU로 이동하고, 옵티마이저를 사용하여 모델의 학습 과정을 설정합니다.

14.7.2.2. Defining Loss and Evaluation Functions

Object detection has two types of losses. The first loss concerns classes of anchor boxes: its computation can simply reuse the cross-entropy loss function that we used for image classification. The second loss concerns offsets of positive (non-background) anchor boxes: this is a regression problem. For this regression problem, however, here we do not use the squared loss described in Section 3.1.3. Instead, we use the ℓ1 norm loss, the absolute value of the difference between the prediction and the ground-truth. The mask variable bbox_masks filters out negative anchor boxes and illegal (padded) anchor boxes in the loss calculation. In the end, we sum up the anchor box class loss and the anchor box offset loss to obtain the loss function for the model.

객체 감지에는 두 가지 유형의 손실이 있습니다. 첫 번째 손실은 앵커 박스의 클래스와 관련이 있습니다. 해당 계산은 이미지 분류에 사용한 교차 엔트로피 손실 함수를 단순히 재사용할 수 있습니다. 두 번째 손실은 포지티브(배경이 아닌) 앵커 상자의 오프셋과 관련됩니다. 이것은 회귀 문제입니다. 그러나이 회귀 문제의 경우 여기에서는 섹션 3.1.3에서 설명한 제곱 손실을 사용하지 않습니다. 대신, 우리는 예측과 ground-truth 사이의 차이의 절대값인 ℓ1 놈 손실을 사용합니다. 마스크 변수 bbox_masks는 손실 계산에서 음의 앵커 상자와 불법(패딩된) 앵커 상자를 필터링합니다. 마지막으로 앵커 박스 클래스 손실과 앵커 박스 오프셋 손실을 합산하여 모델에 대한 손실 함수를 얻습니다.

cls_loss = nn.CrossEntropyLoss(reduction='none')
bbox_loss = nn.L1Loss(reduction='none')

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
    cls = cls_loss(cls_preds.reshape(-1, num_classes),
                   cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
    bbox = bbox_loss(bbox_preds * bbox_masks,
                     bbox_labels * bbox_masks).mean(dim=1)
    return cls + bbox

cls_loss = nn.CrossEntropyLoss(reduction='none'): nn.CrossEntropyLoss 함수를 사용하여 클래스 예측값과 클래스 실제값 사이의 손실(loss)를 계산하는데 사용될 cls_loss를 정의합니다. reduction='none'은 손실을 각 샘플에 대해 개별적으로 계산하겠다는 의미입니다.
bbox_loss = nn.L1Loss(reduction='none'): nn.L1Loss 함수를 사용하여 바운딩 박스 예측값과 실제 바운딩 박스 사이의 L1 손실(loss)를 계산하는데 사용될 bbox_loss를 정의합니다. reduction='none'은 손실을 각 샘플에 대해 개별적으로 계산하겠다는 의미입니다.
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):: 손실을 계산하는 함수를 정의합니다. 이 함수는 클래스 예측값, 클래스 실제값, 바운딩 박스 예측값, 바운딩 박스 실제값, 바운딩 박스 마스크를 입력으로 받습니다.
- batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]: 배치 크기와 클래스 개수를 가져옵니다.
- cls_loss(cls_preds.reshape(-1, num_classes), cls_labels.reshape(-1)): 클래스 예측값과 실제값을 모양을 변형하여 크로스 엔트로피 손실을 계산합니다. reshape(-1, num_classes)는 데이터를 배치 크기와 클래스 개수에 맞게 재구성하여 계산합니다.
- .reshape(batch_size, -1).mean(dim=1): 계산된 클래스 손실을 다시 배치 크기에 맞게 재구성하고, 각 배치의 평균을 계산합니다.
- bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks).mean(dim=1): 바운딩 박스 예측값과 실제값에 바운딩 박스 마스크를 적용하여 L1 손실을 계산하고, 각 배치의 평균을 계산합니다.
- return cls + bbox: 클래스 손실과 바운딩 박스 손실을 더하여 최종 손실을 반환합니다.

이 함수는 클래스 예측과 바운딩 박스 예측 간의 손실을 계산하는데 사용되며, 클래스 손실과 바운딩 박스 손실을 더하여 최종적인 학습 손실을 계산하는 역할을 합니다.

We can use accuracy to evaluate the classification results. Due to the used ℓ1 norm loss for the offsets, we use the mean absolute error to evaluate the predicted bounding boxes. These prediction results are obtained from the generated anchor boxes and the predicted offsets for them.

정확도를 사용하여 분류 결과를 평가할 수 있습니다. 오프셋에 사용된 ℓ1 노름 손실로 인해 예측된 경계 상자를 평가하기 위해 평균 절대 오차를 사용합니다. 이러한 예측 결과는 생성된 앵커 박스와 그에 대한 예측된 오프셋에서 얻습니다.

def cls_eval(cls_preds, cls_labels):
    # Because the class prediction results are on the final dimension,
    # `argmax` needs to specify this dimension
    return float((cls_preds.argmax(dim=-1).type(
        cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

def cls_eval(cls_preds, cls_labels):: 클래스 예측값과 실제 클래스 레이블 사이의 일치 여부를 평가하는 함수를 정의합니다. 이 함수는 클래스 예측값, 실제 클래스 레이블을 입력으로 받습니다.
- cls_preds.argmax(dim=-1): 클래스 예측값에서 가장 높은 값의 인덱스를 찾습니다. 여기서 dim=-1은 마지막 차원을 기준으로 최대값을 찾는다는 의미입니다.
- .type(cls_labels.dtype) == cls_labels: 클래스 예측값의 최대 인덱스와 실제 클래스 레이블이 같은지를 비교합니다. 이 결과는 불리언 값으로 반환됩니다.
- .sum(): 불리언 값을 모두 합산하여 일치하는 클래스 개수를 구합니다.
- return float(...): 일치하는 클래스 개수를 부동소수점 형태로 반환합니다.
def bbox_eval(bbox_preds, bbox_labels, bbox_masks):: 바운딩 박스 예측값과 실제 바운딩 박스 레이블 사이의 차이를 평가하는 함수를 정의합니다. 이 함수는 바운딩 박스 예측값, 실제 바운딩 박스 레이블, 바운딩 박스 마스크를 입력으로 받습니다.
- torch.abs((bbox_labels - bbox_preds) * bbox_masks): 실제 바운딩 박스 레이블과 예측 바운딩 박스 값 사이의 차이를 계산하고, 이를 바운딩 박스 마스크와 곱하여 차이를 마스크된 영역에만 적용합니다. 이렇게 계산된 결과는 절댓값을 취합니다.
- .sum(): 절댓값 차이를 모두 합산하여 바운딩 박스 예측과 실제 값 사이의 차이를 구합니다.
- return float(...): 바운딩 박스 값의 차이를 부동소수점 형태로 반환합니다.

이러한 함수들은 클래스 예측과 바운딩 박스 예측의 정확도와 일치도를 평가하는 데 사용됩니다. cls_eval 함수는 클래스 예측 정확도를 평가하고, bbox_eval 함수는 바운딩 박스 예측과 실제 값 사이의 차이를 평가합니다.

14.7.2.3. Training the Model

When training the model, we need to generate multiscale anchor boxes (anchors) and predict their classes (cls_preds) and offsets (bbox_preds) in the forward propagation. Then we label the classes (cls_labels) and offsets (bbox_labels) of such generated anchor boxes based on the label information Y. Finally, we calculate the loss function using the predicted and labeled values of the classes and offsets. For concise implementations, evaluation of the test dataset is omitted here.

모델을 교육할 때 다중 스케일 앵커 상자(앵커)를 생성하고 순방향 전파에서 해당 클래스(cls_preds) 및 오프셋(bbox_preds)을 예측해야 합니다. 그런 다음 레이블 정보 Y를 기반으로 이렇게 생성된 앵커 박스의 클래스(cls_labels)와 오프셋(bbox_labels)에 레이블을 지정합니다. 마지막으로 클래스와 오프셋의 예측 및 레이블 값을 사용하여 손실 함수를 계산합니다. 간결한 구현을 위해 테스트 데이터 세트의 평가는 여기에서 생략됩니다.

num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                        legend=['class error', 'bbox mae'])
net = net.to(device)
for epoch in range(num_epochs):
    # Sum of training accuracy, no. of examples in sum of training accuracy,
    # Sum of absolute error, no. of examples in sum of absolute error
    metric = d2l.Accumulator(4)
    net.train()
    for features, target in train_iter:
        timer.start()
        trainer.zero_grad()
        X, Y = features.to(device), target.to(device)
        # Generate multiscale anchor boxes and predict their classes and
        # offsets
        anchors, cls_preds, bbox_preds = net(X)
        # Label the classes and offsets of these anchor boxes
        bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
        # Calculate the loss function using the predicted and labeled values
        # of the classes and offsets
        l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                      bbox_masks)
        l.mean().backward()
        trainer.step()
        metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
                   bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                   bbox_labels.numel())
    cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
    animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
      f'{str(device)}')

num_epochs, timer = 20, d2l.Timer(): 학습 에포크 수를 20으로 설정하고, 시간 측정을 위한 d2l.Timer() 객체를 생성합니다.
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], legend=['class error', 'bbox mae']): 그래프를 그리기 위한 d2l.Animator() 객체를 생성합니다. 이 그래프는 에포크별 클래스 오류와 바운딩 박스 평균 절대 오차(MAE)를 나타낼 것입니다.
net = net.to(device): 모델 net을 GPU 디바이스로 이동시킵니다.
for epoch in range(num_epochs):: 에포크 수만큼 반복합니다.
- metric = d2l.Accumulator(4): 훈련 중에 모니터링할 메트릭 값을 저장할 d2l.Accumulator() 객체를 생성합니다. 이 메트릭은 훈련 정확도, 훈련 정확도의 샘플 개수, 절댓값 오차, 절댓값 오차의 샘플 개수를 나타냅니다.
- net.train(): 모델을 훈련 모드로 설정합니다.
- for features, target in train_iter:: 훈련 데이터 로더에서 미니배치를 가져옵니다.
  - timer.start(): 시간 측정을 시작합니다.
  - trainer.zero_grad(): 옵티마이저의 그라디언트를 초기화합니다.
  - X, Y = features.to(device), target.to(device): 입력 데이터와 레이블을 GPU 디바이스로 이동시킵니다.
  - anchors, cls_preds, bbox_preds = net(X): 입력 데이터를 모델에 통과시켜 앵커, 클래스 예측값, 바운딩 박스 예측값을 얻습니다.
  - bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y): 앵커와 레이블을 이용하여 바운딩 박스 레이블, 바운딩 박스 마스크, 클래스 레이블을 생성합니다.
  - l = calc_loss(...): 계산된 레이블과 예측값을 사용하여 손실 함수를 계산합니다.
  - l.mean().backward(): 손실의 평균을 구하고, 역전파를 수행합니다.
  - trainer.step(): 옵티마이저를 사용하여 모델의 매개변수를 업데이트합니다.
  - metric.add(...): 훈련 중에 계산된 메트릭 값을 누적합니다. 클래스 정확도와 바운딩 박스 MAE를 계산하여 누적합니다.
- cls_err, bbox_mae = ...: 에포크별 클래스 오류와 바운딩 박스 평균 절대 오차를 계산합니다.
- animator.add(...): 그래프를 그리기 위해 그래프 데이터를 추가합니다. 에포크와 클래스 오류, 바운딩 박스 MAE를 그래프에 추가합니다.
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}'): 학습 종료 후, 최종 클래스 오류와 바운딩 박스 MAE를 출력합니다.
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on ' f'{str(device)}'): 학습 속도를 출력합니다. 데이터셋에 대한 처리 속도와 사용한 디바이스(GPU) 정보도 함께 출력합니다.

이 코드는 작은 SSD 모델을 학습하는 주요 학습 루프를 나타내며, 각 에포크마다 클래스 오류와 바운딩 박스 MAE를 계산하여 모니터링합니다.

14.7.3. Prediction

During prediction, the goal is to detect all the objects of interest on the image. Below we read and resize a test image, converting it to a four-dimensional tensor that is required by convolutional layers.

예측하는 동안 목표는 이미지에서 관심 대상을 모두 감지하는 것입니다. 아래에서는 테스트 이미지를 읽고 크기를 조정하여 컨볼루션 레이어에 필요한 4차원 텐서로 변환합니다.

X = torchvision.io.read_image('../img/banana.jpg').unsqueeze(0).float()
img = X.squeeze(0).permute(1, 2, 0).long()

Using the multibox_detection function below, the predicted bounding boxes are obtained from the anchor boxes and their predicted offsets. Then non-maximum suppression is used to remove similar predicted bounding boxes.

아래의 multibox_detection 함수를 사용하여 앵커 박스와 예측된 오프셋에서 예측된 경계 상자를 얻습니다. 그런 다음 유사한 예측 경계 상자를 제거하기 위해 최대가 아닌 억제가 사용됩니다.

def predict(X):
    net.eval()
    anchors, cls_preds, bbox_preds = net(X.to(device))
    cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
    output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
    return output[0, idx]

output = predict(X)

def predict(X):: 입력 이미지 X에 대한 객체 검출 예측을 수행하는 함수를 정의합니다.
- net.eval(): 모델을 평가 모드로 설정합니다. 평가 모드에서는 드롭아웃이나 배치 정규화와 같은 레이어의 동작이 변경되어 평가에 적합하게 동작합니다.
- anchors, cls_preds, bbox_preds = net(X.to(device)): 입력 이미지를 모델에 통과시켜 앵커, 클래스 예측값, 바운딩 박스 예측값을 얻습니다. X를 GPU 디바이스로 이동시켜 계산합니다.
- cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1): 클래스 예측값에 소프트맥스 함수를 적용하여 클래스 확률을 계산합니다. dim=2는 클래스 차원을 기준으로 소프트맥스를 계산한다는 의미이며, .permute(0, 2, 1)은 클래스 차원을 뒤로 옮깁니다.
- output = d2l.multibox_detection(cls_probs, bbox_preds, anchors): 클래스 확률과 바운딩 박스 예측값, 앵커 정보를 이용하여 객체 검출을 수행합니다. 이 함수는 객체 검출 결과를 반환합니다.
- idx = [i for i, row in enumerate(output[0]) if row[0] != -1]: 검출 결과 중에서 신뢰도가 -1이 아닌 객체 인덱스를 찾습니다.
- return output[0, idx]: 신뢰도가 -1이 아닌 객체에 대한 검출 결과를 반환합니다.
output = predict(X): 입력 이미지 X에 대한 객체 검출을 수행하여 결과를 output에 저장합니다.

이 코드는 학습된 모델을 사용하여 입력 이미지에 대한 객체 검출을 수행하고, 결과를 반환하는 함수를 정의하며, 실제로 객체 검출을 수행하여 결과를 output에 저장하는 과정을 나타냅니다.

Finally, we display all the predicted bounding boxes with confidence 0.9 or above as output.

마지막으로 신뢰도가 0.9 이상인 모든 예측 경계 상자를 출력으로 표시합니다.

def display(img, output, threshold):
    d2l.set_figsize((5, 5))
    fig = d2l.plt.imshow(img)
    for row in output:
        score = float(row[1])
        if score < threshold:
            continue
        h, w = img.shape[:2]
        bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
        d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')

display(img, output.cpu(), threshold=0.9)

def display(img, output, threshold):: 입력 이미지 img와 객체 검출 결과 output, 그리고 신뢰도 임계값 threshold를 받아 객체 검출 결과를 시각화하는 함수를 정의합니다.
- d2l.set_figsize((5, 5)): 그림 크기를 설정합니다.
- fig = d2l.plt.imshow(img): 입력 이미지를 Matplotlib의 이미지로 표시하고, 해당 이미지 객체를 fig에 저장합니다.
- for row in output:: 객체 검출 결과 output의 각 행에 대해서 반복합니다.
  - score = float(row[1]): 객체 검출 결과에서 신뢰도를 가져옵니다. row[1]은 신뢰도 값을 나타냅니다.
  - if score < threshold:: 만약 신뢰도가 임계값보다 작으면 넘어갑니다.
  - h, w = img.shape[:2]: 입력 이미지의 높이와 너비를 가져옵니다.
  - bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]: 객체 검출 결과에서 바운딩 박스 정보를 가져온 후, 바운딩 박스 좌표를 이미지 크기에 맞게 스케일 조정합니다.
  - d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w'): 스케일 조정된 바운딩 박스와 신뢰도 값을 사용하여 Matplotlib의 축에 바운딩 박스와 레이블을 표시합니다.
display(img, output.cpu(), threshold=0.9): 입력 이미지 img와 객체 검출 결과 output를 시각화합니다. 여기서 output의 텐서를 CPU로 이동시켜야 Matplotlib과 호환됩니다. 신뢰도 임계값은 0.9로 설정되어 0.9보다 높은 신뢰도를 가진 객체만 표시됩니다.

이 코드는 객체 검출 결과를 시각화하여 신뢰도 임계값을 넘는 객체들을 표시하는 함수를 정의하고, 실제 이미지와 객체 검출 결과를 사용하여 시각화하는 과정을 나타냅니다.

14.7.4. Summary

Single shot multibox detection is a multiscale object detection model. Via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes).
단일 샷 멀티박스 감지는 다중 스케일 객체 감지 모델입니다. 기본 네트워크와 여러 멀티스케일 기능 맵 블록을 통해 단일 샷 멀티박스 감지는 다양한 크기의 다양한 앵커 박스를 생성하고 이러한 앵커 박스(따라서 경계 박스)의 클래스와 오프셋을 예측하여 다양한 크기의 객체를 감지합니다.

When training the single-shot multibox detection model, the loss function is calculated based on the predicted and labeled values of the anchor box classes and offsets.
단일 샷 멀티박스 감지 모델을 훈련할 때 손실 함수는 앵커 박스 클래스 및 오프셋의 예측 및 레이블 값을 기반으로 계산됩니다.

14.7.5. Exercises

Can you improve the single-shot multibox detection by improving the loss function? For example, replace ℓ1 norm loss with smooth ℓ1 norm loss for the predicted offsets. This loss function uses a square function around zero for smoothness, which is controlled by the hyperparameter σ:

When σ is very large, this loss is similar to the ℓ1 norm loss. When its value is smaller, the loss function is smoother.

def smooth_l1(data, scalar):
    out = []
    for i in data:
        if abs(i) < 1 / (scalar ** 2):
            out.append(((scalar * i) ** 2) / 2)
        else:
            out.append(abs(i) - 0.5 / (scalar ** 2))
    return torch.tensor(out)

sigmas = [10, 1, 0.5]
lines = ['-', '--', '-.']
x = torch.arange(-2, 2, 0.1)
d2l.set_figsize()

for l, s in zip(lines, sigmas):
    y = smooth_l1(x, scalar=s)
    d2l.plt.plot(x, y, l, label='sigma=%.1f' % s)
d2l.plt.legend();

def smooth_l1(data, scalar):: 입력 데이터 data와 스칼라 값 scalar를 받아서 smooth L1 손실 함수 값을 계산하는 함수를 정의합니다.
- out = []: 결과 값을 저장할 빈 리스트를 생성합니다.
- for i in data:: 입력 데이터의 각 원소에 대해서 반복합니다.
  - if abs(i) < 1 / (scalar ** 2):: 만약 입력 데이터의 절댓값이 1 / (scalar ** 2)보다 작으면,
    - out.append(((scalar * i) ** 2) / 2): 출력 리스트에 (scalar * i)^2 / 2 값을 추가합니다.
  - else:: 그렇지 않으면,
    - out.append(abs(i) - 0.5 / (scalar ** 2)): 출력 리스트에 abs(i) - 0.5 / (scalar ** 2) 값을 추가합니다.
- return torch.tensor(out): 계산된 결과를 텐서 형태로 반환합니다.
sigmas = [10, 1, 0.5]: 다양한 시그마 값 리스트를 정의합니다.
lines = ['-', '--', '-.']: 그래프에서 사용할 라인 스타일을 정의합니다.
x = torch.arange(-2, 2, 0.1): -2부터 2까지 0.1 간격으로 숫자들을 생성하여 x를 정의합니다.
d2l.set_figsize(): 그래프 크기를 설정합니다.
for l, s in zip(lines, sigmas):: 라인 스타일과 시그마 값을 하나씩 가져오면서 반복합니다.
- y = smooth_l1(x, scalar=s): smooth_l1 함수를 사용하여 x 값에 대한 smooth L1 손실 함수 값을 계산하여 y에 저장합니다.
- d2l.plt.plot(x, y, l, label='sigma=%.1f' % s): 계산된 x와 y 값을 사용하여 그래프를 그리고, 라벨에 시그마 값을 표시합니다.
d2l.plt.legend(): 그래프에 범례를 표시합니다.

이 코드는 다양한 시그마 값을 사용하여 smooth L1 손실 함수를 그래프로 표현하는 작업을 수행합니다.

def focal_loss(gamma, x):
    return -(1 - x) ** gamma * torch.log(x)

x = torch.arange(0.01, 1, 0.01)
for l, gamma in zip(lines, [0, 1, 5]):
    y = d2l.plt.plot(x, focal_loss(gamma, x), l, label='gamma=%.1f' % gamma)
d2l.plt.legend();

def focal_loss(gamma, x):: 파라미터 gamma와 입력 x를 받아 focal loss 값을 계산하는 함수를 정의합니다.
- return -(1 - x) ** gamma * torch.log(x): focal loss를 계산하여 반환합니다. focal loss는 -(1 - x)^gamma * log(x) 형태로 정의됩니다.
x = torch.arange(0.01, 1, 0.01): 0.01부터 1까지 0.01 간격으로 숫자들을 생성하여 x를 정의합니다.
for l, gamma in zip(lines, [0, 1, 5]):: 라인 스타일과 감마 값을 하나씩 가져오면서 반복합니다.
- y = d2l.plt.plot(x, focal_loss(gamma, x), l, label='gamma=%.1f' % gamma): x 값에 대해 해당 감마 값을 사용하여 focal loss를 계산하고, 그래프를 그립니다. 라벨에 감마 값을 표시합니다.
d2l.plt.legend(): 그래프에 범례를 표시합니다.

이 코드는 다양한 감마 값과 입력에 대한 focal loss 함수를 그래프로 표현하는 작업을 수행합니다.

Due to space limitations, we have omitted some implementation details of the single shot multibox detection model in this section. Can you further improve the model in the following aspects:
1. When an object is much smaller compared with the image, the model could resize the input image bigger.
2. There are typically a vast number of negative anchor boxes. To make the class distribution more balanced, we could downsample negative anchor boxes.
3. In the loss function, assign different weight hyperparameters to the class loss and the offset loss.
4. Use other methods to evaluate the object detection model, such as those in the single shot multibox detection paper (Liu et al., 2016).

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.6. The Object Detection Dataset

2023. 8. 19. 20:24 | Posted by 솔웅

imgs = (batch[0][:10].permute(0, 2, 3, 1)) / 255
axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, batch[1][:10]):
    d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])

https://d2l.ai/chapter_computer-vision/object-detection-dataset.html

14.6. The Object Detection Dataset — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.6. The Object Detection Dataset

There is no small dataset such as MNIST and Fashion-MNIST in the field of object detection. In order to quickly demonstrate object detection models, we collected and labeled a small dataset. First, we took photos of free bananas from our office and generated 1000 banana images with different rotations and sizes. Then we placed each banana image at a random position on some background image. In the end, we labeled bounding boxes for those bananas on the images.

객체 감지 분야에는 MNIST, Fashion-MNIST와 같은 작은 데이터셋이 없습니다. 객체 감지 모델을 신속하게 시연하기 위해 작은 데이터 세트를 수집하고 레이블을 지정했습니다. 먼저 사무실에서 무료로 제공되는 바나나 사진을 찍어 회전과 크기가 다른 1000개의 바나나 이미지를 생성했습니다. 그런 다음 일부 배경 이미지의 임의 위치에 각 바나나 이미지를 배치했습니다. 결국 이미지의 바나나에 대한 경계 상자에 레이블을 지정했습니다.

14.6.1. Downloading the Dataset

The banana detection dataset with all the image and csv label files can be downloaded directly from the Internet.

모든 이미지 및 csv 레이블 파일이 포함된 바나나 감지 데이터 세트는 인터넷에서 직접 다운로드할 수 있습니다.

%matplotlib inline
import os
import pandas as pd
import torch
import torchvision
from d2l import torch as d2l

#@save
d2l.DATA_HUB['banana-detection'] = (
    d2l.DATA_URL + 'banana-detection.zip',
    '5de26c8fce5ccdea9f91267273464dc968d20d72')

위 코드는 데이터셋을 다운로드하기 위한 설정과 관련된 내용을 포함하고 있습니다. 코드를 한 줄씩 살펴보겠습니다.

%matplotlib inline: 이는 Jupyter Notebook 환경에서 Matplotlib 그래프를 인라인으로 표시하도록 설정하는 명령입니다. Matplotlib 그래프가 노트북 내에서 바로 표시되도록 해줍니다.
import os: 파이썬 내장 라이브러리인 os 모듈을 불러옵니다. 파일 및 디렉토리 경로 관련 기능을 사용할 수 있도록 해줍니다.
import pandas as pd: 데이터 분석과 조작을 위한 판다스 라이브러리를 불러옵니다. 관례상 판다스는 pd로 축약하여 사용됩니다.
import torch: 파이토치 라이브러리를 불러옵니다. 딥러닝 모델을 구축하고 훈련하기 위한 도구를 제공합니다.
import torchvision: 파이토치에서 제공하는 컴퓨터 비전 관련 라이브러리입니다. 이미지 데이터셋 및 전처리 기능을 포함하고 있습니다.
from d2l import torch as d2l: "Dive into Deep Learning" 교재의 파이토치 관련 도우미 함수들을 가져옵니다. d2l은 이러한 도우미 함수들을 사용하기 위한 이름 공간입니다.
d2l.DATA_HUB['banana-detection']: 데이터셋을 다운로드하기 위한 정보를 딕셔너리로 정의합니다. 해당 정보에는 데이터셋의 다운로드 URL과 데이터셋의 해시값이 포함됩니다. 데이터셋의 해시값은 데이터의 무결성을 검증하기 위해 사용됩니다.
- 'banana-detection.zip': 데이터셋의 압축 파일 이름입니다.
- '5de26c8fce5ccdea9f91267273464dc968d20d72': 데이터셋 파일의 해시값입니다. 이 값은 데이터 파일의 무결성을 확인하기 위해 사용됩니다. 데이터가 변조되지 않았는지 확인하는 용도로 사용됩니다.

14.6.2. Reading the Dataset

We are going to read the banana detection dataset in the read_data_bananas function below. The dataset includes a csv file for object class labels and ground-truth bounding box coordinates at the upper-left and lower-right corners.

아래의 read_data_bananas 함수에서 바나나 감지 데이터 세트를 읽을 것입니다. 데이터 세트에는 개체 클래스 레이블에 대한 csv 파일과 왼쪽 상단 및 오른쪽 하단 모서리의 실측 경계 상자 좌표가 포함되어 있습니다.

#@save
def read_data_bananas(is_train=True):
    """Read the banana detection dataset images and labels."""
    data_dir = d2l.download_extract('banana-detection')
    csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
                             else 'bananas_val', 'label.csv')
    csv_data = pd.read_csv(csv_fname)
    csv_data = csv_data.set_index('img_name')
    images, targets = [], []
    for img_name, target in csv_data.iterrows():
        images.append(torchvision.io.read_image(
            os.path.join(data_dir, 'bananas_train' if is_train else
                         'bananas_val', 'images', f'{img_name}')))
        # Here `target` contains (class, upper-left x, upper-left y,
        # lower-right x, lower-right y), where all the images have the same
        # banana class (index 0)
        targets.append(list(target))
    return images, torch.tensor(targets).unsqueeze(1) / 256

위 코드는 바나나 검출 데이터셋의 이미지와 레이블을 읽어오는 함수를 정의하고 있습니다. 코드를 한 줄씩 살펴보겠습니다.

def read_data_bananas(is_train=True):: 이 함수는 바나나 검출 데이터셋의 이미지와 레이블을 읽어오는 기능을 수행합니다. is_train 매개변수는 데이터셋이 훈련용인지 검증용인지를 나타내며, 기본값은 True입니다.
data_dir = d2l.download_extract('banana-detection'): 데이터셋을 다운로드하고 압축을 해제하여 저장할 디렉토리를 지정합니다. download_extract 함수를 사용하여 데이터셋을 다운로드하고 압축을 해제합니다.
csv_fname = os.path.join(data_dir, 'bananas_train' if is_train else 'bananas_val', 'label.csv'): 데이터셋의 레이블 정보가 담긴 CSV 파일의 경로를 지정합니다. is_train이 True인 경우 훈련용 데이터셋의 레이블 파일 경로를, 그렇지 않은 경우 검증용 데이터셋의 레이블 파일 경로를 지정합니다.
csv_data = pd.read_csv(csv_fname): Pandas를 사용하여 CSV 파일을 읽어 데이터를 데이터프레임 형태로 가져옵니다.
csv_data = csv_data.set_index('img_name'): 데이터프레임에서 'img_name' 열을 인덱스로 설정합니다. 각 이미지 파일의 이름을 인덱스로 사용하여 이미지와 레이블을 연결할 수 있도록 합니다.
images, targets = [], []: 이미지와 레이블을 저장하기 위한 빈 리스트를 생성합니다.
for img_name, target in csv_data.iterrows():: 데이터프레임을 순회하면서 이미지 이름과 레이블 정보를 하나씩 가져옵니다.
- images.append(torchvision.io.read_image(...)): 이미지 파일을 읽어와서 리스트에 추가합니다. torchvision.io.read_image 함수를 사용하여 이미지 파일을 읽습니다.
- targets.append(list(target)): 레이블 정보를 리스트에 추가합니다. 레이블은 클래스와 바운딩 박스의 좌표로 구성되어 있습니다.
return images, torch.tensor(targets).unsqueeze(1) / 256: 읽어온 이미지와 레이블을 반환합니다. 레이블을 텐서로 변환하고 바운딩 박스의 좌표 값을 [0, 1] 범위로 스케일링하여 반환합니다.

By using the read_data_bananas function to read images and labels, the following BananasDataset class will allow us to create a customized Dataset instance for loading the banana detection dataset.

read_data_bananas 함수를 사용하여 이미지와 레이블을 읽으면 다음 BananasDataset 클래스를 통해 바나나 감지 데이터 세트를 로드하기 위한 사용자 지정 Dataset 인스턴스를 만들 수 있습니다.

#@save
class BananasDataset(torch.utils.data.Dataset):
    """A customized dataset to load the banana detection dataset."""
    def __init__(self, is_train):
        self.features, self.labels = read_data_bananas(is_train)
        print('read ' + str(len(self.features)) + (f' training examples' if
              is_train else f' validation examples'))

    def __getitem__(self, idx):
        return (self.features[idx].float(), self.labels[idx])

    def __len__(self):
        return len(self.features)

위 코드는 PyTorch의 Dataset 클래스를 상속받아 바나나 검출 데이터셋을 사용하기 위한 사용자 정의 데이터셋 클래스를 정의하고 있습니다. 코드를 한 줄씩 살펴보겠습니다.

class BananasDataset(torch.utils.data.Dataset):: BananasDataset은 torch.utils.data.Dataset 클래스를 상속받아 사용자 정의 데이터셋 클래스를 정의하고 있습니다.
def __init__(self, is_train):: 클래스의 생성자 메서드입니다. 데이터셋의 초기화 작업을 수행합니다. is_train 매개변수는 데이터셋이 훈련용인지 검증용인지를 나타내며, 해당 데이터셋을 읽어와서 self.features와 self.labels에 저장합니다. 또한 읽어온 예제의 개수를 출력합니다.
def __getitem__(self, idx):: 데이터셋의 특정 인덱스 idx에 해당하는 샘플을 반환합니다. 해당 샘플의 이미지 데이터와 레이블을 튜플로 반환합니다. 이미지 데이터는 float 타입으로 변환되어 반환됩니다.
def __len__(self):: 데이터셋의 전체 샘플 개수를 반환합니다. 데이터셋의 길이를 알려주는 역할을 합니다.

이렇게 정의된 BananasDataset 클래스를 사용하면 데이터로더를 생성하여 모델 학습에 활용할 수 있습니다. 이 클래스는 데이터셋을 효율적으로 관리하고 모델 학습에 필요한 샘플과 레이블을 제공하는데 도움을 줍니다.

Finally, we define the load_data_bananas function to return two data iterator instances for both the training and test sets. For the test dataset, there is no need to read it in random order.

마지막으로 훈련 세트와 테스트 세트 모두에 대해 두 개의 데이터 반복자 인스턴스를 반환하도록 load_data_bananas 함수를 정의합니다. 테스트 데이터 세트의 경우 무작위 순서로 읽을 필요가 없습니다.

#@save
def load_data_bananas(batch_size):
    """Load the banana detection dataset."""
    train_iter = torch.utils.data.DataLoader(BananasDataset(is_train=True),
                                             batch_size, shuffle=True)
    val_iter = torch.utils.data.DataLoader(BananasDataset(is_train=False),
                                           batch_size)
    return train_iter, val_iter

위의 코드는 '바나나 감지 데이터셋'을 로드하는 함수인 load_data_bananas를 정의하는 코드입니다. 아래에 코드를 한국어로 설명해보겠습니다.

load_data_bananas(batch_size): 이 함수는 바나나 감지 데이터셋을 로드하는 역할을 합니다. batch_size라는 매개변수를 입력으로 받습니다.
"""바나나 감지 데이터셋을 로드합니다.""": 이 줄은 함수의 설명을 나타내는 주석입니다. 함수가 어떤 역할을 하는지 간단하게 설명하고 있습니다.
train_iter = torch.utils.data.DataLoader(BananasDataset(is_train=True), batch_size, shuffle=True): train_iter라는 변수에는 훈련용 데이터를 로드하기 위한 DataLoader 객체가 저장됩니다. BananasDataset(is_train=True)는 훈련 데이터용 바나나 감지 데이터셋을 나타냅니다. batch_size는 한 번에 처리할 데이터의 개수를 나타내며, shuffle=True는 데이터를 섞어서 순서를 랜덤하게 만듭니다.
val_iter = torch.utils.data.DataLoader(BananasDataset(is_train=False), batch_size): val_iter라는 변수에는 검증용 데이터를 로드하기 위한 DataLoader 객체가 저장됩니다. BananasDataset(is_train=False)는 검증 데이터용 바나나 감지 데이터셋을 나타냅니다. batch_size는 한 번에 처리할 데이터의 개수를 나타냅니다.
return train_iter, val_iter: 이 함수는 train_iter와 val_iter 두 가지 값을 반환합니다. 이 두 변수는 각각 훈련용 데이터와 검증용 데이터를 로드하는 DataLoader 객체입니다.

이 함수는 주어진 batch_size에 따라 훈련용과 검증용 바나나 감지 데이터를 DataLoader를 통해 로드하고, 이를 반환하는 역할을 합니다.

Let’s read a minibatch and print the shapes of both images and labels in this minibatch. The shape of the image minibatch, (batch size, number of channels, height, width), looks familiar: it is the same as in our earlier image classification tasks. The shape of the label minibatch is (batch size, m, 5), where m is the largest possible number of bounding boxes that any image has in the dataset.

미니배치를 읽고 이 미니배치에서 이미지와 레이블의 모양을 모두 인쇄해 보겠습니다. 이미지 미니배치의 모양(배치 크기, 채널 수, 높이, 너비)은 친숙해 보입니다. 이전 이미지 분류 작업과 동일합니다. 레이블 미니배치의 모양은 (배치 크기, m, 5)입니다. 여기서 m은 이미지가 데이터세트에 포함할 수 있는 최대 경계 상자 수입니다.

Although computation in minibatches is more efficient, it requires that all the image examples contain the same number of bounding boxes to form a minibatch via concatenation. In general, images may have a varying number of bounding boxes; thus, images with fewer than m bounding boxes will be padded with illegal bounding boxes until m is reached. Then the label of each bounding box is represented by an array of length 5. The first element in the array is the class of the object in the bounding box, where -1 indicates an illegal bounding box for padding. The remaining four elements of the array are the (x, y)-coordinate values of the upper-left corner and the lower-right corner of the bounding box (the range is between 0 and 1). For the banana dataset, since there is only one bounding box on each image, we have m=1.

미니 배치의 계산이 더 효율적이지만 연결을 통해 미니 배치를 형성하려면 모든 이미지 예제에 동일한 수의 경계 상자가 포함되어야 합니다. 일반적으로 이미지에는 다양한 수의 경계 상자가 있을 수 있습니다. 따라서 m 미만의 경계 상자가 있는 이미지는 m에 도달할 때까지 잘못된 경계 상자로 채워집니다. 그런 다음 각 경계 상자의 레이블은 길이가 5인 배열로 표시됩니다. 배열의 첫 번째 요소는 경계 상자에 있는 개체의 클래스이며 여기서 -1은 패딩에 대한 잘못된 경계 상자를 나타냅니다. 배열의 나머지 4개 요소는 경계 상자의 왼쪽 위 모서리와 오른쪽 아래 모서리의 (x, y) 좌표 값입니다(범위는 0과 1 사이). 바나나 데이터셋의 경우 각 이미지에 경계 상자가 하나만 있으므로 m=1입니다.

batch_size, edge_size = 32, 256
train_iter, _ = load_data_bananas(batch_size)
batch = next(iter(train_iter))
batch[0].shape, batch[1].shape

batch_size, edge_size = 32, 256: 이 줄은 batch_size와 edge_size라는 두 개의 변수에 각각 32와 256 값을 할당합니다. batch_size는 한 번에 처리할 데이터의 개수이고, edge_size는 가로 또는 세로 방향의 이미지 크기를 나타냅니다.
train_iter, _ = load_data_bananas(batch_size): 이 줄은 load_data_bananas 함수를 이용하여 바나나 감지 훈련 데이터를 로드합니다. batch_size 매개변수를 통해 한 번에 처리할 데이터의 개수를 설정합니다. 반환되는 값 중에서 첫 번째 값인 train_iter는 훈련 데이터를 로드하는 DataLoader 객체를 나타냅니다. 두 번째 값인 _는 반환되는 두 번째 객체를 무시하는 역할을 합니다.
batch = next(iter(train_iter)): 이 줄은 훈련 데이터의 첫 번째 배치를 가져와서 batch 변수에 저장합니다. train_iter는 DataLoader 객체이므로, iter(train_iter)를 통해 이를 반복 가능한(iterable) 객체로 변환하고, next() 함수를 통해 첫 번째 배치를 가져옵니다.
batch[0].shape, batch[1].shape: 이 줄은 현재 로드된 배치의 첫 번째 요소와 두 번째 요소의 크기(shape)를 확인합니다. batch[0]은 이미지 데이터를 나타내며, batch[1]은 해당 이미지에 대한 레이블을 나타냅니다. .shape는 배열의 크기를 확인하는 메서드입니다.

이 코드는 주어진 batch_size와 edge_size를 사용하여 바나나 감지 훈련 데이터를 로드하고, 첫 번째 배치의 이미지 데이터와 레이블의 크기를 확인하는 역할을 합니다.

14.6.3. Demonstration

Let’s demonstrate ten images with their labeled ground-truth bounding boxes. We can see that the rotations, sizes, and positions of bananas vary across all these images. Of course, this is just a simple artificial dataset. In practice, real-world datasets are usually much more complicated.

레이블이 지정된 ground-truth 경계 상자가 있는 10개의 이미지를 시연해 보겠습니다. 바나나의 회전, 크기 및 위치가 이 모든 이미지에서 다르다는 것을 알 수 있습니다. 물론 이것은 단순한 인공 데이터 세트일 뿐입니다. 실제로 실제 데이터 세트는 일반적으로 훨씬 더 복잡합니다.

imgs = (batch[0][:10].permute(0, 2, 3, 1)) / 255
axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, batch[1][:10]):
    d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])

imgs = (batch[0][:10].permute(0, 2, 3, 1)) / 255: 이 줄은 batch 변수에서 첫 번째 10개의 이미지 데이터를 선택하고, 차원 순서를 변경하며 픽셀 값 범위를 0에서 1 사이로 스케일링하여 imgs 변수에 저장합니다. 이미지의 픽셀 값은 원래 0에서 255 범위이지만, 여기서는 0에서 1 범위로 스케일을 조정합니다. permute(0, 2, 3, 1)는 차원을 순서대로 변경하는데, [0, 2, 3, 1] 순서로 변경함으로써 이미지의 차원 순서를 변경합니다.
axes = d2l.show_images(imgs, 2, 5, scale=2): imgs에 저장된 이미지 데이터를 2행 5열의 그리드 형태로 시각화하고, 각 이미지의 크기를 2배로 확대하여 axes 변수에 저장합니다. d2l.show_images 함수는 이미지들을 그리드로 나타내주는 도우미 함수입니다.
for ax, label in zip(axes, batch[1][:10]):: axes와 batch[1][:10]를 순회하면서 각 이미지와 해당 이미지의 레이블을 처리합니다.
d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w']): d2l.show_bboxes 함수를 사용하여 이미지 위에 레이블에 해당하는 바운딩 박스를 그립니다. ax는 이미지를 나타내는 축(axes) 객체입니다. label[0][1:5] * edge_size는 레이블에서 바운딩 박스의 좌표를 추출하고 edge_size를 곱하여 실제 이미지 크기에 맞게 스케일링합니다. colors=['w']는 바운딩 박스의 색상을 흰색으로 설정합니다.

이 코드는 훈련 데이터의 첫 번째 10개 이미지를 시각화하고, 각 이미지에 해당하는 바운딩 박스를 그려서 바나나를 감지하는 레이블을 시각적으로 확인합니다.

14.6.4. Summary

The banana detection dataset we collected can be used to demonstrate object detection models.
우리가 수집한 바나나 감지 데이터 세트는 객체 감지 모델을 시연하는 데 사용할 수 있습니다.
The data loading for object detection is similar to that for image classification. However, in object detection the labels also contain information of ground-truth bounding boxes, which is missing in image classification.
객체 감지를 위한 데이터 로딩은 이미지 분류와 유사합니다. 그러나 객체 감지에서 레이블에는 이미지 분류에서 누락된 ground-truth 경계 상자의 정보도 포함됩니다.

14.6.5. Exercises

Demonstrate other images with ground-truth bounding boxes in the banana detection dataset. How do they differ with respect to bounding boxes and objects?
Say that we want to apply data augmentation, such as random cropping, to object detection. How can it be different from that in image classification? Hint: what if a cropped image only contains a small portion of an object?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19
D2L - 14.1. Image Augmentation (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.5. Multiscale Object Detection

2023. 8. 19. 19:59 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/multiscale-object-detection.html

14.5. Multiscale Object Detection — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.5. Multiscale Object Detection

In Section 14.4, we generated multiple anchor boxes centered on each pixel of an input image. Essentially these anchor boxes represent samples of different regions of the image. However, we may end up with too many anchor boxes to compute if they are generated for every pixel. Think of a 561×728 input image. If five anchor boxes with varying shapes are generated for each pixel as their center, over two million anchor boxes (561×728×5) need to be labeled and predicted on the image.

섹션 14.4에서 입력 이미지의 각 픽셀을 중심으로 여러 앵커 상자를 생성했습니다. 기본적으로 이러한 앵커 상자는 이미지의 다른 영역 샘플을 나타냅니다. 그러나 앵커 상자가 모든 픽셀에 대해 생성되는 경우 계산하기에는 앵커 상자가 너무 많아질 수 있습니다. 561×728 입력 이미지를 생각해 보십시오. 각 픽셀마다 다양한 모양의 앵커 박스 5개를 중심으로 생성하면 200만 개가 넘는 앵커 박스(561×728×5)를 이미지에 라벨링하고 예측해야 합니다.

14.5.1. Multiscale Anchor Boxes

You may realize that it is not difficult to reduce anchor boxes on an image. For instance, we can just uniformly sample a small portion of pixels from the input image to generate anchor boxes centered on them. In addition, at different scales we can generate different numbers of anchor boxes of different sizes. Intuitively, smaller objects are more likely to appear on an image than larger ones. As an example, 1×1, 1×2, and 2×2 objects can appear on a 2×2 image in 4, 2, and 1 possible ways, respectively. Therefore, when using smaller anchor boxes to detect smaller objects, we can sample more regions, while for larger objects we can sample fewer regions.

이미지에서 앵커 박스를 줄이는 것이 어렵지 않다는 것을 알 수 있습니다. 예를 들어 입력 이미지에서 픽셀의 작은 부분을 균일하게 샘플링하여 중앙에 앵커 상자를 생성할 수 있습니다. 또한 크기가 다르면 크기가 다른 다양한 수의 앵커 박스를 생성할 수 있습니다. 직관적으로 작은 물체가 큰 물체보다 이미지에 나타날 가능성이 더 큽니다. 예를 들어, 1×1, 1×2, 2×2 객체는 각각 4, 2, 1가지 가능한 방식으로 2×2 이미지에 나타날 수 있습니다. 따라서 더 작은 앵커 상자를 사용하여 더 작은 객체를 감지할 때 더 많은 영역을 샘플링할 수 있는 반면 더 큰 객체의 경우 더 적은 영역을 샘플링할 수 있습니다.

To demonstrate how to generate anchor boxes at multiple scales, let’s read an image. Its height and width are 561 and 728 pixels, respectively.

여러 척도에서 앵커 상자를 생성하는 방법을 시연하기 위해 이미지를 읽어 보겠습니다. 높이와 너비는 각각 561픽셀과 728픽셀입니다.

%matplotlib inline
import torch
from d2l import torch as d2l

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]
h, w

(561, 728)

위 코드는 이미지 파일을 로드하고 해당 이미지의 높이와 너비를 계산하는 부분입니다.

%matplotlib inline: 이 코드는 Jupyter Notebook 등에서 Matplotlib 그림을 인라인으로 표시하도록 설정하는 명령입니다. Matplotlib을 사용하여 그래프나 이미지를 출력할 때 주피터 노트북 내에서 바로 볼 수 있게 합니다.
import torch: 파이토치 라이브러리를 임포트합니다. 딥러닝과 텐서 연산을 위해 사용됩니다.
from d2l import torch as d2l: d2l (Dive into Deep Learning) 라이브러리에서 파이토치에 대한 별칭을 d2l로 설정합니다. 이를 통해 d2l 라이브러리의 함수와 클래스를 이 코드에서 사용할 수 있게 됩니다.
img = d2l.plt.imread('../img/catdog.jpg'): 이미지 파일을 읽어와 img 변수에 저장합니다. 이미지 파일의 경로는 ../img/catdog.jpg로 설정되어 있습니다. 해당 경로에 이미지 파일이 있어야 합니다.
h, w = img.shape[:2]: 읽어온 이미지의 높이와 너비를 h와 w 변수에 저장합니다. .shape 속성을 사용하여 이미지의 형태를 확인하고, [:2] 슬라이싱을 통해 높이와 너비 정보만 추출합니다. 이렇게 함으로써 이미지의 높이와 너비를 h와 w 변수에 각각 할당합니다.

따라서 이 코드는 이미지 파일을 로드하고 해당 이미지의 높이와 너비를 확인하는 부분입니다.

Recall that in Section 7.2 we call a two-dimensional array output of a convolutional layer a feature map. By defining the feature map shape, we can determine centers of uniformly sampled anchor boxes on any image.

섹션 7.2에서 우리는 컨볼루션 레이어의 2차원 배열 출력을 피처 맵이라고 부릅니다. feature 맵 모양을 정의하여 모든 이미지에서 균일하게 샘플링된 앵커 박스의 중심을 결정할 수 있습니다.

The display_anchors function is defined below. We generate anchor boxes (anchors) on the feature map (fmap) with each unit (pixel) as the anchor box center. Since the (x,y)-axis coordinate values in the anchor boxes (anchors) have been divided by the width and height of the feature map (fmap), these values are between 0 and 1, which indicate the relative positions of anchor boxes in the feature map.

display_anchors 함수는 아래에 정의되어 있습니다. 각 단위(픽셀)를 앵커 상자 중심으로 하여 feature 맵(fmap)에 앵커 상자(앵커)를 생성합니다. 앵커 박스(anchor)의 (x,y)축 좌표 값을 특징 맵(fmap)의 너비와 높이로 나누었으므로 이 값은 0과 1 사이이며, 이것은 feature 맵에서 앵커 상자의 상대적 위치를 나타냅니다.

Since centers of the anchor boxes (anchors) are spread over all units on the feature map (fmap), these centers must be uniformly distributed on any input image in terms of their relative spatial positions. More concretely, given the width and height of the feature map fmap_w and fmap_h, respectively, the following function will uniformly sample pixels in fmap_h rows and fmap_w columns on any input image. Centered on these uniformly sampled pixels, anchor boxes of scale s (assuming the length of the list s is 1) and different aspect ratios (ratios) will be generated.

앵커 상자(앵커)의 중심은 기능 맵(fmap)의 모든 단위에 분산되어 있으므로 이러한 중심은 상대적인 공간 위치 측면에서 모든 입력 이미지에 균일하게 분포되어야 합니다. 보다 구체적으로, 기능 맵 fmap_w 및 fmap_h의 너비와 높이가 각각 주어지면 다음 함수는 모든 입력 이미지에서 fmap_h 행과 fmap_w 열의 픽셀을 균일하게 샘플링합니다. 이렇게 균일하게 샘플링된 픽셀을 중심으로 축척 s(목록 s의 길이가 1이라고 가정) 및 다른 종횡비(비율)의 앵커 상자가 생성됩니다.

def display_anchors(fmap_w, fmap_h, s):
    d2l.set_figsize()
    # Values on the first two dimensions do not affect the output
    fmap = torch.zeros((1, 10, fmap_h, fmap_w))
    anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = torch.tensor((w, h, w, h))
    d2l.show_bboxes(d2l.plt.imshow(img).axes,
                    anchors[0] * bbox_scale)

위 코드는 앵커 박스를 생성하고 시각적으로 표시하는 함수를 정의하는 부분입니다.

def display_anchors(fmap_w, fmap_h, s):: display_anchors라는 함수를 정의합니다. 이 함수는 세 개의 인자를 받습니다: fmap_w, fmap_h, s.
d2l.set_figsize(): 그래프의 크기를 설정하는 함수입니다. 이를 통해 플롯된 그림이 더 크게 표시될 수 있도록 설정합니다.
fmap = torch.zeros((1, 10, fmap_h, fmap_w)): fmap은 4D 텐서로, 형태는 (1, 10, fmap_h, fmap_w)입니다. 첫 번째 차원은 배치 크기, 두 번째 차원은 채널 수, 세 번째와 네 번째 차원은 특징 맵의 높이와 너비입니다. 여기서 특징 맵의 값을 0으로 초기화합니다.
anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5]): d2l.multibox_prior 함수를 사용하여 앵커 박스를 생성합니다. 이 함수는 입력으로 특징 맵, 앵커 박스의 크기(sizes), 앵커 박스의 종횡비(ratios)를 받습니다. fmap은 앞서 생성한 4D 텐서이며, sizes는 앵커 박스의 크기를 의미합니다. ratios는 앵커 박스의 종횡비를 나타냅니다.
bbox_scale = torch.tensor((w, h, w, h)): bbox_scale은 이미지의 높이와 너비를 포함하는 1D 텐서입니다. 이미지의 너비와 높이를 통해 크기 조절에 사용됩니다.
d2l.show_bboxes(d2l.plt.imshow(img).axes, anchors[0] * bbox_scale): d2l.show_bboxes 함수를 호출하여 앵커 박스를 이미지 위에 시각적으로 표시합니다. 이 함수는 두 개의 인자를 받습니다. 첫 번째 인자는 이미지가 그려질 축(axes)을 나타내며, 두 번째 인자는 앵커 박스의 좌표를 나타냅니다. 이 좌표를 bbox_scale로 크기를 조절한 뒤 앵커 박스를 시각적으로 표시합니다.

이 함수는 입력으로 주어진 특징 맵 크기와 앵커 박스의 크기, 종횡비를 활용하여 앵커 박스를 생성하고 시각적으로 표시하는 역할을 합니다.

First, let’s consider detection of small objects. In order to make it easier to distinguish when displayed, the anchor boxes with different centers here do not overlap: the anchor box scale is set to 0.15 and the height and width of the feature map are set to 4. We can see that the centers of the anchor boxes in 4 rows and 4 columns on the image are uniformly distributed.

먼저 작은 물체 감지에 대해 살펴보겠습니다. 표시할 때 쉽게 구별할 수 있도록 여기에서 중심이 다른 앵커 상자는 겹치지 않습니다. 앵커 상자 크기는 0.15로 설정되고 기능 맵의 높이와 너비는 4로 설정됩니다. 중심이 서로 다른 것을 볼 수 있습니다. 이미지에서 4행 4열의 앵커 상자가 균일하게 분포되어 있습니다.

display_anchors(fmap_w=4, fmap_h=4, s=[0.15])

위 코드는 display_anchors 함수를 호출하여 특정한 파라미터로 앵커 박스를 생성하고 시각화하는 과정을 보여줍니다.

display_anchors(fmap_w=4, fmap_h=4, s=[0.15]): display_anchors 함수를 호출하는 부분입니다. 함수에 세 개의 인자를 전달합니다. fmap_w는 특징 맵의 너비를, fmap_h는 특징 맵의 높이를, s는 앵커 박스의 크기를 의미하는 리스트를 전달합니다.
- fmap_w=4, fmap_h=4: 특징 맵의 너비와 높이를 각각 4로 설정합니다.
- s=[0.15]: 앵커 박스의 크기를 0.15로 설정합니다.

이렇게 설정된 파라미터로 display_anchors 함수가 호출되면, 해당 특징 맵과 앵커 박스의 크기를 활용하여 앵커 박스를 생성하고 이미지 위에 시각적으로 표시합니다. 이를 통해 특정 크기의 앵커 박스가 특정 위치에 어떻게 배치되는지를 확인할 수 있습니다.

We move on to reduce the height and width of the feature map by half and use larger anchor boxes to detect larger objects. When the scale is set to 0.4, some anchor boxes will overlap with each other.

계속해서 기능 맵의 높이와 너비를 절반으로 줄이고 더 큰 앵커 상자를 사용하여 더 큰 개체를 감지합니다. 배율을 0.4로 설정하면 일부 앵커 상자가 서로 겹칩니다.

display_anchors(fmap_w=2, fmap_h=2, s=[0.4])

Finally, we further reduce the height and width of the feature map by half and increase the anchor box scale to 0.8. Now the center of the anchor box is the center of the image.

마지막으로 기능 맵의 높이와 너비를 절반으로 줄이고 앵커 상자 크기를 0.8로 늘립니다. 이제 앵커 상자의 중심이 이미지의 중심이 됩니다.

display_anchors(fmap_w=1, fmap_h=1, s=[0.8])

14.5.2. Multiscale Detection

Since we have generated multiscale anchor boxes, we will use them to detect objects of various sizes at different scales. In the following we introduce a CNN-based multiscale object detection method that we will implement in Section 14.7.

14.7. Single Shot Multibox Detection — Dive into Deep Learning 1.0.3 documentation

d2l.ai

멀티스케일 앵커 박스를 생성했으므로 이를 사용하여 다양한 스케일에서 다양한 크기의 객체를 감지할 것입니다. 다음에서는 14.7절에서 구현할 CNN 기반 멀티스케일 객체 감지 방법을 소개합니다.

At some scale, say that we have c feature maps of shape ℎ×w. Using the method in Section 14.5.1, we generate ℎw sets of anchor boxes, where each set has α anchor boxes with the same center. For example, at the first scale in the experiments in Section 14.5.1, given ten (number of channels) 4×4 feature maps, we generated 16 sets of anchor boxes, where each set contains 3 anchor boxes with the same center. Next, each anchor box is labeled with the class and offset based on ground-truth bounding boxes. At the current scale, the object detection model needs to predict the classes and offsets of ℎw sets of anchor boxes on the input image, where different sets have different centers.

어떤 규모에서 모양 ℎ×w의 특징 맵이 c개 있다고 가정해 보겠습니다. 섹션 14.5.1의 방법을 사용하여 ℎw 앵커 상자 세트를 생성합니다. 여기서 각 세트에는 동일한 중심을 가진 α 앵커 상자가 있습니다. 예를 들어 섹션 14.5.1 실험의 첫 번째 스케일에서 10개(채널 수)의 4×4 기능 맵이 주어졌을 때 우리는 16개의 앵커 상자 세트를 생성했으며 각 세트에는 동일한 중심을 가진 3개의 앵커 상자가 포함되어 있습니다. 다음으로, 각 앵커 상자는 실측 경계 상자를 기반으로 클래스 및 오프셋으로 레이블이 지정됩니다. 현재 규모에서 물체 감지 모델은 입력 이미지에서 앵커 상자의 ℎw 세트의 클래스와 오프셋을 예측해야 합니다. 여기서 세트마다 중심이 다릅니다.

Assume that the c feature maps here are the intermediate outputs obtained by the CNN forward propagation based on the input image. Since there are ℎw different spatial positions on each feature map, the same spatial position can be thought of as having c units. According to the definition of receptive field in Section 7.2, these c units at the same spatial position of the feature maps have the same receptive field on the input image: they represent the input image information in the same receptive field. Therefore, we can transform the c units of the feature maps at the same spatial position into the classes and offsets of the α anchor boxes generated using this spatial position. In essence, we use the information of the input image in a certain receptive field to predict the classes and offsets of the anchor boxes that are close to that receptive field on the input image.

여기서 c 특성 맵은 입력 이미지를 기반으로 CNN 정방향 전파에 의해 얻은 중간 출력이라고 가정합니다. 각 특징 맵에는 ℎw개의 서로 다른 공간 위치가 있으므로 동일한 공간 위치는 c 단위를 갖는 것으로 생각할 수 있습니다. 7.2 절의 수용 필드의 정의에 따르면, 특징 맵의 동일한 공간 위치에 있는 이러한 c 단위는 입력 이미지에서 동일한 수용 필드를 갖습니다. 즉, 동일한 수용 필드에서 입력 이미지 정보를 나타냅니다. 따라서 동일한 공간 위치에 있는 피처 맵의 c 단위를 이 공간 위치를 사용하여 생성된 α 앵커 상자의 클래스 및 오프셋으로 변환할 수 있습니다. 본질적으로 우리는 입력 이미지의 수용 필드에 가까운 앵커 박스의 클래스와 오프셋을 예측하기 위해 특정 수용 필드의 입력 이미지 정보를 사용합니다.

When the feature maps at different layers have varying-size receptive fields on the input image, they can be used to detect objects of different sizes. For example, we can design a neural network where units of feature maps that are closer to the output layer have wider receptive fields, so they can detect larger objects from the input image.

다른 레이어의 기능 맵이 입력 이미지에 다양한 크기의 수용 필드를 가지고 있는 경우 다양한 크기의 객체를 감지하는 데 사용할 수 있습니다. 예를 들어 출력 레이어에 더 가까운 기능 맵 단위가 더 넓은 수용 필드를 갖는 신경망을 설계하여 입력 이미지에서 더 큰 객체를 감지할 수 있습니다.

In a nutshell, we can leverage layerwise representations of images at multiple levels by deep neural networks for multiscale object detection. We will show how this works through a concrete example in Section 14.7.

간단히 말해서, 우리는 다중 규모 객체 감지를 위해 심층 신경망에 의해 여러 수준에서 이미지의 계층적 표현을 활용할 수 있습니다. 섹션 14.7에서 구체적인 예를 통해 이것이 어떻게 작동하는지 보여줄 것입니다.

14.5.3. Summary

At multiple scales, we can generate anchor boxes with different sizes to detect objects with different sizes.

다양한 규모에서 다양한 크기의 앵커 박스를 생성하여 다양한 크기의 물체를 감지할 수 있습니다.
By defining the shape of feature maps, we can determine centers of uniformly sampled anchor boxes on any image.

특징 맵의 모양을 정의함으로써 모든 이미지에서 균일하게 샘플링된 앵커 박스의 중심을 결정할 수 있습니다.
We use the information of the input image in a certain receptive field to predict the classes and offsets of the anchor boxes that are close to that receptive field on the input image.

특정 수용 필드의 입력 이미지 정보를 사용하여 입력 이미지의 해당 수용 필드에 가까운 앵커 박스의 클래스와 오프셋을 예측합니다.
Through deep learning, we can leverage its layerwise representations of images at multiple levels for multiscale object detection.

딥 러닝을 통해 멀티스케일 객체 감지를 위해 여러 수준에서 이미지의 레이어별 표현을 활용할 수 있습니다.

14.5.4. Exercises

According to our discussions in Section 8.1, deep neural networks learn hierarchical features with increasing levels of abstraction for images. In multiscale object detection, do feature maps at different scales correspond to different levels of abstraction? Why or why not?
At the first scale (fmap_w=4, fmap_h=4) in the experiments in Section 14.5.1, generate uniformly distributed anchor boxes that may overlap.
Given a feature map variable with shape 1×c×ℎ×w, where c, ℎ, and w are the number of channels, height, and width of the feature maps, respectively. How can you transform this variable into the classes and offsets of anchor boxes? What is the shape of the output?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19
D2L - 14.1. Image Augmentation (0)	2023.08.19
D2L - 14. Computer Vision (0)	2023.08.18

Dive into Deep Learning/D2L Computer Vision

D2L - 14.4. Anchor Boxes

2023. 8. 19. 07:45 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/anchor.html

14.4. Anchor Boxes — Dive into Deep Learning 1.0.0 documentation

d2l.ai

14.4. Anchor Boxes

Object detection algorithms usually sample a large number of regions in the input image, determine whether these regions contain objects of interest, and adjust the boundaries of the regions so as to predict the ground-truth bounding boxes of the objects more accurately. Different models may adopt different region sampling schemes. Here we introduce one of such methods: it generates multiple bounding boxes with varying scales and aspect ratios centered on each pixel. These bounding boxes are called anchor boxes. We will design an object detection model based on anchor boxes in Section 14.7.

객체 감지 알고리즘은 일반적으로 입력 이미지에서 많은 수의 영역을 샘플링하고 이러한 영역에 관심 객체가 포함되어 있는지 확인하고 영역의 경계를 조정하여 객체의 실측 경계 상자를 더 정확하게 예측합니다. 다른 모델은 다른 지역 샘플링 방식을 채택할 수 있습니다. 여기서는 이러한 방법 중 하나를 소개합니다. 각 픽셀을 중심으로 다양한 크기와 종횡비로 여러 경계 상자를 생성합니다. 이러한 경계 상자를 앵커 상자라고 합니다. 섹션 14.7에서 앵커 박스를 기반으로 객체 감지 모델을 설계할 것입니다.

First, let’s modify the printing accuracy just for more concise outputs.

먼저 보다 간결한 출력을 위해 인쇄 정확도를 수정해 보겠습니다.

%matplotlib inline
import torch
from d2l import torch as d2l

torch.set_printoptions(2)  # Simplify printing accuracy

위 코드는 PyTorch를 사용하여 텐서 출력의 정밀도를 조정하는 예시입니다.

%matplotlib inline: Jupyter 노트북에서 Matplotlib 그래프를 인라인으로 표시하기 위한 매직 명령입니다.
import torch: PyTorch 라이브러리를 임포트합니다.
from d2l import torch as d2l: 'd2l'이라는 이름으로 PyTorch의 기능을 사용할 수 있도록 임포트합니다.
torch.set_printoptions(2): 출력 정밀도를 소수점 둘째 자리까지 설정합니다. 이렇게 하면 텐서의 값을 간소화된 형식으로 출력할 수 있습니다.

즉, 코드는 PyTorch 텐서의 출력 정밀도를 조정하여 더 간소하고 읽기 쉬운 형식으로 값을 표시하는 예시를 보여줍니다.

Anchor box란?

"Anchor Box"는 객체 탐지(Object Detection) 및 객체 위치 예측 작업에서 사용되는 개념입니다. 객체 탐지 모델에서 객체의 위치와 클래스를 예측할 때, 이미지 내에서 다양한 크기와 종횡비를 가진 객체에 대해 예측을 수행해야 합니다. 이때 Anchor Box는 이러한 다양한 객체의 크기와 종횡비에 대응하는 사전 정의된 사각형 형태의 박스입니다.

Anchor Box는 주로 합성곱 신경망(Convolutional Neural Network, CNN) 기반의 객체 탐지 모델에서 사용됩니다. 모델은 이미지를 입력으로 받아 여러 스케일의 특징 맵을 추출합니다. Anchor Box는 이러한 특징 맵의 각 위치에서 객체를 예측하기 위한 후보 영역을 제시하는 역할을 합니다.

Anchor Box는 크게 두 가지 파라미터로 정의됩니다:

크기(Sizes): 다양한 객체 크기에 대응하기 위해 정의되는 박스의 가로와 세로 크기입니다. 예를 들어, 작은 객체에 대응하는 작은 크기의 박스와 큰 객체에 대응하는 큰 크기의 박스를 정의할 수 있습니다.
종횡비(Ratios): 객체의 종횡비에 대응하기 위해 정의되는 박스의 가로와 세로의 비율입니다. 종횡비는 주로 1:1, 1:2, 2:1과 같은 비율로 정의되며, 다양한 객체 모양에 대응하기 위해 사용됩니다.

객체 탐지 모델은 특징 맵의 각 위치에서 여러 개의 Anchor Box를 생성하고, 이를 기반으로 객체의 위치와 클래스를 예측합니다. 이후 예측된 정보와 실제 객체 위치를 비교하여 모델을 훈련하게 됩니다. Anchor Box를 이용하면 다양한 크기와 모양의 객체에 대한 예측을 효과적으로 수행할 수 있습니다.

14.4.1. Generating Multiple Anchor Boxes

Suppose that the input image has a height of ℎ and width of w. We generate anchor boxes with different shapes centered on each pixel of the image. Let the scale be s∈(0,1] and the aspect ratio (ratio of width to height) is r>0. Then the width and height of the anchor box are ws√r and ℎs/√r, respectively. Note that when the center position is given, an anchor box with known width and height is determined.

입력 이미지의 높이가 ℎ이고 너비가 w라고 가정합니다. 이미지의 각 픽셀을 중심으로 다양한 모양의 앵커 박스를 생성합니다. 스케일을 s∈(0,1]로 하고 종횡비(너비와 높이의 비율)를 r>0이라고 하면 앵커 박스의 너비와 높이는 각각 ws√r과 ℎs/√r입니다. 중심 위치가 주어지면 너비와 높이가 알려진 앵커 박스가 결정됩니다.

To generate multiple anchor boxes with different shapes, let’s set a series of scales s1,…,sn and a series of aspect ratios r1,…,rm. When using all the combinations of these scales and aspect ratios with each pixel as the center, the input image will have a total of wℎnm anchor boxes. Although these anchor boxes may cover all the ground-truth bounding boxes, the computational complexity is easily too high. In practice, we can only consider those combinations containing s1 or r1:

모양이 다른 여러 앵커 박스를 생성하기 위해 일련의 축척 s1,…,sn과 일련의 종횡비 r1,…,rm을 설정해 보겠습니다. 각 픽셀을 중심으로 이러한 배율과 종횡비의 모든 조합을 사용할 때 입력 이미지는 총 wℎnm 앵커 상자를 갖게 됩니다. 이러한 앵커 상자가 실측 경계 상자를 모두 덮을 수 있지만 계산 복잡도가 너무 높습니다. 실제로는 s1 또는 r1을 포함하는 조합만 고려할 수 있습니다.

That is to say, the number of anchor boxes centered on the same pixel is n+m−1. For the entire input image, we will generate a total of wℎ(n+m−1) anchor boxes.

즉, 동일한 픽셀을 중심으로 하는 앵커 박스의 수는 n+m-1입니다. 전체 입력 이미지에 대해 총 wℎ(n+m−1)개의 앵커 상자를 생성합니다.

The above method of generating anchor boxes is implemented in the following multibox_prior function. We specify the input image, a list of scales, and a list of aspect ratios, then this function will return all the anchor boxes.

앵커 박스를 생성하는 위의 방법은 다음 multibox_prior 함수에서 구현됩니다. 입력 이미지, 축척 목록 및 종횡비 목록을 지정하면 이 함수는 모든 앵커 상자를 반환합니다.

#@save
def multibox_prior(data, sizes, ratios):
    """Generate anchor boxes with different shapes centered on each pixel."""
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)
    # Offsets are required to move the anchor to the center of a pixel. Since
    # a pixel has height=1 and width=1, we choose to offset our centers by 0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # Scaled steps in y axis
    steps_w = 1.0 / in_width  # Scaled steps in x axis

    # Generate all center points for the anchor boxes
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

    # Generate `boxes_per_pixel` number of heights and widths that are later
    # used to create anchor box corner coordinates (xmin, xmax, ymin, ymax)
    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # Handle rectangular inputs
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # Divide by 2 to get half height and half width
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # Each center point will have `boxes_per_pixel` number of anchor boxes, so
    # generate a grid of all anchor box centers with `boxes_per_pixel` repeats
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

위 코드는 'multibox_prior'라는 함수로, 다양한 크기와 비율의 앵커 박스를 이미지 내의 각 픽셀 중심에 생성하는 기능을 수행합니다.

multibox_prior(data, sizes, ratios): 이 함수는 입력 데이터 텐서 data, 앵커 박스의 크기 목록 sizes, 그리고 앵커 박스의 비율 목록 ratios를 받습니다. 이 함수는 이미지 내의 각 픽셀 중심을 기준으로 다양한 크기와 비율의 앵커 박스를 생성하여 반환합니다.

함수 내부에서 다음 단계가 수행됩니다:

입력 데이터의 높이와 너비를 가져옵니다.
크기와 비율 텐서를 생성하고 디바이스를 설정합니다.
픽셀 중심의 이동을 위한 오프셋을 계산합니다.
y축과 x축에 대한 스케일된 스텝 값을 계산합니다.
중심 포인트를 생성하여 앵커 박스의 중심을 지정합니다.
다양한 크기와 비율을 활용하여 각 픽셀 중심에서 앵커 박스의 크기를 조정합니다.
생성된 앵커 박스를 반환합니다.

이렇게 생성된 앵커 박스는 객체 감지 작업에서 사용되며, 여러 크기와 비율의 박스를 이미지의 각 픽셀에 적용함으로써 객체를 탐지하는데 사용됩니다.

각 라인별로 분석해 보겠습니다.

#@save
def multibox_prior(data, sizes, ratios):
    """Generate anchor boxes with different shapes centered on each pixel."""

함수 정의가 시작됩니다. 함수 이름은 multibox_prior이며, 입력으로 data, sizes, ratios를 받습니다. 함수 내용은 "Generate anchor boxes with different shapes centered on each pixel.(각 픽셀을 중심으로 다른 모양의 앵커 상자 생성)"라는 주석으로 설명되어 있습니다.

    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)

입력 데이터인 data의 높이와 너비를 가져와 in_height와 in_width로 할당합니다. 또한 data의 디바이스 정보를 가져와 device에 할당하고, sizes와 ratios의 길이를 이용하여 앵커 박스의 크기와 비율 개수를 각각 num_sizes와 num_ratios에 할당합니다. boxes_per_pixel은 앵커 박스 개수를 의미합니다.

    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

sizes와 ratios를 텐서로 변환하고, 디바이스 정보를 설정하여 size_tensor와 ratio_tensor에 저장합니다.

    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # Scaled steps in y axis
    steps_w = 1.0 / in_width  # Scaled steps in x axis

앵커 박스를 픽셀 중심으로 옮기기 위한 오프셋 값을 설정합니다. 또한 y축과 x축의 스케일된 스텝 값을 계산합니다.

    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

y축과 x축에 대한 중심 포인트 값을 계산합니다. 그리고 torch.meshgrid를 사용하여 중심 포인트들을 조합하여 shift_y와 shift_x를 생성합니다.

    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # Handle rectangular inputs
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))

앵커 박스의 너비와 높이 값을 계산합니다. 크기와 비율을 이용하여 각 앵커 박스의 크기를 정의합니다.

    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

앵커 박스의 변화량을 계산합니다. 이 변화량은 앵커 박스의 좌표를 조정하는데 사용됩니다.

    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

앵커 박스 중심의 그리드를 생성하고, 앵커 박스의 변화량을 더하여 최종 앵커 박스의 좌표를 계산합니다. 그리고 이를 반환하기 전에 차원을 조정하여 반환합니다.

We can see that the shape of the returned anchor box variable Y is (batch size, number of anchor boxes, 4).

반환된 앵커 박스 변수 Y의 모양이 (배치 크기, 앵커 박스 수, 4)임을 알 수 있습니다.

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]

print(h, w)
X = torch.rand(size=(1, 3, h, w))  # Construct input data
Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
Y.shape

위 코드는 다음과 같은 내용을 수행합니다. 코드를 한 줄씩 설명하겠습니다.

d2l.plt.imread 함수를 사용하여 이미지 파일을 읽어옵니다. 읽어온 이미지는 img에 저장됩니다.

이미지의 높이와 너비를 가져와 h와 w에 저장합니다.

이미지의 높이와 너비를 출력합니다.

높이 h, 너비 w를 가지고 형상이 (1, 3, h, w)인 랜덤 값을 가지는 텐서 X를 생성합니다. 이는 신경망의 입력 데이터를 대표하는 것입니다. 여기서 3은 RGB 채널의 개수를 나타냅니다.

생성한 입력 데이터 X를 이용하여 multibox_prior 함수를 호출하여 앵커 박스들을 생성합니다. 앵커 박스의 크기(sizes)와 비율(ratios)을 지정합니다.

생성된 앵커 박스들의 형상을 출력합니다. 이는 (앵커 박스 개수, 4)의 형태로 출력됩니다. 4는 앵커 박스의 좌표 정보를 의미하며, (x, y, width, height) 형식으로 구성됩니다.

After changing the shape of the anchor box variable Y to (image height, image width, number of anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes centered on a specified pixel position. In the following, we access the first anchor box centered on (250, 250). It has four elements: the (x,y)-axis coordinates at the upper-left corner and the (x,y)-axis coordinates at the lower-right corner of the anchor box. The coordinate values of both axes are divided by the width and height of the image, respectively.

앵커 박스 변수 Y의 모양을 (이미지 높이, 이미지 너비, 같은 픽셀을 중심으로 하는 앵커 박스 수, 4)로 변경하면 지정된 픽셀 위치를 중심으로 하는 모든 앵커 박스를 얻을 수 있습니다. 다음에서는 (250, 250)을 중심으로 하는 첫 번째 앵커 상자에 액세스합니다. 앵커 상자의 왼쪽 위 모서리에 있는 (x,y)축 좌표와 오른쪽 아래 모서리에 있는 (x,y)축 좌표의 네 가지 요소가 있습니다. 두 축의 좌표 값은 각각 이미지의 너비와 높이로 나뉩니다.

boxes = Y.reshape(h, w, 5, 4)
boxes[250, 250, 0, :]

위 코드는 다음과 같은 내용을 수행합니다. 코드를 한 줄씩 설명하겠습니다.

Y 텐서를 형상 (h, w, 5, 4)로 변형하여 boxes에 저장합니다. 여기서 h와 w는 이미지의 높이와 너비이고, 5는 각 픽셀마다 생성된 앵커 박스의 개수, 4는 각 앵커 박스의 좌표 정보를 나타냅니다.

boxes 텐서의 (250, 250) 위치의 픽셀에 생성된 첫 번째 앵커 박스의 좌표 정보를 출력합니다. 여기서 0은 첫 번째 앵커 박스를 나타냅니다. 결과는 해당 픽셀 위치에서 첫 번째 앵커 박스의 (x, y, width, height) 좌표 정보가 출력됩니다.

#@save
def show_bboxes(axes, bboxes, labels=None, colors=None):
    """Show bounding boxes."""

    def make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = make_list(labels)
    colors = make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))

위 코드는 다음과 같은 목적을 가진 함수인 show_bboxes를 정의하고 있습니다.

axes: Matplotlib의 축(axes) 객체입니다. 여기서 바운딩 박스를 그릴 위치를 지정합니다.
bboxes: 바운딩 박스 좌표 정보를 담고 있는 텐서입니다.
labels: 각 바운딩 박스에 대한 레이블 정보입니다.
colors: 각 바운딩 박스에 대한 색상 정보입니다.

입력으로 받은 객체 obj를 리스트로 만들어 반환하는 함수입니다. default_values가 지정되어 있지 않은 경우에는 None을 기본값으로 사용합니다.

labels와 colors를 리스트로 변환합니다. 기본적으로 colors에는 파란색, 초록색, 빨간색, 자주색, 청록색 순서의 색상이 사용됩니다.

enumerate 함수를 사용하여 바운딩 박스와 해당 인덱스를 반복합니다.
bbox를 이용하여 Matplotlib의 Rectangle 객체를 생성하고, 해당 객체를 축에 추가합니다.
만약 labels가 존재하고, 현재 인덱스가 레이블 개수보다 작으면, 해당 레이블 정보를 바운딩 박스 중심에 출력합니다. 이때 텍스트 색상과 배경 색상이 조정되어 텍스트가 더 잘 보이도록 설정됩니다.

As we just saw, the coordinate values of the x and y axes in the variable boxes have been divided by the width and height of the image, respectively. When drawing anchor boxes, we need to restore their original coordinate values; thus, we define variable bbox_scale below. Now, we can draw all the anchor boxes centered on (250, 250) in the image. As you can see, the blue anchor box with a scale of 0.75 and an aspect ratio of 1 well surrounds the dog in the image.

방금 본 것처럼 변수 상자의 x 및 y축 좌표 값은 각각 이미지의 너비와 높이로 나뉩니다. 앵커 상자를 그릴 때 원래 좌표 값을 복원해야 합니다. 따라서 아래에 변수 bbox_scale을 정의합니다. 이제 이미지에서 (250, 250)을 중심으로 모든 앵커 상자를 그릴 수 있습니다. 보시다시피 스케일 0.75, 종횡비 1 well인 파란색 앵커 상자가 이미지에서 강아지를 잘 둘러싸고 있습니다.

d2l.set_figsize()
bbox_scale = torch.tensor((w, h, w, h))
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
            ['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
             's=0.75, r=0.5'])

위 코드는 바운딩 박스를 이미지 위에 그려주는 작업을 수행하는 코드입니다.

그래프의 크기를 설정하는 함수입니다. d2l은 딥러닝 모델 관련 기능을 제공하는 패키지입니다.

bbox_scale은 바운딩 박스의 크기를 조정하기 위한 스케일을 나타냅니다. 이미지의 폭과 높이 값을 담고 있습니다.

img를 Matplotlib의 imshow 함수를 이용하여 이미지로 표시합니다. 그림을 그릴 축(axes) 객체를 fig에 저장합니다.

show_bboxes 함수를 호출하여 바운딩 박스를 이미지 위에 그립니다. boxes[250, 250, :, :] * bbox_scale는 선택한 픽셀 위치에서 생성된 여러 바운딩 박스의 좌표를 의미합니다.
각 바운딩 박스에 대한 레이블도 함께 지정하여 출력합니다. 이를 통해 바운딩 박스의 크기와 종횡비에 대한 정보를 확인할 수 있습니다.

14.4.2. Intersection over Union (IoU)

We just mentioned that an anchor box “well” surrounds the dog in the image. If the ground-truth bounding box of the object is known, how can “well” here be quantified? Intuitively, we can measure the similarity between the anchor box and the ground-truth bounding box. We know that the Jaccard index can measure the similarity between two sets. Given sets A and B, their Jaccard index is the size of their intersection divided by the size of their union:

우리는 방금 이미지에서 개를 둘러싼 앵커 박스 "well"을 언급했습니다. 객체의 실측 경계 상자가 알려진 경우 여기에서 "well"을 어떻게 정량화할 수 있습니까? 직관적으로 앵커 상자와 실측 경계 상자 사이의 유사성을 측정할 수 있습니다. 우리는 Jaccard 지수가 두 세트 간의 유사성을 측정할 수 있다는 것을 알고 있습니다. 집합 A와 B가 주어지면 Jaccard 지수는 교집합의 크기를 합집합의 크기로 나눈 값입니다.

Jaccard Index란?

The "Jaccard index," also known as the "Intersection over Union (IoU)," is a statistical measure used to evaluate the similarity or overlap between two sets. It is commonly used in the field of image segmentation, object detection, and binary classification tasks to measure the accuracy of predicted regions or bounding boxes compared to the ground truth.

"Jaccard 지수," 또는 "교집합 비율 (IoU, Intersection over Union),"은 두 집합 간의 유사성 또는 중첩을 평가하는 통계적 지표입니다. 이미지 분할, 객체 탐지 및 이진 분류 작업 분야에서 예측된 영역이나 바운딩 박스와 실제값 간의 정확성을 측정하는 데 주로 사용됩니다.

In the context of object detection or image segmentation, the Jaccard index calculates the ratio of the area of overlap between the predicted region and the ground truth region to the total area encompassed by both regions. Mathematically, the Jaccard index is defined as:

객체 탐지나 이미지 분할의 맥락에서 Jaccard 지수는 예측된 영역과 실제값 영역 사이의 중첩 영역의 면적을 전체 영역으로 나눈 비율을 계산합니다. 수학적으로 Jaccard 지수는 다음과 같이 정의됩니다.

Where:

represents the set of pixels or elements in the predicted region.
는 예측된 영역의 픽셀 또는 요소의 집합을 나타냅니다.
represents the set of pixels or elements in the ground truth region.
는 실제값 영역의 픽셀 또는 요소의 집합을 나타냅니다.
denotes the size of the intersection between sets and .
는 집합 와 사이의 교집합의 크기를 나타냅니다.
denotes the size of the union of sets and .
는 집합 와 의 합집합의 크기를 나타냅니다.

The Jaccard index ranges from 0 to 1, where 0 indicates no overlap between the sets (no agreement), and 1 indicates complete overlap (perfect agreement). In the context of object detection, a higher Jaccard index implies a better alignment between the predicted bounding box and the ground truth bounding box.

Jaccard 지수는 0부터 1까지의 범위를 가지며, 0은 집합 간에 중첩이 없음을 나타내며 (일치하지 않음), 1은 완전한 중첩을 나타냅니다 (완전히 일치). 객체 탐지의 맥락에서 높은 Jaccard 지수는 예측된 바운딩 박스와 실제값 바운딩 박스 간의 좋은 정렬을 나타냅니다.

The Jaccard index is a valuable metric for evaluating the accuracy of algorithms in tasks that involve identifying regions of interest or bounding boxes, as it provides insight into the quality of predictions by measuring their spatial agreement with the actual ground truth.

Jaccard 지수는 관심 영역이나 바운딩 박스를 식별하는 작업에서 알고리즘의 정확성을 평가하는 데 유용한 지표로, 예측의 공간적인 일치를 측정하여 실제값과의 품질을 평가합니다.

In fact, we can consider the pixel area of any bounding box as a set of pixels. In this way, we can measure the similarity of the two bounding boxes by the Jaccard index of their pixel sets. For two bounding boxes, we usually refer their Jaccard index as intersection over union (IoU), which is the ratio of their intersection area to their union area, as shown in Fig. 14.4.1. The range of an IoU is between 0 and 1: 0 means that two bounding boxes do not overlap at all, while 1 indicates that the two bounding boxes are equal.

사실 경계 상자의 픽셀 영역을 픽셀 집합으로 간주할 수 있습니다. 이러한 방식으로 픽셀 집합의 Jaccard 인덱스로 두 경계 상자의 유사성을 측정할 수 있습니다. 두 개의 경계 상자에 대해 우리는 일반적으로 그림 14.4.1과 같이 교차 영역과 결합 영역의 비율인 Jaccard 인덱스를 IoU(교차점 오버 유니온)라고 합니다. IoU의 범위는 0에서 1 사이입니다. 0은 두 경계 상자가 전혀 겹치지 않음을 의미하고 1은 두 경계 상자가 동일함을 나타냅니다.

Fig. 14.4.1  IoU is the ratio of the intersection area to the union area of two bounding boxes.

For the remainder of this section, we will use IoU to measure the similarity between anchor boxes and ground-truth bounding boxes, and between different anchor boxes. Given two lists of anchor or bounding boxes, the following box_iou computes their pairwise IoU across these two lists.

이 섹션의 나머지 부분에서는 IoU를 사용하여 앵커 상자와 ground-truth 경계 상자, 서로 다른 앵커 상자 간의 유사성을 측정합니다. 두 개의 앵커 또는 경계 상자 목록이 주어지면 다음 box_iou는 이 두 목록에서 쌍별 IoU를 계산합니다.

#@save
def box_iou(boxes1, boxes2):
    """Compute pairwise IoU across two lists of anchor or bounding boxes."""
    box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
    # Shape of `boxes1`, `boxes2`, `areas1`, `areas2`: (no. of boxes1, 4),
    # (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    # Shape of `inter_upperlefts`, `inter_lowerrights`, `inters`: (no. of
    # boxes1, no. of boxes2, 2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    # Shape of `inter_areas` and `union_areas`: (no. of boxes1, no. of boxes2)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas

위 코드는 두 개의 바운딩 박스 리스트 간에 서로 겹치는 면적의 비율인 IoU (Intersection over Union)를 계산하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

box_iou 함수는 두 개의 바운딩 박스 리스트 간의 IoU를 계산하는 함수입니다. 이 함수는 앵커 박스(anchor box) 또는 바운딩 박스(bounding box) 리스트를 입력으로 받습니다.

box_area는 입력으로 들어온 바운딩 박스들의 면적을 계산하는 람다(lambda) 함수입니다. 바운딩 박스의 너비와 높이를 이용하여 면적을 계산합니다.

boxes1 리스트와 boxes2 리스트의 각각의 바운딩 박스에 대한 면적을 계산하여 areas1과 areas2에 저장합니다.

겹치는 영역의 좌상단 점과 우하단 점을 계산하여 inter_upperlefts와 inter_lowerrights에 저장합니다. inters는 겹치는 영역의 너비와 높이입니다. clamp(min=0) 함수를 이용하여 음수 값을 0으로 만듭니다.

겹치는 영역의 너비와 높이를 이용하여 겹치는 영역의 면적과 합집합의 면적을 계산합니다.

계산된 겹치는 영역의 면적과 합집합의 면적을 이용하여 IoU를 계산하고 반환합니다. 이를 통해 두 개의 바운딩 박스 리스트 간의 IoU를 계산할 수 있습니다.

14.4.3. Labeling Anchor Boxes in Training Data

In a training dataset, we consider each anchor box as a training example. In order to train an object detection model, we need class and offset labels for each anchor box, where the former is the class of the object relevant to the anchor box and the latter is the offset of the ground-truth bounding box relative to the anchor box. During the prediction, for each image we generate multiple anchor boxes, predict classes and offsets for all the anchor boxes, adjust their positions according to the predicted offsets to obtain the predicted bounding boxes, and finally only output those predicted bounding boxes that satisfy certain criteria.

교육 데이터 세트에서 각 앵커 상자를 교육 예제로 간주합니다. 개체 감지 모델을 훈련하려면 각 앵커 상자에 대한 클래스 및 오프셋 레이블이 필요합니다. 여기서 전자는 앵커 상자와 관련된 개체의 클래스입니다. 후자는 앵커 상자에 대한 실측 경계 상자의 오프셋입니다. 예측하는 동안 각 이미지에 대해 여러 개의 앵커 상자를 생성하고 모든 앵커 상자에 대한 클래스와 오프셋을 예측하고 예측된 경계 상자를 얻기 위해 예측된 오프셋에 따라 위치를 조정하고 마지막으로 특정 기준을 충족하는 예측된 경계 상자만 출력합니다. .

As we know, an object detection training set comes with labels for locations of ground-truth bounding boxes and classes of their surrounded objects. To label any generated anchor box, we refer to the labeled location and class of its assigned ground-truth bounding box that is closest to the anchor box. In the following, we describe an algorithm for assigning closest ground-truth bounding boxes to anchor boxes.

아시다시피 객체 감지 훈련 세트에는 ground-truth 경계 상자의 위치와 둘러싸인 객체의 클래스에 대한 레이블이 함께 제공됩니다. 생성된 앵커 상자에 레이블을 지정하려면 앵커 상자에 가장 가까운 할당된 ground-truth 경계 상자의 레이블이 지정된 위치와 클래스를 참조합니다. 다음에서는 가장 가까운 ground-truth 경계 상자를 앵커 상자에 할당하는 알고리즘을 설명합니다.

14.4.3.1. Assigning Ground-Truth Bounding Boxes to Anchor Boxes

Given an image, suppose that the anchor boxes are A1,A2,…,Ana and the ground-truth bounding boxes are B1,B2,…,Bnb, where na≥nb. Let’s define a matrix X∈ℝ**na×nb, whose element xij in the ith row and jth column is the IoU of the anchor box Ai and the ground-truth bounding box Bj. The algorithm consists of the following steps:

주어진 이미지에서 앵커 상자가 A1,A2,…,Ana이고 ground-truth 경계 상자가 B1,B2,…,Bnb라고 가정합니다. 행렬 X∈ℝ**na×nb를 정의해 보겠습니다. 여기서 i번째 행과 j번째 열의 요소 xij는 앵커 상자 Ai와 ground-truth 경계 상자 Bj의 IoU입니다. 알고리즘은 다음 단계로 구성됩니다.

Find the largest element in matrix X and denote its row and column indices as i1 and j1, respectively. Then the ground-truth bounding box Bj1 is assigned to the anchor box Ai1. This is quite intuitive because Ai1 and Bj1 are the closest among all the pairs of anchor boxes and ground-truth bounding boxes. After the first assignment, discard all the elements in the i1th row and the j1th column in matrix X.

행렬 X에서 가장 큰 요소를 찾고 해당 행 및 열 인덱스를 각각 i1 및 j1로 나타냅니다. 그런 다음 ground-truth 경계 상자 Bj1이 앵커 상자 Ai1에 할당됩니다. 이것은 Ai1과 Bj1이 모든 앵커 박스와 ground-truth bounding box 쌍 중에서 가장 가깝기 때문에 매우 직관적입니다. 첫 번째 할당 후 행렬 X의 i1번째 행과 j1번째 열의 모든 요소를 버립니다.
Find the largest of the remaining elements in matrix X and denote its row and column indices as i2 and j2, respectively. We assign ground-truth bounding box Bj2 to anchor box Ai2 and discard all the elements in the i2th row and the j2th column in matrix X.

행렬 X에서 나머지 요소 중 가장 큰 요소를 찾고 해당 행 및 열 인덱스를 각각 i2 및 j2로 나타냅니다. ground-truth 경계 상자 Bj2를 앵커 상자 Ai2에 할당하고 행렬 X의 i2번째 행과 j2번째 열에 있는 모든 요소를 버립니다.
At this point, elements in two rows and two columns in matrix X have been discarded. We proceed until all elements in nb columns in matrix X are discarded. At this time, we have assigned a ground-truth bounding box to each of nb anchor boxes.

이 시점에서 행렬 X의 두 행과 두 열의 요소는 버려졌습니다. 행렬 X의 nb 열에 있는 모든 요소가 버려질 때까지 진행합니다. 이때 각 nb 앵커 박스에 ground-truth 경계 상자를 할당했습니다.
Only traverse through the remaining na−nb anchor boxes. For example, given any anchor box Ai, find the ground-truth bounding box Bj with the largest IoU with Ai throughout the ith row of matrix X, and assign Bj to Ai only if this IoU is greater than a predefined threshold.

나머지 na-nb 앵커 박스를 통해서만 트래버스하십시오. 예를 들어 앵커 상자 Ai가 주어지면 행렬 X의 i번째 행 전체에서 Ai와 함께 가장 큰 IoU가 있는 실측 경계 상자 Bj를 찾고 이 IoU가 미리 정의된 임계값보다 큰 경우에만 Bj를 Ai에 할당합니다.

Let’s illustrate the above algorithm using a concrete example. As shown in Fig. 14.4.2 (left), assuming that the maximum value in matrix X is x23, we assign the ground-truth bounding box B3 to the anchor box A2. Then, we discard all the elements in row 2 and column 3 of the matrix, find the largest x71 in the remaining elements (shaded area), and assign the ground-truth bounding box B1 to the anchor box A7. Next, as shown in Fig. 14.4.2 (middle), discard all the elements in row 7 and column 1 of the matrix, find the largest x54 in the remaining elements (shaded area), and assign the ground-truth bounding box B4 to the anchor box A5. Finally, as shown in Fig. 14.4.2 (right), discard all the elements in row 5 and column 4 of the matrix, find the largest x92 in the remaining elements (shaded area), and assign the ground-truth bounding box B2 to the anchor box A9. After that, we only need to traverse through the remaining anchor boxes A1,A3,A4,A6,A8 and determine whether to assign them ground-truth bounding boxes according to the threshold.

구체적인 예를 사용하여 위의 알고리즘을 설명하겠습니다. 그림 14.4.2(왼쪽)에 표시된 대로 행렬 X의 최대값이 x23이라고 가정하고 앵커 상자 A2에 ground-truth 경계 상자 B3을 할당합니다. 그런 다음 행렬의 2행과 3열의 모든 요소를 버리고 나머지 요소(음영 영역)에서 가장 큰 x71을 찾고 ground-truth 경계 상자 B1을 앵커 상자 A7에 할당합니다. 다음으로 그림 14.4.2(가운데)와 같이 행렬의 7행 1열의 원소를 모두 버리고 나머지 원소(음영 영역)에서 가장 큰 x54를 찾아 ground-truth bounding box B4를 할당한다. 앵커 박스 A5에. 마지막으로 그림 14.4.2(오른쪽)과 같이 행렬의 5행 4열의 원소를 모두 버리고 나머지 원소(음영 영역)에서 가장 큰 x92를 찾아,실측 경계 상자 B2를 앵커 상자 A9에 할당합니다. 그런 다음 나머지 앵커 상자 A1,A3,A4,A6,A8을 통과하고 임계값에 따라 실측 경계 상자를 할당할지 여부만 결정하면 됩니다.

Fig. 14.4.2  Assigning ground-truth bounding boxes to anchor boxes.

This algorithm is implemented in the following assign_anchor_to_bbox function.

이 알고리즘은 다음 assign_anchor_to_bbox 함수에서 구현됩니다.

#@save
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign closest ground-truth bounding boxes to anchor boxes."""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # Element x_ij in the i-th row and j-th column is the IoU of the anchor
    # box i and the ground-truth bounding box j
    jaccard = box_iou(anchors, ground_truth)
    # Initialize the tensor to hold the assigned ground-truth bounding box for
    # each anchor
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                                  device=device)
    # Assign ground-truth bounding boxes according to the threshold
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
    box_j = indices[max_ious >= iou_threshold]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)  # Find the largest IoU
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map

위 코드는 앵커 박스(Anchor Boxes)에 대해 가장 가까운 실제 바운딩 박스(Ground-Truth Bounding Boxes)를 할당하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign closest ground-truth bounding boxes to anchor boxes."""

assign_anchor_to_bbox 함수는 앵커 박스와 실제 바운딩 박스 간에 IoU를 계산하여 가장 가까운 바운딩 박스를 앵커 박스에 할당하는 함수입니다.

num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]

num_anchors는 앵커 박스의 개수, num_gt_boxes는 실제 바운딩 박스의 개수를 나타냅니다.

jaccard = box_iou(anchors, ground_truth)

box_iou 함수를 이용하여 앵커 박스와 실제 바운딩 박스 간의 IoU를 계산하여 jaccard에 저장합니다.

anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                              device=device)

각 앵커 박스에 해당하는 가장 가까운 실제 바운딩 박스의 인덱스를 저장하기 위한 텐서인 anchors_bbox_map을 생성합니다. 초기 값은 -1로 설정합니다.

max_ious, indices = torch.max(jaccard, dim=1)
anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
box_j = indices[max_ious >= iou_threshold]
anchors_bbox_map[anc_i] = box_j

max_ious는 각 앵커 박스에 대한 가장 큰 IoU 값을, indices는 해당 IoU 값을 가지는 실제 바운딩 박스의 인덱스를 나타냅니다. iou_threshold 이상의 IoU 값을 가지는 앵커 박스들에 대해 가장 가까운 실제 바운딩 박스의 인덱스를 anchors_bbox_map에 할당합니다.

col_discard = torch.full((num_anchors,), -1)
row_discard = torch.full((num_gt_boxes,), -1)

할당된 바운딩 박스와 IoU 값을 표시하기 위해 사용할 보조 텐서인 col_discard와 row_discard를 생성합니다.

for _ in range(num_gt_boxes):
    max_idx = torch.argmax(jaccard)  # Find the largest IoU
    box_idx = (max_idx % num_gt_boxes).long()
    anc_idx = (max_idx / num_gt_boxes).long()
    anchors_bbox_map[anc_idx] = box_idx
    jaccard[:, box_idx] = col_discard
    jaccard[anc_idx, :] = row_discard

각 실제 바운딩 박스에 대해 가장 큰 IoU 값을 가지는 앵커 박스를 찾아 해당 앵커 박스의 인덱스와 실제 바운딩 박스의 인덱스를 anchors_bbox_map에 할당합니다. 할당한 후 해당 앵커 박스와 실제 바운딩 박스에 대한 IoU 값을 무효화하기 위해 jaccard의 해당 열과 행을 col_discard와 row_discard로 설정합니다.

return anchors_bbox_map

앵커 박스와 실제 바운딩 박스 간의 할당 결과를 반환합니다.

14.4.3.2. Labeling Classes and Offsets

Now we can label the class and offset for each anchor box. Suppose that an anchor box A is assigned a ground-truth bounding box B. On the one hand, the class of the anchor box A will be labeled as that of B. On the other hand, the offset of the anchor box A will be labeled according to the relative position between the central coordinates of B and A together with the relative size between these two boxes. Given varying positions and sizes of different boxes in the dataset, we can apply transformations to those relative positions and sizes that may lead to more uniformly distributed offsets that are easier to fit. Here we describe a common transformation. Given the central coordinates of A and B as (xa,ya) and (xb,yb), their widths as wa and wb and their heights as ℎa and ℎb, respectively. We may label the offset of A as

이제 각 앵커 상자의 클래스와 오프셋에 레이블을 지정할 수 있습니다. 앵커 상자 A에 실측 경계 상자 B가 할당되었다고 가정합니다. 한편으로 앵커 상자 A의 클래스는 B의 클래스로 레이블이 지정됩니다. 다른 한편으로 앵커 상자 A의 오프셋은 다음과 같습니다. 이 두 상자 사이의 상대 크기와 함께 B와 A의 중심 좌표 사이의 상대 위치에 따라 레이블이 지정됩니다. 데이터 세트에 있는 다양한 상자의 다양한 위치와 크기가 주어지면 더 쉽게 맞출 수 있는 보다 균일하게 분포된 오프셋으로 이어질 수 있는 상대적인 위치와 크기에 변환을 적용할 수 있습니다. 여기서는 일반적인 변환에 대해 설명합니다. A와 B의 중심 좌표가 (xa,ya)와 (xb,yb)로 주어지면 너비는 wa와 wb, 높이는 각각 ℎa와 ℎb입니다. A의 오프셋을 다음과 같이 표시할 수 있습니다.

#@save
def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """Transform for anchor box offsets."""
    c_anc = d2l.box_corner_to_center(anchors)
    c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset = torch.cat([offset_xy, offset_wh], axis=1)
    return offset

위 코드는 앵커 박스와 해당 앵커 박스에 할당된 실제 바운딩 박스 사이의 변환을 계산하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

offset_boxes 함수는 앵커 박스와 해당 앵커 박스에 할당된 실제 바운딩 박스 사이의 변환을 계산하는 함수입니다.

d2l.box_corner_to_center 함수를 이용하여 앵커 박스와 할당된 바운딩 박스를 모두 상단 왼쪽 모서리에서 중심 좌표와 너비, 높이 형태로 변환합니다.

offset_xy는 중심 좌표의 변화를 나타내는 텐서로, (할당된 바운딩 박스 중심 좌표 - 앵커 박스 중심 좌표) / 앵커 박스 너비 및 높이로 계산됩니다. offset_wh는 너비와 높이의 변화를 나타내는 텐서로, 5 * log(1e-6 + 할당된 바운딩 박스 높이 / 앵커 박스 높이)로 계산됩니다. 이때 eps는 로그 계산에서 0으로 나누기를 방지하는 작은 값입니다.

offset_xy와 offset_wh를 수평 방향으로 연결하여 하나의 텐서인 offset을 생성합니다.

계산된 변환을 반환합니다. 이 변환은 앵커 박스를 기준으로 해당 앵커 박스에 할당된 실제 바운딩 박스의 변화량을 나타내게 됩니다.

If an anchor box is not assigned a ground-truth bounding box, we just label the class of the anchor box as “background”. Anchor boxes whose classes are background are often referred to as negative anchor boxes, and the rest are called positive anchor boxes. We implement the following multibox_target function to label classes and offsets for anchor boxes (the anchors argument) using ground-truth bounding boxes (the labels argument). This function sets the background class to zero and increments the integer index of a new class by one.

앵커 상자에 실측 경계 상자가 할당되지 않은 경우 앵커 상자의 클래스를 "배경"으로 레이블을 지정합니다. 클래스가 배경인 앵커 박스는 종종 네거티브 앵커 박스라고 하고 나머지는 포지티브 앵커 박스라고 합니다. 다음 multibox_target 함수를 구현하여 ground-truth 경계 상자(label 인수)를 사용하여 앵커 상자(anchors 인수)의 클래스 및 오프셋에 레이블을 지정합니다. 이 함수는 백그라운드 클래스를 0으로 설정하고 새 클래스의 정수 인덱스를 1씩 증가시킵니다.

#@save
def multibox_target(anchors, labels):
    """Label anchor boxes using ground-truth bounding boxes."""
    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # Initialize class labels and assigned bounding box coordinates with
        # zeros
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
                                   device=device)
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
                                  device=device)
        # Label classes of anchor boxes using their assigned ground-truth
        # bounding boxes. If an anchor box is not assigned any, we label its
        # class as background (the value remains zero)
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # Offset transformation
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
        batch_offset.append(offset.reshape(-1))
        batch_mask.append(bbox_mask.reshape(-1))
        batch_class_labels.append(class_labels)
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

위 코드는 앵커 박스와 실제 바운딩 박스에 기반하여 앵커 박스에 라벨을 지정하는 함수인 multibox_target를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

multibox_target 함수는 앵커 박스와 실제 바운딩 박스를 기반으로 앵커 박스에 라벨을 지정하는 함수입니다.

batch_size는 배치 내의 데이터 수를 나타냅니다. labels 텐서의 크기에서 배치 차원을 제거한 뒤, 앵커 박스 텐서를 할당합니다.

계산 결과를 저장하기 위한 빈 리스트인 batch_offset, batch_mask, batch_class_labels를 생성합니다.

for i in range(batch_size):
    label = labels[i, :, :]
    anchors_bbox_map = assign_anchor_to_bbox(label[:, 1:], anchors, device)

각 배치 데이터에 대하여 반복하면서 실제 바운딩 박스 정보를 가져오고, 해당 실제 바운딩 박스와 앵커 박스 간의 IoU를 계산하여 할당을 수행합니다.

bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(1, 4)

할당된 바운딩 박스가 있는 앵커 박스의 인덱스를 이용하여 바운딩 박스 마스크를 생성합니다. 할당된 바운딩 박스가 있는 위치의 마스크 값은 1이고, 그렇지 않은 위치의 마스크 값은 0입니다.

class_labels = torch.zeros(num_anchors, dtype=torch.long, device=device)
assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32, device=device)

클래스 라벨과 할당된 바운딩 박스의 초기화를 위한 텐서를 생성합니다.

indices_true = torch.nonzero(anchors_bbox_map >= 0)
bb_idx = anchors_bbox_map[indices_true]
class_labels[indices_true] = label[bb_idx, 0].long() + 1
assigned_bb[indices_true] = label[bb_idx, 1:]

할당된 바운딩 박스가 있는 위치에 대해서 실제 바운딩 박스의 클래스 라벨과 좌표 정보를 class_labels와 assigned_bb에 할당합니다. 이때, 클래스 라벨은 1부터 시작하여 라벨링됩니다.

offset = offset_boxes(anchors, assigned_bb) * bbox_mask
batch_offset.append(offset.reshape(-1))
batch_mask.append(bbox_mask.reshape(-1))
batch_class_labels.append(class_labels)

할당된 바운딩 박스를 기준으로 앵커 박스와 실제 바운딩 박스 사이의 변환을 계산합니다. 이 변환을 offset으로 저장하고, batch_offset 리스트에 추가합니다. 마찬가지로 바운딩 박스 마스크와 클래스 라벨 정보도 batch_mask와 batch_class_labels 리스트에 추가합니다.

bbox_offset = torch.stack(batch_offset)
bbox_mask = torch.stack(batch_mask)
class_labels = torch.stack(batch_class_labels)

batch_offset, batch_mask, batch_class_labels를 스택하여 최종적으로 배치 단위의 변환 정보, 마스크 정보, 클래스 라벨 정보를 생성합니다.

return (bbox_offset, bbox_mask, class_labels)

계산된 변환 정보, 마스크 정보, 클래스 라벨 정보를 반환합니다. 이 정보는 학습할 때 앵커 박스의 위치와 클래스에 대한 정보를 지원하기 위해 사용됩니다.

14.4.3.3. An Example

Let’s illustrate anchor box labeling via a concrete example. We define ground-truth bounding boxes for the dog and cat in the loaded image, where the first element is the class (0 for dog and 1 for cat) and the remaining four elements are the (x,y)-axis coordinates at the upper-left corner and the lower-right corner (range is between 0 and 1). We also construct five anchor boxes to be labeled using the coordinates of the upper-left corner and the lower-right corner: A0,…,A4 (the index starts from 0). Then we plot these ground-truth bounding boxes and anchor boxes in the image.

구체적인 예를 통해 앵커 박스 라벨링을 설명하겠습니다. 로드된 이미지에서 개와 고양이에 대한 ground-truth 경계 상자를 정의합니다. 여기서 첫 번째 요소는 클래스(개는 0, 고양이는 1)이고 나머지 4개 요소는 (x,y)축 좌표입니다. 왼쪽 위 모서리와 오른쪽 아래 모서리(범위는 0과 1 사이)입니다. 또한 왼쪽 위 모서리와 오른쪽 아래 모서리의 좌표인 A0,…,A4(인덱스는 0부터 시작)를 사용하여 레이블을 지정할 5개의 앵커 상자를 구성합니다. 그런 다음 이러한 실측 경계 상자와 앵커 상자를 이미지에 플로팅합니다.

ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);

Using the multibox_target function defined above, we can label classes and offsets of these anchor boxes based on the ground-truth bounding boxes for the dog and cat. In this example, indices of the background, dog, and cat classes are 0, 1, and 2, respectively. Below we add an dimension for examples of anchor boxes and ground-truth bounding boxes.

위에서 정의한 multibox_target 함수를 사용하여 개와 고양이에 대한 ground-truth 경계 상자를 기반으로 이러한 앵커 상자의 클래스와 오프셋에 레이블을 지정할 수 있습니다. 이 예제에서 background, dog, cat 클래스의 인덱스는 각각 0, 1, 2입니다. 아래에서 앵커 박스와 ground-truth 경계 상자의 예에 대한 차원을 추가합니다.

labels = multibox_target(anchors.unsqueeze(dim=0),
                         ground_truth.unsqueeze(dim=0))

위 코드는 multibox_target 함수를 사용하여 앵커 박스에 라벨을 할당하는 과정을 설명합니다.

anchors.unsqueeze(dim=0)는 anchors 텐서에 차원을 하나 추가하여 배치 차원을 생성합니다. anchors 텐서는 앵커 박스의 좌표 정보를 가지고 있는 텐서입니다. 이것은 실제 데이터와의 계산을 위해 배치 차원을 맞추기 위한 과정입니다.
ground_truth.unsqueeze(dim=0)도 마찬가지로 실제 바운딩 박스의 정보를 가지고 있는 텐서에 배치 차원을 추가합니다. 이 과정 역시 실제 데이터와의 계산을 위해 배치 차원을 맞추기 위한 것입니다.
이렇게 앵커 박스와 실제 바운딩 박스의 정보를 하나의 배치로 변환한 뒤, multibox_target 함수를 호출하여 해당 배치에 대한 라벨링을 수행합니다. 이때, multibox_target 함수는 앵커 박스에 대한 라벨 정보를 계산하고 반환합니다.

There are three items in the returned result, all of which are in the tensor format. The third item contains the labeled classes of the input anchor boxes.

반환된 결과에는 3개의 항목이 있으며 모두 텐서 형식입니다. 세 번째 항목에는 입력 앵커 상자의 레이블이 지정된 클래스가 포함되어 있습니다.

Let’s analyze the returned class labels below based on anchor box and ground-truth bounding box positions in the image. First, among all the pairs of anchor boxes and ground-truth bounding boxes, the IoU of the anchor box A4 and the ground-truth bounding box of the cat is the largest. Thus, the class of A4 is labeled as the cat. Taking out pairs containing A4 or the ground-truth bounding box of the cat, among the rest the pair of the anchor box A1 and the ground-truth bounding box of the dog has the largest IoU. So the class of A1 is labeled as the dog. Next, we need to traverse through the remaining three unlabeled anchor boxes: A0, A2, and A3. For A0, the class of the ground-truth bounding box with the largest IoU is the dog, but the IoU is below the predefined threshold (0.5), so the class is labeled as background; for A2, the class of the ground-truth bounding box with the largest IoU is the cat and the IoU exceeds the threshold, so the class is labeled as the cat; for A3, the class of the ground-truth bounding box with the largest IoU is the cat, but the value is below the threshold, so the class is labeled as background.

이미지의 앵커 상자 및 ground-truth 경계 상자 위치를 기반으로 아래 반환된 클래스 레이블을 분석해 보겠습니다. 먼저, 앵커 박스와 ground-truth bounding box의 모든 쌍 중에서 앵커 박스 A4와 cat의 ground-truth bounding box의 IoU가 가장 크다. 따라서 A4의 클래스는 고양이로 표시됩니다. A4 또는 고양이의 ground-truth bounding box를 포함하는 쌍을 꺼내고 나머지 중에서 앵커 상자 A1과 dog의 ground-truth bounding box의 쌍이 가장 큰 IoU를 갖습니다. 따라서 A1의 class 개로 분류됩니다. 다음으로 레이블이 지정되지 않은 나머지 3개의 앵커 상자(A0, A2 및 A3)를 통과해야 합니다. A0의 경우 IoU가 가장 큰 ground-truth 경계 상자의 클래스는 개이지만 IoU는 사전 정의된 임계값(0.5) 미만이므로 클래스는 배경으로 레이블이 지정됩니다. A2의 경우 IoU가 가장 큰 ground-truth 경계 상자의 클래스는 고양이이고 IoU는 임계값을 초과하므로 클래스는 고양이로 레이블이 지정됩니다. A3의 경우 IoU가 가장 큰 ground-truth bounding box의 클래스는 cat이지만 값이 임계값 미만이므로 클래스를 background로 레이블링합니다.

labels[2]

위 코드는 labels 텐서에서 인덱스 2에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[2]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 2는 배치 내에서 세 번째 데이터를 의미합니다.

따라서 labels[2]는 세 번째 배치에 해당하는 앵커 박스의 라벨 정보를 반환합니다. 이 정보에는 각 앵커 박스에 할당된 클래스 라벨과 바운딩 박스의 오프셋 정보가 포함될 수 있습니다.

tensor([[0, 1, 2, 0, 2]])

The second returned item is a mask variable of the shape (batch size, four times the number of anchor boxes). Every four elements in the mask variable correspond to the four offset values of each anchor box. Since we do not care about background detection, offsets of this negative class should not affect the objective function. Through elementwise multiplications, zeros in the mask variable will filter out negative class offsets before calculating the objective function.

두 번째 반환된 항목은 모양(배치 크기, 앵커 상자 수의 4배)의 마스크 변수입니다. 마스크 변수의 모든 4개 요소는 각 앵커 상자의 4개 오프셋 값에 해당합니다. 우리는 배경 감지에 관심이 없기 때문에 이 네거티브 클래스의 오프셋은 목적 함수에 영향을 미치지 않아야 합니다. 요소별 곱셈을 통해 마스크 변수의 0은 목적 함수를 계산하기 전에 음수 클래스 오프셋을 필터링합니다.

labels[1]

위 코드는 labels 텐서에서 인덱스 1에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[1]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 1은 두 번째 차원 내의 데이터를 의미합니다.

따라서 labels[1]은 앵커 박스에 할당된 바운딩 박스의 오프셋 정보를 반환합니다. 이 정보는 각 앵커 박스마다 정답 바운딩 박스와의 오프셋 변환을 나타내는 값들로 구성되어 있습니다.

tensor([[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1.,
         1., 1.]])

The first returned item contains the four offset values labeled for each anchor box. Note that the offsets of negative-class anchor boxes are labeled as zeros.

첫 번째로 반환된 항목에는 각 앵커 상자에 대해 레이블이 지정된 4개의 오프셋 값이 포함되어 있습니다. 네거티브 클래스 앵커 상자의 오프셋은 0으로 레이블이 지정됩니다.

labels[0]

위 코드는 labels 텐서에서 인덱스 0에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[0]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 0은 첫 번째 차원 내의 데이터를 의미합니다.

따라서 labels[0]은 각 앵커 박스에 대한 바운딩 박스의 오프셋 변환 값들을 가지고 있는 텐서를 반환합니다. 이 값들은 해당 앵커 박스의 위치와 크기를 바탕으로 실제 정답 바운딩 박스와의 오프셋을 나타내는 값들입니다.

tensor([[-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00,  1.40e+00,  1.00e+01,
          2.59e+00,  7.18e+00, -1.20e+00,  2.69e-01,  1.68e+00, -1.57e+00,
         -0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, -5.71e-01, -1.00e+00,
          4.17e-06,  6.26e-01]])

14.4.4. Predicting Bounding Boxes with Non-Maximum Suppression

During prediction, we generate multiple anchor boxes for the image and predict classes and offsets for each of them. A predicted bounding box is thus obtained according to an anchor box with its predicted offset. Below we implement the offset_inverse function that takes in anchors and offset predictions as inputs and applies inverse offset transformations to return the predicted bounding box coordinates.

예측하는 동안 이미지에 대한 여러 앵커 상자를 생성하고 각각에 대한 클래스와 오프셋을 예측합니다. 예측된 오프셋을 갖는 앵커 박스에 따라 예측된 바운딩 박스가 얻어진다. 아래에서는 앵커 및 오프셋 예측을 입력으로 받아들이고 역 오프셋 변환을 적용하여 예측된 경계 상자 좌표를 반환하는 offset_inverse 함수를 구현합니다.

#@save
def offset_inverse(anchors, offset_preds):
    """Predict bounding boxes based on anchor boxes with predicted offsets."""
    anc = d2l.box_corner_to_center(anchors)
    pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
    pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
    pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1)
    predicted_bbox = d2l.box_center_to_corner(pred_bbox)
    return predicted_bbox

위 코드는 앵커 박스와 예측된 오프셋을 사용하여 예측된 바운딩 박스를 계산하는 함수인 offset_inverse를 정의합니다.

anchors: 앵커 박스의 좌표와 크기 정보를 가지고 있는 텐서입니다.
offset_preds: 앵커 박스의 예측된 오프셋 정보를 가지고 있는 텐서입니다.

위 함수는 다음과 같은 단계로 동작합니다:

d2l.box_corner_to_center(anchors): 입력으로 들어온 앵커 박스의 좌상단, 우하단 좌표를 중심 좌표와 폭/높이로 변환합니다.
offset_preds[:, :2] * anc[:, 2:] / 10 + anc[:, :2]: 예측된 오프셋의 x, y 값을 앵커 박스의 크기로 스케일링한 후 중심 좌표를 더하여 예측된 바운딩 박스의 중심 좌표를 계산합니다.
torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]: 예측된 오프셋의 폭, 높이 값을 지수함수로 변환한 후 앵커 박스의 크기로 스케일링하여 예측된 바운딩 박스의 폭과 높이를 계산합니다.
torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1): 중심 좌표와 폭/높이 값을 합쳐서 예측된 바운딩 박스의 정보를 생성합니다.
d2l.box_center_to_corner(pred_bbox): 예측된 바운딩 박스의 중심 좌표와 폭/높이 정보를 좌상단, 우하단 좌표로 변환하여 예측된 바운딩 박스를 생성합니다.

이렇게 생성된 predicted_bbox는 앵커 박스와 예측된 오프셋을 이용하여 예측된 바운딩 박스의 좌상단, 우하단 좌표 정보를 가지고 있는 텐서입니다.

When there are many anchor boxes, many similar (with significant overlap) predicted bounding boxes can be potentially output for surrounding the same object. To simplify the output, we can merge similar predicted bounding boxes that belong to the same object by using non-maximum suppression (NMS).

앵커 박스가 많으면 동일한 객체를 둘러싸는 유사한(상당한 중첩이 있는) 예측 경계 상자가 많이 출력될 수 있습니다. 출력을 단순화하기 위해 NMS(Non-Maximum Suppression)를 사용하여 동일한 객체에 속하는 유사한 예측 경계 상자를 병합할 수 있습니다.

Here is how non-maximum suppression works. For a predicted bounding box B, the object detection model calculates the predicted likelihood for each class. Denoting by p the largest predicted likelihood, the class corresponding to this probability is the predicted class for B. Specifically, we refer to p as the confidence (score) of the predicted bounding box B. On the same image, all the predicted non-background bounding boxes are sorted by confidence in descending order to generate a list L. Then we manipulate the sorted list L in the following steps:

non-maximum suppression이 작동하는 방식은 다음과 같습니다. 예측된 경계 상자 B의 경우 객체 감지 모델은 각 클래스에 대한 예측 가능성을 계산합니다. 가장 큰 예측 가능도를 p로 표시하면 이 확률에 해당하는 클래스가 B에 대한 예측 클래스입니다. 구체적으로 p는 예측된 경계 상자 B의 신뢰도(점수)입니다. 배경 경계 상자는 목록 L을 생성하기 위해 내림차순으로 신뢰도로 정렬됩니다. 그런 다음 다음 단계에서 정렬된 목록 L을 조작합니다.

Select the predicted bounding box B1 with the highest confidence from L as a basis and remove all non-basis predicted bounding boxes whose IoU with B1 exceeds a predefined threshold ϵ from L. At this point, L keeps the predicted bounding box with the highest confidence but drops others that are too similar to it. In a nutshell, those with non-maximum confidence scores are suppressed.

L에서 신뢰도가 가장 높은 예측 경계 상자 B1을 기준으로 선택하고 B1과의 IoU가 L에서 미리 정의된 임계값 ϵ를 초과하는 비기준 예측 경계 상자를 모두 제거합니다. 이 시점에서 L은 예측 경계 상자를 가장 높은 신뢰도로 유지합니다. 그러나 너무 유사한 다른 항목은 삭제합니다. 간단히 말해서 최대 신뢰도 점수가 아닌 항목은 표시되지 않습니다.
Select the predicted bounding box B2 with the second highest confidence from L as another basis and remove all non-basis predicted bounding boxes whose IoU with B2 exceeds ϵ from L.

L에서 두 번째로 신뢰도가 높은 예측 경계 상자 B2를 다른 기준으로 선택하고 B2의 IoU가 L에서 ϵ를 초과하는 비기준 예측 경계 상자를 모두 제거합니다.
Repeat the above process until all the predicted bounding boxes in L have been used as a basis. At this time, the IoU of any pair of predicted bounding boxes in L is below the threshold ϵ; thus, no pair is too similar with each other.

L의 모든 예측된 경계 상자가 기준으로 사용될 때까지 위의 프로세스를 반복합니다. 이때 L에 있는 예측된 경계 상자 쌍의 IoU는 임계값 ϵ 미만입니다. 따라서 어떤 쌍도 서로 너무 유사하지 않습니다.
Output all the predicted bounding boxes in the list L.

목록 L의 모든 예측 경계 상자를 출력합니다.

The following nms function sorts confidence scores in descending order and returns their indices.

다음 nms 함수는 신뢰도 점수를 내림차순으로 정렬하고 해당 인덱스를 반환합니다.

#@save
def nms(boxes, scores, iou_threshold):
    """Sort confidence scores of predicted bounding boxes."""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # Indices of predicted bounding boxes that will be kept
    while B.numel() > 0:
        i = B[0]
        keep.append(i)
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

위 코드는 비최대 억제(NMS, Non-Maximum Suppression) 알고리즘을 사용하여 중복된 예측된 바운딩 박스를 제거하는 함수인 nms를 정의합니다.

boxes: 예측된 바운딩 박스의 좌표 정보를 가지고 있는 텐서입니다.
scores: 예측된 바운딩 박스의 신뢰도 점수를 가지고 있는 텐서입니다.
iou_threshold: IoU 임계값으로, 이 값을 넘는 예측된 바운딩 박스들은 중복으로 간주됩니다.

위 함수는 다음과 같은 단계로 동작합니다:

torch.argsort(scores, dim=-1, descending=True): 예측된 바운딩 박스의 신뢰도 점수를 내림차순으로 정렬하여 인덱스를 얻습니다.
B[0]: 가장 높은 신뢰도 점수를 가진 바운딩 박스의 인덱스를 선택합니다.
keep.append(i): 선택된 바운딩 박스의 인덱스를 keep 리스트에 추가합니다.
iou = box_iou(boxes[i, :].reshape(-1, 4), boxes[B[1:], :].reshape(-1, 4)).reshape(-1): 선택된 바운딩 박스와 다른 모든 바운딩 박스 간의 IoU를 계산합니다.
inds = torch.nonzero(iou <= iou_threshold).reshape(-1): IoU 값이 임계값보다 작거나 같은 바운딩 박스들의 인덱스를 찾습니다.
B = B[inds + 1]: 중복으로 간주되지 않은 바운딩 박스들의 인덱스를 업데이트합니다.
위 과정을 반복하면서 중복으로 간주되지 않은 바운딩 박스들의 인덱스를 keep 리스트에 추가합니다.
return torch.tensor(keep, device=boxes.device): 최종적으로 선택된 바운딩 박스들의 인덱스를 텐서로 반환합니다.

이렇게 선택된 바운딩 박스들은 중복된 예측된 바운딩 박스들을 제거하는 데 사용됩니다.

We define the following multibox_detection to apply non-maximum suppression to predicting bounding boxes. Do not worry if you find the implementation a bit complicated: we will show how it works with a concrete example right after the implementation.

예측 경계 상자에 최대가 아닌 억제를 적용하기 위해 다음 multibox_detection을 정의합니다. 구현이 약간 복잡하다고 생각되더라도 걱정하지 마십시오. 구현 직후 구체적인 예를 통해 어떻게 작동하는지 보여드리겠습니다.

#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
                       pos_threshold=0.009999999):
    """Predict bounding boxes using non-maximum suppression."""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)
        # Find all non-`keep` indices and set the class to background
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined = torch.cat((keep, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted = torch.cat((keep, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # Here `pos_threshold` is a threshold for positive (non-background)
        # predictions
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info = torch.cat((class_id.unsqueeze(1),
                               conf.unsqueeze(1),
                               predicted_bb), dim=1)
        out.append(pred_info)
    return torch.stack(out)

위 코드는 Non-Maximum Suppression(NMS) 알고리즘을 사용하여 예측된 바운딩 박스에 대한 최종 예측 결과를 생성하는 함수인 multibox_detection을 정의합니다.

이 함수는 다음과 같은 단계로 동작합니다:

device, batch_size = cls_probs.device, cls_probs.shape[0]: 디바이스와 배치 크기 정보를 가져옵니다.
anchors = anchors.squeeze(0): 앵커 박스의 차원을 조정합니다.
cls_probs[i], offset_preds[i].reshape(-1, 4): i번째 배치의 클래스 확률과 예측된 오프셋 값을 가져옵니다.
conf, class_id = torch.max(cls_prob[1:], 0): 클래스 확률 중 가장 높은 값을 선택하고 그 인덱스와 값을 가져옵니다.
offset_inverse(anchors, offset_pred): 예측된 오프셋 값을 사용하여 예측된 바운딩 박스를 계산합니다.
keep = nms(predicted_bb, conf, nms_threshold): NMS를 사용하여 중복된 예측된 바운딩 박스를 제거한 인덱스를 가져옵니다.
중복되지 않은 바운딩 박스 인덱스와 모든 인덱스를 합쳐서 유니크한 값과 빈도수를 계산합니다.
중복되지 않은 인덱스들을 찾아 배경 클래스(-1)로 설정하고, 클래스 및 신뢰도 값을 갱신합니다.
pos_threshold 값보다 작은 신뢰도 값을 가진 인덱스를 찾아 배경 클래스로 설정하고, 신뢰도 값을 업데이트합니다.
최종 예측 정보를 생성하여 반환합니다.

즉, 이 함수는 NMS를 사용하여 중복된 예측된 바운딩 박스를 제거하고, 최종 예측된 클래스와 신뢰도, 바운딩 박스 정보를 생성합니다.

Now let’s apply the above implementations to a concrete example with four anchor boxes. For simplicity, we assume that the predicted offsets are all zeros. This means that the predicted bounding boxes are anchor boxes. For each class among the background, dog, and cat, we also define its predicted likelihood.

이제 위의 구현을 4개의 앵커 상자가 있는 구체적인 예에 적용해 보겠습니다. 단순화를 위해 예측된 오프셋이 모두 0이라고 가정합니다. 이는 예측된 경계 상자가 앵커 상자임을 의미합니다. 배경, 개, 고양이 중 각 클래스에 대해 예측 가능성도 정의합니다.

anchors = torch.tensor([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
                      [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = torch.tensor([0] * anchors.numel())
cls_probs = torch.tensor([[0] * 4,  # Predicted background likelihood
                      [0.9, 0.8, 0.7, 0.1],  # Predicted dog likelihood
                      [0.1, 0.2, 0.3, 0.9]])  # Predicted cat likelihood

위 코드는 예시 데이터를 사용하여 multibox_detection 함수를 호출하는 부분입니다. 여기서 주어진 anchors, offset_preds, cls_probs 데이터는 예측된 바운딩 박스의 위치, 오프셋, 클래스 확률을 나타내는 정보입니다.

anchors: 예측된 바운딩 박스의 앵커 정보입니다. 여기서 각 행은 각 바운딩 박스의 좌상단과 우하단 좌표를 나타냅니다.
offset_preds: 예측된 바운딩 박스의 오프셋 정보입니다. 여기서는 모든 값이 0으로 초기화되어 있습니다.
cls_probs: 예측된 바운딩 박스의 클래스 확률 정보입니다. 각 행은 각 바운딩 박스에 대한 클래스 확률을 나타냅니다. 첫 번째 행은 배경 클래스(클래스 0)에 대한 확률을 나타내고, 나머지 두 행은 각각 개(dog)와 고양이(cat) 클래스에 대한 확률을 나타냅니다.

이렇게 예측된 앵커, 오프셋, 클래스 확률 정보를 가지고 multibox_detection 함수를 호출하면, 예측된 바운딩 박스들을 NMS를 통해 최적화하여 최종 예측 결과를 생성합니다.

We can plot these predicted bounding boxes with their confidence on the image.

이미지에 대한 확신을 가지고 예측된 경계 상자를 그릴 수 있습니다.

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale,
            ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])

위 코드는 이미지 위에 예측된 앵커 박스와 해당 클래스에 대한 확률 정보를 표시하는 부분입니다.

fig = d2l.plt.imshow(img): 입력 이미지를 시각화하는 부분입니다. d2l.plt.imshow() 함수는 이미지를 화면에 표시합니다.
show_bboxes(fig.axes, anchors * bbox_scale, ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9']): show_bboxes 함수를 호출하여 앵커 박스와 해당 클래스에 대한 확률 정보를 시각화합니다. 이 함수는 이미지 위에 각 앵커 박스를 사각형으로 그리고, 사각형 안에 해당 클래스와 확률을 표시합니다. anchors는 예측된 앵커 박스들을 나타내는 텐서입니다. bbox_scale은 이미지 크기를 앵커 박스 크기에 맞게 조정하기 위한 스케일링 인자입니다. 마지막으로, 사각형 안에 나타나는 문자열은 해당 클래스와 확률을 나타냅니다.

Now we can invoke the multibox_detection function to perform non-maximum suppression, where the threshold is set to 0.5. Note that we add a dimension for examples in the tensor input.

이제 multibox_detection 함수를 호출하여 임계값이 0.5로 설정된 비최대 억제를 수행할 수 있습니다. 텐서 입력에 예제에 대한 차원을 추가합니다.

We can see that the shape of the returned result is (batch size, number of anchor boxes, 6). The six elements in the innermost dimension gives the output information for the same predicted bounding box. The first element is the predicted class index, which starts from 0 (0 is dog and 1 is cat). The value -1 indicates background or removal in non-maximum suppression. The second element is the confidence of the predicted bounding box. The remaining four elements are the (x,y)-axis coordinates of the upper-left corner and the lower-right corner of the predicted bounding box, respectively (range is between 0 and 1).

반환된 결과의 모양이 (배치 크기, 앵커 상자 수, 6)임을 알 수 있습니다. 가장 안쪽 차원의 6개 요소는 동일한 예측 경계 상자에 대한 출력 정보를 제공합니다. 첫 번째 요소는 0부터 시작하는 예측 클래스 인덱스입니다(0은 개, 1은 고양이). 값 -1은 최대가 아닌 억제에서 백그라운드 또는 제거를 나타냅니다. 두 번째 요소는 예측된 경계 상자의 신뢰도입니다. 나머지 4개의 요소는 각각 예측된 경계 상자의 왼쪽 위 모서리와 오른쪽 아래 모서리의 (x,y)축 좌표입니다(범위는 0과 1 사이).

output = multibox_detection(cls_probs.unsqueeze(dim=0),
                            offset_preds.unsqueeze(dim=0),
                            anchors.unsqueeze(dim=0),
                            nms_threshold=0.5)
output

위 코드는 다중 객체 탐지 모델의 출력을 계산하는 부분입니다.

cls_probs.unsqueeze(dim=0): cls_probs 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 클래스 확률 예측 텐서입니다.
offset_preds.unsqueeze(dim=0): offset_preds 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 박스 오프셋 예측 텐서입니다.
anchors.unsqueeze(dim=0): anchors 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 앵커 박스의 좌표 텐서입니다.
nms_threshold=0.5: 비최대 억제(NMS)에 사용되는 임계값을 설정합니다. NMS는 겹치는 박스 중에서 가장 확률이 높은 박스만 선택하는 기술입니다.

위의 모델 출력 계산을 multibox_detection 함수에 적용하여 다중 객체 탐지 결과를 얻습니다. 결과로 얻는 텐서는 다중 객체 탐지 결과입니다. 이 결과 텐서에는 예측된 객체의 클래스 ID, 확률 및 박스 좌표 정보가 포함되어 있습니다.

tensor([[[ 0.00,  0.90,  0.10,  0.08,  0.52,  0.92],
         [ 1.00,  0.90,  0.55,  0.20,  0.90,  0.88],
         [-1.00,  0.80,  0.08,  0.20,  0.56,  0.95],
         [-1.00,  0.70,  0.15,  0.30,  0.62,  0.91]]])

After removing those predicted bounding boxes of class -1, we can output the final predicted bounding box kept by non-maximum suppression.

클래스 -1의 예측 경계 상자를 제거한 후 비최대 억제로 유지되는 최종 예측 경계 상자를 출력할 수 있습니다.

fig = d2l.plt.imshow(img)
for i in output[0].detach().numpy():
    if i[0] == -1:
        continue
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label)

위 코드는 다중 객체 탐지 모델의 결과를 시각화하는 부분입니다.

fig = d2l.plt.imshow(img): 입력 이미지를 시각화합니다.
for i in output[0].detach().numpy():: 다중 객체 탐지 모델의 결과를 하나씩 반복하면서 처리합니다. output 텐서의 첫 번째 배치의 결과를 사용합니다.
if i[0] == -1:: 만약 객체의 클래스 ID가 -1이면 (배경인 경우) 건너뜁니다.
label = ('dog=', 'cat=')[int(i[0])] + str(i[1]): 클래스 ID에 따라 객체의 레이블을 생성합니다. 클래스 ID 0은 'dog=', 클래스 ID 1은 'cat='을 갖습니다. 그리고 해당 클래스의 확률 값을 문자열로 추가합니다.
show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label): show_bboxes 함수를 사용하여 이미지에 예측된 박스를 그려줍니다. i[2:]는 박스의 좌표 정보를 나타내며, bbox_scale을 곱하여 실제 이미지 크기에 맞게 스케일 조정한 값으로 설정됩니다. 그리고 label을 표시하여 어떤 클래스의 객체인지와 확률을 함께 나타냅니다.

이 코드는 모델의 출력을 바탕으로 예측된 객체의 박스를 시각화하고, 각 박스 위에 해당하는 레이블과 확률 값을 표시하는 부분입니다.

In practice, we can remove predicted bounding boxes with lower confidence even before performing non-maximum suppression, thereby reducing computation in this algorithm. We may also post-process the output of non-maximum suppression, for example, by only keeping results with higher confidence in the final output.

실제로 non-maximum suppression를 수행하기 전에도 낮은 신뢰도로 예측된 경계 상자를 제거할 수 있으므로 이 알고리즘의 계산을 줄일 수 있습니다. 예를 들어 최종 출력에서 신뢰도가 더 높은 결과만 유지하여 최대가 아닌 억제의 출력을 사후 처리할 수도 있습니다.

NMS (Non-Maximum Suppression)이란?

Non-Maximum Suppression (NMS) is a post-processing technique commonly used in object detection tasks to reduce duplicate or overlapping bounding box predictions and retain only the most relevant and accurate ones. NMS helps refine the output of object detection algorithms by selecting the most appropriate bounding boxes for detected objects while eliminating redundant and overlapping detections.

비최대 억제(NMS, Non-Maximum Suppression)는 객체 탐지 작업에서 널리 사용되는 후 처리 기법으로, 중복되거나 겹치는 바운딩 박스 예측을 줄이고 가장 관련성 높고 정확한 예측만 보존하는 데 사용됩니다. NMS는 객체 탐지 알고리즘의 출력을 개선하기 위해 감지된 객체에 가장 적절한 바운딩 박스를 선택하고 중복 및 겹침을 없애는 역할을 합니다.

The primary goal of NMS is to eliminate multiple bounding box predictions that cover the same object. This is often the case when object detection models generate multiple bounding boxes for the same object due to different scales, positions, or overlapping receptive fields. NMS considers the confidence scores associated with each bounding box prediction to determine the most likely accurate ones.

NMS의 주요 목표는 동일한 객체를 포함하는 여러 바운딩 박스 예측을 제거하는 것입니다. 이는 객체 탐지 모델이 서로 다른 크기, 위치 또는 겹치는 수용 영역 때문에 동일한 객체에 대해 여러 개의 바운딩 박스를 생성하는 경우가 많기 때문입니다. NMS는 각 바운딩 박스 예측과 연관된 신뢰도 점수를 고려하여 가장 정확한 것을 판별합니다.

Here's how the Non-Maximum Suppression algorithm works:

다음은 비최대 억제 알고리즘의 작동 방식입니다.

Input: The algorithm takes a set of bounding box predictions along with their associated confidence scores.

입력: 알고리즘은 바운딩 박스 예측 세트와 연관된 신뢰도 점수를 가져옵니다.
Sorting: The bounding box predictions are first sorted based on their confidence scores in descending order. This ensures that the bounding box with the highest confidence score is considered first.

정렬: 바운딩 박스 예측은 먼저 신뢰도 점수를 내림차순으로 정렬됩니다. 이렇게 하면 가장 높은 신뢰도 점수를 가진 바운딩 박스가 먼저 고려됩니다.
Suppression: Starting from the bounding box with the highest confidence score, NMS compares it with the remaining bounding boxes. If the Intersection over Union (IoU) between the two bounding boxes exceeds a certain threshold (often defined by the user), the bounding box with the lower confidence score is suppressed or discarded.

억제: 가장 높은 신뢰도 점수를 가진 바운딩 박스부터 시작하여 NMS는 나머지 바운딩 박스와 비교합니다. 두 바운딩 박스 사이의 교집합 비율 (IoU)이 일정한 임계값을 초과하면 낮은 신뢰도 점수를 가진 바운딩 박스가 억제되거나 삭제됩니다.
Iteration: The process is repeated for the remaining bounding boxes, excluding those that have been suppressed.

반복: 남은 바운딩 박스에 대해 프로세스가 반복됩니다. 억제된 바운딩 박스를 제외하고 수행됩니다.
Output: The result is a reduced set of non-overlapping and accurate bounding box predictions, each associated with its corresponding confidence score.

출력: 결과는 겹치지 않는 정확한 바운딩 박스 예측의 줄어든 세트이며, 각각 해당하는 신뢰도 점수가 포함됩니다.

The threshold used in NMS plays a critical role in controlling the level of overlap allowed between bounding boxes. If the threshold is set too high, it may remove valid bounding boxes. Conversely, if it is set too low, redundant bounding boxes might still be present.

NMS에서 사용되는 임계값은 바운딩 박스 간에 허용되는 겹침 수준을 조절하는 데 중요한 역할을 합니다. 임계값이 너무 높게 설정되면 유효한 바운딩 박스가 제거될 수 있습니다. 반대로 너무 낮게 설정하면 중복 바운딩 박스가 여전히 존재할 수 있습니다.

NMS is an effective technique to refine object detection results, improving the precision of the algorithm and reducing the likelihood of multiple detections for the same object. It is a crucial step in achieving accurate and reliable object localization and classification in tasks like object detection and instance segmentation.

NMS는 객체 탐지 결과를 개선하는 효과적인 기술로, 알고리즘의 정밀도를 향상시키고 동일한 객체에 대한 다중 감지 가능성을 줄이는 데 중요한 역할을 합니다. 객체 탐지 및 인스턴스 분할과 같은 작업에서 정확하고 신뢰할 수 있는 객체 위치 및 분류를 달성하는 중요한 단계입니다.

14.4.5. Summary

We generate anchor boxes with different shapes centered on each pixel of the image.

이미지의 각 픽셀을 중심으로 다양한 모양의 앵커 박스를 생성합니다.
Intersection over union (IoU), also known as Jaccard index, measures the similarity of two bounding boxes. It is the ratio of their intersection area to their union area.

Jaccard 인덱스라고도 하는 IoU(Intersection over Union)는 두 경계 상자의 유사성을 측정합니다. 교차 영역과 결합 영역의 비율입니다.
In a training set, we need two types of labels for each anchor box. One is the class of the object relevant to the anchor box and the other is the offset of the ground-truth bounding box relative to the anchor box.

트레이닝 세트에서는 각 앵커 박스에 대해 두 가지 유형의 레이블이 필요합니다. 하나는 앵커 상자와 관련된 객체의 클래스이고 다른 하나는 앵커 상자에 대한 실측 경계 상자의 오프셋입니다.
During prediction, we can use non-maximum suppression (NMS) to remove similar predicted bounding boxes, thereby simplifying the output.

예측 중에 비최대 억제(NMS)를 사용하여 유사한 예측된 경계 상자를 제거하여 출력을 단순화할 수 있습니다.

14.4.6. Exercises

Change values of sizes and ratios in the multibox_prior function. What are the changes to the generated anchor boxes?
Construct and visualize two bounding boxes with an IoU of 0.5. How do they overlap with each other?
Modify the variable anchors in Section 14.4.3 and Section 14.4.4. How do the results change?
Non-maximum suppression is a greedy algorithm that suppresses predicted bounding boxes by removing them. Is it possible that some of these removed ones are actually useful? How can this algorithm be modified to suppress softly? You may refer to Soft-NMS (Bodla et al., 2017).
Rather than being hand-crafted, can non-maximum suppression be learned?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19
D2L - 14.1. Image Augmentation (0)	2023.08.19
D2L - 14. Computer Vision (0)	2023.08.18

Dive into Deep Learning/D2L Computer Vision

D2L - 14.3. Object Detection and Bounding Boxes

2023. 8. 19. 03:51 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/bounding-box.html

14.3. Object Detection and Bounding Boxes — Dive into Deep Learning 1.0.0 documentation

d2l.ai

In earlier sections (e.g., Section 8.1–Section 8.4), we introduced various models for image classification. In image classification tasks, we assume that there is only one major object in the image and we only focus on how to recognize its category. However, there are often multiple objects in the image of interest. We not only want to know their categories, but also their specific positions in the image. In computer vision, we refer to such tasks as object detection (or object recognition).

이전 섹션(예: 섹션 8.1–섹션 8.4)에서 이미지 분류를 위한 다양한 모델을 소개했습니다. 이미지 분류 작업에서는 이미지에 주요 객체가 하나만 있다고 가정하고 해당 범주를 인식하는 방법에만 집중합니다. 그러나 관심 있는 이미지에는 여러 개체가 있는 경우가 많습니다. 우리는 그들의 범주뿐만 아니라 이미지에서의 특정 위치도 알고 싶어합니다. 컴퓨터 비전에서는 객체 감지(또는 객체 인식)와 같은 작업을 참조합니다.

Object detection has been widely applied in many fields. For example, self-driving needs to plan traveling routes by detecting the positions of vehicles, pedestrians, roads, and obstacles in the captured video images. Besides, robots may use this technique to detect and localize objects of interest throughout its navigation of an environment. Moreover, security systems may need to detect abnormal objects, such as intruders or bombs.

객체 감지는 많은 분야에서 광범위하게 적용되었습니다. 예를 들어 자율주행차는 촬영한 영상에서 차량, 보행자, 도로, 장애물 등의 위치를 감지해 주행 경로를 계획해야 한다. 게다가 로봇은 이 기술을 사용하여 환경을 탐색하는 동안 관심 대상을 감지하고 위치를 파악할 수 있습니다. 또한 보안 시스템은 침입자나 폭탄과 같은 비정상적인 물체를 감지해야 할 수도 있습니다.

In the next few sections, we will introduce several deep learning methods for object detection. We will begin with an introduction to positions (or locations) of objects.

다음 몇 섹션에서는 객체 감지를 위한 몇 가지 딥 러닝 방법을 소개합니다. 객체의 위치(또는 위치)에 대한 소개부터 시작하겠습니다.

%matplotlib inline
import torch
from d2l import torch as d2l

위 코드는 Matplotlib을 사용하여 이미지와 그래프를 Jupyter Notebook에서 표시하는 설정을 수행하는 것입니다.

%matplotlib inline: 이 코드는 Jupyter Notebook에서 Matplotlib의 그림을 노트북 셀 안에 직접 표시하도록 설정하는 명령입니다. 그래프나 이미지를 출력하면 노트북 안에서 바로 확인할 수 있게 됩니다.
import torch: PyTorch 라이브러리를 가져오는 코드입니다.
from d2l import torch as d2l: d2l (Dive into Deep Learning) 라이브러리에서 torch 모듈을 가져오는데, 이렇게 함으로써 d2l 라이브러리의 torch 모듈을 d2l이라는 이름으로 사용할 수 있게 됩니다. 이 라이브러리는 딥러닝과 관련된 여러 유용한 함수와 도구들을 제공합니다.

We will load the sample image to be used in this section. We can see that there is a dog on the left side of the image and a cat on the right. They are the two major objects in this image.

이 섹션에서 사용할 샘플 이미지를 로드합니다. 이미지의 왼쪽에는 개가 있고 오른쪽에는 고양이가 있는 것을 볼 수 있습니다. 이 이미지에서 두 개의 주요 객체입니다.

d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);

위 코드는 d2l 라이브러리를 사용하여 이미지를 표시하는 과정을 수행하는 코드입니다.

d2l.set_figsize(): Matplotlib 그래프의 크기를 설정하는 함수입니다. d2l 라이브러리의 set_figsize() 함수를 호출하여 그래프의 크기를 미리 정의된 기본 크기로 설정합니다.
img = d2l.plt.imread('../img/catdog.jpg'): 이미지 파일을 읽어와 변수 img에 저장합니다. 이미지 파일은 ../img/catdog.jpg 경로에서 읽어옵니다. d2l.plt.imread() 함수는 이미지 파일을 NumPy 배열로 변환하여 반환합니다.
d2l.plt.imshow(img): Matplotlib의 imshow() 함수를 사용하여 이미지를 표시합니다. img에 저장된 NumPy 배열 이미지를 그래프 상에 표시합니다. 이 때, 이미지는 해당 경로의 파일을 읽어온 것이므로 이 코드는 경로에 해당하는 이미지를 출력합니다.

14.3.1. Bounding Boxes

In object detection, we usually use a bounding box to describe the spatial location of an object. The bounding box is rectangular, which is determined by the x and y coordinates of the upper-left corner of the rectangle and the such coordinates of the lower-right corner. Another commonly used bounding box representation is the (x,y)-axis coordinates of the bounding box center, and the width and height of the box.

객체 감지에서 우리는 일반적으로 bounding box를 사용하여 객체의 공간적 위치(spatial location)를 설명합니다. 경계 상자 bounding box는 직사각형이며 직사각형의 왼쪽 위 모서리의 x 및 y 좌표와 오른쪽 아래 모서리의 이러한 좌표에 의해 결정됩니다. 일반적으로 사용되는 또 다른 경계 상자 bounding box 표현은 경계 상자 bounding box 중심의 (x,y)축 좌표와 상자의 너비 및 높이입니다.

Here we define functions to convert between these two representations: box_corner_to_center converts from the two-corner representation to the center-width-height presentation, and box_center_to_corner vice versa. The input argument boxes should be a two-dimensional tensor of shape (n, 4), where n is the number of bounding boxes.

여기서 우리는 이 두 가지 표현 사이를 변환하는 함수를 정의합니다. box_corner_to_center는 두 모서리 표현에서 중앙 너비 표현으로 변환하고 box_center_to_corner는 그 반대로 변환합니다. 입력 인수 상자는 모양이 (n, 4)인 2차원 텐서여야 합니다. 여기서 n은 경계 상자의 수입니다.

#@save
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height)."""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = torch.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """Convert from (center, width, height) to (upper-left, lower-right)."""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = torch.stack((x1, y1, x2, y2), axis=-1)
    return boxes

위 코드는 바운딩 박스 좌표 변환에 사용되는 두 개의 함수를 정의하는 코드입니다.

box_corner_to_center(boxes): 이 함수는 주어진 바운딩 박스의 좌표를 (상단 왼쪽, 하단 오른쪽) 형식에서 (중심, 너비, 높이) 형식으로 변환하는 역할을 합니다.
- 입력: boxes는 바운딩 박스의 좌표를 나타내는 텐서입니다. 각 행은 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식으로 된 좌표를 가지고 있습니다.
- 출력: 변환된 바운딩 박스의 좌표를 나타내는 텐서입니다. 각 행은 (중심 x, 중심 y, 너비, 높이) 형식으로 된 좌표를 가지고 있습니다.
box_center_to_corner(boxes): 이 함수는 주어진 바운딩 박스의 좌표를 (중심, 너비, 높이) 형식에서 (상단 왼쪽, 하단 오른쪽) 형식으로 변환하는 역할을 합니다.
- 입력: boxes는 바운딩 박스의 좌표를 나타내는 텐서입니다. 각 행은 (중심 x, 중심 y, 너비, 높이) 형식으로 된 좌표를 가지고 있습니다.
- 출력: 변환된 바운딩 박스의 좌표를 나타내는 텐서입니다. 각 행은 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식으로 된 좌표를 가지고 있습니다.

We will define the bounding boxes of the dog and the cat in the image based on the coordinate information. The origin of the coordinates in the image is the upper-left corner of the image, and to the right and down are the positive directions of the x and y axes, respectively.

좌표 정보를 기반으로 이미지에서 강아지와 고양이의 경계 상자를 정의합니다. 이미지에서 좌표의 원점은 이미지의 왼쪽 위 모서리이고 오른쪽과 아래는 각각 x축과 y축의 양의 방향입니다.

# Here `bbox` is the abbreviation for bounding box
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]

위 코드에서는 개와 고양이의 바운딩 박스 좌표를 리스트로 정의하고 있습니다.

dog_bbox: 개의 바운딩 박스 좌표입니다. 각 값은 순서대로 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식으로 바운딩 박스의 좌표를 나타냅니다.
cat_bbox: 고양이의 바운딩 박스 좌표입니다. 각 값은 순서대로 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식으로 바운딩 박스의 좌표를 나타냅니다.

이렇게 정의된 바운딩 박스 좌표는 이미지에서 개와 고양이 객체의 위치를 나타내기 위해 사용될 수 있습니다.

We can verify the correctness of the two bounding box conversion functions by converting twice.

두 번 변환하여 두 경계 상자 변환 함수의 정확성을 확인할 수 있습니다.

boxes = torch.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

tensor([[True, True, True, True],
        [True, True, True, True]])

위 코드는 바운딩 박스 좌표 변환 함수를 사용하여 변환 전과 변환 후의 좌표가 동일한지 확인하는 작업을 수행합니다.

boxes 텐서를 정의합니다. 이는 개와 고양이의 바운딩 박스 좌표가 포함되어 있습니다.
box_corner_to_center 함수를 사용하여 바운딩 박스 좌표를 (상자의 중심, 너비, 높이) 형식으로 변환합니다.
변환된 좌표를 box_center_to_corner 함수를 사용하여 다시 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식으로 변환합니다.
마지막으로 변환된 좌표와 원래 좌표 boxes가 동일한지 비교합니다.

즉, 이 코드는 바운딩 박스 좌표 변환 함수들이 서로 상호 호환되는지 검증하는 과정을 나타냅니다. 만약 변환 함수들이 정확하게 작동하고 서로 호환된다면, 변환 전과 후의 좌표는 동일해야 합니다.

Let’s draw the bounding boxes in the image to check if they are accurate. Before drawing, we will define a helper function bbox_to_rect. It represents the bounding box in the bounding box format of the matplotlib package.

정확한지 확인하기 위해 이미지에 경계 상자를 그려 봅시다. 그리기 전에 도우미 함수 bbox_to_rect를 정의합니다. matplotlib 패키지의 경계 상자 형식으로 경계 상자를 나타냅니다.

#@save
def bbox_to_rect(bbox, color):
    """Convert bounding box to matplotlib format."""
    # Convert the bounding box (upper-left x, upper-left y, lower-right x,
    # lower-right y) format to the matplotlib format: ((upper-left x,
    # upper-left y), width, height)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

위 코드는 바운딩 박스 좌표를 Matplotlib의 형식으로 변환하는 함수인 bbox_to_rect를 정의합니다.

bbox: 변환할 바운딩 박스 좌표입니다.
color: Matplotlib에서 사용할 선의 색상입니다.

이 함수는 바운딩 박스의 좌표를 (상단 왼쪽 x, 상단 왼쪽 y, 하단 오른쪽 x, 하단 오른쪽 y) 형식에서 Matplotlib 형식인 ((상단 왼쪽 x, 상단 왼쪽 y), width, height) 형식으로 변환합니다. 변환된 정보를 바탕으로 plt.Rectangle을 생성하여 반환합니다. 이 때 선의 색상과 선의 굵기 등을 설정하여 시각적인 효과를 부여합니다. 반환된 객체는 그래프에 추가하여 바운딩 박스를 시각적으로 나타낼 수 있습니다.

After adding the bounding boxes on the image, we can see that the main outline of the two objects are basically inside the two boxes.

이미지에 테두리 상자를 추가한 후 두 개체의 주요 윤곽선이 기본적으로 두 상자 안에 있음을 알 수 있습니다.

fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

위 코드는 Matplotlib를 사용하여 이미지 위에 바운딩 박스를 그리는 예시입니다.

fig = d2l.plt.imshow(img): 이미지를 플로팅합니다.
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue')): 강아지 바운딩 박스를 파란색 선으로 그립니다.
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red')): 고양이 바운딩 박스를 빨간색 선으로 그립니다.

이렇게 하면 bbox_to_rect 함수를 사용하여 바운딩 박스를 Matplotlib의 Rectangle로 변환하고, 해당 Rectangle 객체를 이미지 위에 추가하여 바운딩 박스를 시각적으로 표시할 수 있습니다.

14.3.2. Summary

Object detection not only recognizes all the objects of interest in the image, but also their positions. The position is generally represented by a rectangular bounding box.
개체 감지는 이미지에서 관심 있는 모든 개체뿐만 아니라 해당 개체의 위치도 인식합니다. 위치는 일반적으로 직사각형 경계 상자로 표시됩니다.
We can convert between two commonly used bounding box representations.
일반적으로 사용되는 두 가지 경계 상자 표현 간에 변환할 수 있습니다.

14.3.3. Exercises

Find another image and try to label a bounding box that contains the object. Compare labeling bounding boxes and categories: which usually takes longer?
Why is the innermost dimension of the input argument boxes of box_corner_to_center and box_center_to_corner always 4?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19
D2L - 14.1. Image Augmentation (0)	2023.08.19
D2L - 14. Computer Vision (0)	2023.08.18

1 2

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리

'Dive into Deep Learning/D2L Computer Vision'에 해당되는 글 13건

14.12. Neural Style Transfer

14.12.1. Method

14.12.2. Reading the Content and Style Images

14.12.3. Preprocessing and Postprocessing

14.12.4. Extracting Features

14.12.5. Defining the Loss Function

14.12.5.1. Content Loss

14.12.5.2. Style Loss

14.12.5.3. Total Variation Loss

14.12.5.4. Loss Function

14.12.6. Initializing the Synthesized Image

14.12.7. Training

14.12.8. Summary

14.12.9. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.11.1. The Model

14.11.2. Initializing Transposed Convolutional Layers

14.11.3. Reading the Dataset

14.11.4. Training

14.11.5. Prediction

14.11.6. Summary

14.11.7. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.10. Transposed Convolution

14.10.1. Basic Operation

14.10.2. Padding, Strides, and Multiple Channels

14.10.3. Connection to Matrix Transposition

14.10.4. Summary

14.10.5. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.9. Semantic Segmentation and the Dataset

14.9.1. Image Segmentation and Instance Segmentation

14.9.2. The Pascal VOC2012 Semantic Segmentation Dataset

14.9.2.1. Data Preprocessing

14.9.2.2. Custom Semantic Segmentation Dataset Class

14.9.2.3. Reading the Dataset

14.9.2.4. Putting It All Together

14.9.3. Summary

14.9.4. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.8. Region-based CNNs (R-CNNs)

14.8.1. R-CNNs

14.8.2. Fast R-CNN

14.8.3. Faster R-CNN

14.8.4. Mask R-CNN

14.8.5. Summary

14.8.6. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.7. Single Shot Multibox Detection

14.7.1. Model

14.7.1.1. Class Prediction Layer

14.7.1.2. Bounding Box Prediction Layer

14.7.1.3. Concatenating Predictions for Multiple Scales

14.7.1.4. Downsampling Block

14.7.1.5. Base Network Block

14.7.1.6. The Complete Model

14.7.2. Training

14.7.2.1. Reading the Dataset and Initializing the Model

14.7.2.2. Defining Loss and Evaluation Functions

14.7.2.3. Training the Model

14.7.3. Prediction

14.7.4. Summary

14.7.5. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.6. The Object Detection Dataset

14.6.1. Downloading the Dataset

14.6.2. Reading the Dataset

14.6.3. Demonstration

14.6.4. Summary

14.6.5. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.5. Multiscale Object Detection

14.5.1. Multiscale Anchor Boxes