IT 기술 따라잡기 :: D2L - 14.11. Fully Convolutional Networks

Dive into Deep Learning/D2L Computer Vision

D2L - 14.11. Fully Convolutional Networks

2023. 8. 21. 04:10 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/fcn.html

14.11. Fully Convolutional Networks — Dive into Deep Learning 1.0.3 documentation

d2l.ai

As discussed in Section 14.9, semantic segmentation classifies images in pixel level. A fully convolutional network (FCN) uses a convolutional neural network to transform image pixels to pixel classes (Long et al., 2015). Unlike the CNNs that we encountered earlier for image classification or object detection, a fully convolutional network transforms the height and width of intermediate feature maps back to those of the input image: this is achieved by the transposed convolutional layer introduced in Section 14.10. As a result, the classification output and the input image have a one-to-one correspondence in pixel level: the channel dimension at any output pixel holds the classification results for the input pixel at the same spatial position.

섹션 14.9에서 논의한 바와 같이 의미론적 분할 semantic segmentation은 픽셀 수준에서 이미지를 분류합니다. 완전 컨볼루션 네트워크 fully convolutional network (FCN)는 컨볼루션 신경망을 사용하여 이미지 픽셀을 픽셀 클래스로 변환합니다(Long et al., 2015). 이전에 이미지 분류 또는 객체 감지를 위해 접한 CNN과 달리 완전 컨볼루션 네트워크는 중간 피처 맵의 높이와 너비를 입력 이미지의 높이와 너비로 다시 변환합니다. 결과적으로 분류 출력과 입력 이미지는 픽셀 수준에서 일대일 대응을 갖습니다. 모든 출력 픽셀의 채널 차원은 동일한 공간 위치에서 입력 픽셀에 대한 분류 결과를 유지합니다.

Fully Convolutional Networks 란?

"Fully Convolutional Networks"는 완전 합성곱 신경망이라고도 불리며, 주로 이미지 분할과 같은 컴퓨터 비전 작업에 사용되는 딥러닝 아키텍처를 의미합니다.

기존의 전통적인 합성곱 신경망(Convolutional Neural Networks, CNN)은 주로 분류 작업을 위해 설계되었으며, 입력 이미지를 각기 다른 레이어로 전달하여 점진적으로 특징을 추출한 후, 마지막에 소프트맥스 레이어로 클래스 확률을 출력하는 구조를 가지고 있습니다.

반면에 Fully Convolutional Networks (FCN)는 입력 이미지를 픽셀 단위로 분류하는 이미지 분할 작업에 특화된 구조입니다. FCN은 일반적인 CNN의 특징 추출 부분을 유지하면서, 마지막 레이어를 완전 합성곱 레이어로 변경하여 원본 이미지의 공간적인 정보를 보존할 수 있도록 합니다. 이를 통해 각 픽셀에 대한 클래스 레이블을 예측하게 됩니다.

FCN은 픽셀 단위 분류 작업에서 효과적이며, 이미지 분할, 객체 탐지, 시맨틱 세그멘테이션 등의 컴퓨터 비전 작업에서 널리 사용되는 아키텍처입니다. FCN은 입력 이미지의 크기에 따라 유연하게 대응할 수 있으며, 특정 객체나 물체의 경계를 정확하게 분할할 수 있는 장점을 가지고 있습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

이 코드는 주피터 노트북(Jupyter Notebook)에서 실행되는 경우에 사용되는 매직 명령 %matplotlib inline을 시작으로, 필요한 모듈과 라이브러리를 가져오는 부분입니다.

%matplotlib inline: 주피터 노트북 환경에서 Matplotlib 그래프를 인라인으로 표시하도록 설정하는 매직 명령입니다.
import torch: 파이토치 라이브러리를 가져옵니다.
import torchvision: 파이토치의 컴퓨터 비전 관련 기능을 확장한 torchvision 라이브러리를 가져옵니다.
from torch import nn: 파이토치의 nn (neural network) 모듈에서 neural network 관련 기능들을 가져옵니다.
from torch.nn import functional as F: 파이토치의 nn 모듈에서 함수형 인터페이스를 가져와 F로 별칭을 지정합니다. 이것은 주로 활성화 함수나 손실 함수와 같은 함수적 기능을 가리키기 위해 사용됩니다.
from d2l import torch as d2l: "Dive into Deep Learning" 책의 도구 함수를 사용하기 위해 d2l 라이브러리를 가져옵니다. torch 별칭을 d2l로 변경하여 사용합니다.

이 코드 조각은 파이토치와 관련된 라이브러리와 함수를 가져옴으로써 딥러닝 모델을 구축하고 훈련하기 위한 기본적인 환경을 설정합니다.

14.11.1. The Model

Here we describe the basic design of the fully convolutional network model. As shown in Fig. 14.11.1, this model first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution introduced in Section 14.10. As a result, the model output has the same height and width as the input image, where the output channel contains the predicted classes for the input pixel at the same spatial position.

여기서 우리는 완전 컨벌루션 네트워크 모델의 기본 설계를 설명합니다. 그림 14.11.1과 같이 이 모델은 먼저 CNN을 사용하여 이미지 특징을 추출한 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환하고 마지막으로 특징 맵의 높이와 너비를 변환합니다. 섹션 14.10에서 소개한 전치 컨벌루션을 통해 입력 이미지의 이미지로 변환합니다. 결과적으로 모델 출력은 입력 이미지와 동일한 높이와 너비를 가지며 출력 채널에는 동일한 공간 위치에서 입력 픽셀에 대해 예측된 클래스가 포함됩니다.

Fig. 14.11.1  Fully convolutional network.

Below, we use a ResNet-18 model pretrained on the ImageNet dataset to extract image features and denote the model instance as pretrained_net. The last few layers of this model include a global average pooling layer and a fully connected layer: they are not needed in the fully convolutional network.

아래에서는 ImageNet 데이터 세트에서 사전 훈련된 ResNet-18 모델을 사용하여 이미지 기능을 추출하고 모델 인스턴스를 pretrained_net으로 표시합니다. 이 모델의 마지막 몇 계층에는 전역 평균 풀링 계층과 완전 연결 계층이 포함됩니다. 완전 컨벌루션 네트워크에서는 필요하지 않습니다.

pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]

이 코드는 사전 훈련된 ResNet-18 모델을 불러오고, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

pretrained_net: 사전 훈련된 ResNet-18 모델을 불러옵니다.
torchvision.models.resnet18(pretrained=True): 사전 훈련된 ResNet-18 모델을 불러옵니다. pretrained=True는 사전 훈련된 가중치를 사용하겠다는 의미입니다.
list(pretrained_net.children()): 불러온 ResNet-18 모델의 모든 하위 모듈을 리스트로 반환합니다. 이는 네트워크의 각 레이어를 순차적으로 표현한 것입니다.
[-3:]: 리스트의 마지막 세 개의 원소를 선택합니다. 이는 불러온 ResNet-18 모델의 마지막 세 개의 레이어를 의미합니다.

따라서, 위 코드는 사전 훈련된 ResNet-18 모델을 불러온 뒤, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/ci/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 56.3MB/s]

[Sequential(
   (0): BasicBlock(
     (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (downsample): Sequential(
       (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     )
   )
   (1): BasicBlock(
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 ),
 AdaptiveAvgPool2d(output_size=(1, 1)),
 Linear(in_features=512, out_features=1000, bias=True)]

위 출력은 ResNet-18 모델의 구조를 설명합니다. 아래에서부터 위로 각각의 레이어와 그 내용을 설명하겠습니다.

Linear 레이어:
- 입력 특성 개수: 512
- 출력 특성 개수: 1000
- 활성화 함수: 없음
- 주어진 입력 특성에 대한 선형 변환을 수행하는 fully connected 레이어입니다. 이것은 마지막 분류 레이어로, 모델의 출력이 클래스 수에 대한 확률 분포를 나타냅니다.
AdaptiveAvgPool2d 레이어:
- 출력 크기: (1, 1)
- 입력으로 들어오는 특성 맵을 주어진 출력 크기에 맞게 자동으로 평균 풀링하는 레이어입니다. 입력 이미지의 크기에 관계 없이 일정한 크기의 출력을 생성합니다.
Sequential 레이어 (2개의 BasicBlock 레이어 포함):
- 첫 번째 BasicBlock:
  - conv1: 입력 특성 개수 256, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn1: 배치 정규화 레이어
  - relu: ReLU 활성화 함수
  - conv2: 입력 특성 개수 512, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn2: 배치 정규화 레이어
  - downsample: 다운샘플링을 위한 Sequential 레이어
- 두 번째 BasicBlock:
  - 위와 같은 구조를 가진 BasicBlock 레이어
BasicBlock은 ResNet의 기본 구성 요소입니다. 입력 특성을 컨볼루션과 배치 정규화를 거쳐 변환한 뒤, 활성화 함수 ReLU를 적용하고 다시 컨볼루션과 배치 정규화를 거칩니다. 이러한 구성으로 인해 ResNet은 깊게 쌓여도 그레이디언트 소실 문제를 해결하며 성능을 높일 수 있습니다.

따라서 위의 출력은 ResNet-18 모델의 구조를 나타내며, 각 레이어의 구성 요소를 설명합니다.

Next, we create the fully convolutional network instance net. It copies all the pretrained layers in the ResNet-18 except for the final global average pooling layer and the fully connected layer that are closest to the output.

다음으로 완전 컨볼루션 네트워크 인스턴스 네트워크를 만듭니다. 최종 전역 평균 풀링 계층과 출력에 가장 가까운 완전 연결 계층을 제외하고 ResNet-18의 모든 사전 훈련된 계층을 복사합니다.

net = nn.Sequential(*list(pretrained_net.children())[:-2])

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 새로운 신경망 모델로 가져오는 과정을 나타냅니다.

pretrained_net은 미리 학습된 ResNet-18 모델을 나타냅니다.
pretrained_net.children()는 모델의 모든 하위 레이어를 가져오는 메소드입니다.
list(pretrained_net.children())[:-2]는 마지막 두 레이어를 제외한 모든 레이어를 리스트로 가져옵니다.
nn.Sequential(*...)는 주어진 레이어들을 순차적으로 연결한 새로운 신경망 모델을 생성합니다.

따라서 net은 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 가지는 새로운 신경망 모델입니다. 이렇게 하면 기존의 미리 학습된 모델에서 일부 레이어를 재활용하고, 이에 새로운 레이어를 추가하여 원하는 작업에 맞는 모델을 빠르게 구축할 수 있습니다.

Given an input with height and width of 320 and 480 respectively, the forward propagation of net reduces the input height and width to 1/32 of the original, namely 10 and 15.

높이와 너비가 각각 320과 480인 입력이 주어지면 net의 정방향 전파는 입력 높이와 너비를 원본의 1/32, 즉 10과 15로 줄입니다.

X = torch.rand(size=(1, 3, 320, 480))
net(X).shape

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 나머지 레이어로 입력 데이터를 전달하여 출력의 형태를 확인하는 과정을 나타냅니다.

torch.rand(size=(1, 3, 320, 480))은 크기가 (1, 3, 320, 480)인 랜덤한 값을 가지는 입력 데이터를 생성합니다. 이는 배치 크기가 1이고 채널 수가 3이며, 높이가 320이고 너비가 480인 이미지 데이터를 나타냅니다.
net(X)는 위에서 생성한 입력 데이터 X를 미리 학습된 ResNet-18 모델인 net에 통과시켜 출력을 계산하는 과정입니다.
.shape는 텐서의 크기(shape)를 반환하는 속성입니다. 따라서 net(X).shape는 네트워크의 출력 형태를 나타내게 됩니다.

이 코드를 실행하면 입력 데이터 X를 네트워크에 전달한 결과의 출력 형태를 확인할 수 있습니다.

torch.Size([1, 512, 10, 15])

Next, we use a 1×1 convolutional layer to transform the number of output channels into the number of classes (21) of the Pascal VOC2012 dataset. Finally, we need to increase the height and width of the feature maps by 32 times to change them back to the height and width of the input image. Recall how to calculate the output shape of a convolutional layer in Section 7.3. Since (320−64+16×2+32)/32=10 and (480−64+16×2+32)/32=15, we construct a transposed convolutional layer with stride of 32, setting the height and width of the kernel to 64, the padding to 16. In general, we can see that for stride s, padding s/2 (assuming s/2 is an integer), and the height and width of the kernel 2s, the transposed convolution will increase the height and width of the input by s times.

다음으로 1×1 컨벌루션 레이어를 사용하여 출력 채널 수를 Pascal VOC2012 데이터 세트의 클래스 수(21)로 변환합니다. 마지막으로 피처 맵의 높이와 너비를 32배 늘려 입력 이미지의 높이와 너비로 다시 변경해야 합니다. 섹션 7.3에서 컨볼루션 레이어의 출력 형태를 계산하는 방법을 기억하십시오. (320-64+16×2+32)/32=10이고 (480-64+16×2+32)/32=15이므로 보폭이 32인 전치 컨볼루션 레이어를 구성하고 커널은 64로, 패딩은 16으로 높이와 너비를 설정합니다. 일반적으로 보폭 s, 패딩 s/2(s/2가 정수라고 가정) 및 커널 2s의 높이와 너비에 대해 전치 컨볼루션이 입력의 높이와 너비를 s배로 증가시키는 것을 볼 수 있습니다. .

num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
                                    kernel_size=64, padding=16, stride=32))

위 코드는 기존의 신경망 모델 net에 두 개의 추가 레이어를 추가하는 과정을 나타냅니다.

num_classes = 21: 클래스의 개수를 나타내는 변수로, 이 경우에는 21개의 클래스를 분류하는 문제를 가정하고 있습니다.
nn.Conv2d(512, num_classes, kernel_size=1): final_conv라는 이름의 1x1 컨볼루션 레이어를 생성하고, 입력 채널 수가 512이며 출력 채널 수가 num_classes인 컨볼루션 레이어를 추가합니다. 이 레이어는 마지막 컨볼루션 레이어를 대체하는 역할을 합니다.
nn.ConvTranspose2d(num_classes, num_classes, kernel_size=64, padding=16, stride=32): transpose_conv라는 이름의 전치 컨볼루션 레이어를 생성하고, 입력 채널 수와 출력 채널 수가 모두 num_classes인 전치 컨볼루션 레이어를 추가합니다. 이 레이어는 특성 맵의 크기를 늘려주는 데 사용되며, 이 경우에는 크기를 32배 확대하고 패딩과 스트라이드를 조절하여 원하는 출력 크기를 얻습니다.

따라서 위 코드는 미리 학습된 신경망 net의 마지막 레이어를 대체하고, 그 뒤에 전치 컨볼루션 레이어를 추가하여 원하는 출력을 생성할 수 있도록 수정하는 과정을 나타냅니다. 이는 주로 세그멘테이션 등의 문제에서 사용되며, 네트워크의 출력 크기를 조정하고 클래스별 특성 맵을 얻을 때 유용합니다.

14.11.2. Initializing Transposed Convolutional Layers

We already know that transposed convolutional layers can increase the height and width of feature maps. In image processing, we may need to scale up an image, i.e., upsampling. Bilinear interpolation is one of the commonly used upsampling techniques. It is also often used for initializing transposed convolutional layers.

우리는 transposed convolutional layer가 feature map의 height와 width를 증가시킬 수 있다는 것을 이미 알고 있습니다. 이미지 처리에서 이미지를 확장해야 할 수도 있습니다. 즉, 업샘플링이 필요할 수 있습니다. 쌍선형 보간 Bilinear interpolation 은 일반적으로 사용되는 업샘플링 기술 중 하나입니다. 또한 전치된 컨볼루션 레이어를 초기화하는 데 자주 사용됩니다.

To explain bilinear interpolation, say that given an input image we want to calculate each pixel of the upsampled output image. In order to calculate the pixel of the output image at coordinate (x,y), first map (x,y) to coordinate (x′,y′) on the input image, for example, according to the ratio of the input size to the output size. Note that the mapped x′ and y′ are real numbers. Then, find the four pixels closest to coordinate (x′,y′) on the input image. Finally, the pixel of the output image at coordinate (x,y) is calculated based on these four closest pixels on the input image and their relative distance from (x′,y′).

쌍선형 보간법 bilinear interpolation을 설명하기 위해 입력 이미지가 주어졌을 때 업샘플링된 출력 이미지의 각 픽셀을 계산하려고 한다고 가정해 보겠습니다. 좌표 (x,y)에서 출력 이미지의 픽셀을 계산하기 위해 먼저 입력 크기의 비율에 따라 (x,y)를 입력 이미지의 좌표 (x′,y′)에 매핑합니다. 출력 크기에. 매핑된 x′ 및 y′는 실수입니다. 그런 다음 입력 이미지에서 좌표(x′,y′)에 가장 가까운 4개의 픽셀을 찾습니다. 마지막으로 좌표 (x,y)에 있는 출력 이미지의 픽셀은 입력 이미지에서 가장 가까운 4개의 픽셀과 (x',y')로부터의 상대적인 거리를 기반으로 계산됩니다.

bilinear interpolation이란?

'Bilinear Interpolation'은 이미지나 그래픽의 크기나 해상도를 변경하는 과정에서 사용되는 보간 기법 중 하나입니다. 보간(interpolation)은 주어진 데이터 포인트 사이에 새로운 값을 추정하는 기술을 의미하며, 이를 통해 부드럽게 변환된 이미지나 그래픽을 생성할 수 있습니다.

이 중 'Bilinear Interpolation'은 2차원 평면에서 사용되며, 두 개의 인접한 데이터 포인트 사이에서 새로운 값을 추정합니다. 간단히 말해, 2차원 좌표 상에서 (x, y) 위치에 존재하는 값은 주변 4개의 데이터 포인트를 사용하여 계산됩니다. 이렇게 인접한 4개의 데이터 포인트를 가중치를 적용하여 선형 보간한 결과를 사용하여 새로운 값을 추정합니다.

이 기법은 이미지나 그래픽의 크기 변경, 회전, 이동 등에 사용되며, 특히 이미지 리사이징이나 세그멘테이션과 같은 작업에서 주로 활용됩니다. 'Bilinear'이라는 이름은 가중치를 계산하는 데에 선형 보간을 사용한다는 특성을 나타내며, 두 개의 축에 대한 보간이 동시에 이루어지므로 'Bilinear'이라는 이름이 붙여졌습니다.

Upsampling of bilinear interpolation can be implemented by the transposed convolutional layer with the kernel constructed by the following bilinear_kernel function. Due to space limitations, we only provide the implementation of the bilinear_kernel function below without discussions on its algorithm design.

쌍선형 보간법 bilinear interpolation 의 업샘플링은 다음 bilinear_kernel 함수로 구성된 커널을 사용하여 전치된 컨벌루션 계층 transposed convolutional layer에 의해 구현될 수 있습니다. 공간 제한으로 인해 알고리즘 설계에 대한 논의 없이 bilinear_kernel 함수의 구현만 제공합니다.

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (torch.arange(kernel_size).reshape(-1, 1),
          torch.arange(kernel_size).reshape(1, -1))
    filt = (1 - torch.abs(og[0] - center) / factor) * \
           (1 - torch.abs(og[1] - center) / factor)
    weight = torch.zeros((in_channels, out_channels,
                          kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return weight

위 코드는 주어진 입력 채널 수와 출력 채널 수, 그리고 커널 크기에 따라 이차원의 'Bilinear Interpolation' 커널을 생성하는 함수를 정의하고 있습니다. 이 커널은 주로 합성곱 신경망(Convolutional Neural Network, CNN)에서 전치 합성곱 연산에 활용되며, 주로 세그멘테이션(segmentation)과 같은 작업에서 이미지를 업샘플링하는 데 사용됩니다.

여기서 in_channels은 입력 채널 수, out_channels는 출력 채널 수, kernel_size는 커널의 크기(가로와 세로)입니다. 이 커널은 입력 이미지의 각 채널별로 다른 가중치를 가지며, 이를 통해 입력 이미지의 특성을 보존하면서 업샘플링됩니다.

커널을 생성하는 과정은 다음과 같습니다:

factor 변수는 커널 크기의 중앙을 계산합니다.
center 변수는 중앙 위치를 결정합니다.
og는 커널 크기에 따른 좌표 그리드를 생성합니다.
filt는 각 좌표 위치에 대한 커널 가중치를 계산합니다. 이것은 'Bilinear Interpolation'에 따라 계산된 값으로, 가운데에 가까운 값일수록 높은 가중치를 가지며, 커널 크기를 기준으로 대칭적인 값을 가집니다.
weight 텐서는 이렇게 계산된 filt 값을 이용하여 입력 채널 수와 출력 채널 수에 따른 4차원 텐서를 생성합니다. 이 텐서의 각 채널에 커널 가중치 값을 할당합니다.

이러한 커널은 전치 합성곱 연산에서 업샘플링을 수행할 때 사용되며, 주로 모델의 특성 맵 크기를 늘리는 데 활용됩니다.

Let’s experiment with upsampling of bilinear interpolation that is implemented by a transposed convolutional layer. We construct a transposed convolutional layer that doubles the height and weight, and initialize its kernel with the bilinear_kernel function.

전치된 컨볼루션 레이어에 의해 구현되는 쌍선형 보간의 업샘플링을 실험해 봅시다. 높이와 무게를 두 배로 늘리는 전치 컨볼루션 레이어를 구성하고 bilinear_kernel 함수로 커널을 초기화합니다.

conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
                                bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));

위 코드는 입력 채널 수와 출력 채널 수가 모두 3인 커스텀 'Bilinear Interpolation' 커널을 생성하여 이를 사용하여 전치 합성곱 연산을 수행하는 부분입니다.

해석을 해보겠습니다:

conv_trans는 전치 합성곱 레이어를 생성하는데 사용됩니다. nn.ConvTranspose2d는 전치 합성곱을 수행하는 클래스입니다. 여기서 입력 채널 수와 출력 채널 수 모두 3이며, 커널 크기는 4x4이고, 패딩은 1, 스트라이드는 2로 설정됩니다.
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4))는 위에서 정의한 bilinear_kernel 함수를 사용하여 커스텀 'Bilinear Interpolation' 커널을 생성하고, 이를 conv_trans 레이어의 가중치 텐서로 복사하는 역할을 합니다. 이를 통해 'Bilinear Interpolation'의 특성을 가지는 커널이 전치 합성곱 레이어에 적용됩니다.

따라서 위 코드는 입력 이미지의 특성 맵을 업샘플링하여 크기를 확장하는 역할을 수행합니다. 이를 통해 세그멘테이션과 같은 작업에서 높은 해상도의 예측을 얻을 수 있습니다.

Read the image X and assign the upsampling output to Y. In order to print the image, we need to adjust the position of the channel dimension.

이미지 X를 읽고 업샘플링 출력을 Y에 할당합니다. 이미지를 인쇄하려면 채널 차원의 위치를 조정해야 합니다.

img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()

위 코드는 주어진 이미지에 대해 Bilinear Interpolation을 사용하여 업샘플링을 수행하는 과정을 보여주는 부분입니다.

해석을 해보겠습니다:

img는 d2l.Image.open('../img/catdog.jpg')를 통해 로드된 이미지입니다. 이 이미지를 텐서 형식으로 변환하기 위해 torchvision.transforms.ToTensor()를 사용하여 텐서로 변환합니다.
X는 img를 모델에 입력으로 사용하기 위해 차원을 확장한 것입니다. 일반적으로 모델에 입력할 때 배치 차원을 추가하기 위해 사용됩니다.
conv_trans(X)는 이전에 정의한 conv_trans 레이어를 사용하여 입력 이미지 X에 대한 업샘플링을 수행한 결과를 얻습니다. 이렇게 얻은 결과인 Y는 업샘플링된 이미지입니다.
out_img는 Y 텐서에서 0번째 배치 차원을 선택하여 이미지를 추출합니다. 그리고 이를 .permute(1, 2, 0)를 통해 채널 순서를 변경하여 이미지의 형식을 HWC (높이, 너비, 채널)로 변경합니다. .detach()는 연산을 추적하지 않도록 하기 위해 사용되며, 역전파에 영향을 주지 않습니다.

결과적으로 out_img는 업샘플링된 이미지를 나타내며, 해당 이미지를 시각화하거나 분석하는 데 사용할 수 있습니다.

As we can see, the transposed convolutional layer increases both the height and width of the image by a factor of two. Except for the different scales in coordinates, the image scaled up by bilinear interpolation and the original image printed in Section 14.3 look the same.

보시다시피 transposed convolutional layer는 이미지의 높이와 너비를 2배로 늘립니다. 좌표의 스케일이 다른 것을 제외하면 쌍선형 보간에 의해 확대된 이미지와 14.3절에서 인쇄된 원본 이미지는 동일하게 보입니다.

d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);

위 코드는 입력 이미지와 Bilinear Interpolation을 통해 얻은 업샘플링된 출력 이미지를 시각화하는 부분입니다.

해석을 해보겠습니다:

d2l.set_figsize()는 시각화할 그림의 크기를 설정하는 함수입니다. 보통 d2l 라이브러리에서 제공하는 시각화 함수를 사용할 때 함께 사용됩니다.
print('input image shape:', img.permute(1, 2, 0).shape)는 입력 이미지의 형태를 출력합니다. 이미지의 채널 순서를 변경하여 (높이, 너비, 채널) 형식으로 변경하고, .shape를 사용하여 해당 형태를 출력합니다.
d2l.plt.imshow(img.permute(1, 2, 0))는 d2l 라이브러리의 시각화 함수를 사용하여 입력 이미지를 표시합니다. img.permute(1, 2, 0)를 사용하여 채널 순서를 변경하고 이미지를 시각화합니다.
print('output image shape:', out_img.shape)는 업샘플링된 출력 이미지의 형태를 출력합니다.
d2l.plt.imshow(out_img)는 업샘플링된 출력 이미지를 시각화합니다.

결과적으로 코드는 입력 이미지와 업샘플링된 출력 이미지를 시각화하여 보여주는 역할을 합니다.

input image shape: torch.Size([561, 728, 3])
output image shape: torch.Size([1122, 1456, 3])

In a fully convolutional network, we initialize the transposed convolutional layer with upsampling of bilinear interpolation. For the 1×1 convolutional layer, we use Xavier initialization.

완전 컨벌루션 네트워크에서 쌍선형 보간법의 업샘플링으로 전치된 컨벌루션 레이어를 초기화합니다. 1×1 컨볼루션 레이어의 경우 Xavier 초기화를 사용합니다.

W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);

위 코드는 미리 정의된 네트워크 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 가중치로 설정하는 부분입니다.

해석을 해보겠습니다:

W = bilinear_kernel(num_classes, num_classes, 64)은 bilinear_kernel 함수를 사용하여 Bilinear Interpolation에 사용되는 가중치 행렬 W를 생성합니다. 이 함수는 입력 채널 수와 출력 채널 수가 같으며, 커널 크기는 64x64입니다. 이렇게 생성된 W는 Bilinear Interpolation을 수행하는 필터 역할을 합니다.
net.transpose_conv.weight.data.copy_(W)은 미리 정의된 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 W로 설정합니다. net.transpose_conv.weight.data는 레이어의 가중치 데이터를 의미하며, .copy_(W)를 사용하여 W의 값을 이 가중치에 복사합니다.

이렇게 설정된 가중치로 인해 ConvTranspose2d 레이어는 입력 이미지를 업샘플링하고 Bilinear Interpolation을 수행하여 출력 이미지를 생성하게 됩니다.

14.11.3. Reading the Dataset

We read the semantic segmentation dataset as introduced in Section 14.9. The output image shape of random cropping is specified as 320×480: both the height and width are divisible by 32.

섹션 14.9에서 소개한 시맨틱 분할 데이터 세트를 읽습니다. 임의 자르기의 출력 이미지 모양은 320×480으로 지정됩니다. 높이와 너비는 모두 32로 나눌 수 있습니다.

batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

위 코드는 Pascal VOC 데이터셋을 사용하여 학습과 테스트 데이터로 나누어 불러오는 부분입니다.

해석을 해보겠습니다:

batch_size는 미니배치의 크기를 나타내며, 한 번의 iteration에서 처리되는 데이터 샘플의 수입니다.
crop_size는 이미지를 잘라내는 크기입니다. VOC 데이터셋의 이미지를 모두 동일한 크기로 자르기 위해 사용됩니다.
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)는 d2l.load_data_voc 함수를 호출하여 Pascal VOC 데이터셋을 불러오고 학습용 데이터와 테스트용 데이터로 나눈 미니배치 반복자 train_iter와 test_iter를 생성합니다. 이 반복자들은 학습 및 테스트에 사용됩니다.

즉, 위 코드는 Pascal VOC 데이터셋을 불러와서 학습용 데이터와 테스트용 데이터로 나누고, 미니배치 반복자를 생성하는 과정을 나타냅니다.

read 1114 examples
read 1078 examples

14.11.4. Training

Now we can train our constructed fully convolutional network. The loss function and accuracy calculation here are not essentially different from those in image classification of earlier chapters. Because we use the output channel of the transposed convolutional layer to predict the class for each pixel, the channel dimension is specified in the loss calculation. In addition, the accuracy is calculated based on correctness of the predicted class for all the pixels.

이제 우리는 구축한 완전 컨벌루션 네트워크를 훈련할 수 있습니다. 여기서 손실 함수와 정확도 계산은 이전 장의 이미지 분류와 본질적으로 다르지 않습니다. 각 픽셀의 클래스를 예측하기 위해 전치된 컨벌루션 레이어의 출력 채널을 사용하기 때문에 손실 계산에서 채널 차원이 지정됩니다. 또한 정확도는 모든 픽셀에 대해 예측된 클래스의 정확도를 기반으로 계산됩니다.

def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

위 코드는 주어진 모델을 학습시키는 과정을 나타내고 있습니다.

def loss(inputs, targets) 함수는 입력 데이터와 실제 타겟 데이터를 받아와서 Cross Entropy 손실을 계산하는 함수입니다. 여기서 reduction='none'으로 설정하여 각 데이터 포인트의 손실 값을 계산하고, 이를 평균하여 최종 손실을 구합니다.
num_epochs는 에포크 수를 나타냅니다. 에포크는 학습 데이터를 몇 번 반복해서 사용하여 모델을 업데이트하는 단위입니다.
lr은 학습률(learning rate)로, 모델의 가중치를 업데이트할 때 얼마나 큰 스텝으로 업데이트할지를 결정합니다.
wd는 가중치 감쇠(weight decay)로, 가중치의 크기를 제어하여 오버피팅을 방지하고 모델의 일반화 성능을 높이는데 사용됩니다.
devices는 사용 가능한 GPU 장치를 찾는 함수인 d2l.try_all_gpus()를 호출하여 GPU 장치 목록을 가져옵니다.
trainer는 옵티마이저로, torch.optim.SGD를 사용하여 네트워크의 파라미터를 업데이트합니다.
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)는 모델을 학습시키는 함수를 호출합니다. 이 함수는 네트워크 모델, 학습용 데이터 반복자, 테스트용 데이터 반복자, 손실 함수, 옵티마이저, 에포크 수, GPU 장치 정보를 인자로 받아서 모델을 학습시키는 프로세스를 실행합니다.

따라서 위 코드는 주어진 모델을 주어진 데이터로 학습시키는 과정을 수행하는 코드입니다.

loss 0.449, train acc 0.861, test acc 0.852
226.7 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]

14.11.5. Prediction

When predicting, we need to standardize the input image in each channel and transform the image into the four-dimensional input format required by the CNN.

예측할 때 각 채널의 입력 이미지를 표준화하고 이미지를 CNN에서 요구하는 4차원 입력 형식으로 변환해야 합니다.

def predict(img):
    X = test_iter.dataset.normalize_image(img).unsqueeze(0)
    pred = net(X.to(devices[0])).argmax(dim=1)
    return pred.reshape(pred.shape[1], pred.shape[2])

위 코드는 주어진 이미지에 대한 Semantic Segmentation 예측을 수행하는 함수를 정의한 것입니다.

def predict(img) 함수는 이미지를 입력으로 받아와서 Semantic Segmentation을 예측합니다.
X = test_iter.dataset.normalize_image(img).unsqueeze(0)는 입력 이미지를 네트워크의 입력 형식에 맞게 전처리하는 과정입니다. 입력 이미지를 모델의 입력에 맞게 정규화(Normalization)하고, 배치 차원을 추가하여 모델에 한 번에 처리할 수 있도록 준비합니다.
pred = net(X.to(devices[0])).argmax(dim=1)는 전처리된 입력 이미지를 네트워크에 전달하고, 네트워크의 출력을 가져와서 가장 높은 값의 클래스 인덱스를 선택합니다. 이를 통해 각 픽셀에 대한 예측 클래스 인덱스를 얻습니다.
return pred.reshape(pred.shape[1], pred.shape[2])는 예측된 클래스 인덱스를 원본 이미지 크기에 맞게 재조정하여 반환합니다. 반환된 클래스 인덱스 맵은 이미지의 각 픽셀에 대한 클래스 정보를 포함하게 됩니다.

즉, predict 함수는 주어진 이미지에 대해 Semantic Segmentation 모델로 예측을 수행하고, 예측된 클래스 인덱스 맵을 반환하는 역할을 합니다.

To visualize the predicted class of each pixel, we map the predicted class back to its label color in the dataset.

각 픽셀의 예측된 클래스를 시각화하기 위해 예측된 클래스를 데이터 세트의 레이블 색상에 다시 매핑합니다.

def label2image(pred):
    colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
    X = pred.long()
    return colormap[X, :]

위 코드는 예측된 클래스 인덱스 맵을 컬러 이미지로 변환하는 함수를 정의한 것입니다.

def label2image(pred) 함수는 예측된 클래스 인덱스 맵을 입력으로 받아서 컬러 이미지로 변환합니다.
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])는 VOC 데이터셋에 사용되는 클래스의 컬러 맵을 Torch Tensor로 변환하여 준비합니다. 각 클래스마다 대응하는 RGB 컬러 값을 가지고 있습니다.
X = pred.long()는 예측된 클래스 인덱스 맵을 정수 형태로 변환합니다. 일반적으로 클래스 인덱스는 정수 값으로 표현됩니다.
return colormap[X, :]는 클래스 인덱스 맵에 대응하는 컬러 값을 colormap에서 찾아서 반환합니다. 따라서 각 픽셀에 대한 클래스 인덱스에 해당하는 컬러로 구성된 컬러 이미지가 반환됩니다.

이 함수를 사용하면 Semantic Segmentation 모델이 예측한 클래스 인덱스 맵을 실제 컬러 이미지로 변환하여 시각화할 수 있습니다.

Images in the test dataset vary in size and shape. Since the model uses a transposed convolutional layer with stride of 32, when the height or width of an input image is indivisible by 32, the output height or width of the transposed convolutional layer will deviate from the shape of the input image. In order to address this issue, we can crop multiple rectangular areas with height and width that are integer multiples of 32 in the image, and perform forward propagation on the pixels in these areas separately. Note that the union of these rectangular areas needs to completely cover the input image. When a pixel is covered by multiple rectangular areas, the average of the transposed convolution outputs in separate areas for this same pixel can be input to the softmax operation to predict the class.

테스트 데이터 세트의 이미지는 크기와 모양이 다양합니다. 이 모델은 보폭이 32인 전치 컨볼루션 레이어를 사용하기 때문에 입력 이미지의 높이 또는 너비가 32로 나눌 수 없는 경우 전치된 컨볼루션 레이어의 출력 높이 또는 너비가 입력 이미지의 모양에서 벗어날 것입니다. 이 문제를 해결하기 위해 이미지에서 높이와 너비가 32의 정수배인 여러 직사각형 영역을 자르고 이 영역의 픽셀에 대해 개별적으로 순방향 전파를 수행할 수 있습니다. 이러한 직사각형 영역의 합집합은 입력 이미지를 완전히 덮어야 합니다. 픽셀이 여러 직사각형 영역으로 덮여 있는 경우 동일한 픽셀에 대해 별도의 영역에서 전치된 컨볼루션 출력의 평균을 소프트맥스 연산에 입력하여 클래스를 예측할 수 있습니다.

For simplicity, we only read a few larger test images, and crop a 320×480 area for prediction starting from the upper-left corner of an image. For these test images, we print their cropped areas, prediction results, and ground-truth row by row.

간단하게 하기 위해 몇 개의 더 큰 테스트 이미지만 읽고 이미지의 왼쪽 상단 모서리에서 시작하여 예측을 위해 320×480 영역을 자릅니다. 이러한 테스트 이미지의 경우 잘린 영역, 예측 결과 및 실측 정보를 행별로 인쇄합니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 320, 480)
    X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X.permute(1,2,0), pred.cpu(),
             torchvision.transforms.functional.crop(
                 test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

위 코드는 VOC 데이터셋의 테스트 이미지와 해당 이미지에 대한 예측 결과, 그리고 실제 레이블 이미지를 시각화하는 예제입니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')는 VOC 데이터셋을 다운로드하고 압축을 해제하여 voc_dir에 저장합니다.
test_images, test_labels = d2l.read_voc_images(voc_dir, False)는 VOC 데이터셋의 테스트 이미지와 레이블 이미지를 읽어옵니다.
n, imgs = 4, []는 시각화할 이미지 개수 n과 결과 이미지를 저장할 리스트 imgs를 초기화합니다.
for i in range(n):는 n 개의 이미지에 대해 반복합니다.
crop_rect = (0, 0, 320, 480)는 이미지를 잘라낼 영역을 정의합니다.
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)는 테스트 이미지 i를 잘라낸 후 X에 저장합니다.
pred = label2image(predict(X))는 잘라낸 이미지 X에 대한 예측을 수행한 뒤, 예측 결과를 컬러 이미지로 변환하여 pred에 저장합니다.
imgs += [X.permute(1,2,0), pred.cpu(), ...]는 잘라낸 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 리스트에 추가합니다. .permute(1, 2, 0)는 이미지의 차원 순서를 변경하여 HWC 형식으로 변환합니다.
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2)는 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 한 줄에 3개씩 표시하고, 이를 n 행으로 나열하여 시각화합니다. scale=2는 이미지 크기를 확대하여 표시합니다.

이 코드를 통해 VOC 데이터셋의 테스트 이미지에 대한 Semantic Segmentation 모델의 예측 결과와 실제 레이블 이미지를 시각화할 수 있습니다.

14.11.6. Summary

The fully convolutional network first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution.

완전 컨벌루션 네트워크는 먼저 CNN을 사용하여 이미지 특징을 추출합니다. 그런 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환합니다. 마지막으로 transposed convolution을 통해 feature map의 높이와 너비를 입력 이미지의 높이와 너비로 변환합니다.
In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize the transposed convolutional layer.

완전 컨벌루션 네트워크에서는 쌍선형 보간의 업샘플링을 사용하여 전치된 컨볼루션 레이어를 초기화할 수 있습니다.

14.11.7. Exercises

If we use Xavier initialization for the transposed convolutional layer in the experiment, how does the result change?
Can you further improve the accuracy of the model by tuning the hyperparameters?
Predict the classes of all pixels in test images.
The original fully convolutional network paper also uses outputs of some intermediate CNN layers (Long et al., 2015). Try to implement this idea.

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리

D2L - 14.11. Fully Convolutional Networks

14.11.1. The Model

14.11.2. Initializing Transposed Convolutional Layers

14.11.3. Reading the Dataset

14.11.4. Training

14.11.5. Prediction

14.11.6. Summary

14.11.7. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

티스토리툴바