Dive into Deep Learning/D2L Convolutional Neural Networks (CNN)

D2L - 8.4. Multi-Branch Networks (GoogLeNet)

2023. 7. 18. 12:31 | Posted by 솔웅

https://d2l.ai/chapter_convolutional-modern/googlenet.html

8.4. Multi-Branch Networks (GoogLeNet) — Dive into Deep Learning 1.0.0-beta0 documentation

d2l.ai

8.4. Multi-Branch Networks (GoogLeNet)

In 2014, GoogLeNet won the ImageNet Challenge (Szegedy et al., 2015), using a structure that combined the strengths of NiN (Lin et al., 2013), repeated blocks (Simonyan and Zisserman, 2014), and a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinction among the stem (data ingest), body (data processing), and head (prediction) in a CNN. This design pattern has persisted ever since in the design of deep networks: the stem is given by the first 2–3 convolutions that operate on the image. They extract low-level features from the underlying images. This is followed by a body of convolutional blocks. Finally, the head maps the features obtained so far to the required classification, segmentation, detection, or tracking problem at hand.

2014년 GoogLeNet은 NiN과(Lin et al., 2013), repeated blocks(Simonyan and Zisserman, 2014) 그리고 convolution kernels의 cocktail 을 결합한 구조를 사용하여 ImageNet Challenge(Szegedy et al., 2015)에서 우승했습니다. 또한 CNN에서 stem (data ingest), body (data processing) 및 head (prediction) 간에 명확한 구분을 나타내는 최초의 네트워크이기도 합니다. 이 디자인 패턴은 딥 네트워크의 디자인 이후로 지속되어 왔습니다. 스템은 이미지에서 작동하는 처음 2-3개의 컨볼루션에 의해 제공됩니다. 기본 이미지에서 낮은 수준의 기능을 추출합니다. 그 다음에는 컨볼루션 블록의 본문이 이어집니다. 마지막으로, 헤드는 지금까지 얻은 기능을 당면한 필요한 분류, 세분화, 감지 또는 추적 문제에 매핑합니다.

The key contribution in GoogLeNet was the design of the network body. It solved the problem of selecting convolution kernels in an ingenious way. While other works tried to identify which convolution, ranging from 1×1 to 11×11 would be best, it simply concatenated multi-branch convolutions. In what follows we introduce a slightly simplified version of GoogLeNet: the original design included a number of tricks to stabilize training through intermediate loss functions, applied to multiple layers of the network. They are no longer necessary due to the availability of improved training algorithms.

GoogLeNet의 주요 기여는 design of the network body였습니다. 컨볼루션 커널을 선택하는 문제를 기발한 방식으로 해결했습니다. 다른 작업에서는 1×1에서 11×11까지 어떤 컨볼루션이 가장 좋은지 식별하려고 시도했지만 단순히 다중 분기 컨볼루션을 연결했습니다. 다음에서 우리는 GoogLeNet의 약간 단순화된 버전을 소개합니다. 원래 설계에는 네트워크의 여러 계층에 적용된 중간 손실 함수를 통해 훈련을 안정화하는 여러 트릭이 포함되었습니다. 개선된 훈련 알고리즘의 가용성으로 인해 더 이상 필요하지 않습니다.

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

위 코드는 PyTorch와 d2l 패키지로부터 필요한 모듈을 불러오는 코드입니다. 각 라인별로 설명하겠습니다:

import torch: PyTorch 패키지를 불러옵니다. 이를 통해 텐서 연산과 딥러닝 모델 구축을 위한 다양한 기능을 사용할 수 있습니다.
from torch import nn: PyTorch의 nn 모듈을 불러옵니다. 이는 딥러닝 모델을 구성하기 위한 다양한 레이어와 손실 함수 등을 포함하고 있습니다.
from torch.nn import functional as F: PyTorch의 functional 모듈을 불러옵니다. 이 모듈은 활성화 함수, 손실 함수 등과 같은 기능들을 포함하고 있습니다. 여기서는 F라는 이름으로 모듈을 불러왔습니다.
from d2l import torch as d2l: d2l 패키지에서 torch를 불러옵니다. d2l은 "Dive into Deep Learning" 책의 코드 예제를 제공하는 패키지입니다. 이를 통해 책의 예제 코드들을 사용할 수 있습니다.

https://youtu.be/xHgGtnef9mA

https://youtu.be/05PCt_JFc84

https://youtu.be/zCagtR4xLMg

https://youtu.be/W9MlakX3vko

Google net : Inception Block (Inception Modul)을 이해 해야 함

Inception Module 개념으로 깊은 네트워크에서도 비교적 적은 파라미터를 사용할 수 있도록 함 => 모델 성능 향상

1*1 Conv : 연산량을 줄이기 위해 사용함

Auxiliary Classifiers : Layer 가 깊어질수록 Vanishing Gradient 가 나올 가능성이 커짐 => 중간중간 Backprapagate를 해서 Gradient를 보관 함

2014년도에 VGG가 2등 GoogLeNet이 1등을 함 (MSRA가 3등): VGG가 feature를 만드는데 강점이 있다는 평가가 있어서 더 많이 사용함

https://youtu.be/_POOiiV0_3I

8.4.1. Inception Blocks

The basic convolutional block in GoogLeNet is called an Inception block, stemming from the meme “we need to go deeper” of the movie Inception.

GoogLeNet의 basic convolutional block은 영화 인셉션의 "우리는 더 깊이 들어가야 합니다"라는 밈에서 유래한 인셉션 블록이라고 합니다.

Fig. 8.4.1  Structure of the Inception block. ¶

As depicted in Fig. 8.4.1, the inception block consists of four parallel branches. The first three branches use convolutional layers with window sizes of 1×1, 3×3, and 5×5 to extract information from different spatial sizes. The middle two branches also add a 1×1 convolution of the input to reduce the number of channels, reducing the model’s complexity. The fourth branch uses a 3×3 max-pooling layer, followed by a 1×1 convolutional layer to change the number of channels. The four branches all use appropriate padding to give the input and output the same height and width. Finally, the outputs along each branch are concatenated along the channel dimension and comprise the block’s output. The commonly-tuned hyperparameters of the Inception block are the number of output channels per layer, i.e., how to allocate capacity among convolutions of different size.

그림 8.4.1과 같이 시작 블록은 4개의 병렬 분기로 구성됩니다. 처음 세 가지 분기는 1×1, 3×3 및 5×5의 창 크기를 가진 컨벌루션 레이어를 사용하여 서로 다른 공간 크기에서 정보를 추출합니다. 가운데 두 가지도 채널 수를 줄이기 위해 입력의 1×1 컨벌루션을 추가하여 모델의 복잡성을 줄입니다. 네 번째 브랜치는 3×3 최대 풀링 레이어를 사용하고 채널 수를 변경하기 위해 1×1 컨벌루션 레이어를 사용합니다. 네 가지 모두 적절한 패딩을 사용하여 입력과 출력에 동일한 높이와 너비를 제공합니다. 마지막으로 각 분기의 출력은 채널 차원을 따라 연결되며 블록의 출력을 구성합니다. Inception 블록의 일반적으로 튜닝되는 하이퍼파라미터는 레이어당 출력 채널 수, 즉 크기가 다른 컨볼루션 간에 용량을 할당하는 방법입니다.

class Inception(nn.Module):
    # c1--c4 are the number of output channels for each branch
    def __init__(self, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # Branch 1
        self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
        # Branch 2
        self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
        self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
        # Branch 3
        self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
        self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
        # Branch 4
        self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)

    def forward(self, x):
        b1 = F.relu(self.b1_1(x))
        b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
        b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
        b4 = F.relu(self.b4_2(self.b4_1(x)))
        return torch.cat((b1, b2, b3, b4), dim=1)

위 코드는 Inception 블록을 정의하는 클래스인 Inception을 구현한 것입니다. 이 블록은 GoogLeNet 네트워크에 사용되며, 다양한 크기의 커널을 사용하여 다양한 방향의 특징을 추출하는 것을 목표로 합니다. 코드를 각 라인별로 설명하겠습니다:

class Inception(nn.Module):: nn.Module을 상속하여 Inception 클래스를 정의합니다. 이 클래스는 PyTorch의 모듈로 사용자 정의 모델을 구현하는데 사용됩니다.
def __init__(self, c1, c2, c3, c4, **kwargs):: Inception 클래스의 생성자입니다. 다양한 파라미터를 입력으로 받습니다. c1, c2, c3, c4는 각각 브랜치의 출력 채널 수를 나타내는 파라미터입니다. **kwargs는 추가적인 키워드 인수를 받는 매개변수입니다.
super(Inception, self).__init__(**kwargs): 상위 클래스의 생성자를 호출하여 초기화합니다.
self.b1_1 = nn.LazyConv2d(c1, kernel_size=1): 첫 번째 브랜치를 정의합니다. 1x1 커널을 사용하는 합성곱 레이어를 정의합니다.
self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1): 두 번째 브랜치의 첫 번째 레이어를 정의합니다. 1x1 커널을 사용하는 합성곱 레이어를 정의합니다.
self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1): 두 번째 브랜치의 두 번째 레이어를 정의합니다. 3x3 커널을 사용하는 합성곱 레이어를 정의합니다.
self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1): 세 번째 브랜치의 첫 번째 레이어를 정의합니다. 1x1 커널을 사용하는 합성곱 레이어를 정의합니다.
self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2): 세 번째 브랜치의 두 번째 레이어를 정의합니다. 5x5 커널을 사용하는 합성곱 레이어를 정의합니다.
self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1): 네 번째 브랜치의 첫 번째 레이어를 정의합니다. 3x3 최대 풀링 레이어를 정의합니다.
self.b4_2 = nn.LazyConv2d(c4, kernel_size=1): 네 번째 브랜치의 두 번째 레이어를 정의합니다. 1x1 커널을 사용하는 합성곱 레이어를 정의합니다.
def forward(self, x):: forward 메서드를 정의합니다. 이는 모델을 통과할 때 호출되는 메서드입니다.
b1 = F.relu(self.b1_1(x)): 첫 번째 브랜치의 계산을 수행합니다. 입력 x를 1x1 커널을 사용하는 합성곱 레이어를 통과시키고 ReLU 활성화 함수를 적용합니다.
b2 = F.relu(self.b2_2(F.relu(self.b2_1(x)))): 두 번째 브랜치의 계산을 수행합니다. 입력 x를 먼저 1x1 커널을 사용하는 합
b3 = F.relu(self.b3_2(F.relu(self.b3_1(x)))): 세 번째 브랜치의 계산을 수행합니다. 입력 x를 먼저 1x1 커널을 사용하는 합성곱 레이어를 통과시키고 ReLU 활성화 함수를 적용합니다. 그리고 이를 5x5 커널을 사용하는 합성곱 레이어를 통과시키고 다시 ReLU 활성화 함수를 적용합니다.
b4 = F.relu(self.b4_2(self.b4_1(x))): 네 번째 브랜치의 계산을 수행합니다. 입력 x를 먼저 3x3 최대 풀링 레이어를 통과시킵니다. 그리고 이를 1x1 커널을 사용하는 합성곱 레이어를 통과시키고 다시 ReLU 활성화 함수를 적용합니다.
return torch.cat((b1, b2, b3, b4), dim=1): 네 개의 브랜치를 합칩니다. 이를 위해 torch.cat 함수를 사용하여 네 개의 브랜치를 차원 1을 따라 연결합니다. 이렇게 하여 Inception 블록의 출력이 반환됩니다.

To gain some intuition for why this network works so well, consider the combination of the filters. They explore the image in a variety of filter sizes. This means that details at different extents can be recognized efficiently by filters of different sizes. At the same time, we can allocate different amounts of parameters for different filters.

이 네트워크가 잘 작동하는 이유에 대한 직관을 얻으려면 필터 조합을 고려하십시오. 다양한 필터 크기로 이미지를 탐색합니다. 즉, 다양한 크기의 필터를 통해 다양한 범위의 세부 정보를 효율적으로 인식할 수 있습니다. 동시에 서로 다른 필터에 서로 다른 양의 매개 변수를 할당할 수 있습니다.

8.4.2. GoogLeNet Model

As shown in Fig. 8.4.2, GoogLeNet uses a stack of a total of 9 inception blocks, arranged into 3 groups with max-pooling in between, and global average pooling in its head to generate its estimates. Max-pooling between inception blocks reduces the dimensionality. At its stem, the first module is similar to AlexNet and LeNet.

8.4. Multi-Branch Networks (GoogLeNet) — Dive into Deep Learning 1.0.0-beta0 documentation

d2l.ai

그림 8.4.2에서 볼 수 있듯이 GoogLeNet은 총 9개의 시작 블록 스택을 사용하여 3개의 그룹으로 배열하고 그 사이에 max-pooling을 사용하고 헤드에서 global average pooling을 사용하여 추정치를 생성합니다. 시작 블록 간의 최대 풀링은 차원을 줄입니다. 줄기에서 첫 번째 모듈은 AlexNet 및 LeNet과 유사합니다.

We can now implement GoogLeNet piece by piece. Let’s begin with the stem. The first module uses a 64-channel 7×7 convolutional layer.

이제 하나씩 GoogLeNet을 구현할 수 있습니다. 줄기부터 시작합시다. 첫 번째 모듈은 64채널 7×7 컨벌루션 레이어를 사용합니다.

class GoogleNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

위 코드는 GoogLeNet(구글넷)이라는 딥러닝 아키텍처를 구현하는 클래스인 GoogleNet을 정의한 것입니다. 코드를 각 라인별로 설명하겠습니다.

class GoogleNet(d2l.Classifier):: GoogleNet 클래스를 정의하고, d2l.Classifier를 상속하여 사용자 정의 분류기 클래스를 만듭니다.
def b1(self):: b1이라는 함수를 정의합니다. 이 함수는 GoogLeNet의 첫 번째 브랜치를 정의하는 역할을 합니다.
return nn.Sequential(...) : 여러 모듈을 순차적으로 쌓아서 신경망을 정의하는 nn.Sequential 클래스를 반환합니다.
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3): 7x7 크기의 커널을 사용하는 합성곱 레이어를 정의합니다. 입력 채널 수는 64이며, 스트라이드는 2이고, 패딩은 3입니다.
nn.ReLU(): ReLU 활성화 함수를 정의합니다.
nn.MaxPool2d(kernel_size=3, stride=2, padding=1): 3x3 크기의 최대 풀링 레이어를 정의합니다. 스트라이드는 2이고, 패딩은 1입니다.

이렇게 정의된 b1 함수는 GoogLeNet 아키텍처의 첫 번째 브랜치를 정의하는데 사용됩니다.

The second module uses two convolutional layers: first, a 64-channel 1×1 convolutional layer, followed by a 3×3 convolutional layer that triples the number of channels. This corresponds to the second branch in the Inception block and concludes the design of the body. At this point we have 192 channels.

두 번째 모듈은 두 개의 컨볼루션 레이어를 사용합니다. 첫 번째는 64채널 1×1 컨볼루션 레이어이고 그 다음에는 채널 수를 세 배로 늘리는 3×3 컨볼루션 레이어가 있습니다. 이것은 Inception 블록의 두 번째 분기에 해당하며 본체 디자인을 마칩니다. 현재 192개의 채널이 있습니다.

@d2l.add_to_class(GoogleNet)
def b2(self):
    return nn.Sequential(
        nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

위 코드는 GoogleNet 클래스에 새로운 메서드 b2를 추가하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

@d2l.add_to_class(GoogleNet): 데코레이터를 사용하여 GoogleNet 클래스에 새로운 메서드를 추가합니다.
def b2(self):: b2라는 함수를 정의합니다. 이 함수는 GoogLeNet의 두 번째 브랜치를 정의하는 역할을 합니다.
return nn.Sequential(...) : 여러 모듈을 순차적으로 쌓아서 신경망을 정의하는 nn.Sequential 클래스를 반환합니다.
nn.LazyConv2d(64, kernel_size=1): 1x1 크기의 커널을 사용하는 합성곱 레이어를 정의합니다. 입력 채널 수는 64입니다.
nn.ReLU(): ReLU 활성화 함수를 정의합니다.
nn.LazyConv2d(192, kernel_size=3, padding=1): 3x3 크기의 커널을 사용하는 합성곱 레이어를 정의합니다. 입력 채널 수는 192이며, 패딩은 1입니다.
nn.ReLU(): ReLU 활성화 함수를 정의합니다.
nn.MaxPool2d(kernel_size=3, stride=2, padding=1): 3x3 크기의 최대 풀링 레이어를 정의합니다. 스트라이드는 2이고, 패딩은 1입니다.

이렇게 정의된 b2 함수는 GoogLeNet 아키텍처의 두 번째 브랜치를 정의하는데 사용됩니다.

The third module connects two complete Inception blocks in series. The number of output channels of the first Inception block is 64+128+32+32=256. This amounts to a ratio of the number of output channels among the four branches of 2:4:1:1. Achieving this, we first reduce the input dimensions by 1/2 and by 1/12 in the second and third branch respectively to arrive at 96=192/2 and 16=192/12 channels respectively.

세 번째 모듈은 두 개의 완전한 Inception 블록을 직렬로 연결합니다. 첫 번째 Inception 블록의 출력 채널 수는 64+128+32+32=256입니다. 이는 2:4:1:1의 4개 분기 중 출력 채널 수의 비율에 해당합니다. 이를 달성하기 위해 먼저 입력 크기를 두 번째 및 세 번째 분기에서 각각 1/2 및 1/12씩 줄여 각각 96=192/2 및 16=192/12 채널에 도달합니다.

The number of output channels of the second Inception block is increased to 128+192+96+64=480, yielding a ratio of 128:192:96:64=4:6:3:2. As before, we need to reduce the number of intermediate dimensions in the second and third channel. A scale of 1/2 and 1/8 respectively suffices, yielding 128 and 32 channels respectively. This is captured by the arguments of the following Inception block constructors.

두 번째 인셉션 블록의 출력 채널 수는 128+192+96+64=480으로 증가하여 128:192:96:64=4:6:3:2 비율이 됩니다. 이전과 마찬가지로 두 번째 및 세 번째 채널에서 중간 차원의 수를 줄여야 합니다. 각각 1/2 및 1/8 스케일이면 충분하며 각각 128 및 32 채널이 생성됩니다. 이것은 다음 Inception 블록 생성자의 인수에 의해 캡처됩니다.

@d2l.add_to_class(GoogleNet)
def b3(self):
    return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
                         Inception(128, (128, 192), (32, 96), 64),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

위 코드는 GoogleNet 클래스에 새로운 메서드 b3를 추가하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

@d2l.add_to_class(GoogleNet): 데코레이터를 사용하여 GoogleNet 클래스에 새로운 메서드를 추가합니다.
def b3(self):: b3라는 함수를 정의합니다. 이 함수는 GoogLeNet의 세 번째 브랜치를 정의하는 역할을 합니다.
return nn.Sequential(...) : 여러 모듈을 순차적으로 쌓아서 신경망을 정의하는 nn.Sequential 클래스를 반환합니다.
Inception(64, (96, 128), (16, 32), 32): 앞서 정의한 Inception 클래스를 사용하여 인셉션 블록을 정의합니다. 이 블록은 각각 64개, (96, 128)개, (16, 32)개, 32개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(128, (128, 192), (32, 96), 64): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 128개, (128, 192)개, (32, 96)개, 64개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
nn.MaxPool2d(kernel_size=3, stride=2, padding=1): 3x3 크기의 최대 풀링 레이어를 정의합니다. 스트라이드는 2이고, 패딩은 1입니다.

이렇게 정의된 b3 함수는 GoogLeNet 아키텍처의 세 번째 브랜치를 정의하는데 사용됩니다. 이 브랜치는 두 개의 인셉션 블록으로 시작하고, 그 후에 최대 풀링 레이어로 마무리됩니다.

The fourth module is more complicated. It connects five Inception blocks in series, and they have 192+208+48+64=512, 160+224+64+64=512, 128+256+64+64=512, 112+288+64+64=528, and 256+320+128+128=832 output channels, respectively. The number of channels assigned to these branches is similar to that in the third module: the second branch with the 3×3 convolutional layer outputs the largest number of channels, followed by the first branch with only the 1×1 convolutional layer, the third branch with the 5×5 convolutional layer, and the fourth branch with the 3×3 max-pooling layer. The second and third branches will first reduce the number of channels according to the ratio. These ratios are slightly different in different Inception blocks.

네 번째 모듈은 더 복잡합니다. 5개의 인셉션 블록을 직렬로 연결하여 192+208+48+64=512, 160+224+64+64=512, 128+256+64+64=512, 112+288+64+64=528 , 및 256+320+128+128=832 출력 채널. 이 가지에 할당된 채널의 수는 세 번째 모듈과 비슷합니다. 3×3 컨볼루션 레이어가 있는 두 번째 브랜치가 가장 많은 수의 채널을 출력하고 그 다음으로 1×1 컨볼루션 레이어만 있는 첫 번째 브랜치가 출력하고 세 번째 5×5 컨벌루션 레이어가 있는 분기, 3×3 최대 풀링 레이어가 있는 네 번째 분기. 두 번째 및 세 번째 분기는 먼저 비율에 따라 채널 수를 줄입니다. 이 비율은 다른 시작 블록에서 약간 다릅니다.

@d2l.add_to_class(GoogleNet)
def b4(self):
    return nn.Sequential(Inception(192, (96, 208), (16, 48), 64),
                         Inception(160, (112, 224), (24, 64), 64),
                         Inception(128, (128, 256), (24, 64), 64),
                         Inception(112, (144, 288), (32, 64), 64),
                         Inception(256, (160, 320), (32, 128), 128),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

위 코드는 GoogleNet 클래스에 새로운 메서드 b4를 추가하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

@d2l.add_to_class(GoogleNet): 데코레이터를 사용하여 GoogleNet 클래스에 새로운 메서드를 추가합니다.
def b4(self):: b4라는 함수를 정의합니다. 이 함수는 GoogLeNet의 네 번째 브랜치를 정의하는 역할을 합니다.
return nn.Sequential(...) : 여러 모듈을 순차적으로 쌓아서 신경망을 정의하는 nn.Sequential 클래스를 반환합니다.
Inception(192, (96, 208), (16, 48), 64): 앞서 정의한 Inception 클래스를 사용하여 인셉션 블록을 정의합니다. 이 블록은 각각 192개, (96, 208)개, (16, 48)개, 64개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(160, (112, 224), (24, 64), 64): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 160개, (112, 224)개, (24, 64)개, 64개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(128, (128, 256), (24, 64), 64): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 128개, (128, 256)개, (24, 64)개, 64개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(112, (144, 288), (32, 64), 64): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 112개, (144, 288)개, (32, 64)개, 64개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(256, (160, 320), (32, 128), 128): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 256개, (160, 320)개, (32, 128)개, 128개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
nn.MaxPool2d(kernel_size=3, stride=2, padding=1): 3x3 크기의 최대 풀링 레이어를 정의합니다. 스트라이드는 2이고, 패딩은 1입니다.

이렇게 정의된 b4 함수는 GoogLeNet 아키텍처의 네 번째 브랜치를 정의하는데 사용됩니다. 이 브랜치는 다섯 개의 인셉션 블록으로 시작하고, 그 후에 최대 풀링 레이어로 마무리됩니다.

The fifth module has two Inception blocks with 256+320+128+128=832 and 384+384+128+128=1024 output channels. The number of channels assigned to each branch is the same as that in the third and fourth modules, but differs in specific values. It should be noted that the fifth block is followed by the output layer. This block uses the global average pooling layer to change the height and width of each channel to 1, just as in NiN. Finally, we turn the output into a two-dimensional array followed by a fully connected layer whose number of outputs is the number of label classes.

다섯 번째 모듈에는 256+320+128+128=832 및 384+384+128+128=1024 출력 채널이 있는 두 개의 Inception 블록이 있습니다. 각 분기에 할당된 채널 수는 세 번째 및 네 번째 모듈과 동일하지만 세부적인 값이 다릅니다. 다섯 번째 블록 다음에는 출력 레이어가 옵니다. 이 블록은 전역 평균 풀링 계층을 사용하여 NiN에서와 마찬가지로 각 채널의 높이와 너비를 1로 변경합니다. 마지막으로 출력을 2차원 배열로 변환한 다음 출력 수가 레이블 클래스의 수인 완전 연결 계층으로 전환합니다.

@d2l.add_to_class(GoogleNet)
def b5(self):
    return nn.Sequential(Inception(256, (160, 320), (32, 128), 128),
                         Inception(384, (192, 384), (48, 128), 128),
                         nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())

위 코드는 GoogleNet 클래스에 새로운 메서드 b5를 추가하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

@d2l.add_to_class(GoogleNet): 데코레이터를 사용하여 GoogleNet 클래스에 새로운 메서드를 추가합니다.
def b5(self):: b5라는 함수를 정의합니다. 이 함수는 GoogLeNet의 다섯 번째 브랜치를 정의하는 역할을 합니다.
return nn.Sequential(...) : 여러 모듈을 순차적으로 쌓아서 신경망을 정의하는 nn.Sequential 클래스를 반환합니다.
Inception(256, (160, 320), (32, 128), 128): 앞서 정의한 Inception 클래스를 사용하여 인셉션 블록을 정의합니다. 이 블록은 각각 256개, (160, 320)개, (32, 128)개, 128개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
Inception(384, (192, 384), (48, 128), 128): 앞서 정의한 Inception 클래스를 사용하여 또 다른 인셉션 블록을 정의합니다. 이 블록은 각각 384개, (192, 384)개, (48, 128)개, 128개의 출력 채널을 가지는 네 가지 합성곱 레이어로 구성됩니다.
nn.AdaptiveAvgPool2d((1,1)): 입력 텐서의 크기를 (1,1)로 자동으로 조정하는 전역 평균 풀링 레이어를 정의합니다. 이렇게 하면 출력 텐서의 크기는 항상 (1,1)이 됩니다.
nn.Flatten(): 2D 입력 텐서를 1D로 펼치는 레이어입니다. 이를 통해 텐서를 신경망의 완전 연결 레이어에 입력으로 사용할 수 있게 됩니다.

이렇게 정의된 b5 함수는 GoogLeNet 아키텍처의 다섯 번째 브랜치를 정의하는데 사용됩니다. 이 브랜치는 두 개의 인셉션 블록으로 시작하고, 그 후에 전역 평균 풀링 레이어와 펼치기 레이어로 마무리됩니다.

Now that we defined all blocks b1 through b5, it is just a matter of assembling them all into a full network.

이제 b1에서 b5까지 모든 블록을 정의했으므로 전체 네트워크로 모두 조립하는 문제입니다.

@d2l.add_to_class(GoogleNet)
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),
                             self.b5(), nn.LazyLinear(num_classes))
    self.net.apply(d2l.init_cnn)

위 코드는 GoogleNet 클래스에 새로운 __init__ 메서드를 추가하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

@d2l.add_to_class(GoogleNet): 데코레이터를 사용하여 GoogleNet 클래스에 새로운 메서드를 추가합니다.
def __init__(self, lr=0.1, num_classes=10):: 새로운 __init__ 메서드를 정의합니다. 이 메서드는 GoogleNet 클래스의 인스턴스를 초기화하는 역할을 합니다. lr은 학습률을 설정하는 매개변수이며, num_classes는 분류할 클래스의 수를 나타내는 매개변수입니다.
super(GoogleNet, self).__init__(): 상위 클래스인 nn.Module의 __init__ 메서드를 호출하여 부모 클래스의 초기화 메서드를 실행합니다. 이를 통해 GoogleNet 클래스는 nn.Module 클래스의 속성과 메서드를 상속받습니다.
self.save_hyperparameters(): d2l.Classifier의 메서드인 save_hyperparameters를 호출하여 현재 클래스의 하이퍼파라미터를 저장합니다. 이를 통해 모델을 저장하거나 로드할 때 하이퍼파라미터 값을 기억할 수 있습니다.
self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(), self.b5(), nn.LazyLinear(num_classes)): GoogLeNet의 신경망 아키텍처를 정의합니다. nn.Sequential을 사용하여 각 블록을 순차적으로 쌓아서 전체 네트워크를 구성합니다. self.b1(), self.b2(), self.b3(), self.b4(), self.b5()는 앞서 정의한 메서드로, 각각 GoogLeNet의 다섯 번째 브랜치까지의 블록을 정의하는 역할을 합니다.
self.net.apply(d2l.init_cnn): 이전에 정의한 d2l.init_cnn 함수를 사용하여 신경망의 모든 가중치를 초기화합니다.

위 코드에서 GoogleNet 클래스는 GoogLeNet 아키텍처를 정의하는데 사용됩니다. b1, b2, b3, b4, b5 메서드를 사용하여 각 브랜치의 블록을 정의하고, 이를 nn.Sequential을 이용하여 순차적으로 결합하여 전체 네트워크를 생성합니다. 그리고 d2l.init_cnn 함수를 사용하여 모든 가중치를 초기화합니다. 이렇게 정의된 GoogleNet 클래스를 사용하여 학습하거나 추론을 수행할 수 있습니다.

The GoogLeNet model is computationally complex. Note the large number of relatively arbitrary hyperparameters in terms of the number of channels chosen, the number of blocks prior to dimensionality reduction, the relative partitioning of capacity across channels, etc. Much of it is due to the fact that at the time when GoogLeNet was introduced, automatic tools for network definition or design exploration were not yet available. For instance, by now we take it for granted that a competent deep learning framework is capable of inferring dimensionalities of input tensors automatically. At the time, many such configurations had to be specified explicitly by the experimenter, thus often slowing down active experimentation. Moreover, the tools needed for automatic exploration were still in flux and initial experiments largely amounted to costly brute force exploration, genetic algorithms, and similar strategies.

GoogLeNet 모델은 계산적으로 복잡합니다. 선택된 채널 수, 차원 감소 이전의 블록 수, 채널 간 용량의 상대적 분할 등과 관련하여 상대적으로 임의적인 하이퍼파라미터가 많다는 점에 유의하십시오. 대부분은 GoogLeNet이 네트워크 정의 또는 설계 탐색을 위한 자동 도구는 아직 사용할 수 없었습니다. 예를 들어, 이제 우리는 유능한 딥 러닝 프레임워크가 입력 텐서의 차원을 자동으로 추론할 수 있다는 것을 당연하게 여깁니다. 그 당시에는 이러한 많은 구성을 실험자가 명시적으로 지정해야 했기 때문에 활성 실험 속도가 느려지는 경우가 많았습니다. 더욱이 자동 탐사에 필요한 도구는 여전히 유동적이었고 초기 실험은 대부분 비용이 많이 드는 무차별 탐색, 유전 알고리즘 및 유사한 전략에 달했습니다.

For now the only modification we will carry out is to reduce the input height and width from 224 to 96 to have a reasonable training time on Fashion-MNIST. This simplifies the computation. Let’s have a look at the changes in the shape of the output between the various modules.

지금 우리가 수행할 유일한 수정은 Fashion-MNIST에서 합리적인 훈련 시간을 갖기 위해 입력 높이와 너비를 224에서 96으로 줄이는 것입니다. 이것은 계산을 단순화합니다. 다양한 모듈 간의 출력 형태 변화를 살펴보겠습니다.

model = GoogleNet().layer_summary((1, 1, 96, 96))

위 코드는 GoogleNet 모델을 생성하고, layer_summary 메서드를 사용하여 모델의 각 레이어의 출력 크기를 확인하는 부분입니다. 코드를 각 라인별로 설명하겠습니다.

model = GoogleNet(): GoogleNet 클래스의 인스턴스인 model을 생성합니다. 이를 통해 GoogLeNet 모델을 초기화합니다.
model.layer_summary((1, 1, 96, 96)): 이전에 생성한 model을 사용하여 layer_summary 메서드를 호출합니다. 이 메서드는 모델의 입력 데이터 크기를 (1, 1, 96, 96)로 설정한 뒤, 각 레이어의 출력 크기를 출력합니다. 이렇게 하면 모델의 각 레이어에서 데이터가 어떻게 변화하는지를 확인할 수 있습니다.

이 코드를 실행하면 GoogLeNet 모델의 각 레이어의 출력 크기가 출력되게 됩니다. 이를 통해 모델의 구조를 이해하고 입력 데이터가 어떻게 변화하는지를 쉽게 파악할 수 있습니다.

Sequential output shape:     torch.Size([1, 64, 24, 24])
Sequential output shape:     torch.Size([1, 192, 12, 12])
Sequential output shape:     torch.Size([1, 480, 6, 6])
Sequential output shape:     torch.Size([1, 832, 3, 3])
Sequential output shape:     torch.Size([1, 1024])
Linear output shape:         torch.Size([1, 10])

8.4.3. Training

As before, we train our model using the Fashion-MNIST dataset. We transform it to 96×96 pixel resolution before invoking the training procedure.

이전과 마찬가지로 Fashion-MNIST 데이터 세트를 사용하여 모델을 훈련합니다. 학습 절차를 호출하기 전에 96×96 픽셀 해상도로 변환합니다.

model = GoogleNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

위 코드는 GoogleNet 모델을 생성하고 학습시키는 부분으로, 각 라인별로 설명하겠습니다.

model = GoogleNet(lr=0.01): GoogleNet 클래스의 인스턴스인 model을 생성합니다. 이때 학습률 lr을 0.01로 설정하여 생성합니다.
trainer = d2l.Trainer(max_epochs=10, num_gpus=1): d2l.Trainer 클래스의 인스턴스인 trainer를 생성합니다. max_epochs는 최대 학습 에폭 수를 10으로 설정하고, num_gpus는 사용할 GPU 수를 1로 설정합니다.
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96)): FashionMNIST 데이터셋을 로드하여 data 변수에 저장합니다. 배치 크기는 128로 설정하고, 이미지 크기를 96x96 픽셀로 변환합니다.
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn): 모델을 초기화합니다. 이때 첫 번째 미니배치 데이터를 사용하여 모델 파라미터를 초기화합니다. d2l.init_cnn은 모델 파라미터를 초기화하는 함수입니다.
trainer.fit(model, data): 모델을 학습합니다. trainer를 사용하여 데이터셋 data로부터 모델 model을 학습시킵니다.

위 코드를 실행하면 GoogleNet 모델이 FashionMNIST 데이터셋으로 학습되게 됩니다. 학습이 완료된 모델은 model 변수에 저장되며, 이후 다른 데이터에 대해 예측이나 평가를 수행할 수 있습니다.

8.4.4. Discussion

A key feature of GoogLeNet is that it is actually cheaper to compute than its predecessors while simultaneously providing improved accuracy. This marks the beginning of a much more deliberate network design that trades off the cost of evaluating a network with a reduction in errors. It also marks the beginning of experimentation at a block level with network design hyperparameters, even though it was entirely manual at the time. We will revisit this topic in Section 8.8 when discussing strategies for network structure exploration.

GoogLeNet의 핵심 기능은 향상된 정확도를 제공하는 동시에 이전 모델보다 계산 비용이 실제로 저렴하다는 것입니다. 이는 네트워크 평가 비용과 오류 감소를 절충하는 훨씬 더 신중한 네트워크 설계의 시작을 나타냅니다. 또한 당시에는 완전히 수동이었지만 네트워크 설계 하이퍼파라미터를 사용하여 블록 수준에서 실험을 시작했습니다. 네트워크 구조 탐색을 위한 전략을 논의할 때 섹션 8.8에서 이 주제를 다시 다룰 것입니다.

Over the following sections we will encounter a number of design choices (e.g., batch normalization, residual connections, and channel grouping) that allow us to improve networks significantly. For now, you can be proud to have implemented what is arguably the first truly modern CNN.

다음 섹션에서는 네트워크를 크게 개선할 수 있는 여러 설계 선택(예: 배치 정규화, 잔류 연결 및 채널 그룹화)을 접하게 됩니다. 지금은 틀림없이 최초의 진정으로 현대적인 CNN을 구현한 것을 자랑스럽게 생각할 수 있습니다.

8.4.5. Exercises

GoogLeNet was so successful that it went through a number of iterations. There are several iterations of GoogLeNet that progressively improved speed and accuracy. Try to implement and run some of them. They include the following:
Add a batch normalization layer (Ioffe and Szegedy, 2015), as described later in Section 8.5.
Make adjustments to the Inception block (width, choice and order of convolutions), as described in Szegedy et al. (2016).
Use label smoothing for model regularization, as described in Szegedy et al. (2016).
Make further adjustments to the Inception block by adding residual connection (Szegedy et al., 2017), as described later in Section 8.6.
What is the minimum image size for GoogLeNet to work?
Can you design a variant of GoogLeNet that works on Fashion-MNIST’s native resolution of 28×28 pixels? How would you need to change the stem, the body, and the head of the network, if anything at all?
Compare the model parameter sizes of AlexNet, VGG, NiN, and GoogLeNet. How do the latter two network architectures significantly reduce the model parameter size?
Compare the amount of computation needed in GoogLeNet and AlexNet. How does this affect the design of an accelerator chip, e.g., in terms of memory size, memory bandwidth, cache size, the amount of computation, and the benefit of specialized operations?

'Dive into Deep Learning > D2L Convolutional Neural Networks (CNN)' 카테고리의 다른 글

D2L - 8.8. Designing Convolution Network Architectures (0)	2023.07.18
D2L - 8.7. Densely Connected Networks (DenseNet) (0)	2023.07.18
D2L - 8.6. Residual Networks (ResNet) and ResNeXt (0)	2023.07.18
D2L - 8.5. Batch Normalization (0)	2023.07.18
D2L - 8.3. Network in Network (NiN) (0)	2023.07.11
D2L - 8.2. Networks Using Blocks (VGG) (0)	2023.07.10
D2L - 8.1. Deep Convolutional Neural Networks (AlexNet) (0)	2023.07.10
D2L - 8. Modern Convolutional Neural Networks (0)	2023.07.10
D2L - 7.6. Convolutional Neural Networks (LeNet) (0)	2023.07.09
D2L - 7.5. Pooling (1)	2023.07.09

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리