Dive into Deep Learning/D2L Computer Vision

D2L - 14.4. Anchor Boxes

2023. 8. 19. 07:45 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/anchor.html

14.4. Anchor Boxes — Dive into Deep Learning 1.0.0 documentation

d2l.ai

14.4. Anchor Boxes

Object detection algorithms usually sample a large number of regions in the input image, determine whether these regions contain objects of interest, and adjust the boundaries of the regions so as to predict the ground-truth bounding boxes of the objects more accurately. Different models may adopt different region sampling schemes. Here we introduce one of such methods: it generates multiple bounding boxes with varying scales and aspect ratios centered on each pixel. These bounding boxes are called anchor boxes. We will design an object detection model based on anchor boxes in Section 14.7.

객체 감지 알고리즘은 일반적으로 입력 이미지에서 많은 수의 영역을 샘플링하고 이러한 영역에 관심 객체가 포함되어 있는지 확인하고 영역의 경계를 조정하여 객체의 실측 경계 상자를 더 정확하게 예측합니다. 다른 모델은 다른 지역 샘플링 방식을 채택할 수 있습니다. 여기서는 이러한 방법 중 하나를 소개합니다. 각 픽셀을 중심으로 다양한 크기와 종횡비로 여러 경계 상자를 생성합니다. 이러한 경계 상자를 앵커 상자라고 합니다. 섹션 14.7에서 앵커 박스를 기반으로 객체 감지 모델을 설계할 것입니다.

First, let’s modify the printing accuracy just for more concise outputs.

먼저 보다 간결한 출력을 위해 인쇄 정확도를 수정해 보겠습니다.

%matplotlib inline
import torch
from d2l import torch as d2l

torch.set_printoptions(2)  # Simplify printing accuracy

위 코드는 PyTorch를 사용하여 텐서 출력의 정밀도를 조정하는 예시입니다.

%matplotlib inline: Jupyter 노트북에서 Matplotlib 그래프를 인라인으로 표시하기 위한 매직 명령입니다.
import torch: PyTorch 라이브러리를 임포트합니다.
from d2l import torch as d2l: 'd2l'이라는 이름으로 PyTorch의 기능을 사용할 수 있도록 임포트합니다.
torch.set_printoptions(2): 출력 정밀도를 소수점 둘째 자리까지 설정합니다. 이렇게 하면 텐서의 값을 간소화된 형식으로 출력할 수 있습니다.

즉, 코드는 PyTorch 텐서의 출력 정밀도를 조정하여 더 간소하고 읽기 쉬운 형식으로 값을 표시하는 예시를 보여줍니다.

Anchor box란?

"Anchor Box"는 객체 탐지(Object Detection) 및 객체 위치 예측 작업에서 사용되는 개념입니다. 객체 탐지 모델에서 객체의 위치와 클래스를 예측할 때, 이미지 내에서 다양한 크기와 종횡비를 가진 객체에 대해 예측을 수행해야 합니다. 이때 Anchor Box는 이러한 다양한 객체의 크기와 종횡비에 대응하는 사전 정의된 사각형 형태의 박스입니다.

Anchor Box는 주로 합성곱 신경망(Convolutional Neural Network, CNN) 기반의 객체 탐지 모델에서 사용됩니다. 모델은 이미지를 입력으로 받아 여러 스케일의 특징 맵을 추출합니다. Anchor Box는 이러한 특징 맵의 각 위치에서 객체를 예측하기 위한 후보 영역을 제시하는 역할을 합니다.

Anchor Box는 크게 두 가지 파라미터로 정의됩니다:

크기(Sizes): 다양한 객체 크기에 대응하기 위해 정의되는 박스의 가로와 세로 크기입니다. 예를 들어, 작은 객체에 대응하는 작은 크기의 박스와 큰 객체에 대응하는 큰 크기의 박스를 정의할 수 있습니다.
종횡비(Ratios): 객체의 종횡비에 대응하기 위해 정의되는 박스의 가로와 세로의 비율입니다. 종횡비는 주로 1:1, 1:2, 2:1과 같은 비율로 정의되며, 다양한 객체 모양에 대응하기 위해 사용됩니다.

객체 탐지 모델은 특징 맵의 각 위치에서 여러 개의 Anchor Box를 생성하고, 이를 기반으로 객체의 위치와 클래스를 예측합니다. 이후 예측된 정보와 실제 객체 위치를 비교하여 모델을 훈련하게 됩니다. Anchor Box를 이용하면 다양한 크기와 모양의 객체에 대한 예측을 효과적으로 수행할 수 있습니다.

14.4.1. Generating Multiple Anchor Boxes

Suppose that the input image has a height of ℎ and width of w. We generate anchor boxes with different shapes centered on each pixel of the image. Let the scale be s∈(0,1] and the aspect ratio (ratio of width to height) is r>0. Then the width and height of the anchor box are ws√r and ℎs/√r, respectively. Note that when the center position is given, an anchor box with known width and height is determined.

입력 이미지의 높이가 ℎ이고 너비가 w라고 가정합니다. 이미지의 각 픽셀을 중심으로 다양한 모양의 앵커 박스를 생성합니다. 스케일을 s∈(0,1]로 하고 종횡비(너비와 높이의 비율)를 r>0이라고 하면 앵커 박스의 너비와 높이는 각각 ws√r과 ℎs/√r입니다. 중심 위치가 주어지면 너비와 높이가 알려진 앵커 박스가 결정됩니다.

To generate multiple anchor boxes with different shapes, let’s set a series of scales s1,…,sn and a series of aspect ratios r1,…,rm. When using all the combinations of these scales and aspect ratios with each pixel as the center, the input image will have a total of wℎnm anchor boxes. Although these anchor boxes may cover all the ground-truth bounding boxes, the computational complexity is easily too high. In practice, we can only consider those combinations containing s1 or r1:

모양이 다른 여러 앵커 박스를 생성하기 위해 일련의 축척 s1,…,sn과 일련의 종횡비 r1,…,rm을 설정해 보겠습니다. 각 픽셀을 중심으로 이러한 배율과 종횡비의 모든 조합을 사용할 때 입력 이미지는 총 wℎnm 앵커 상자를 갖게 됩니다. 이러한 앵커 상자가 실측 경계 상자를 모두 덮을 수 있지만 계산 복잡도가 너무 높습니다. 실제로는 s1 또는 r1을 포함하는 조합만 고려할 수 있습니다.

That is to say, the number of anchor boxes centered on the same pixel is n+m−1. For the entire input image, we will generate a total of wℎ(n+m−1) anchor boxes.

즉, 동일한 픽셀을 중심으로 하는 앵커 박스의 수는 n+m-1입니다. 전체 입력 이미지에 대해 총 wℎ(n+m−1)개의 앵커 상자를 생성합니다.

The above method of generating anchor boxes is implemented in the following multibox_prior function. We specify the input image, a list of scales, and a list of aspect ratios, then this function will return all the anchor boxes.

앵커 박스를 생성하는 위의 방법은 다음 multibox_prior 함수에서 구현됩니다. 입력 이미지, 축척 목록 및 종횡비 목록을 지정하면 이 함수는 모든 앵커 상자를 반환합니다.

#@save
def multibox_prior(data, sizes, ratios):
    """Generate anchor boxes with different shapes centered on each pixel."""
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)
    # Offsets are required to move the anchor to the center of a pixel. Since
    # a pixel has height=1 and width=1, we choose to offset our centers by 0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # Scaled steps in y axis
    steps_w = 1.0 / in_width  # Scaled steps in x axis

    # Generate all center points for the anchor boxes
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

    # Generate `boxes_per_pixel` number of heights and widths that are later
    # used to create anchor box corner coordinates (xmin, xmax, ymin, ymax)
    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # Handle rectangular inputs
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # Divide by 2 to get half height and half width
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # Each center point will have `boxes_per_pixel` number of anchor boxes, so
    # generate a grid of all anchor box centers with `boxes_per_pixel` repeats
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

위 코드는 'multibox_prior'라는 함수로, 다양한 크기와 비율의 앵커 박스를 이미지 내의 각 픽셀 중심에 생성하는 기능을 수행합니다.

multibox_prior(data, sizes, ratios): 이 함수는 입력 데이터 텐서 data, 앵커 박스의 크기 목록 sizes, 그리고 앵커 박스의 비율 목록 ratios를 받습니다. 이 함수는 이미지 내의 각 픽셀 중심을 기준으로 다양한 크기와 비율의 앵커 박스를 생성하여 반환합니다.

함수 내부에서 다음 단계가 수행됩니다:

입력 데이터의 높이와 너비를 가져옵니다.
크기와 비율 텐서를 생성하고 디바이스를 설정합니다.
픽셀 중심의 이동을 위한 오프셋을 계산합니다.
y축과 x축에 대한 스케일된 스텝 값을 계산합니다.
중심 포인트를 생성하여 앵커 박스의 중심을 지정합니다.
다양한 크기와 비율을 활용하여 각 픽셀 중심에서 앵커 박스의 크기를 조정합니다.
생성된 앵커 박스를 반환합니다.

이렇게 생성된 앵커 박스는 객체 감지 작업에서 사용되며, 여러 크기와 비율의 박스를 이미지의 각 픽셀에 적용함으로써 객체를 탐지하는데 사용됩니다.

각 라인별로 분석해 보겠습니다.

#@save
def multibox_prior(data, sizes, ratios):
    """Generate anchor boxes with different shapes centered on each pixel."""

함수 정의가 시작됩니다. 함수 이름은 multibox_prior이며, 입력으로 data, sizes, ratios를 받습니다. 함수 내용은 "Generate anchor boxes with different shapes centered on each pixel.(각 픽셀을 중심으로 다른 모양의 앵커 상자 생성)"라는 주석으로 설명되어 있습니다.

    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)

입력 데이터인 data의 높이와 너비를 가져와 in_height와 in_width로 할당합니다. 또한 data의 디바이스 정보를 가져와 device에 할당하고, sizes와 ratios의 길이를 이용하여 앵커 박스의 크기와 비율 개수를 각각 num_sizes와 num_ratios에 할당합니다. boxes_per_pixel은 앵커 박스 개수를 의미합니다.

    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

sizes와 ratios를 텐서로 변환하고, 디바이스 정보를 설정하여 size_tensor와 ratio_tensor에 저장합니다.

    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # Scaled steps in y axis
    steps_w = 1.0 / in_width  # Scaled steps in x axis

앵커 박스를 픽셀 중심으로 옮기기 위한 오프셋 값을 설정합니다. 또한 y축과 x축의 스케일된 스텝 값을 계산합니다.

    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

y축과 x축에 대한 중심 포인트 값을 계산합니다. 그리고 torch.meshgrid를 사용하여 중심 포인트들을 조합하여 shift_y와 shift_x를 생성합니다.

    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # Handle rectangular inputs
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))

앵커 박스의 너비와 높이 값을 계산합니다. 크기와 비율을 이용하여 각 앵커 박스의 크기를 정의합니다.

    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

앵커 박스의 변화량을 계산합니다. 이 변화량은 앵커 박스의 좌표를 조정하는데 사용됩니다.

    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

앵커 박스 중심의 그리드를 생성하고, 앵커 박스의 변화량을 더하여 최종 앵커 박스의 좌표를 계산합니다. 그리고 이를 반환하기 전에 차원을 조정하여 반환합니다.

We can see that the shape of the returned anchor box variable Y is (batch size, number of anchor boxes, 4).

반환된 앵커 박스 변수 Y의 모양이 (배치 크기, 앵커 박스 수, 4)임을 알 수 있습니다.

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]

print(h, w)
X = torch.rand(size=(1, 3, h, w))  # Construct input data
Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
Y.shape

위 코드는 다음과 같은 내용을 수행합니다. 코드를 한 줄씩 설명하겠습니다.

d2l.plt.imread 함수를 사용하여 이미지 파일을 읽어옵니다. 읽어온 이미지는 img에 저장됩니다.

이미지의 높이와 너비를 가져와 h와 w에 저장합니다.

이미지의 높이와 너비를 출력합니다.

높이 h, 너비 w를 가지고 형상이 (1, 3, h, w)인 랜덤 값을 가지는 텐서 X를 생성합니다. 이는 신경망의 입력 데이터를 대표하는 것입니다. 여기서 3은 RGB 채널의 개수를 나타냅니다.

생성한 입력 데이터 X를 이용하여 multibox_prior 함수를 호출하여 앵커 박스들을 생성합니다. 앵커 박스의 크기(sizes)와 비율(ratios)을 지정합니다.

생성된 앵커 박스들의 형상을 출력합니다. 이는 (앵커 박스 개수, 4)의 형태로 출력됩니다. 4는 앵커 박스의 좌표 정보를 의미하며, (x, y, width, height) 형식으로 구성됩니다.

After changing the shape of the anchor box variable Y to (image height, image width, number of anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes centered on a specified pixel position. In the following, we access the first anchor box centered on (250, 250). It has four elements: the (x,y)-axis coordinates at the upper-left corner and the (x,y)-axis coordinates at the lower-right corner of the anchor box. The coordinate values of both axes are divided by the width and height of the image, respectively.

앵커 박스 변수 Y의 모양을 (이미지 높이, 이미지 너비, 같은 픽셀을 중심으로 하는 앵커 박스 수, 4)로 변경하면 지정된 픽셀 위치를 중심으로 하는 모든 앵커 박스를 얻을 수 있습니다. 다음에서는 (250, 250)을 중심으로 하는 첫 번째 앵커 상자에 액세스합니다. 앵커 상자의 왼쪽 위 모서리에 있는 (x,y)축 좌표와 오른쪽 아래 모서리에 있는 (x,y)축 좌표의 네 가지 요소가 있습니다. 두 축의 좌표 값은 각각 이미지의 너비와 높이로 나뉩니다.

boxes = Y.reshape(h, w, 5, 4)
boxes[250, 250, 0, :]

위 코드는 다음과 같은 내용을 수행합니다. 코드를 한 줄씩 설명하겠습니다.

Y 텐서를 형상 (h, w, 5, 4)로 변형하여 boxes에 저장합니다. 여기서 h와 w는 이미지의 높이와 너비이고, 5는 각 픽셀마다 생성된 앵커 박스의 개수, 4는 각 앵커 박스의 좌표 정보를 나타냅니다.

boxes 텐서의 (250, 250) 위치의 픽셀에 생성된 첫 번째 앵커 박스의 좌표 정보를 출력합니다. 여기서 0은 첫 번째 앵커 박스를 나타냅니다. 결과는 해당 픽셀 위치에서 첫 번째 앵커 박스의 (x, y, width, height) 좌표 정보가 출력됩니다.

#@save
def show_bboxes(axes, bboxes, labels=None, colors=None):
    """Show bounding boxes."""

    def make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = make_list(labels)
    colors = make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))

위 코드는 다음과 같은 목적을 가진 함수인 show_bboxes를 정의하고 있습니다.

axes: Matplotlib의 축(axes) 객체입니다. 여기서 바운딩 박스를 그릴 위치를 지정합니다.
bboxes: 바운딩 박스 좌표 정보를 담고 있는 텐서입니다.
labels: 각 바운딩 박스에 대한 레이블 정보입니다.
colors: 각 바운딩 박스에 대한 색상 정보입니다.

입력으로 받은 객체 obj를 리스트로 만들어 반환하는 함수입니다. default_values가 지정되어 있지 않은 경우에는 None을 기본값으로 사용합니다.

labels와 colors를 리스트로 변환합니다. 기본적으로 colors에는 파란색, 초록색, 빨간색, 자주색, 청록색 순서의 색상이 사용됩니다.

enumerate 함수를 사용하여 바운딩 박스와 해당 인덱스를 반복합니다.
bbox를 이용하여 Matplotlib의 Rectangle 객체를 생성하고, 해당 객체를 축에 추가합니다.
만약 labels가 존재하고, 현재 인덱스가 레이블 개수보다 작으면, 해당 레이블 정보를 바운딩 박스 중심에 출력합니다. 이때 텍스트 색상과 배경 색상이 조정되어 텍스트가 더 잘 보이도록 설정됩니다.

As we just saw, the coordinate values of the x and y axes in the variable boxes have been divided by the width and height of the image, respectively. When drawing anchor boxes, we need to restore their original coordinate values; thus, we define variable bbox_scale below. Now, we can draw all the anchor boxes centered on (250, 250) in the image. As you can see, the blue anchor box with a scale of 0.75 and an aspect ratio of 1 well surrounds the dog in the image.

방금 본 것처럼 변수 상자의 x 및 y축 좌표 값은 각각 이미지의 너비와 높이로 나뉩니다. 앵커 상자를 그릴 때 원래 좌표 값을 복원해야 합니다. 따라서 아래에 변수 bbox_scale을 정의합니다. 이제 이미지에서 (250, 250)을 중심으로 모든 앵커 상자를 그릴 수 있습니다. 보시다시피 스케일 0.75, 종횡비 1 well인 파란색 앵커 상자가 이미지에서 강아지를 잘 둘러싸고 있습니다.

d2l.set_figsize()
bbox_scale = torch.tensor((w, h, w, h))
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
            ['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
             's=0.75, r=0.5'])

위 코드는 바운딩 박스를 이미지 위에 그려주는 작업을 수행하는 코드입니다.

그래프의 크기를 설정하는 함수입니다. d2l은 딥러닝 모델 관련 기능을 제공하는 패키지입니다.

bbox_scale은 바운딩 박스의 크기를 조정하기 위한 스케일을 나타냅니다. 이미지의 폭과 높이 값을 담고 있습니다.

img를 Matplotlib의 imshow 함수를 이용하여 이미지로 표시합니다. 그림을 그릴 축(axes) 객체를 fig에 저장합니다.

show_bboxes 함수를 호출하여 바운딩 박스를 이미지 위에 그립니다. boxes[250, 250, :, :] * bbox_scale는 선택한 픽셀 위치에서 생성된 여러 바운딩 박스의 좌표를 의미합니다.
각 바운딩 박스에 대한 레이블도 함께 지정하여 출력합니다. 이를 통해 바운딩 박스의 크기와 종횡비에 대한 정보를 확인할 수 있습니다.

14.4.2. Intersection over Union (IoU)

We just mentioned that an anchor box “well” surrounds the dog in the image. If the ground-truth bounding box of the object is known, how can “well” here be quantified? Intuitively, we can measure the similarity between the anchor box and the ground-truth bounding box. We know that the Jaccard index can measure the similarity between two sets. Given sets A and B, their Jaccard index is the size of their intersection divided by the size of their union:

우리는 방금 이미지에서 개를 둘러싼 앵커 박스 "well"을 언급했습니다. 객체의 실측 경계 상자가 알려진 경우 여기에서 "well"을 어떻게 정량화할 수 있습니까? 직관적으로 앵커 상자와 실측 경계 상자 사이의 유사성을 측정할 수 있습니다. 우리는 Jaccard 지수가 두 세트 간의 유사성을 측정할 수 있다는 것을 알고 있습니다. 집합 A와 B가 주어지면 Jaccard 지수는 교집합의 크기를 합집합의 크기로 나눈 값입니다.

Jaccard Index란?

The "Jaccard index," also known as the "Intersection over Union (IoU)," is a statistical measure used to evaluate the similarity or overlap between two sets. It is commonly used in the field of image segmentation, object detection, and binary classification tasks to measure the accuracy of predicted regions or bounding boxes compared to the ground truth.

"Jaccard 지수," 또는 "교집합 비율 (IoU, Intersection over Union),"은 두 집합 간의 유사성 또는 중첩을 평가하는 통계적 지표입니다. 이미지 분할, 객체 탐지 및 이진 분류 작업 분야에서 예측된 영역이나 바운딩 박스와 실제값 간의 정확성을 측정하는 데 주로 사용됩니다.

In the context of object detection or image segmentation, the Jaccard index calculates the ratio of the area of overlap between the predicted region and the ground truth region to the total area encompassed by both regions. Mathematically, the Jaccard index is defined as:

객체 탐지나 이미지 분할의 맥락에서 Jaccard 지수는 예측된 영역과 실제값 영역 사이의 중첩 영역의 면적을 전체 영역으로 나눈 비율을 계산합니다. 수학적으로 Jaccard 지수는 다음과 같이 정의됩니다.

Where:

represents the set of pixels or elements in the predicted region.
는 예측된 영역의 픽셀 또는 요소의 집합을 나타냅니다.
represents the set of pixels or elements in the ground truth region.
는 실제값 영역의 픽셀 또는 요소의 집합을 나타냅니다.
denotes the size of the intersection between sets and .
는 집합 와 사이의 교집합의 크기를 나타냅니다.
denotes the size of the union of sets and .
는 집합 와 의 합집합의 크기를 나타냅니다.

The Jaccard index ranges from 0 to 1, where 0 indicates no overlap between the sets (no agreement), and 1 indicates complete overlap (perfect agreement). In the context of object detection, a higher Jaccard index implies a better alignment between the predicted bounding box and the ground truth bounding box.

Jaccard 지수는 0부터 1까지의 범위를 가지며, 0은 집합 간에 중첩이 없음을 나타내며 (일치하지 않음), 1은 완전한 중첩을 나타냅니다 (완전히 일치). 객체 탐지의 맥락에서 높은 Jaccard 지수는 예측된 바운딩 박스와 실제값 바운딩 박스 간의 좋은 정렬을 나타냅니다.

The Jaccard index is a valuable metric for evaluating the accuracy of algorithms in tasks that involve identifying regions of interest or bounding boxes, as it provides insight into the quality of predictions by measuring their spatial agreement with the actual ground truth.

Jaccard 지수는 관심 영역이나 바운딩 박스를 식별하는 작업에서 알고리즘의 정확성을 평가하는 데 유용한 지표로, 예측의 공간적인 일치를 측정하여 실제값과의 품질을 평가합니다.

In fact, we can consider the pixel area of any bounding box as a set of pixels. In this way, we can measure the similarity of the two bounding boxes by the Jaccard index of their pixel sets. For two bounding boxes, we usually refer their Jaccard index as intersection over union (IoU), which is the ratio of their intersection area to their union area, as shown in Fig. 14.4.1. The range of an IoU is between 0 and 1: 0 means that two bounding boxes do not overlap at all, while 1 indicates that the two bounding boxes are equal.

사실 경계 상자의 픽셀 영역을 픽셀 집합으로 간주할 수 있습니다. 이러한 방식으로 픽셀 집합의 Jaccard 인덱스로 두 경계 상자의 유사성을 측정할 수 있습니다. 두 개의 경계 상자에 대해 우리는 일반적으로 그림 14.4.1과 같이 교차 영역과 결합 영역의 비율인 Jaccard 인덱스를 IoU(교차점 오버 유니온)라고 합니다. IoU의 범위는 0에서 1 사이입니다. 0은 두 경계 상자가 전혀 겹치지 않음을 의미하고 1은 두 경계 상자가 동일함을 나타냅니다.

Fig. 14.4.1  IoU is the ratio of the intersection area to the union area of two bounding boxes.

For the remainder of this section, we will use IoU to measure the similarity between anchor boxes and ground-truth bounding boxes, and between different anchor boxes. Given two lists of anchor or bounding boxes, the following box_iou computes their pairwise IoU across these two lists.

이 섹션의 나머지 부분에서는 IoU를 사용하여 앵커 상자와 ground-truth 경계 상자, 서로 다른 앵커 상자 간의 유사성을 측정합니다. 두 개의 앵커 또는 경계 상자 목록이 주어지면 다음 box_iou는 이 두 목록에서 쌍별 IoU를 계산합니다.

#@save
def box_iou(boxes1, boxes2):
    """Compute pairwise IoU across two lists of anchor or bounding boxes."""
    box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
    # Shape of `boxes1`, `boxes2`, `areas1`, `areas2`: (no. of boxes1, 4),
    # (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    # Shape of `inter_upperlefts`, `inter_lowerrights`, `inters`: (no. of
    # boxes1, no. of boxes2, 2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    # Shape of `inter_areas` and `union_areas`: (no. of boxes1, no. of boxes2)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas

위 코드는 두 개의 바운딩 박스 리스트 간에 서로 겹치는 면적의 비율인 IoU (Intersection over Union)를 계산하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

box_iou 함수는 두 개의 바운딩 박스 리스트 간의 IoU를 계산하는 함수입니다. 이 함수는 앵커 박스(anchor box) 또는 바운딩 박스(bounding box) 리스트를 입력으로 받습니다.

box_area는 입력으로 들어온 바운딩 박스들의 면적을 계산하는 람다(lambda) 함수입니다. 바운딩 박스의 너비와 높이를 이용하여 면적을 계산합니다.

boxes1 리스트와 boxes2 리스트의 각각의 바운딩 박스에 대한 면적을 계산하여 areas1과 areas2에 저장합니다.

겹치는 영역의 좌상단 점과 우하단 점을 계산하여 inter_upperlefts와 inter_lowerrights에 저장합니다. inters는 겹치는 영역의 너비와 높이입니다. clamp(min=0) 함수를 이용하여 음수 값을 0으로 만듭니다.

겹치는 영역의 너비와 높이를 이용하여 겹치는 영역의 면적과 합집합의 면적을 계산합니다.

계산된 겹치는 영역의 면적과 합집합의 면적을 이용하여 IoU를 계산하고 반환합니다. 이를 통해 두 개의 바운딩 박스 리스트 간의 IoU를 계산할 수 있습니다.

14.4.3. Labeling Anchor Boxes in Training Data

In a training dataset, we consider each anchor box as a training example. In order to train an object detection model, we need class and offset labels for each anchor box, where the former is the class of the object relevant to the anchor box and the latter is the offset of the ground-truth bounding box relative to the anchor box. During the prediction, for each image we generate multiple anchor boxes, predict classes and offsets for all the anchor boxes, adjust their positions according to the predicted offsets to obtain the predicted bounding boxes, and finally only output those predicted bounding boxes that satisfy certain criteria.

교육 데이터 세트에서 각 앵커 상자를 교육 예제로 간주합니다. 개체 감지 모델을 훈련하려면 각 앵커 상자에 대한 클래스 및 오프셋 레이블이 필요합니다. 여기서 전자는 앵커 상자와 관련된 개체의 클래스입니다. 후자는 앵커 상자에 대한 실측 경계 상자의 오프셋입니다. 예측하는 동안 각 이미지에 대해 여러 개의 앵커 상자를 생성하고 모든 앵커 상자에 대한 클래스와 오프셋을 예측하고 예측된 경계 상자를 얻기 위해 예측된 오프셋에 따라 위치를 조정하고 마지막으로 특정 기준을 충족하는 예측된 경계 상자만 출력합니다. .

As we know, an object detection training set comes with labels for locations of ground-truth bounding boxes and classes of their surrounded objects. To label any generated anchor box, we refer to the labeled location and class of its assigned ground-truth bounding box that is closest to the anchor box. In the following, we describe an algorithm for assigning closest ground-truth bounding boxes to anchor boxes.

아시다시피 객체 감지 훈련 세트에는 ground-truth 경계 상자의 위치와 둘러싸인 객체의 클래스에 대한 레이블이 함께 제공됩니다. 생성된 앵커 상자에 레이블을 지정하려면 앵커 상자에 가장 가까운 할당된 ground-truth 경계 상자의 레이블이 지정된 위치와 클래스를 참조합니다. 다음에서는 가장 가까운 ground-truth 경계 상자를 앵커 상자에 할당하는 알고리즘을 설명합니다.

14.4.3.1. Assigning Ground-Truth Bounding Boxes to Anchor Boxes

Given an image, suppose that the anchor boxes are A1,A2,…,Ana and the ground-truth bounding boxes are B1,B2,…,Bnb, where na≥nb. Let’s define a matrix X∈ℝ**na×nb, whose element xij in the ith row and jth column is the IoU of the anchor box Ai and the ground-truth bounding box Bj. The algorithm consists of the following steps:

주어진 이미지에서 앵커 상자가 A1,A2,…,Ana이고 ground-truth 경계 상자가 B1,B2,…,Bnb라고 가정합니다. 행렬 X∈ℝ**na×nb를 정의해 보겠습니다. 여기서 i번째 행과 j번째 열의 요소 xij는 앵커 상자 Ai와 ground-truth 경계 상자 Bj의 IoU입니다. 알고리즘은 다음 단계로 구성됩니다.

Find the largest element in matrix X and denote its row and column indices as i1 and j1, respectively. Then the ground-truth bounding box Bj1 is assigned to the anchor box Ai1. This is quite intuitive because Ai1 and Bj1 are the closest among all the pairs of anchor boxes and ground-truth bounding boxes. After the first assignment, discard all the elements in the i1th row and the j1th column in matrix X.

행렬 X에서 가장 큰 요소를 찾고 해당 행 및 열 인덱스를 각각 i1 및 j1로 나타냅니다. 그런 다음 ground-truth 경계 상자 Bj1이 앵커 상자 Ai1에 할당됩니다. 이것은 Ai1과 Bj1이 모든 앵커 박스와 ground-truth bounding box 쌍 중에서 가장 가깝기 때문에 매우 직관적입니다. 첫 번째 할당 후 행렬 X의 i1번째 행과 j1번째 열의 모든 요소를 버립니다.
Find the largest of the remaining elements in matrix X and denote its row and column indices as i2 and j2, respectively. We assign ground-truth bounding box Bj2 to anchor box Ai2 and discard all the elements in the i2th row and the j2th column in matrix X.

행렬 X에서 나머지 요소 중 가장 큰 요소를 찾고 해당 행 및 열 인덱스를 각각 i2 및 j2로 나타냅니다. ground-truth 경계 상자 Bj2를 앵커 상자 Ai2에 할당하고 행렬 X의 i2번째 행과 j2번째 열에 있는 모든 요소를 버립니다.
At this point, elements in two rows and two columns in matrix X have been discarded. We proceed until all elements in nb columns in matrix X are discarded. At this time, we have assigned a ground-truth bounding box to each of nb anchor boxes.

이 시점에서 행렬 X의 두 행과 두 열의 요소는 버려졌습니다. 행렬 X의 nb 열에 있는 모든 요소가 버려질 때까지 진행합니다. 이때 각 nb 앵커 박스에 ground-truth 경계 상자를 할당했습니다.
Only traverse through the remaining na−nb anchor boxes. For example, given any anchor box Ai, find the ground-truth bounding box Bj with the largest IoU with Ai throughout the ith row of matrix X, and assign Bj to Ai only if this IoU is greater than a predefined threshold.

나머지 na-nb 앵커 박스를 통해서만 트래버스하십시오. 예를 들어 앵커 상자 Ai가 주어지면 행렬 X의 i번째 행 전체에서 Ai와 함께 가장 큰 IoU가 있는 실측 경계 상자 Bj를 찾고 이 IoU가 미리 정의된 임계값보다 큰 경우에만 Bj를 Ai에 할당합니다.

Let’s illustrate the above algorithm using a concrete example. As shown in Fig. 14.4.2 (left), assuming that the maximum value in matrix X is x23, we assign the ground-truth bounding box B3 to the anchor box A2. Then, we discard all the elements in row 2 and column 3 of the matrix, find the largest x71 in the remaining elements (shaded area), and assign the ground-truth bounding box B1 to the anchor box A7. Next, as shown in Fig. 14.4.2 (middle), discard all the elements in row 7 and column 1 of the matrix, find the largest x54 in the remaining elements (shaded area), and assign the ground-truth bounding box B4 to the anchor box A5. Finally, as shown in Fig. 14.4.2 (right), discard all the elements in row 5 and column 4 of the matrix, find the largest x92 in the remaining elements (shaded area), and assign the ground-truth bounding box B2 to the anchor box A9. After that, we only need to traverse through the remaining anchor boxes A1,A3,A4,A6,A8 and determine whether to assign them ground-truth bounding boxes according to the threshold.

구체적인 예를 사용하여 위의 알고리즘을 설명하겠습니다. 그림 14.4.2(왼쪽)에 표시된 대로 행렬 X의 최대값이 x23이라고 가정하고 앵커 상자 A2에 ground-truth 경계 상자 B3을 할당합니다. 그런 다음 행렬의 2행과 3열의 모든 요소를 버리고 나머지 요소(음영 영역)에서 가장 큰 x71을 찾고 ground-truth 경계 상자 B1을 앵커 상자 A7에 할당합니다. 다음으로 그림 14.4.2(가운데)와 같이 행렬의 7행 1열의 원소를 모두 버리고 나머지 원소(음영 영역)에서 가장 큰 x54를 찾아 ground-truth bounding box B4를 할당한다. 앵커 박스 A5에. 마지막으로 그림 14.4.2(오른쪽)과 같이 행렬의 5행 4열의 원소를 모두 버리고 나머지 원소(음영 영역)에서 가장 큰 x92를 찾아,실측 경계 상자 B2를 앵커 상자 A9에 할당합니다. 그런 다음 나머지 앵커 상자 A1,A3,A4,A6,A8을 통과하고 임계값에 따라 실측 경계 상자를 할당할지 여부만 결정하면 됩니다.

Fig. 14.4.2  Assigning ground-truth bounding boxes to anchor boxes.

This algorithm is implemented in the following assign_anchor_to_bbox function.

이 알고리즘은 다음 assign_anchor_to_bbox 함수에서 구현됩니다.

#@save
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign closest ground-truth bounding boxes to anchor boxes."""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # Element x_ij in the i-th row and j-th column is the IoU of the anchor
    # box i and the ground-truth bounding box j
    jaccard = box_iou(anchors, ground_truth)
    # Initialize the tensor to hold the assigned ground-truth bounding box for
    # each anchor
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                                  device=device)
    # Assign ground-truth bounding boxes according to the threshold
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
    box_j = indices[max_ious >= iou_threshold]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)  # Find the largest IoU
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map

위 코드는 앵커 박스(Anchor Boxes)에 대해 가장 가까운 실제 바운딩 박스(Ground-Truth Bounding Boxes)를 할당하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign closest ground-truth bounding boxes to anchor boxes."""

assign_anchor_to_bbox 함수는 앵커 박스와 실제 바운딩 박스 간에 IoU를 계산하여 가장 가까운 바운딩 박스를 앵커 박스에 할당하는 함수입니다.

num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]

num_anchors는 앵커 박스의 개수, num_gt_boxes는 실제 바운딩 박스의 개수를 나타냅니다.

jaccard = box_iou(anchors, ground_truth)

box_iou 함수를 이용하여 앵커 박스와 실제 바운딩 박스 간의 IoU를 계산하여 jaccard에 저장합니다.

anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                              device=device)

각 앵커 박스에 해당하는 가장 가까운 실제 바운딩 박스의 인덱스를 저장하기 위한 텐서인 anchors_bbox_map을 생성합니다. 초기 값은 -1로 설정합니다.

max_ious, indices = torch.max(jaccard, dim=1)
anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
box_j = indices[max_ious >= iou_threshold]
anchors_bbox_map[anc_i] = box_j

max_ious는 각 앵커 박스에 대한 가장 큰 IoU 값을, indices는 해당 IoU 값을 가지는 실제 바운딩 박스의 인덱스를 나타냅니다. iou_threshold 이상의 IoU 값을 가지는 앵커 박스들에 대해 가장 가까운 실제 바운딩 박스의 인덱스를 anchors_bbox_map에 할당합니다.

col_discard = torch.full((num_anchors,), -1)
row_discard = torch.full((num_gt_boxes,), -1)

할당된 바운딩 박스와 IoU 값을 표시하기 위해 사용할 보조 텐서인 col_discard와 row_discard를 생성합니다.

for _ in range(num_gt_boxes):
    max_idx = torch.argmax(jaccard)  # Find the largest IoU
    box_idx = (max_idx % num_gt_boxes).long()
    anc_idx = (max_idx / num_gt_boxes).long()
    anchors_bbox_map[anc_idx] = box_idx
    jaccard[:, box_idx] = col_discard
    jaccard[anc_idx, :] = row_discard

각 실제 바운딩 박스에 대해 가장 큰 IoU 값을 가지는 앵커 박스를 찾아 해당 앵커 박스의 인덱스와 실제 바운딩 박스의 인덱스를 anchors_bbox_map에 할당합니다. 할당한 후 해당 앵커 박스와 실제 바운딩 박스에 대한 IoU 값을 무효화하기 위해 jaccard의 해당 열과 행을 col_discard와 row_discard로 설정합니다.

return anchors_bbox_map

앵커 박스와 실제 바운딩 박스 간의 할당 결과를 반환합니다.

14.4.3.2. Labeling Classes and Offsets

Now we can label the class and offset for each anchor box. Suppose that an anchor box A is assigned a ground-truth bounding box B. On the one hand, the class of the anchor box A will be labeled as that of B. On the other hand, the offset of the anchor box A will be labeled according to the relative position between the central coordinates of B and A together with the relative size between these two boxes. Given varying positions and sizes of different boxes in the dataset, we can apply transformations to those relative positions and sizes that may lead to more uniformly distributed offsets that are easier to fit. Here we describe a common transformation. Given the central coordinates of A and B as (xa,ya) and (xb,yb), their widths as wa and wb and their heights as ℎa and ℎb, respectively. We may label the offset of A as

이제 각 앵커 상자의 클래스와 오프셋에 레이블을 지정할 수 있습니다. 앵커 상자 A에 실측 경계 상자 B가 할당되었다고 가정합니다. 한편으로 앵커 상자 A의 클래스는 B의 클래스로 레이블이 지정됩니다. 다른 한편으로 앵커 상자 A의 오프셋은 다음과 같습니다. 이 두 상자 사이의 상대 크기와 함께 B와 A의 중심 좌표 사이의 상대 위치에 따라 레이블이 지정됩니다. 데이터 세트에 있는 다양한 상자의 다양한 위치와 크기가 주어지면 더 쉽게 맞출 수 있는 보다 균일하게 분포된 오프셋으로 이어질 수 있는 상대적인 위치와 크기에 변환을 적용할 수 있습니다. 여기서는 일반적인 변환에 대해 설명합니다. A와 B의 중심 좌표가 (xa,ya)와 (xb,yb)로 주어지면 너비는 wa와 wb, 높이는 각각 ℎa와 ℎb입니다. A의 오프셋을 다음과 같이 표시할 수 있습니다.

#@save
def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """Transform for anchor box offsets."""
    c_anc = d2l.box_corner_to_center(anchors)
    c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset = torch.cat([offset_xy, offset_wh], axis=1)
    return offset

위 코드는 앵커 박스와 해당 앵커 박스에 할당된 실제 바운딩 박스 사이의 변환을 계산하는 함수를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

offset_boxes 함수는 앵커 박스와 해당 앵커 박스에 할당된 실제 바운딩 박스 사이의 변환을 계산하는 함수입니다.

d2l.box_corner_to_center 함수를 이용하여 앵커 박스와 할당된 바운딩 박스를 모두 상단 왼쪽 모서리에서 중심 좌표와 너비, 높이 형태로 변환합니다.

offset_xy는 중심 좌표의 변화를 나타내는 텐서로, (할당된 바운딩 박스 중심 좌표 - 앵커 박스 중심 좌표) / 앵커 박스 너비 및 높이로 계산됩니다. offset_wh는 너비와 높이의 변화를 나타내는 텐서로, 5 * log(1e-6 + 할당된 바운딩 박스 높이 / 앵커 박스 높이)로 계산됩니다. 이때 eps는 로그 계산에서 0으로 나누기를 방지하는 작은 값입니다.

offset_xy와 offset_wh를 수평 방향으로 연결하여 하나의 텐서인 offset을 생성합니다.

계산된 변환을 반환합니다. 이 변환은 앵커 박스를 기준으로 해당 앵커 박스에 할당된 실제 바운딩 박스의 변화량을 나타내게 됩니다.

If an anchor box is not assigned a ground-truth bounding box, we just label the class of the anchor box as “background”. Anchor boxes whose classes are background are often referred to as negative anchor boxes, and the rest are called positive anchor boxes. We implement the following multibox_target function to label classes and offsets for anchor boxes (the anchors argument) using ground-truth bounding boxes (the labels argument). This function sets the background class to zero and increments the integer index of a new class by one.

앵커 상자에 실측 경계 상자가 할당되지 않은 경우 앵커 상자의 클래스를 "배경"으로 레이블을 지정합니다. 클래스가 배경인 앵커 박스는 종종 네거티브 앵커 박스라고 하고 나머지는 포지티브 앵커 박스라고 합니다. 다음 multibox_target 함수를 구현하여 ground-truth 경계 상자(label 인수)를 사용하여 앵커 상자(anchors 인수)의 클래스 및 오프셋에 레이블을 지정합니다. 이 함수는 백그라운드 클래스를 0으로 설정하고 새 클래스의 정수 인덱스를 1씩 증가시킵니다.

#@save
def multibox_target(anchors, labels):
    """Label anchor boxes using ground-truth bounding boxes."""
    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # Initialize class labels and assigned bounding box coordinates with
        # zeros
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
                                   device=device)
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
                                  device=device)
        # Label classes of anchor boxes using their assigned ground-truth
        # bounding boxes. If an anchor box is not assigned any, we label its
        # class as background (the value remains zero)
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # Offset transformation
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
        batch_offset.append(offset.reshape(-1))
        batch_mask.append(bbox_mask.reshape(-1))
        batch_class_labels.append(class_labels)
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

위 코드는 앵커 박스와 실제 바운딩 박스에 기반하여 앵커 박스에 라벨을 지정하는 함수인 multibox_target를 정의하는 코드입니다. 코드를 한 줄씩 설명하겠습니다.

multibox_target 함수는 앵커 박스와 실제 바운딩 박스를 기반으로 앵커 박스에 라벨을 지정하는 함수입니다.

batch_size는 배치 내의 데이터 수를 나타냅니다. labels 텐서의 크기에서 배치 차원을 제거한 뒤, 앵커 박스 텐서를 할당합니다.

계산 결과를 저장하기 위한 빈 리스트인 batch_offset, batch_mask, batch_class_labels를 생성합니다.

for i in range(batch_size):
    label = labels[i, :, :]
    anchors_bbox_map = assign_anchor_to_bbox(label[:, 1:], anchors, device)

각 배치 데이터에 대하여 반복하면서 실제 바운딩 박스 정보를 가져오고, 해당 실제 바운딩 박스와 앵커 박스 간의 IoU를 계산하여 할당을 수행합니다.

bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(1, 4)

할당된 바운딩 박스가 있는 앵커 박스의 인덱스를 이용하여 바운딩 박스 마스크를 생성합니다. 할당된 바운딩 박스가 있는 위치의 마스크 값은 1이고, 그렇지 않은 위치의 마스크 값은 0입니다.

class_labels = torch.zeros(num_anchors, dtype=torch.long, device=device)
assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32, device=device)

클래스 라벨과 할당된 바운딩 박스의 초기화를 위한 텐서를 생성합니다.

indices_true = torch.nonzero(anchors_bbox_map >= 0)
bb_idx = anchors_bbox_map[indices_true]
class_labels[indices_true] = label[bb_idx, 0].long() + 1
assigned_bb[indices_true] = label[bb_idx, 1:]

할당된 바운딩 박스가 있는 위치에 대해서 실제 바운딩 박스의 클래스 라벨과 좌표 정보를 class_labels와 assigned_bb에 할당합니다. 이때, 클래스 라벨은 1부터 시작하여 라벨링됩니다.

offset = offset_boxes(anchors, assigned_bb) * bbox_mask
batch_offset.append(offset.reshape(-1))
batch_mask.append(bbox_mask.reshape(-1))
batch_class_labels.append(class_labels)

할당된 바운딩 박스를 기준으로 앵커 박스와 실제 바운딩 박스 사이의 변환을 계산합니다. 이 변환을 offset으로 저장하고, batch_offset 리스트에 추가합니다. 마찬가지로 바운딩 박스 마스크와 클래스 라벨 정보도 batch_mask와 batch_class_labels 리스트에 추가합니다.

bbox_offset = torch.stack(batch_offset)
bbox_mask = torch.stack(batch_mask)
class_labels = torch.stack(batch_class_labels)

batch_offset, batch_mask, batch_class_labels를 스택하여 최종적으로 배치 단위의 변환 정보, 마스크 정보, 클래스 라벨 정보를 생성합니다.

return (bbox_offset, bbox_mask, class_labels)

계산된 변환 정보, 마스크 정보, 클래스 라벨 정보를 반환합니다. 이 정보는 학습할 때 앵커 박스의 위치와 클래스에 대한 정보를 지원하기 위해 사용됩니다.

14.4.3.3. An Example

Let’s illustrate anchor box labeling via a concrete example. We define ground-truth bounding boxes for the dog and cat in the loaded image, where the first element is the class (0 for dog and 1 for cat) and the remaining four elements are the (x,y)-axis coordinates at the upper-left corner and the lower-right corner (range is between 0 and 1). We also construct five anchor boxes to be labeled using the coordinates of the upper-left corner and the lower-right corner: A0,…,A4 (the index starts from 0). Then we plot these ground-truth bounding boxes and anchor boxes in the image.

구체적인 예를 통해 앵커 박스 라벨링을 설명하겠습니다. 로드된 이미지에서 개와 고양이에 대한 ground-truth 경계 상자를 정의합니다. 여기서 첫 번째 요소는 클래스(개는 0, 고양이는 1)이고 나머지 4개 요소는 (x,y)축 좌표입니다. 왼쪽 위 모서리와 오른쪽 아래 모서리(범위는 0과 1 사이)입니다. 또한 왼쪽 위 모서리와 오른쪽 아래 모서리의 좌표인 A0,…,A4(인덱스는 0부터 시작)를 사용하여 레이블을 지정할 5개의 앵커 상자를 구성합니다. 그런 다음 이러한 실측 경계 상자와 앵커 상자를 이미지에 플로팅합니다.

ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);

Using the multibox_target function defined above, we can label classes and offsets of these anchor boxes based on the ground-truth bounding boxes for the dog and cat. In this example, indices of the background, dog, and cat classes are 0, 1, and 2, respectively. Below we add an dimension for examples of anchor boxes and ground-truth bounding boxes.

위에서 정의한 multibox_target 함수를 사용하여 개와 고양이에 대한 ground-truth 경계 상자를 기반으로 이러한 앵커 상자의 클래스와 오프셋에 레이블을 지정할 수 있습니다. 이 예제에서 background, dog, cat 클래스의 인덱스는 각각 0, 1, 2입니다. 아래에서 앵커 박스와 ground-truth 경계 상자의 예에 대한 차원을 추가합니다.

labels = multibox_target(anchors.unsqueeze(dim=0),
                         ground_truth.unsqueeze(dim=0))

위 코드는 multibox_target 함수를 사용하여 앵커 박스에 라벨을 할당하는 과정을 설명합니다.

anchors.unsqueeze(dim=0)는 anchors 텐서에 차원을 하나 추가하여 배치 차원을 생성합니다. anchors 텐서는 앵커 박스의 좌표 정보를 가지고 있는 텐서입니다. 이것은 실제 데이터와의 계산을 위해 배치 차원을 맞추기 위한 과정입니다.
ground_truth.unsqueeze(dim=0)도 마찬가지로 실제 바운딩 박스의 정보를 가지고 있는 텐서에 배치 차원을 추가합니다. 이 과정 역시 실제 데이터와의 계산을 위해 배치 차원을 맞추기 위한 것입니다.
이렇게 앵커 박스와 실제 바운딩 박스의 정보를 하나의 배치로 변환한 뒤, multibox_target 함수를 호출하여 해당 배치에 대한 라벨링을 수행합니다. 이때, multibox_target 함수는 앵커 박스에 대한 라벨 정보를 계산하고 반환합니다.

There are three items in the returned result, all of which are in the tensor format. The third item contains the labeled classes of the input anchor boxes.

반환된 결과에는 3개의 항목이 있으며 모두 텐서 형식입니다. 세 번째 항목에는 입력 앵커 상자의 레이블이 지정된 클래스가 포함되어 있습니다.

Let’s analyze the returned class labels below based on anchor box and ground-truth bounding box positions in the image. First, among all the pairs of anchor boxes and ground-truth bounding boxes, the IoU of the anchor box A4 and the ground-truth bounding box of the cat is the largest. Thus, the class of A4 is labeled as the cat. Taking out pairs containing A4 or the ground-truth bounding box of the cat, among the rest the pair of the anchor box A1 and the ground-truth bounding box of the dog has the largest IoU. So the class of A1 is labeled as the dog. Next, we need to traverse through the remaining three unlabeled anchor boxes: A0, A2, and A3. For A0, the class of the ground-truth bounding box with the largest IoU is the dog, but the IoU is below the predefined threshold (0.5), so the class is labeled as background; for A2, the class of the ground-truth bounding box with the largest IoU is the cat and the IoU exceeds the threshold, so the class is labeled as the cat; for A3, the class of the ground-truth bounding box with the largest IoU is the cat, but the value is below the threshold, so the class is labeled as background.

이미지의 앵커 상자 및 ground-truth 경계 상자 위치를 기반으로 아래 반환된 클래스 레이블을 분석해 보겠습니다. 먼저, 앵커 박스와 ground-truth bounding box의 모든 쌍 중에서 앵커 박스 A4와 cat의 ground-truth bounding box의 IoU가 가장 크다. 따라서 A4의 클래스는 고양이로 표시됩니다. A4 또는 고양이의 ground-truth bounding box를 포함하는 쌍을 꺼내고 나머지 중에서 앵커 상자 A1과 dog의 ground-truth bounding box의 쌍이 가장 큰 IoU를 갖습니다. 따라서 A1의 class 개로 분류됩니다. 다음으로 레이블이 지정되지 않은 나머지 3개의 앵커 상자(A0, A2 및 A3)를 통과해야 합니다. A0의 경우 IoU가 가장 큰 ground-truth 경계 상자의 클래스는 개이지만 IoU는 사전 정의된 임계값(0.5) 미만이므로 클래스는 배경으로 레이블이 지정됩니다. A2의 경우 IoU가 가장 큰 ground-truth 경계 상자의 클래스는 고양이이고 IoU는 임계값을 초과하므로 클래스는 고양이로 레이블이 지정됩니다. A3의 경우 IoU가 가장 큰 ground-truth bounding box의 클래스는 cat이지만 값이 임계값 미만이므로 클래스를 background로 레이블링합니다.

labels[2]

위 코드는 labels 텐서에서 인덱스 2에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[2]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 2는 배치 내에서 세 번째 데이터를 의미합니다.

따라서 labels[2]는 세 번째 배치에 해당하는 앵커 박스의 라벨 정보를 반환합니다. 이 정보에는 각 앵커 박스에 할당된 클래스 라벨과 바운딩 박스의 오프셋 정보가 포함될 수 있습니다.

tensor([[0, 1, 2, 0, 2]])

The second returned item is a mask variable of the shape (batch size, four times the number of anchor boxes). Every four elements in the mask variable correspond to the four offset values of each anchor box. Since we do not care about background detection, offsets of this negative class should not affect the objective function. Through elementwise multiplications, zeros in the mask variable will filter out negative class offsets before calculating the objective function.

두 번째 반환된 항목은 모양(배치 크기, 앵커 상자 수의 4배)의 마스크 변수입니다. 마스크 변수의 모든 4개 요소는 각 앵커 상자의 4개 오프셋 값에 해당합니다. 우리는 배경 감지에 관심이 없기 때문에 이 네거티브 클래스의 오프셋은 목적 함수에 영향을 미치지 않아야 합니다. 요소별 곱셈을 통해 마스크 변수의 0은 목적 함수를 계산하기 전에 음수 클래스 오프셋을 필터링합니다.

labels[1]

위 코드는 labels 텐서에서 인덱스 1에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[1]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 1은 두 번째 차원 내의 데이터를 의미합니다.

따라서 labels[1]은 앵커 박스에 할당된 바운딩 박스의 오프셋 정보를 반환합니다. 이 정보는 각 앵커 박스마다 정답 바운딩 박스와의 오프셋 변환을 나타내는 값들로 구성되어 있습니다.

tensor([[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1.,
         1., 1.]])

The first returned item contains the four offset values labeled for each anchor box. Note that the offsets of negative-class anchor boxes are labeled as zeros.

첫 번째로 반환된 항목에는 각 앵커 상자에 대해 레이블이 지정된 4개의 오프셋 값이 포함되어 있습니다. 네거티브 클래스 앵커 상자의 오프셋은 0으로 레이블이 지정됩니다.

labels[0]

위 코드는 labels 텐서에서 인덱스 0에 해당하는 데이터를 추출하는 작업을 수행합니다.

labels 텐서는 multibox_target 함수의 결과로 생성된 텐서입니다. 이 텐서는 앵커 박스에 할당된 라벨 정보를 담고 있습니다.
[0]는 텐서의 인덱스 연산자로, 해당 인덱스에 해당하는 데이터를 추출합니다. 인덱스 0은 첫 번째 차원 내의 데이터를 의미합니다.

따라서 labels[0]은 각 앵커 박스에 대한 바운딩 박스의 오프셋 변환 값들을 가지고 있는 텐서를 반환합니다. 이 값들은 해당 앵커 박스의 위치와 크기를 바탕으로 실제 정답 바운딩 박스와의 오프셋을 나타내는 값들입니다.

tensor([[-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00,  1.40e+00,  1.00e+01,
          2.59e+00,  7.18e+00, -1.20e+00,  2.69e-01,  1.68e+00, -1.57e+00,
         -0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, -5.71e-01, -1.00e+00,
          4.17e-06,  6.26e-01]])

14.4.4. Predicting Bounding Boxes with Non-Maximum Suppression

During prediction, we generate multiple anchor boxes for the image and predict classes and offsets for each of them. A predicted bounding box is thus obtained according to an anchor box with its predicted offset. Below we implement the offset_inverse function that takes in anchors and offset predictions as inputs and applies inverse offset transformations to return the predicted bounding box coordinates.

예측하는 동안 이미지에 대한 여러 앵커 상자를 생성하고 각각에 대한 클래스와 오프셋을 예측합니다. 예측된 오프셋을 갖는 앵커 박스에 따라 예측된 바운딩 박스가 얻어진다. 아래에서는 앵커 및 오프셋 예측을 입력으로 받아들이고 역 오프셋 변환을 적용하여 예측된 경계 상자 좌표를 반환하는 offset_inverse 함수를 구현합니다.

#@save
def offset_inverse(anchors, offset_preds):
    """Predict bounding boxes based on anchor boxes with predicted offsets."""
    anc = d2l.box_corner_to_center(anchors)
    pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
    pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
    pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1)
    predicted_bbox = d2l.box_center_to_corner(pred_bbox)
    return predicted_bbox

위 코드는 앵커 박스와 예측된 오프셋을 사용하여 예측된 바운딩 박스를 계산하는 함수인 offset_inverse를 정의합니다.

anchors: 앵커 박스의 좌표와 크기 정보를 가지고 있는 텐서입니다.
offset_preds: 앵커 박스의 예측된 오프셋 정보를 가지고 있는 텐서입니다.

위 함수는 다음과 같은 단계로 동작합니다:

d2l.box_corner_to_center(anchors): 입력으로 들어온 앵커 박스의 좌상단, 우하단 좌표를 중심 좌표와 폭/높이로 변환합니다.
offset_preds[:, :2] * anc[:, 2:] / 10 + anc[:, :2]: 예측된 오프셋의 x, y 값을 앵커 박스의 크기로 스케일링한 후 중심 좌표를 더하여 예측된 바운딩 박스의 중심 좌표를 계산합니다.
torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]: 예측된 오프셋의 폭, 높이 값을 지수함수로 변환한 후 앵커 박스의 크기로 스케일링하여 예측된 바운딩 박스의 폭과 높이를 계산합니다.
torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1): 중심 좌표와 폭/높이 값을 합쳐서 예측된 바운딩 박스의 정보를 생성합니다.
d2l.box_center_to_corner(pred_bbox): 예측된 바운딩 박스의 중심 좌표와 폭/높이 정보를 좌상단, 우하단 좌표로 변환하여 예측된 바운딩 박스를 생성합니다.

이렇게 생성된 predicted_bbox는 앵커 박스와 예측된 오프셋을 이용하여 예측된 바운딩 박스의 좌상단, 우하단 좌표 정보를 가지고 있는 텐서입니다.

When there are many anchor boxes, many similar (with significant overlap) predicted bounding boxes can be potentially output for surrounding the same object. To simplify the output, we can merge similar predicted bounding boxes that belong to the same object by using non-maximum suppression (NMS).

앵커 박스가 많으면 동일한 객체를 둘러싸는 유사한(상당한 중첩이 있는) 예측 경계 상자가 많이 출력될 수 있습니다. 출력을 단순화하기 위해 NMS(Non-Maximum Suppression)를 사용하여 동일한 객체에 속하는 유사한 예측 경계 상자를 병합할 수 있습니다.

Here is how non-maximum suppression works. For a predicted bounding box B, the object detection model calculates the predicted likelihood for each class. Denoting by p the largest predicted likelihood, the class corresponding to this probability is the predicted class for B. Specifically, we refer to p as the confidence (score) of the predicted bounding box B. On the same image, all the predicted non-background bounding boxes are sorted by confidence in descending order to generate a list L. Then we manipulate the sorted list L in the following steps:

non-maximum suppression이 작동하는 방식은 다음과 같습니다. 예측된 경계 상자 B의 경우 객체 감지 모델은 각 클래스에 대한 예측 가능성을 계산합니다. 가장 큰 예측 가능도를 p로 표시하면 이 확률에 해당하는 클래스가 B에 대한 예측 클래스입니다. 구체적으로 p는 예측된 경계 상자 B의 신뢰도(점수)입니다. 배경 경계 상자는 목록 L을 생성하기 위해 내림차순으로 신뢰도로 정렬됩니다. 그런 다음 다음 단계에서 정렬된 목록 L을 조작합니다.

Select the predicted bounding box B1 with the highest confidence from L as a basis and remove all non-basis predicted bounding boxes whose IoU with B1 exceeds a predefined threshold ϵ from L. At this point, L keeps the predicted bounding box with the highest confidence but drops others that are too similar to it. In a nutshell, those with non-maximum confidence scores are suppressed.

L에서 신뢰도가 가장 높은 예측 경계 상자 B1을 기준으로 선택하고 B1과의 IoU가 L에서 미리 정의된 임계값 ϵ를 초과하는 비기준 예측 경계 상자를 모두 제거합니다. 이 시점에서 L은 예측 경계 상자를 가장 높은 신뢰도로 유지합니다. 그러나 너무 유사한 다른 항목은 삭제합니다. 간단히 말해서 최대 신뢰도 점수가 아닌 항목은 표시되지 않습니다.
Select the predicted bounding box B2 with the second highest confidence from L as another basis and remove all non-basis predicted bounding boxes whose IoU with B2 exceeds ϵ from L.

L에서 두 번째로 신뢰도가 높은 예측 경계 상자 B2를 다른 기준으로 선택하고 B2의 IoU가 L에서 ϵ를 초과하는 비기준 예측 경계 상자를 모두 제거합니다.
Repeat the above process until all the predicted bounding boxes in L have been used as a basis. At this time, the IoU of any pair of predicted bounding boxes in L is below the threshold ϵ; thus, no pair is too similar with each other.

L의 모든 예측된 경계 상자가 기준으로 사용될 때까지 위의 프로세스를 반복합니다. 이때 L에 있는 예측된 경계 상자 쌍의 IoU는 임계값 ϵ 미만입니다. 따라서 어떤 쌍도 서로 너무 유사하지 않습니다.
Output all the predicted bounding boxes in the list L.

목록 L의 모든 예측 경계 상자를 출력합니다.

The following nms function sorts confidence scores in descending order and returns their indices.

다음 nms 함수는 신뢰도 점수를 내림차순으로 정렬하고 해당 인덱스를 반환합니다.

#@save
def nms(boxes, scores, iou_threshold):
    """Sort confidence scores of predicted bounding boxes."""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # Indices of predicted bounding boxes that will be kept
    while B.numel() > 0:
        i = B[0]
        keep.append(i)
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

위 코드는 비최대 억제(NMS, Non-Maximum Suppression) 알고리즘을 사용하여 중복된 예측된 바운딩 박스를 제거하는 함수인 nms를 정의합니다.

boxes: 예측된 바운딩 박스의 좌표 정보를 가지고 있는 텐서입니다.
scores: 예측된 바운딩 박스의 신뢰도 점수를 가지고 있는 텐서입니다.
iou_threshold: IoU 임계값으로, 이 값을 넘는 예측된 바운딩 박스들은 중복으로 간주됩니다.

위 함수는 다음과 같은 단계로 동작합니다:

torch.argsort(scores, dim=-1, descending=True): 예측된 바운딩 박스의 신뢰도 점수를 내림차순으로 정렬하여 인덱스를 얻습니다.
B[0]: 가장 높은 신뢰도 점수를 가진 바운딩 박스의 인덱스를 선택합니다.
keep.append(i): 선택된 바운딩 박스의 인덱스를 keep 리스트에 추가합니다.
iou = box_iou(boxes[i, :].reshape(-1, 4), boxes[B[1:], :].reshape(-1, 4)).reshape(-1): 선택된 바운딩 박스와 다른 모든 바운딩 박스 간의 IoU를 계산합니다.
inds = torch.nonzero(iou <= iou_threshold).reshape(-1): IoU 값이 임계값보다 작거나 같은 바운딩 박스들의 인덱스를 찾습니다.
B = B[inds + 1]: 중복으로 간주되지 않은 바운딩 박스들의 인덱스를 업데이트합니다.
위 과정을 반복하면서 중복으로 간주되지 않은 바운딩 박스들의 인덱스를 keep 리스트에 추가합니다.
return torch.tensor(keep, device=boxes.device): 최종적으로 선택된 바운딩 박스들의 인덱스를 텐서로 반환합니다.

이렇게 선택된 바운딩 박스들은 중복된 예측된 바운딩 박스들을 제거하는 데 사용됩니다.

We define the following multibox_detection to apply non-maximum suppression to predicting bounding boxes. Do not worry if you find the implementation a bit complicated: we will show how it works with a concrete example right after the implementation.

예측 경계 상자에 최대가 아닌 억제를 적용하기 위해 다음 multibox_detection을 정의합니다. 구현이 약간 복잡하다고 생각되더라도 걱정하지 마십시오. 구현 직후 구체적인 예를 통해 어떻게 작동하는지 보여드리겠습니다.

#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
                       pos_threshold=0.009999999):
    """Predict bounding boxes using non-maximum suppression."""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)
        # Find all non-`keep` indices and set the class to background
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined = torch.cat((keep, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted = torch.cat((keep, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # Here `pos_threshold` is a threshold for positive (non-background)
        # predictions
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info = torch.cat((class_id.unsqueeze(1),
                               conf.unsqueeze(1),
                               predicted_bb), dim=1)
        out.append(pred_info)
    return torch.stack(out)

위 코드는 Non-Maximum Suppression(NMS) 알고리즘을 사용하여 예측된 바운딩 박스에 대한 최종 예측 결과를 생성하는 함수인 multibox_detection을 정의합니다.

이 함수는 다음과 같은 단계로 동작합니다:

device, batch_size = cls_probs.device, cls_probs.shape[0]: 디바이스와 배치 크기 정보를 가져옵니다.
anchors = anchors.squeeze(0): 앵커 박스의 차원을 조정합니다.
cls_probs[i], offset_preds[i].reshape(-1, 4): i번째 배치의 클래스 확률과 예측된 오프셋 값을 가져옵니다.
conf, class_id = torch.max(cls_prob[1:], 0): 클래스 확률 중 가장 높은 값을 선택하고 그 인덱스와 값을 가져옵니다.
offset_inverse(anchors, offset_pred): 예측된 오프셋 값을 사용하여 예측된 바운딩 박스를 계산합니다.
keep = nms(predicted_bb, conf, nms_threshold): NMS를 사용하여 중복된 예측된 바운딩 박스를 제거한 인덱스를 가져옵니다.
중복되지 않은 바운딩 박스 인덱스와 모든 인덱스를 합쳐서 유니크한 값과 빈도수를 계산합니다.
중복되지 않은 인덱스들을 찾아 배경 클래스(-1)로 설정하고, 클래스 및 신뢰도 값을 갱신합니다.
pos_threshold 값보다 작은 신뢰도 값을 가진 인덱스를 찾아 배경 클래스로 설정하고, 신뢰도 값을 업데이트합니다.
최종 예측 정보를 생성하여 반환합니다.

즉, 이 함수는 NMS를 사용하여 중복된 예측된 바운딩 박스를 제거하고, 최종 예측된 클래스와 신뢰도, 바운딩 박스 정보를 생성합니다.

Now let’s apply the above implementations to a concrete example with four anchor boxes. For simplicity, we assume that the predicted offsets are all zeros. This means that the predicted bounding boxes are anchor boxes. For each class among the background, dog, and cat, we also define its predicted likelihood.

이제 위의 구현을 4개의 앵커 상자가 있는 구체적인 예에 적용해 보겠습니다. 단순화를 위해 예측된 오프셋이 모두 0이라고 가정합니다. 이는 예측된 경계 상자가 앵커 상자임을 의미합니다. 배경, 개, 고양이 중 각 클래스에 대해 예측 가능성도 정의합니다.

anchors = torch.tensor([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
                      [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = torch.tensor([0] * anchors.numel())
cls_probs = torch.tensor([[0] * 4,  # Predicted background likelihood
                      [0.9, 0.8, 0.7, 0.1],  # Predicted dog likelihood
                      [0.1, 0.2, 0.3, 0.9]])  # Predicted cat likelihood

위 코드는 예시 데이터를 사용하여 multibox_detection 함수를 호출하는 부분입니다. 여기서 주어진 anchors, offset_preds, cls_probs 데이터는 예측된 바운딩 박스의 위치, 오프셋, 클래스 확률을 나타내는 정보입니다.

anchors: 예측된 바운딩 박스의 앵커 정보입니다. 여기서 각 행은 각 바운딩 박스의 좌상단과 우하단 좌표를 나타냅니다.
offset_preds: 예측된 바운딩 박스의 오프셋 정보입니다. 여기서는 모든 값이 0으로 초기화되어 있습니다.
cls_probs: 예측된 바운딩 박스의 클래스 확률 정보입니다. 각 행은 각 바운딩 박스에 대한 클래스 확률을 나타냅니다. 첫 번째 행은 배경 클래스(클래스 0)에 대한 확률을 나타내고, 나머지 두 행은 각각 개(dog)와 고양이(cat) 클래스에 대한 확률을 나타냅니다.

이렇게 예측된 앵커, 오프셋, 클래스 확률 정보를 가지고 multibox_detection 함수를 호출하면, 예측된 바운딩 박스들을 NMS를 통해 최적화하여 최종 예측 결과를 생성합니다.

We can plot these predicted bounding boxes with their confidence on the image.

이미지에 대한 확신을 가지고 예측된 경계 상자를 그릴 수 있습니다.

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale,
            ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])

위 코드는 이미지 위에 예측된 앵커 박스와 해당 클래스에 대한 확률 정보를 표시하는 부분입니다.

fig = d2l.plt.imshow(img): 입력 이미지를 시각화하는 부분입니다. d2l.plt.imshow() 함수는 이미지를 화면에 표시합니다.
show_bboxes(fig.axes, anchors * bbox_scale, ['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9']): show_bboxes 함수를 호출하여 앵커 박스와 해당 클래스에 대한 확률 정보를 시각화합니다. 이 함수는 이미지 위에 각 앵커 박스를 사각형으로 그리고, 사각형 안에 해당 클래스와 확률을 표시합니다. anchors는 예측된 앵커 박스들을 나타내는 텐서입니다. bbox_scale은 이미지 크기를 앵커 박스 크기에 맞게 조정하기 위한 스케일링 인자입니다. 마지막으로, 사각형 안에 나타나는 문자열은 해당 클래스와 확률을 나타냅니다.

Now we can invoke the multibox_detection function to perform non-maximum suppression, where the threshold is set to 0.5. Note that we add a dimension for examples in the tensor input.

이제 multibox_detection 함수를 호출하여 임계값이 0.5로 설정된 비최대 억제를 수행할 수 있습니다. 텐서 입력에 예제에 대한 차원을 추가합니다.

We can see that the shape of the returned result is (batch size, number of anchor boxes, 6). The six elements in the innermost dimension gives the output information for the same predicted bounding box. The first element is the predicted class index, which starts from 0 (0 is dog and 1 is cat). The value -1 indicates background or removal in non-maximum suppression. The second element is the confidence of the predicted bounding box. The remaining four elements are the (x,y)-axis coordinates of the upper-left corner and the lower-right corner of the predicted bounding box, respectively (range is between 0 and 1).

반환된 결과의 모양이 (배치 크기, 앵커 상자 수, 6)임을 알 수 있습니다. 가장 안쪽 차원의 6개 요소는 동일한 예측 경계 상자에 대한 출력 정보를 제공합니다. 첫 번째 요소는 0부터 시작하는 예측 클래스 인덱스입니다(0은 개, 1은 고양이). 값 -1은 최대가 아닌 억제에서 백그라운드 또는 제거를 나타냅니다. 두 번째 요소는 예측된 경계 상자의 신뢰도입니다. 나머지 4개의 요소는 각각 예측된 경계 상자의 왼쪽 위 모서리와 오른쪽 아래 모서리의 (x,y)축 좌표입니다(범위는 0과 1 사이).

output = multibox_detection(cls_probs.unsqueeze(dim=0),
                            offset_preds.unsqueeze(dim=0),
                            anchors.unsqueeze(dim=0),
                            nms_threshold=0.5)
output

위 코드는 다중 객체 탐지 모델의 출력을 계산하는 부분입니다.

cls_probs.unsqueeze(dim=0): cls_probs 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 클래스 확률 예측 텐서입니다.
offset_preds.unsqueeze(dim=0): offset_preds 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 박스 오프셋 예측 텐서입니다.
anchors.unsqueeze(dim=0): anchors 텐서를 배치 차원을 추가하여 모델의 출력 형태로 변환합니다. 이것은 앵커 박스의 좌표 텐서입니다.
nms_threshold=0.5: 비최대 억제(NMS)에 사용되는 임계값을 설정합니다. NMS는 겹치는 박스 중에서 가장 확률이 높은 박스만 선택하는 기술입니다.

위의 모델 출력 계산을 multibox_detection 함수에 적용하여 다중 객체 탐지 결과를 얻습니다. 결과로 얻는 텐서는 다중 객체 탐지 결과입니다. 이 결과 텐서에는 예측된 객체의 클래스 ID, 확률 및 박스 좌표 정보가 포함되어 있습니다.

tensor([[[ 0.00,  0.90,  0.10,  0.08,  0.52,  0.92],
         [ 1.00,  0.90,  0.55,  0.20,  0.90,  0.88],
         [-1.00,  0.80,  0.08,  0.20,  0.56,  0.95],
         [-1.00,  0.70,  0.15,  0.30,  0.62,  0.91]]])

After removing those predicted bounding boxes of class -1, we can output the final predicted bounding box kept by non-maximum suppression.

클래스 -1의 예측 경계 상자를 제거한 후 비최대 억제로 유지되는 최종 예측 경계 상자를 출력할 수 있습니다.

fig = d2l.plt.imshow(img)
for i in output[0].detach().numpy():
    if i[0] == -1:
        continue
    label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
    show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label)

위 코드는 다중 객체 탐지 모델의 결과를 시각화하는 부분입니다.

fig = d2l.plt.imshow(img): 입력 이미지를 시각화합니다.
for i in output[0].detach().numpy():: 다중 객체 탐지 모델의 결과를 하나씩 반복하면서 처리합니다. output 텐서의 첫 번째 배치의 결과를 사용합니다.
if i[0] == -1:: 만약 객체의 클래스 ID가 -1이면 (배경인 경우) 건너뜁니다.
label = ('dog=', 'cat=')[int(i[0])] + str(i[1]): 클래스 ID에 따라 객체의 레이블을 생성합니다. 클래스 ID 0은 'dog=', 클래스 ID 1은 'cat='을 갖습니다. 그리고 해당 클래스의 확률 값을 문자열로 추가합니다.
show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label): show_bboxes 함수를 사용하여 이미지에 예측된 박스를 그려줍니다. i[2:]는 박스의 좌표 정보를 나타내며, bbox_scale을 곱하여 실제 이미지 크기에 맞게 스케일 조정한 값으로 설정됩니다. 그리고 label을 표시하여 어떤 클래스의 객체인지와 확률을 함께 나타냅니다.

이 코드는 모델의 출력을 바탕으로 예측된 객체의 박스를 시각화하고, 각 박스 위에 해당하는 레이블과 확률 값을 표시하는 부분입니다.

In practice, we can remove predicted bounding boxes with lower confidence even before performing non-maximum suppression, thereby reducing computation in this algorithm. We may also post-process the output of non-maximum suppression, for example, by only keeping results with higher confidence in the final output.

실제로 non-maximum suppression를 수행하기 전에도 낮은 신뢰도로 예측된 경계 상자를 제거할 수 있으므로 이 알고리즘의 계산을 줄일 수 있습니다. 예를 들어 최종 출력에서 신뢰도가 더 높은 결과만 유지하여 최대가 아닌 억제의 출력을 사후 처리할 수도 있습니다.

NMS (Non-Maximum Suppression)이란?

Non-Maximum Suppression (NMS) is a post-processing technique commonly used in object detection tasks to reduce duplicate or overlapping bounding box predictions and retain only the most relevant and accurate ones. NMS helps refine the output of object detection algorithms by selecting the most appropriate bounding boxes for detected objects while eliminating redundant and overlapping detections.

비최대 억제(NMS, Non-Maximum Suppression)는 객체 탐지 작업에서 널리 사용되는 후 처리 기법으로, 중복되거나 겹치는 바운딩 박스 예측을 줄이고 가장 관련성 높고 정확한 예측만 보존하는 데 사용됩니다. NMS는 객체 탐지 알고리즘의 출력을 개선하기 위해 감지된 객체에 가장 적절한 바운딩 박스를 선택하고 중복 및 겹침을 없애는 역할을 합니다.

The primary goal of NMS is to eliminate multiple bounding box predictions that cover the same object. This is often the case when object detection models generate multiple bounding boxes for the same object due to different scales, positions, or overlapping receptive fields. NMS considers the confidence scores associated with each bounding box prediction to determine the most likely accurate ones.

NMS의 주요 목표는 동일한 객체를 포함하는 여러 바운딩 박스 예측을 제거하는 것입니다. 이는 객체 탐지 모델이 서로 다른 크기, 위치 또는 겹치는 수용 영역 때문에 동일한 객체에 대해 여러 개의 바운딩 박스를 생성하는 경우가 많기 때문입니다. NMS는 각 바운딩 박스 예측과 연관된 신뢰도 점수를 고려하여 가장 정확한 것을 판별합니다.

Here's how the Non-Maximum Suppression algorithm works:

다음은 비최대 억제 알고리즘의 작동 방식입니다.

Input: The algorithm takes a set of bounding box predictions along with their associated confidence scores.

입력: 알고리즘은 바운딩 박스 예측 세트와 연관된 신뢰도 점수를 가져옵니다.
Sorting: The bounding box predictions are first sorted based on their confidence scores in descending order. This ensures that the bounding box with the highest confidence score is considered first.

정렬: 바운딩 박스 예측은 먼저 신뢰도 점수를 내림차순으로 정렬됩니다. 이렇게 하면 가장 높은 신뢰도 점수를 가진 바운딩 박스가 먼저 고려됩니다.
Suppression: Starting from the bounding box with the highest confidence score, NMS compares it with the remaining bounding boxes. If the Intersection over Union (IoU) between the two bounding boxes exceeds a certain threshold (often defined by the user), the bounding box with the lower confidence score is suppressed or discarded.

억제: 가장 높은 신뢰도 점수를 가진 바운딩 박스부터 시작하여 NMS는 나머지 바운딩 박스와 비교합니다. 두 바운딩 박스 사이의 교집합 비율 (IoU)이 일정한 임계값을 초과하면 낮은 신뢰도 점수를 가진 바운딩 박스가 억제되거나 삭제됩니다.
Iteration: The process is repeated for the remaining bounding boxes, excluding those that have been suppressed.

반복: 남은 바운딩 박스에 대해 프로세스가 반복됩니다. 억제된 바운딩 박스를 제외하고 수행됩니다.
Output: The result is a reduced set of non-overlapping and accurate bounding box predictions, each associated with its corresponding confidence score.

출력: 결과는 겹치지 않는 정확한 바운딩 박스 예측의 줄어든 세트이며, 각각 해당하는 신뢰도 점수가 포함됩니다.

The threshold used in NMS plays a critical role in controlling the level of overlap allowed between bounding boxes. If the threshold is set too high, it may remove valid bounding boxes. Conversely, if it is set too low, redundant bounding boxes might still be present.

NMS에서 사용되는 임계값은 바운딩 박스 간에 허용되는 겹침 수준을 조절하는 데 중요한 역할을 합니다. 임계값이 너무 높게 설정되면 유효한 바운딩 박스가 제거될 수 있습니다. 반대로 너무 낮게 설정하면 중복 바운딩 박스가 여전히 존재할 수 있습니다.

NMS is an effective technique to refine object detection results, improving the precision of the algorithm and reducing the likelihood of multiple detections for the same object. It is a crucial step in achieving accurate and reliable object localization and classification in tasks like object detection and instance segmentation.

NMS는 객체 탐지 결과를 개선하는 효과적인 기술로, 알고리즘의 정밀도를 향상시키고 동일한 객체에 대한 다중 감지 가능성을 줄이는 데 중요한 역할을 합니다. 객체 탐지 및 인스턴스 분할과 같은 작업에서 정확하고 신뢰할 수 있는 객체 위치 및 분류를 달성하는 중요한 단계입니다.

14.4.5. Summary

We generate anchor boxes with different shapes centered on each pixel of the image.

이미지의 각 픽셀을 중심으로 다양한 모양의 앵커 박스를 생성합니다.
Intersection over union (IoU), also known as Jaccard index, measures the similarity of two bounding boxes. It is the ratio of their intersection area to their union area.

Jaccard 인덱스라고도 하는 IoU(Intersection over Union)는 두 경계 상자의 유사성을 측정합니다. 교차 영역과 결합 영역의 비율입니다.
In a training set, we need two types of labels for each anchor box. One is the class of the object relevant to the anchor box and the other is the offset of the ground-truth bounding box relative to the anchor box.

트레이닝 세트에서는 각 앵커 박스에 대해 두 가지 유형의 레이블이 필요합니다. 하나는 앵커 상자와 관련된 객체의 클래스이고 다른 하나는 앵커 상자에 대한 실측 경계 상자의 오프셋입니다.
During prediction, we can use non-maximum suppression (NMS) to remove similar predicted bounding boxes, thereby simplifying the output.

예측 중에 비최대 억제(NMS)를 사용하여 유사한 예측된 경계 상자를 제거하여 출력을 단순화할 수 있습니다.

14.4.6. Exercises

Change values of sizes and ratios in the multibox_prior function. What are the changes to the generated anchor boxes?
Construct and visualize two bounding boxes with an IoU of 0.5. How do they overlap with each other?
Modify the variable anchors in Section 14.4.3 and Section 14.4.4. How do the results change?
Non-maximum suppression is a greedy algorithm that suppresses predicted bounding boxes by removing them. Is it possible that some of these removed ones are actually useful? How can this algorithm be modified to suppress softly? You may refer to Soft-NMS (Bodla et al., 2017).
Rather than being hand-crafted, can non-maximum suppression be learned?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19
D2L - 14.1. Image Augmentation (0)	2023.08.19
D2L - 14. Computer Vision (0)	2023.08.18

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리