'Dive into Deep Learning'에 해당되는 글 123건

2023.12.12 Train and tune a deep learning model at scalewith Amazon SageMaker 1
2023.12.03 D2L - 3.7. Weight Decay 1
2023.12.03 D2L - 3.6. Generalization 1
2023.10.14 D2L - 2.7. Documentation 2
2023.10.14 D2L - 2.6. Probability and Statistics
2023.10.12 D2L - 2.5. Automatic Differentiation
2023.10.12 D2L - 2.4. Calculus : 미적분 1
2023.10.11 D2L - 2.3. Linear Algebra - 선형 대수학 1
2023.10.09 D2L - 2.2. Data Preprocessing
2023.10.09 D2L - 2.1. Data Manipulation

Dive into Deep Learning/Scratch

Train and tune a deep learning model at scalewith Amazon SageMaker

2023. 12. 12. 13:53 | Posted by 솔웅

https://aws.amazon.com/ko/tutorials/train-tune-deep-learning-model-amazon-sagemaker/?nc1=h_ls

Train and tune a deep learning model at scale with Amazon SageMaker

aws.amazon.com

Train and tune a deep learning model at scale

with Amazon SageMaker

In this tutorial, you learn how to use Amazon SageMaker to build, train, and tune a TensorFlow deep learning model.

이 자습서에서는 Amazon SageMaker를 사용하여 TensorFlow 딥 러닝 모델을 구축, 훈련 및 조정하는 방법을 알아봅니다.

Amazon SageMaker is a fully managed service that provides machine learning (ML) developers and data scientists with the ability to build, train, and deploy ML models quickly. Amazon SageMaker provides you with everything you need to train and tune models at scale without the need to manage infrastructure. You can use Amazon SageMaker Studio, the first integrated development environment (IDE) for machine learning, to quickly visualize experiments and track training progress without ever leaving the familiar Jupyter Notebook interface. Within Amazon SageMaker Studio, you can use Amazon SageMaker Experiments to track, evaluate, and organize experiments easily.

Amazon SageMaker는 기계 학습(ML) 개발자와 데이터 과학자에게 ML 모델을 신속하게 구축, 교육 및 배포할 수 있는 기능을 제공하는 완전관리형 서비스입니다. Amazon SageMaker는 인프라를 관리할 필요 없이 대규모로 모델을 훈련하고 조정하는 데 필요한 모든 것을 제공합니다. 기계 학습을 위한 최초의 통합 개발 환경(IDE)인 Amazon SageMaker Studio를 사용하면 친숙한 Jupyter Notebook 인터페이스를 벗어나지 않고도 실험을 신속하게 시각화하고 훈련 진행 상황을 추적할 수 있습니다. Amazon SageMaker Studio 내에서 Amazon SageMaker 실험을 사용하여 실험을 쉽게 추적, 평가 및 구성할 수 있습니다.

In this tutorial, you learn how to:

이 자습서에서는 다음 방법을 알아봅니다.

Set up Amazon SageMaker Studio
Amazon SageMaker Studio 설정
Download a public dataset using an Amazon SageMaker Studio Notebook and upload it to Amazon S3
Amazon SageMaker Studio Notebook을 사용하여 공개 데이터 세트를 다운로드하고 Amazon S3에 업로드합니다.
Create an Amazon SageMaker Experiment to track and manage training jobs
훈련 작업을 추적하고 관리하기 위한 Amazon SageMaker 실험 생성
Run a TensorFlow training job on a fully managed GPU instance using one-click training with Amazon SageMaker
Amazon SageMaker의 원클릭 교육을 사용하여 완전 관리형 GPU 인스턴스에서 TensorFlow 교육 작업 실행
Improve accuracy by running a large-scale Amazon SageMaker Automatic Model Tuning job to find the best model hyperparameters
대규모 Amazon SageMaker 자동 모델 튜닝 작업을 실행하여 최상의 모델 하이퍼파라미터를 찾아 정확성을 향상시킵니다.
Visualize training results
훈련 결과 시각화

You’ll be using the CIFAR-10 dataset to train a model in TensorFlow to classify images into 10 classes. This dataset consists of 60,000 32x32 color images, split into 40,000 images for training, 10,000 images for validation and 10,000 images for testing.

CIFAR-10 데이터 세트를 사용하여 TensorFlow에서 모델을 훈련하여 이미지를 10개 클래스로 분류하게 됩니다. 이 데이터 세트는 60,000개의 32x32 컬러 이미지로 구성되어 있으며 훈련용 이미지 40,000개, 검증용 이미지 10,000개, 테스트용 이미지 10,000개로 나뉩니다.

이 튜토리얼의 비용은 약 $100입니다.

Amazon SageMaker Studio에 온보딩하고 Amazon SageMaker Studio 제어판을 설정하려면 다음 단계를 완료하십시오.

참고: 자세한 내용은 Amazon SageMaker 설명서의 Get Started with Amazon SageMaker Studio 를 참조하십시오.

a. Amazon SageMaker console 에 로그인합니다.

참고: 오른쪽 상단에서 SageMaker Studio를 사용할 수 있는 AWS 리전을 선택하십시오. 지역 목록은 Onboard to Amazon SageMaker Studio 을 참조하십시오.

b. Amazon SageMaker 탐색 창에서 Amazon SageMaker Studio를 선택합니다.

참고: Amazon SageMaker Studio를 처음 사용하는 경우 Studio onboarding process 를 완료해야 합니다. 온보딩 시 인증 방법으로 AWS Single Sign-On(AWS SSO) 또는 AWS Identity and Access Management(IAM)를 사용하도록 선택할 수 있습니다. IAM 인증을 사용하는 경우 빠른 시작 또는 표준 설정 절차를 선택할 수 있습니다. 어떤 옵션을 선택해야 할지 잘 모르겠으면 Onboard to Amazon SageMaker Studio 을 참조하고 IT 관리자에게 도움을 요청하세요. 단순화를 위해 이 자습서에서는 빠른 시작 절차를 사용합니다.

c. 시작하기 상자에서 빠른 시작을 선택하고 사용자 이름을 지정합니다.

d. 실행 역할에서 IAM 역할 생성을 선택합니다. 표시되는 대화 상자에서 모든 S3 버킷을 선택하고 역할 생성을 선택합니다.
Amazon SageMaker는 필요한 권한이 있는 역할을 생성하고 이를 인스턴스에 할당합니다.

e. Submit을 클릭하세요.

Amazon SageMaker Studio 노트북은 훈련 스크립트를 구축하고 테스트하는 데 필요한 모든 것이 포함된 원클릭 Jupyter 노트북입니다. SageMaker Studio에는 실험 추적 및 시각화도 포함되어 있어 전체 기계 학습 워크플로를 한 곳에서 쉽게 관리할 수 있습니다.

SageMaker 노트북을 생성하고, 데이터 세트를 다운로드하고, 데이터 세트를 TensorFlow 지원 TFRecord 형식으로 변환한 다음 데이터 세트를 Amazon S3에 업로드하려면 다음 단계를 완료하십시오.

참고: 자세한 내용은 Amazon SageMaker 설명서의 Use Amazon SageMaker Studio Notebooks 사용을 참조하십시오.

a. Amazon SageMaker Studio 제어판에서 Open Studio를 선택합니다.

b. JupyterLab의 파일 메뉴에서 새 실행 프로그램을 선택합니다. 노트북 및 컴퓨팅 리소스 섹션의 SageMaker 이미지 선택에서 TensorFlow 1.15 Python 3.6(CPU에 최적화됨)을 선택합니다. 그런 다음 노트북에서 Python 3을 선택합니다.

참고: 이 단계에서는 데이터 세트를 다운로드하고, 교육 스크립트를 작성하고, Amazon SageMaker 교육 작업을 제출하고, 결과를 시각화하는 SageMaker 노트북을 실행하는 데 사용되는 CPU 인스턴스를 선택합니다. 훈련 작업 자체는 5단계에서 볼 수 있는 GPU 인스턴스와 같이 지정할 수 있는 별도의 인스턴스 유형에서 실행됩니다.

c. 다음 코드 블록을 복사하여 코드 셀에 붙여넣고 실행을 선택합니다.

이 코드는 generate_cifar10_tfrecords.py 스크립트를 다운로드하고 CIFAR-10 dataset 를 다운로드한 후 TFRecord 형식으로 변환합니다.

참고: 코드가 실행되는 동안 대괄호 사이에 *가 나타납니다. 몇 초 후에 코드 실행이 완료되고 *가 숫자로 대체됩니다.

https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/utils/generate_cifar10_tfrecords.py

다음 코드를 복사하여 코드 셀에 붙여넣고 실행을 선택합니다.

!pip install ipywidgets
!python generate_cifar10_tfrecords.py --data-dir cifar10

d. 기본 Amazon SageMaker Amazon S3 버킷에 데이터세트를 업로드합니다. 다음 코드를 복사하여 코드 셀에 붙여넣고 실행을 선택합니다.

데이터세트의 Amazon S3 위치가 출력으로 표시되어야 합니다.

import time, os, sys
import sagemaker, boto3
import numpy as np
import pandas as pd

sess = boto3.Session()
sm   = sess.client('sagemaker')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)

datasets = sagemaker_session.upload_data(path='cifar10', key_prefix='datasets/cifar10-dataset')
datasets

이제 Amazon S3에서 데이터 세트를 다운로드하고 준비했으므로 Amazon SageMaker 실험을 생성할 수 있습니다. 실험은 동일한 기계 학습 프로젝트와 관련된 처리 및 학습 작업의 모음입니다. Amazon SageMaker Experiments는 훈련 실행을 자동으로 관리하고 추적합니다.

새 실험을 만들려면 다음 단계를 완료하세요.

참고: 자세한 내용은 Amazon SageMaker 설명서의 실험을 참조하십시오.

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

training_experiment = Experiment.create(
                                experiment_name = "sagemaker-training-experiments", 
                                description     = "Experiment to track cifar10 training trials", 
                                sagemaker_boto_client=sm)

b. 왼쪽 도구 모음에서 구성 요소 및 레지스트리(삼각형 아이콘)를 선택한 다음 실험 및 시험을 선택합니다. 새 실험 Sagemaker-training-experiments가 목록에 나타납니다.

CIFAR-10 데이터 세트에서 분류기를 훈련하려면 훈련 스크립트가 필요합니다. 이 단계에서는 TensorFlow 학습 작업을 위한 평가판 및 학습 스크립트를 만듭니다. 각 시도는 엔드 투 엔드 훈련 작업의 반복입니다. 훈련 작업 외에도 평가판에서는 전처리, 후처리 작업은 물론 데이터 세트 및 기타 메타데이터도 추적할 수 있습니다. 단일 실험에는 여러 시도가 포함될 수 있으므로 Amazon SageMaker Studio 실험 창 내에서 시간 경과에 따른 여러 반복을 쉽게 추적할 수 있습니다.

TensorFlow 학습 작업을 위한 새로운 평가판 및 학습 스크립트를 생성하려면 다음 단계를 완료하세요.

참고: 자세한 내용은 Amazon SageMaker 설명서의 Use TensorFlow with Amazon SageMaker 을 참조하십시오.

a. Jupyter Notebook에서 다음 코드 블록을 복사하여 코드 셀에 붙여넣고 실행을 선택합니다.

이 코드는 새로운 시도를 생성하고 이를 4단계에서 생성한 실험과 연결합니다.

single_gpu_trial = Trial.create(
    trial_name = 'sagemaker-single-gpu-training', 
    experiment_name = training_experiment.experiment_name,
    sagemaker_boto_client = sm,
)

trial_comp_name = 'single-gpu-training-job'
experiment_config = {"ExperimentName": training_experiment.experiment_name, 
                       "TrialName": single_gpu_trial.trial_name,
                       "TrialComponentDisplayName": trial_comp_name}

각 시도는 엔드 투 엔드 훈련 작업의 반복입니다. 훈련 작업 외에도 시도에서는 전처리 작업, 후처리 작업, 데이터 세트 및 기타 메타데이터를 추적할 수도 있습니다. 단일 실험에는 여러 시도가 포함될 수 있으므로 Amazon SageMaker Studio 실험 창 내에서 시간 경과에 따른 여러 반복을 쉽게 추적할 수 있습니다.

b. 왼쪽 도구 모음에서 구성 요소 및 레지스트리(삼각형 아이콘)를 선택합니다. Sagemaker-training-experiments를 두 번 클릭하여 관련 시도를 표시합니다. 새로운 평가판 Sagemaker-single-Gpu-training이 목록에 나타납니다.

c. 파일 메뉴에서 새로 만들기를 선택한 다음 텍스트 파일을 선택합니다. 코드 편집기에서 다음 TensorFlow 코드를 복사하여 새로 생성된 파일에 붙여넣습니다.

이 스크립트는 TensorFlow 코드를 구현하여 CIFAR-10 데이터 세트를 읽고 resnet50 모델을 교육합니다.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.optimizers import Adam, SGD
import argparse
import os
import re
import time

HEIGHT = 32
WIDTH = 32
DEPTH = 3
NUM_CLASSES = 10

def single_example_parser(serialized_example):
    """Parses a single tf.Example into image and label tensors."""
    # Dimensions of the images in the CIFAR-10 dataset.
    # See http://www.cs.toronto.edu/~kriz/cifar.html for a description of the
    # input format.
    features = tf.io.parse_single_example(
        serialized_example,
        features={
            'image': tf.io.FixedLenFeature([], tf.string),
            'label': tf.io.FixedLenFeature([], tf.int64),
        })
    image = tf.decode_raw(features['image'], tf.uint8)
    image.set_shape([DEPTH * HEIGHT * WIDTH])

    # Reshape from [depth * height * width] to [depth, height, width].
    image = tf.cast(
        tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
        tf.float32)
    label = tf.cast(features['label'], tf.int32)
    
    image = train_preprocess_fn(image)
    label = tf.one_hot(label, NUM_CLASSES)
    
    return image, label

def train_preprocess_fn(image):

    # Resize the image to add four extra pixels on each side.
    image = tf.image.resize_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)

    # Randomly crop a [HEIGHT, WIDTH] section of the image.
    image = tf.image.random_crop(image, [HEIGHT, WIDTH, DEPTH])

    # Randomly flip the image horizontally.
    image = tf.image.random_flip_left_right(image)
    return image

def get_dataset(filenames, batch_size):
    """Read the images and labels from 'filenames'."""
    # Repeat infinitely.
    dataset = tf.data.TFRecordDataset(filenames).repeat().shuffle(10000)

    # Parse records.
    dataset = dataset.map(single_example_parser, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    # Batch it up.
    dataset = dataset.batch(batch_size, drop_remainder=True)
    return dataset

def get_model(input_shape, learning_rate, weight_decay, optimizer, momentum):
    input_tensor = Input(shape=input_shape)
    base_model = keras.applications.resnet50.ResNet50(include_top=False,
                                                          weights='imagenet',
                                                          input_tensor=input_tensor,
                                                          input_shape=input_shape,
                                                          classes=None)
    x = Flatten()(base_model.output)
    predictions = Dense(NUM_CLASSES, activation='softmax')(x)
    model = Model(inputs=base_model.input, outputs=predictions)
    return model

def main(args):
    # Hyper-parameters
    epochs       = args.epochs
    lr           = args.learning_rate
    batch_size   = args.batch_size
    momentum     = args.momentum
    weight_decay = args.weight_decay
    optimizer    = args.optimizer

    # SageMaker options
    training_dir   = args.training
    validation_dir = args.validation
    eval_dir       = args.eval

    train_dataset = get_dataset(training_dir+'/train.tfrecords',  batch_size)
    val_dataset   = get_dataset(validation_dir+'/validation.tfrecords', batch_size)
    eval_dataset  = get_dataset(eval_dir+'/eval.tfrecords', batch_size)
    
    input_shape = (HEIGHT, WIDTH, DEPTH)
    model = get_model(input_shape, lr, weight_decay, optimizer, momentum)
    
    # Optimizer
    if optimizer.lower() == 'sgd':
        opt = SGD(lr=lr, decay=weight_decay, momentum=momentum)
    else:
        opt = Adam(lr=lr, decay=weight_decay)

    # Compile model
    model.compile(optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    
    # Train model
    history = model.fit(train_dataset, steps_per_epoch=40000 // batch_size,
                        validation_data=val_dataset, 
                        validation_steps=10000 // batch_size,
                        epochs=epochs)
                        
    
    # Evaluate model performance
    score = model.evaluate(eval_dataset, steps=10000 // batch_size, verbose=1)
    print('Test loss    :', score[0])
    print('Test accuracy:', score[1])
    
    # Save model to model directory
    model.save(f'{os.environ["SM_MODEL_DIR"]}/{time.strftime("%m%d%H%M%S", time.gmtime())}', save_format='tf')


#%%
if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()
    # Hyper-parameters
    parser.add_argument('--epochs',        type=int,   default=10)
    parser.add_argument('--learning-rate', type=float, default=0.01)
    parser.add_argument('--batch-size',    type=int,   default=128)
    parser.add_argument('--weight-decay',  type=float, default=2e-4)
    parser.add_argument('--momentum',      type=float, default='0.9')
    parser.add_argument('--optimizer',     type=str,   default='sgd')

    # SageMaker parameters
    parser.add_argument('--model_dir',        type=str)
    parser.add_argument('--training',         type=str,   default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--validation',       type=str,   default=os.environ['SM_CHANNEL_VALIDATION'])
    parser.add_argument('--eval',             type=str,   default=os.environ['SM_CHANNEL_EVAL'])
    
    args = parser.parse_args()
    main(args)

d. 파일 메뉴에서 파일 이름 바꾸기를 선택합니다. 새 이름 상자에서 cifar10-training-sagemaker.py를 복사하여 붙여넣고 이름 바꾸기를 선택합니다. (새 확장자가 .txt가 아니라 .py인지 확인하세요.) 그런 다음 파일을 선택하고 Python 파일 저장을 선택합니다.

이 단계에서는 Amazon SageMaker를 사용하여 TensorFlow 훈련 작업을 실행합니다. Amazon SageMaker를 사용하면 모델 훈련이 쉽습니다. Amazon S3의 데이터 세트 위치와 훈련 인스턴스 유형을 지정하면 Amazon SageMaker가 훈련 인프라를 관리합니다.

TensorFlow 학습 작업을 실행한 후 결과를 시각화하려면 다음 단계를 완료하세요.

참고: 자세한 내용은 Amazon SageMaker 설명서의 Use TensorFlow with Amazon SageMaker 을 참조하십시오.

a. Jupyter Notebook에서 다음 코드 블록을 복사하여 코드 셀에 붙여넣고 실행을 선택합니다. 그런 다음 코드를 자세히 살펴보세요.

참고: ResourceLimitExceeded가 나타나면 인스턴스 유형을 ml.c5.xlarge로 변경하세요.

참고: 사용 중단 경고는 무시해도 됩니다(예: sagemaker.deprecations:train_instance_type의 이름이 변경되었습니다...). 이 경고는 버전 변경으로 인한 것이며 학습 실패를 일으키지 않습니다.

from sagemaker.tensorflow import TensorFlow

hyperparams={'epochs'       : 30,
             'learning-rate': 0.01,
             'batch-size'   : 256,
             'weight-decay' : 2e-4,
             'momentum'     : 0.9,
             'optimizer'    : 'adam'}

bucket_name = sagemaker_session.default_bucket()
output_path = f's3://{bucket_name}/jobs'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

tf_estimator = TensorFlow(entry_point          = 'cifar10-training-sagemaker.py', 
                          output_path          = f'{output_path}/',
                          code_location        = output_path,
                          role                 = role,
                          train_instance_count = 1, 
                          train_instance_type  = 'ml.g4dn.xlarge',
                          framework_version    = '1.15.2', 
                          py_version           = 'py3',
                          script_mode          = True,
                          metric_definitions   = metric_definitions,
                          sagemaker_session    = sagemaker_session,
                          hyperparameters      = hyperparams)

job_name=f'tensorflow-single-gpu-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
tf_estimator.fit({'training'  : datasets,
                  'validation': datasets,
                  'eval'      : datasets},
                 job_name = job_name,
                 experiment_config=experiment_config)

이 코드는 세 부분으로 구성됩니다.

- 학습 작업 하이퍼파라미터를 지정합니다.
- Amazon SageMaker Estimator 함수를 호출하고 훈련 작업 세부 정보(훈련 스크립트 이름, 훈련할 인스턴스 유형, 프레임워크 버전 등)를 제공합니다.
- 훈련 작업을 시작하기 위해 fit 함수를 호출합니다.

Amazon SageMaker는 요청된 인스턴스를 자동으로 프로비저닝하고, 데이터 세트를 다운로드하고, TensorFlow 컨테이너를 가져오고, 훈련 스크립트를 다운로드하고, 훈련을 시작합니다.

이 예에서는 GPU 인스턴스인 ml.g4dn.xlarge에서 실행할 Amazon SageMaker 훈련 작업을 제출합니다. 딥 러닝 훈련은 계산 집약적이며 결과를 더 빠르게 얻으려면 GPU 인스턴스가 권장됩니다.

학습이 완료되면 최종 정확도 결과, 학습 시간 및 청구 가능 시간이 표시됩니다.

b. 교육 요약을 봅니다. 왼쪽 도구 모음에서 구성 요소 및 레지스트리(삼각형 아이콘)를 선택합니다. sagemaker-training-experiments를 두 번 클릭한 다음 sagemaker-single-gpu-training을 두 번 클릭하고 훈련 작업에 대해 새로 생성된 Single-Gpu-training-job 평가판 구성 요소를 두 번 클릭합니다. 측정항목을 선택합니다.

c. 결과를 시각화합니다. 차트를 선택한 다음 차트 추가를 선택합니다. 차트 속성 창에서 다음을 선택합니다.

차트 유형: 선
X축 차원: Epoch
Y축: val_acc_EVAL_avg
학습이 진행됨에 따라 평가 정확도의 변화를 보여주는 그래프가 표시되고 6a단계의 최종 정확도로 끝납니다.

Step 7. Tune the model with Amazon SageMaker automatic model tuning

이 단계에서는 Amazon SageMaker 자동 모델 튜닝 작업을 실행하여 최상의 하이퍼파라미터를 찾고 6단계에서 얻은 훈련 정확도를 향상시킵니다. 모델 튜닝 작업을 실행하려면 Amazon SageMaker에 고정 값이 아닌 하이퍼파라미터 범위를 제공해야 합니다. 하이퍼파라미터 공간을 탐색하고 자동으로 최적의 값을 찾을 수 있다는 것입니다.
자동 모델 튜닝 작업을 실행하려면 다음 단계를 완료하세요.

참고: 자세한 내용은 Amazon SageMaker 설명서의 Perform Automatic Model Tuning 을 참조하십시오.

a. Jupyter Notebook에서 다음 코드 블록을 복사하여 코드 셀에 붙여넣고 실행을 선택합니다. 그런 다음 코드를 자세히 살펴보세요.

참고: ResourceLimitExceeded가 나타나면 인스턴스 유형을 ml.c5.xlarge로 변경하세요.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
    'epochs'        : IntegerParameter(5, 30),
    'learning-rate' : ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'), 
    'batch-size'    : CategoricalParameter(['128', '256', '512']),
    'momentum'      : ContinuousParameter(0.9, 0.99),
    'optimizer'     : CategoricalParameter(['sgd', 'adam'])
}

objective_metric_name = 'val_acc'
objective_type = 'Maximize'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

tf_estimator = TensorFlow(entry_point          = 'cifar10-training-sagemaker.py', 
                          output_path          = f'{output_path}/',
                          code_location        = output_path,
                          role                 = role,
                          train_instance_count = 1, 
                          train_instance_type  = 'ml.g4dn.xlarge',
                          framework_version    = '1.15', 
                          py_version           = 'py3',
                          script_mode          = True,
                          metric_definitions   = metric_definitions,
                          sagemaker_session    = sagemaker_session)

tuner = HyperparameterTuner(estimator             = tf_estimator,
                            objective_metric_name = objective_metric_name,
                            hyperparameter_ranges = hyperparameter_ranges,
                            metric_definitions    = metric_definitions,
                            max_jobs              = 16,
                            max_parallel_jobs     = 8,
                            objective_type        = objective_type)

job_name=f'tf-hpo-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
tuner.fit({'training'  : datasets,
           'validation': datasets,
           'eval'      : datasets},
            job_name = job_name)

이 코드는 네 부분으로 구성됩니다.

- 하이퍼파라미터의 값 범위를 지정합니다. 이는 정수 범위(예: Epoch 번호), 연속 범위(예: 학습률) 또는 범주형 값(예: 최적화 유형 sgd 또는 adam)일 수 있습니다.
- 6단계의 것과 유사한 Estimator 함수를 호출합니다.
- 하이퍼파라미터 범위, 최대 작업 수, 실행할 병렬 작업 수를 포함하는 HyperparameterTuner 객체를 생성합니다.
- 초매개변수 조정 작업을 시작하기 위해 맞춤 함수를 호출합니다.

참고: max_jobs 변수를 16에서 더 작은 숫자로 줄여 튜닝 작업 비용을 절약할 수 있습니다. 그러나 튜닝 작업 수를 줄이면 더 나은 모델을 찾을 가능성이 줄어듭니다. max_parallel_jobs 변수를 max_jobs보다 작거나 같은 숫자로 줄일 수도 있습니다. max_parallel_jobs가 max_jobs와 같으면 결과를 더 빨리 얻을 수 있습니다. 리소스 오류가 발생하지 않도록 max_parallel_jobs가 AWS 계정의 인스턴스 제한보다 낮은지 확인하십시오.

b. 최고의 하이퍼파라미터를 확인하세요. Amazon SageMaker 콘솔을 열고 왼쪽 탐색 창의 훈련 아래에서 하이퍼파라미터 튜닝 작업을 선택하고 튜닝 작업을 선택한 다음 최상의 훈련 작업을 선택합니다. 6단계의 결과(60%)에 비해 훈련 정확도(80%)가 향상되는 것을 확인할 수 있습니다.

참고: 결과는 다를 수 있습니다. max_jobs를 늘리고, 하이퍼파라미터 범위를 완화하고, 다른 모델 아키텍처를 탐색하여 결과를 더욱 향상시킬 수 있습니다.

이 단계에서는 이 실습에서 사용한 리소스를 종료합니다.

중요: 적극적으로 사용되지 않는 리소스를 종료하면 비용이 절감되므로 모범 사례입니다. 리소스를 종료하지 않으면 계정에 요금이 청구됩니다.

학습 작업을 중지합니다.

1. Amazon SageMaker 콘솔을 엽니다.
2. 교육 아래 왼쪽 탐색 창에서 교육 작업을 선택합니다.
3. 진행 중 상태의 교육 작업이 없는지 확인합니다. 진행 중인 훈련 작업의 경우 작업이 훈련을 마칠 때까지 기다리거나 훈련 작업 이름을 선택하고 중지를 선택할 수 있습니다.

(선택 사항) 모든 훈련 아티팩트 정리: 모든 훈련 아티팩트(모델, 사전 처리된 데이터 세트 등)를 정리하려면 Jupyter Notebook에서 다음 코드를 복사하여 붙여넣고 실행을 선택합니다.

참고: ACCOUNT_NUMBER를 계좌 번호로 바꿔야 합니다.

!aws s3 rm --recursive s3://sagemaker-us-west-2-ACCOUNT_NUMBER/datasets/cifar10-dataset
!aws s3 rm --recursive s3://sagemaker-us-west-2-ACCOUNT_NUMBER/jobs

Conclusion

Congratulations! You created, trained, and tuned a TensorFlow deep learning model with Amazon SageMaker.

축하해요! Amazon SageMaker를 사용하여 TensorFlow 딥 러닝 모델을 생성, 교육 및 조정했습니다.

You can continue your machine learning journey with SageMaker by following the next steps section below.

아래의 다음 단계 섹션에 따라 SageMaker를 사용하여 기계 학습 여정을 계속할 수 있습니다.

저작자표시 (새창열림)

'Dive into Deep Learning > Scratch' 카테고리의 다른 글

ReLU에서는 왜 양수가 아닌 음수를 버리는가? ChatGPT에게 물어보기. (0)	2023.07.08
딥러닝에서의 연쇄법칙이란? The Chain Rule in Deep Learning (0)	2023.07.08
D2L - Setup (0)	2023.06.17

Dive into Deep Learning/D2L Linear Neural Networks

D2L - 3.7. Weight Decay

2023. 12. 3. 13:00 | Posted by 솔웅

https://d2l.ai/chapter_linear-regression/weight-decay.html

3.7. Weight Decay — Dive into Deep Learning 1.0.3 documentation

d2l.ai

3.7. Weight Decay

Now that we have characterized the problem of overfitting, we can introduce our first regularization technique. Recall that we can always mitigate overfitting by collecting more training data. However, that can be costly, time consuming, or entirely out of our control, making it impossible in the short run. For now, we can assume that we already have as much high-quality data as our resources permit and focus the tools at our disposal when the dataset is taken as a given.

이제 과적합 문제를 특성화했으므로 첫 번째 정규화 기술을 소개할 수 있습니다. 더 많은 훈련 데이터를 수집하면 항상 과적합을 완화할 수 있다는 점을 기억하세요. 그러나 이는 비용이 많이 들고, 시간이 많이 걸리거나 완전히 통제할 수 없어 단기적으로 불가능할 수 있습니다. 지금은 리소스가 허용하는 만큼의 고품질 데이터를 이미 보유하고 있다고 가정하고 데이터 세트를 주어진 것으로 간주할 때 사용할 수 있는 도구에 집중할 수 있습니다.

Recall that in our polynomial regression example (Section 3.6.2.1) we could limit our model’s capacity by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features is a popular technique for mitigating overfitting. However, simply tossing aside features can be too blunt an instrument. Sticking with the polynomial regression example, consider what might happen with high-dimensional input. The natural extensions of polynomials to multivariate data are called monomials, which are simply products of powers of variables. The degree of a monomial is the sum of the powers. For example, x1**2x2, and x3x5**2 (

)are both monomials of degree 3.

다항식 회귀 예제(섹션 3.6.2.1)에서 피팅된 다항식의 차수를 조정하여 모델의 용량을 제한할 수 있다는 점을 기억하세요. 실제로 특성 수를 제한하는 것은 과적합을 완화하는 데 널리 사용되는 기술입니다. 그러나 단순히 기능을 제쳐두는 것은 너무 무뚝뚝한 도구가 될 수 있습니다. 다항식 회귀 예제를 계속 사용하면서 고차원 입력에서 어떤 일이 발생할 수 있는지 생각해 보세요. 다변량 데이터에 대한 다항식의 자연스러운 확장을 단항식이라고 하며 이는 단순히 변수 거듭제곱의 곱입니다. 단항식의 차수는 거듭제곱의 합입니다. 예를 들어 x1**2x2와 x3x5**2 ()는 모두 3차 단항식입니다.

Note that the number of terms with degree d blows up rapidly as d grows larger. Given k variables, the number of monomials of degree d is (k−1+d k−1) (

). Even small changes in degree, say from 2 to 3, dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity.

d가 커짐에 따라 차수 d를 갖는 항의 수가 급격히 증가한다는 점에 유의하십시오. k개의 변수가 주어지면 d차 단항식의 수는 (k−1+d k−1)입니다. 2에서 3까지의 작은 변화조차도 모델의 복잡성을 극적으로 증가시킵니다. 따라서 함수 복잡성을 조정하기 위해 보다 세분화된 도구가 필요한 경우가 많습니다.

%matplotlib inline
import torch
from torch import nn
from d2l import torch as d2l

3.7.1. Norms and Weight Decay

Rather than directly manipulating the number of parameters, weight decay, operates by restricting the values that the parameters can take. More commonly called ℓ2 regularization outside of deep learning circles when optimized by minibatch stochastic gradient descent, weight decay might be the most widely used technique for regularizing parametric machine learning models. The technique is motivated by the basic intuition that among all functions f, the function f=0 (assigning the value 0 to all inputs) is in some sense the simplest, and that we can measure the complexity of a function by the distance of its parameters from zero. But how precisely should we measure the distance between a function and zero? There is no single right answer. In fact, entire branches of mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted to addressing such issues.

매개변수 수를 직접 조작하는 대신 가중치 감소는 매개변수가 취할 수 있는 값을 제한하여 작동합니다. 미니배치 확률적 경사 하강법으로 최적화할 때 딥 러닝 분야 외부에서 더 일반적으로 ℓ2 정규화라고 불리는 가중치 감소는 파라메트릭 기계 학습 모델을 정규화하는 데 가장 널리 사용되는 기술일 수 있습니다. 이 기술은 모든 함수 f 중에서 함수 f=0(모든 입력에 값 0을 할당하는)이 어떤 의미에서는 가장 단순하며 함수의 복잡성을 함수의 거리로 측정할 수 있다는 기본적인 직관에 의해 동기가 부여되었습니다. 매개변수는 0부터 시작됩니다. 하지만 함수와 0 사이의 거리를 얼마나 정확하게 측정해야 할까요? 정답은 하나도 없습니다. 실제로 기능 분석의 일부와 바나흐 공간 이론을 포함한 수학의 전체 분야가 이러한 문제를 해결하는 데 전념하고 있습니다.

One simple interpretation might be to measure the complexity of a linear function f(x)=w⊤x by some norm of its weight vector, e.g., ‖w‖**2. Recall that we introduced the ℓ2 norm and ℓ1 norm, which are special cases of the more general ℓp norm, in Section 2.3.11. The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss. Thus we replace our original objective, minimizing the prediction loss on the training labels, with new objective, minimizing the sum of the prediction loss and the penalty term. Now, if our weight vector grows too large, our learning algorithm might focus on minimizing the weight norm ‖w‖**2 rather than minimizing the training error. That is exactly what we want. To illustrate things in code, we revive our previous example from Section 3.1 for linear regression. There, our loss was given by

한 가지 간단한 해석은 선형 함수 f(x)=w⊤x의 복잡성을 해당 가중치 벡터의 일부 표준(예: "w"**2)으로 측정하는 것입니다. 섹션 2.3.11에서 보다 일반적인 ℓp 노름의 특수한 경우인 ℓ2 노름과 ℓ1 노름을 소개했음을 기억하세요. 작은 가중치 벡터를 보장하는 가장 일반적인 방법은 손실을 최소화하는 문제에 페널티 항으로 해당 노름을 추가하는 것입니다. 따라서 우리는 훈련 라벨의 예측 손실을 최소화하는 원래 목표를 예측 손실과 페널티 항의 합을 최소화하는 새로운 목표로 대체합니다. 이제 가중치 벡터가 너무 커지면 학습 알고리즘은 훈련 오류를 최소화하는 대신 가중치 표준 "w"**2를 최소화하는 데 중점을 둘 수 있습니다. 그것이 바로 우리가 원하는 것입니다. 코드로 내용을 설명하기 위해 선형 회귀에 대한 섹션 3.1의 이전 예제를 되살립니다. 거기에서 우리의 손실은 다음과 같습니다.

Recall that x**(i) are the features, y**(i) is the label for any data example i, and (w,b) are the weight and bias parameters, respectively. To penalize the size of the weight vector, we must somehow add ‖w‖**2 to the loss function, but how should the model trade off the standard loss for this new additive penalty? In practice, we characterize this trade-off via the regularization constant λ , a nonnegative hyperparameter that we fit using validation data:

x**(i)는 특징이고, y**(i)는 데이터 예제 i에 대한 레이블이며, (w,b)는 각각 가중치 및 편향 매개변수라는 점을 기억하세요. 가중치 벡터의 크기에 페널티를 적용하려면 어떻게든 손실 함수에 "w"**2를 추가해야 합니다. 하지만 모델은 이 새로운 추가 페널티에 대한 표준 손실을 어떻게 교환해야 할까요? 실제로 우리는 검증 데이터를 사용하여 피팅한 음이 아닌 하이퍼파라미터인 정규화 상수 λ를 통해 이러한 절충안을 특성화합니다.

For λ =0, we recover our original loss function. For λ >0, we restrict the size of ‖W‖. We divide by 2 by convention: when we take the derivative of a quadratic function, the 2 and 1/2 cancel out, ensuring that the expression for the update looks nice and simple. The astute reader might wonder why we work with the squared norm and not the standard norm (i.e., the Euclidean distance). We do this for computational convenience. By squaring the ℓ2 norm, we remove the square root, leaving the sum of squares of each component of the weight vector. This makes the derivative of the penalty easy to compute: the sum of derivatives equals the derivative of the sum.

λ =0인 경우 원래의 손실 함수를 복구합니다. λ >0인 경우 "W" 크기를 제한합니다. 관례에 따라 2로 나눕니다. 이차 함수의 미분을 취하면 2와 1/2이 상쇄되어 업데이트에 대한 식이 멋지고 단순해 보입니다. 기민한 독자라면 왜 우리가 표준 표준(예: 유클리드 거리)이 아닌 제곱 표준을 사용하여 작업하는지 궁금할 것입니다. 우리는 계산상의 편의를 위해 이렇게 합니다. ℓ2 노름을 제곱함으로써 제곱근을 제거하고 가중치 벡터의 각 구성요소의 제곱합을 남깁니다. 이는 페널티의 미분을 계산하기 쉽게 만듭니다. 미분의 합은 합계의 미분과 같습니다.

Moreover, you might ask why we work with the ℓ2 norm in the first place and not, say, the ℓ1 norm. In fact, other choices are valid and popular throughout statistics. While ℓ2-regularized linear models constitute the classic ridge regression algorithm, ℓ1-regularized linear regression is a similarly fundamental method in statistics, popularly known as lasso regression. One reason to work with the ℓ2 norm is that it places an outsize penalty on large components of the weight vector. This biases our learning algorithm towards models that distribute weight evenly across a larger number of features. In practice, this might make them more robust to measurement error in a single variable. By contrast, ℓ1 penalties lead to models that concentrate weights on a small set of features by clearing the other weights to zero. This gives us an effective method for feature selection, which may be desirable for other reasons. For example, if our model only relies on a few features, then we may not need to collect, store, or transmit data for the other (dropped) features.

게다가 왜 우리가 ℓ1 표준이 아닌 ℓ2 표준으로 작업하는지 궁금할 수도 있습니다. 실제로 통계 전반에 걸쳐 다른 선택이 유효하고 널리 사용됩니다. ℓ2 정규화 선형 모델이 고전적인 능선 회귀 알고리즘을 구성하는 반면, ℓ1 정규화 선형 회귀는 lasso 회귀로 널리 알려진 통계의 유사한 기본 방법입니다. ℓ2 표준을 사용하는 한 가지 이유는 가중치 벡터의 큰 구성요소에 특대 페널티를 적용한다는 것입니다. 이는 우리의 학습 알고리즘을 더 많은 수의 특성에 균등하게 가중치를 분배하는 모델로 편향시킵니다. 실제로 이는 단일 변수의 측정 오류에 더욱 강력해질 수 있습니다. 대조적으로, ℓ1 페널티는 다른 가중치를 0으로 지워서 작은 특성 집합에 가중치를 집중시키는 모델로 이어집니다. 이는 다른 이유로 바람직할 수 있는 특징 선택을 위한 효과적인 방법을 제공합니다. 예를 들어 모델이 몇 가지 기능에만 의존하는 경우 다른(삭제된) 기능에 대한 데이터를 수집, 저장 또는 전송할 필요가 없을 수 있습니다.

Using the same notation in (3.1.11), minibatch stochastic gradient descent updates for ℓ2-regularized regression as follows:

(3.1.11)의 동일한 표기법을 사용하여 ℓ2 정규 회귀에 대한 미니배치 확률적 경사하강법 업데이트는 다음과 같습니다.

As before, we update w based on the amount by which our estimate differs from the observation. However, we also shrink the size of w towards zero. That is why the method is sometimes called “weight decay”: given the penalty term alone, our optimization algorithm decays the weight at each step of training. In contrast to feature selection, weight decay offers us a mechanism for continuously adjusting the complexity of a function. Smaller values of λ correspond to less constrained w, whereas larger values of λ constrain w more considerably. Whether we include a corresponding bias penalty b**2 can vary across implementations, and may vary across layers of a neural network. Often, we do not regularize the bias term. Besides, although ℓ2 regularization may not be equivalent to weight decay for other optimization algorithms, the idea of regularization through shrinking the size of weights still holds true.

이전과 마찬가지로 추정값이 관측값과 다른 정도에 따라 w를 업데이트합니다. 그러나 w의 크기도 0으로 축소합니다. 이것이 바로 이 방법을 "가중치 감소"라고 부르는 이유입니다. 페널티 항만 주어지면 우리의 최적화 알고리즘은 훈련의 각 단계에서 가중치를 감소시킵니다. 기능 선택과 달리 가중치 감소는 기능의 복잡성을 지속적으로 조정하는 메커니즘을 제공합니다. λ의 값이 작을수록 w가 덜 제한되는 반면, λ의 값이 클수록 w가 더 크게 제한됩니다. 해당 바이어스 페널티 b**2를 포함하는지 여부는 구현에 따라 다를 수 있으며 신경망의 계층에 따라 다를 수 있습니다. 종종 우리는 편향 항을 정규화하지 않습니다. 게다가 ℓ2 정규화는 다른 최적화 알고리즘의 가중치 감소와 동일하지 않을 수 있지만 가중치 크기 축소를 통한 정규화 아이디어는 여전히 유효합니다.

3.7.2. High-Dimensional Linear Regression

We can illustrate the benefits of weight decay through a simple synthetic example.

간단한 합성 예를 통해 체중 감소의 이점을 설명할 수 있습니다.

First, we generate some data as before:

먼저 이전과 같이 일부 데이터를 생성합니다.

In this synthetic dataset, our label is given by an underlying linear function of our inputs, corrupted by Gaussian noise with zero mean and standard deviation 0.01. For illustrative purposes, we can make the effects of overfitting pronounced, by increasing the dimensionality of our problem to d=200 and working with a small training set with only 20 examples.

이 합성 데이터 세트에서 레이블은 평균이 0이고 표준 편차가 0.01인 가우스 노이즈로 인해 손상된 입력의 기본 선형 함수로 제공됩니다. 설명을 위해 문제의 차원을 d=200으로 늘리고 20개의 예제만 있는 작은 훈련 세트로 작업하여 과적합의 효과를 뚜렷하게 만들 수 있습니다.

class Data(d2l.DataModule):
    def __init__(self, num_train, num_val, num_inputs, batch_size):
        self.save_hyperparameters()
        n = num_train + num_val
        self.X = torch.randn(n, num_inputs)
        noise = torch.randn(n, 1) * 0.01
        w, b = torch.ones((num_inputs, 1)) * 0.01, 0.05
        self.y = torch.matmul(self.X, w) + b + noise

    def get_dataloader(self, train):
        i = slice(0, self.num_train) if train else slice(self.num_train, None)
        return self.get_tensorloader([self.X, self.y], train, i)

3.7.3. Implementation from Scratc

Now, let’s try implementing weight decay from scratch. Since minibatch stochastic gradient descent is our optimizer, we just need to add the squared ℓ2 penalty to the original loss function.

이제 처음부터 가중치 감소를 구현해 보겠습니다. 미니배치 확률적 경사하강법이 우리의 최적화 프로그램이므로 원래 손실 함수에 제곱된 ℓ2 페널티를 추가하기만 하면 됩니다.

3.7.3.1. Defining ℓ2 Norm Penalty

Perhaps the most convenient way of implementing this penalty is to square all terms in place and sum them.

아마도 이 페널티를 구현하는 가장 편리한 방법은 모든 항을 제곱하고 합하는 것입니다.

def l2_penalty(w):
    return (w ** 2).sum() / 2

3.7.3.2. Defining the Model

In the final model, the linear regression and the squared loss have not changed since Section 3.4, so we will just define a subclass of d2l.LinearRegressionScratch. The only change here is that our loss now includes the penalty term.

최종 모델에서는 선형 회귀와 제곱 손실이 섹션 3.4 이후로 변경되지 않았으므로 d2l.LinearRegressionScratch의 하위 클래스만 정의하겠습니다. 여기서 유일한 변경 사항은 이제 손실에 페널티 기간이 포함된다는 것입니다.

class WeightDecayScratch(d2l.LinearRegressionScratch):
    def __init__(self, num_inputs, lambd, lr, sigma=0.01):
        super().__init__(num_inputs, lr, sigma)
        self.save_hyperparameters()

    def loss(self, y_hat, y):
        return (super().loss(y_hat, y) +
                self.lambd * l2_penalty(self.w))

The following code fits our model on the training set with 20 examples and evaluates it on the validation set with 100 examples.

다음 코드는 20개의 예제가 있는 훈련 세트에 모델을 맞추고 100개의 예제가 있는 검증 세트에서 모델을 평가합니다.

data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5)
trainer = d2l.Trainer(max_epochs=10)

def train_scratch(lambd):
    model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
    model.board.yscale='log'
    trainer.fit(model, data)
    print('L2 norm of w:', float(l2_penalty(model.w)))

3.7.3.3. Training without Regularization

We now run this code with lambd = 0, disabling weight decay. Note that we overfit badly, decreasing the training error but not the validation error—a textbook case of overfitting.

이제 이 코드를 Lambd = 0으로 실행하여 가중치 감소를 비활성화합니다. 우리는 과적합을 심하게 하여 학습 오류를 줄였지만 검증 오류는 줄이지 않았습니다. 이는 과적합의 교과서적인 사례입니다.

train_scratch(0)

L2 norm of w: 0.009948714636266232

3.7.3.4. Using Weight Decay

Below, we run with substantial weight decay. Note that the training error increases but the validation error decreases. This is precisely the effect we expect from regularization.

아래에서는 상당한 체중 감소를 보여줍니다. 학습 오류는 증가하지만 검증 오류는 감소합니다. 이것이 바로 우리가 정규화에서 기대하는 효과입니다.

train_scratch(3)

L2 norm of w: 0.0017270983662456274

3.7.4. Concise Implementation

Because weight decay is ubiquitous in neural network optimization, the deep learning framework makes it especially convenient, integrating weight decay into the optimization algorithm itself for easy use in combination with any loss function. Moreover, this integration serves a computational benefit, allowing implementation tricks to add weight decay to the algorithm, without any additional computational overhead. Since the weight decay portion of the update depends only on the current value of each parameter, the optimizer must touch each parameter once anyway.

가중치 감소는 신경망 최적화에서 어디에나 존재하기 때문에 딥 러닝 프레임워크는 모든 손실 함수와 결합하여 쉽게 사용할 수 있도록 최적화 알고리즘 자체에 가중치 감소를 통합하여 이를 특히 편리하게 만듭니다. 또한 이러한 통합은 추가 계산 오버헤드 없이 알고리즘에 가중치 감소를 추가하는 구현 트릭을 허용하므로 계산상의 이점을 제공합니다. 업데이트의 가중치 감소 부분은 각 매개변수의 현재 값에만 의존하므로 최적화 프로그램은 어쨌든 각 매개변수를 한 번 터치해야 합니다.

Below, we specify the weight decay hyperparameter directly through weight_decay when instantiating our optimizer. By default, PyTorch decays both weights and biases simultaneously, but we can configure the optimizer to handle different parameters according to different policies. Here, we only set weight_decay for the weights (the net.weight parameters), hence the bias (the net.bias parameter) will not decay.

아래에서는 최적화 프로그램을 인스턴스화할 때 Weight_decay를 통해 직접 가중치 감소 하이퍼파라미터를 지정합니다. 기본적으로 PyTorch는 가중치와 편향을 동시에 감소시키지만, 다양한 정책에 따라 다양한 매개변수를 처리하도록 최적화 프로그램을 구성할 수 있습니다. 여기서는 가중치(net.weight 매개변수)에 대해서만 Weight_decay를 설정하므로 편향(net.bias 매개변수)은 감소하지 않습니다.

class WeightDecay(d2l.LinearRegression):
    def __init__(self, wd, lr):
        super().__init__(lr)
        self.save_hyperparameters()
        self.wd = wd

    def configure_optimizers(self):
        return torch.optim.SGD([
            {'params': self.net.weight, 'weight_decay': self.wd},
            {'params': self.net.bias}], lr=self.lr)

The plot looks similar to that when we implemented weight decay from scratch. However, this version runs faster and is easier to implement, benefits that will become more pronounced as you address larger problems and this work becomes more routine.

플롯은 처음부터 가중치 감소를 구현했을 때와 유사해 보입니다. 그러나 이 버전은 더 빠르게 실행되고 구현하기가 더 쉬우므로 더 큰 문제를 해결하고 이 작업이 더 일상화될수록 이점이 더욱 뚜렷해집니다.

model = WeightDecay(wd=3, lr=0.01)
model.board.yscale='log'
trainer.fit(model, data)

print('L2 norm of w:', float(l2_penalty(model.get_w_b()[0])))

L2 norm of w: 0.013779522851109505

So far, we have touched upon one notion of what constitutes a simple linear function. However, even for simple nonlinear functions, the situation can be much more complex. To see this, the concept of reproducing kernel Hilbert space (RKHS) allows one to apply tools introduced for linear functions in a nonlinear context. Unfortunately, RKHS-based algorithms tend to scale poorly to large, high-dimensional data. In this book we will often adopt the common heuristic whereby weight decay is applied to all layers of a deep network.

지금까지 우리는 단순한 선형 함수를 구성하는 개념 중 하나를 다루었습니다. 그러나 단순한 비선형 함수의 경우에도 상황은 훨씬 더 복잡할 수 있습니다. 이를 확인하기 위해 RKHS(커널 힐베르트 공간 재현) 개념을 사용하면 비선형 맥락에서 선형 함수에 대해 도입된 도구를 적용할 수 있습니다. 불행하게도 RKHS 기반 알고리즘은 대규모 고차원 데이터에 제대로 확장되지 않는 경향이 있습니다. 이 책에서 우리는 딥 네트워크의 모든 계층에 가중치 감소를 적용하는 일반적인 경험적 방법을 자주 채택할 것입니다.

3.7.5. Summary

Regularization is a common method for dealing with overfitting. Classical regularization techniques add a penalty term to the loss function (when training) to reduce the complexity of the learned model. One particular choice for keeping the model simple is using an ℓ2 penalty. This leads to weight decay in the update steps of the minibatch stochastic gradient descent algorithm. In practice, the weight decay functionality is provided in optimizers from deep learning frameworks. Different sets of parameters can have different update behaviors within the same training loop.

정규화는 과적합을 처리하는 일반적인 방법입니다. 고전적인 정규화 기술은 학습된 모델의 복잡성을 줄이기 위해 (훈련 시) 손실 함수에 페널티 항을 추가합니다. 모델을 단순하게 유지하기 위한 한 가지 특별한 선택은 다음을 사용하는 것입니다.
패널티. 이로 인해 미니배치 확률적 경사하강법 알고리즘의 업데이트 단계에서 가중치 감소가 발생합니다. 실제로 가중치 감소 기능은 딥러닝 프레임워크의 최적화 프로그램에서 제공됩니다. 서로 다른 매개변수 세트는 동일한 훈련 루프 내에서 서로 다른 업데이트 동작을 가질 수 있습니다.

3.7.6. Exercises

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Linear Neural Networks' 카테고리의 다른 글

D2L - 3.6. Generalization (1)	2023.12.03
D2L - 4.7. Environment and Distribution Shift (0)	2023.06.27
D2L - 4.6. Generalization in Classification (0)	2023.06.27
D2L - 4.5. Concise Implementation of Softmax Regression¶ (0)	2023.06.27
D2L - 4.4. Softmax Regression Implementation from Scratch (0)	2023.06.26
D2L - 4.3. The Base Classification Model (0)	2023.06.26
D2L - 4.2. The Image Classification Dataset (0)	2023.06.26
D2L 4.1. Softmax Regression (0)	2023.06.26
D2L - 4. Linear Neural Networks for Classification (0)	2023.06.26
D2L - Local Environment Setting (0)	2023.06.26

Dive into Deep Learning/D2L Linear Neural Networks

D2L - 3.6. Generalization

2023. 12. 3. 12:43 | Posted by 솔웅

https://d2l.ai/chapter_linear-regression/generalization.html

3.6. Generalization — Dive into Deep Learning 1.0.3 documentation

d2l.ai

3.6. Generalization

Consider two college students diligently preparing for their final exam. Commonly, this preparation will consist of practicing and testing their abilities by taking exams administered in previous years. Nonetheless, doing well on past exams is no guarantee that they will excel when it matters. For instance, imagine a student, Extraordinary Ellie, whose preparation consisted entirely of memorizing the answers to previous years’ exam questions. Even if Ellie were endowed with an extraordinary memory, and thus could perfectly recall the answer to any previously seen question, she might nevertheless freeze when faced with a new (previously unseen) question. By comparison, imagine another student, Inductive Irene, with comparably poor memorization skills, but a knack for picking up patterns. Note that if the exam truly consisted of recycled questions from a previous year, Ellie would handily outperform Irene. Even if Irene’s inferred patterns yielded 90% accurate predictions, they could never compete with Ellie’s 100% recall. However, even if the exam consisted entirely of fresh questions, Irene might maintain her 90% average.

최종 시험을 부지런히 준비하는 두 명의 대학생을 생각해 보십시오. 일반적으로 이 준비는 전년도에 시행된 시험을 통해 자신의 능력을 연습하고 테스트하는 것으로 구성됩니다. 그럼에도 불구하고 과거 시험에서 좋은 성적을 냈다고 해서 중요한 순간에 뛰어난 성적을 거둘 것이라는 보장은 없습니다. 예를 들어, 전년도 시험 문제에 대한 답을 암기하는 것만으로 준비를 했던 Extraordinary Ellie라는 학생을 상상해 보십시오. Ellie가 특별한 기억력을 부여받아 이전에 본 질문에 대한 답을 완벽하게 기억할 수 있다고 하더라도 새로운(이전에는 볼 수 없었던) 질문에 직면하면 그녀는 얼어붙을 수도 있습니다. 이에 비해 암기 능력은 비교적 낮지만 패턴을 파악하는 능력이 있는 또 다른 학생인 Induction Irene을 상상해 보십시오. 시험이 실제로 전년도의 질문을 재활용하여 구성되었다면 Ellie가 Irene보다 더 좋은 성적을 냈을 것입니다. 아이린이 추론한 패턴이 90% 정확한 예측을 내놨다고 해도 엘리의 100% 회상과 결코 경쟁할 수는 없습니다. 그러나 시험이 완전히 새로운 문제로 구성되더라도 아이린은 평균 90%를 유지할 수 있습니다.

As machine learning scientists, our goal is to discover patterns. But how can we be sure that we have truly discovered a general pattern and not simply memorized our data? Most of the time, our predictions are only useful if our model discovers such a pattern. We do not want to predict yesterday’s stock prices, but tomorrow’s. We do not need to recognize already diagnosed diseases for previously seen patients, but rather previously undiagnosed ailments in previously unseen patients. This problem—how to discover patterns that generalize—is the fundamental problem of machine learning, and arguably of all of statistics. We might cast this problem as just one slice of a far grander question that engulfs all of science: when are we ever justified in making the leap from particular observations to more general statements?

기계 학습 과학자로서 우리의 목표는 패턴을 발견하는 것입니다. 하지만 단순히 데이터를 암기한 것이 아니라 실제로 일반적인 패턴을 발견했다는 것을 어떻게 확신할 수 있습니까? 대부분의 경우 예측은 모델이 그러한 패턴을 발견한 경우에만 유용합니다. 우리는 어제의 주가를 예측하고 싶지 않고 내일의 주가를 예측하고 싶습니다. 우리는 이전에 본 환자에 대해 이미 진단된 질병을 인식할 필요가 없으며, 이전에 보지 못한 환자의 이전에 진단되지 않은 질병을 인식할 필요가 있습니다. 일반화되는 패턴을 발견하는 방법이라는 문제는 기계 학습과 모든 통계의 근본적인 문제입니다. 우리는 이 문제를 모든 과학을 포괄하는 훨씬 더 큰 질문의 한 조각으로 간주할 수 있습니다. 특정 관찰에서 보다 일반적인 진술로 도약하는 것이 언제 정당화될 수 있습니까?

In real life, we must fit our models using a finite collection of data. The typical scales of that data vary wildly across domains. For many important medical problems, we can only access a few thousand data points. When studying rare diseases, we might be lucky to access hundreds. By contrast, the largest public datasets consisting of labeled photographs, e.g., ImageNet (Deng et al., 2009), contain millions of images. And some unlabeled image collections such as the Flickr YFC100M dataset can be even larger, containing over 100 million images (Thomee et al., 2016). However, even at this extreme scale, the number of available data points remains infinitesimally small compared to the space of all possible images at a megapixel resolution. Whenever we work with finite samples, we must keep in mind the risk that we might fit our training data, only to discover that we failed to discover a generalizable pattern.

실생활에서는 유한한 데이터 모음을 사용하여 모델을 맞춰야 합니다. 해당 데이터의 일반적인 규모는 도메인에 따라 크게 다릅니다. 많은 중요한 의료 문제의 경우 우리는 수천 개의 데이터 포인트에만 접근할 수 있습니다. 희귀 질병을 연구할 때 운이 좋게도 수백 가지 질병에 접근할 수 있습니다. 대조적으로, ImageNet(Deng et al., 2009)과 같이 레이블이 지정된 사진으로 구성된 가장 큰 공개 데이터 세트에는 수백만 개의 이미지가 포함되어 있습니다. 그리고 Flickr YFC100M 데이터 세트와 같은 일부 레이블이 없는 이미지 컬렉션은 1억 개가 넘는 이미지를 포함하여 훨씬 더 클 수 있습니다(Thomee et al., 2016). 그러나 이러한 극단적인 규모에서도 사용 가능한 데이터 포인트의 수는 메가픽셀 해상도에서 가능한 모든 이미지의 공간에 비해 무한히 작은 상태로 유지됩니다. 유한한 샘플로 작업할 때마다 훈련 데이터를 적합했지만 일반화 가능한 패턴을 발견하지 못했다는 사실을 발견하게 될 위험을 염두에 두어야 합니다.

The phenomenon of fitting closer to our training data than to the underlying distribution is called overfitting, and techniques for combatting overfitting are often called regularization methods. While it is no substitute for a proper introduction to statistical learning theory (see Boucheron et al. (2005), Vapnik (1998)), we will give you just enough intuition to get going. We will revisit generalization in many chapters throughout the book, exploring both what is known about the principles underlying generalization in various models, and also heuristic techniques that have been found (empirically) to yield improved generalization on tasks of practical interest.

기본 분포보다 훈련 데이터에 더 가깝게 피팅되는 현상을 과적합이라고 하며, 과적합을 방지하는 기술을 종종 정규화 방법이라고 합니다. 이것이 통계적 학습 이론에 대한 적절한 소개를 대체할 수는 없지만(Boucheron et al.(2005), Vapnik(1998) 참조), 시작하는 데 충분한 직관을 제공할 것입니다. 우리는 다양한 모델의 일반화 기본 원리에 대해 알려진 내용과 실제 관심 있는 작업에 대해 개선된 일반화를 산출하기 위해 (경험적으로) 발견된 경험적 기법을 탐구하면서 책 전체의 여러 장에서 일반화를 다시 살펴볼 것입니다.

3.6.1. Training Error and Generalization Error

In the standard supervised learning setting, we assume that the training data and the test data are drawn independently from identical distributions. This is commonly called the IID assumption. While this assumption is strong, it is worth noting that, absent any such assumption, we would be dead in the water. Why should we believe that training data sampled from distribution P(X,Y) should tell us how to make predictions on test data generated by a different distribution Q(X,Y)? Making such leaps turns out to require strong assumptions about how P and Q are related. Later on we will discuss some assumptions that allow for shifts in distribution but first we need to understand the IID case, where P(⋅)=Q(⋅).

표준 지도 학습 설정에서는 훈련 데이터와 테스트 데이터가 동일한 분포에서 독립적으로 추출된다고 가정합니다. 이를 일반적으로 IID 가정이라고 합니다. 이 가정은 강력하지만, 그러한 가정이 없다면 우리는 물 속에서 죽을 것이라는 점은 주목할 가치가 있습니다. 분포 P(X,Y)에서 샘플링된 훈련 데이터가 다른 분포 Q(X,Y)에 의해 생성된 테스트 데이터에 대해 예측하는 방법을 알려주어야 하는 이유는 무엇입니까? 그러한 도약을 위해서는 P와 Q가 어떻게 관련되어 있는지에 대한 강력한 가정이 필요하다는 것이 밝혀졌습니다. 나중에 우리는 분포의 변화를 허용하는 몇 가지 가정에 대해 논의할 것이지만 먼저 P(⋅)=Q(⋅)인 IID 사례를 이해해야 합니다.

To begin with, we need to differentiate between the training error Remp, which is a statistic calculated on the training dataset, and the generalization error R, which is an expectation taken with respect to the underlying distribution. You can think of the generalization error as what you would see if you applied your model to an infinite stream of additional data examples drawn from the same underlying data distribution. Formally the training error is expressed as a sum (with the same notation as Section 3.1):

우선, 훈련 데이터세트에서 계산된 통계인 훈련 오류 Remp와 기본 분포에 대한 기대값인 일반화 오류 R을 구별해야 합니다. 일반화 오류는 동일한 기본 데이터 분포에서 추출된 추가 데이터 예제의 무한한 스트림에 모델을 적용한 경우 표시되는 오류로 생각할 수 있습니다. 공식적으로 훈련 오류는 합계로 표현됩니다(섹션 3.1과 동일한 표기법 사용).

while the generalization error is expressed as an integral:

일반화 오류는 적분으로 표현됩니다.

Problematically, we can never calculate the generalization error R exactly. Nobody ever tells us the precise form of the density function p(x,y). Moreover, we cannot sample an infinite stream of data points. Thus, in practice, we must estimate the generalization error by applying our model to an independent test set constituted of a random selection of examples X′ and labels y′ that were withheld from our training set. This consists of applying the same formula that was used for calculating the empirical training error but to a test set X′,y′.

문제는 일반화 오류 R을 정확하게 계산할 수 없다는 점입니다. 밀도 함수 p(x,y)의 정확한 형태를 알려주는 사람은 아무도 없습니다. 게다가 무한한 데이터 포인트 스트림을 샘플링할 수도 없습니다. 따라서 실제로는 훈련 세트에서 보류된 X' 및 레이블 y'의 무작위 선택으로 구성된 독립적인 테스트 세트에 모델을 적용하여 일반화 오류를 추정해야 합니다. 이는 경험적 훈련 오류를 계산하는 데 사용된 것과 동일한 공식을 테스트 세트 X',y'에 적용하는 것으로 구성됩니다.

Crucially, when we evaluate our classifier on the test set, we are working with a fixed classifier (it does not depend on the sample of the test set), and thus estimating its error is simply the problem of mean estimation. However the same cannot be said for the training set. Note that the model we wind up with depends explicitly on the selection of the training set and thus the training error will in general be a biased estimate of the true error on the underlying population. The central question of generalization is then when should we expect our training error to be close to the population error (and thus the generalization error).

결정적으로 테스트 세트에서 분류기를 평가할 때 고정된 분류기를 사용하여 작업하므로(테스트 세트의 샘플에 의존하지 않음) 오류를 추정하는 것은 단순히 평균 추정의 문제입니다. 그러나 훈련 세트에 대해서도 마찬가지입니다. 우리가 마무리하는 모델은 훈련 세트의 선택에 명시적으로 의존하므로 훈련 오류는 일반적으로 기본 모집단의 실제 오류에 대한 편향된 추정치입니다. 일반화의 핵심 질문은 언제 훈련 오류가 모집단 오류(따라서 일반화 오류)에 가까워질 것으로 예상해야 하는가입니다.

3.6.1.1. Model Complexit

In classical theory, when we have simple models and abundant data, the training and generalization errors tend to be close. However, when we work with more complex models and/or fewer examples, we expect the training error to go down but the generalization gap to grow. This should not be surprising. Imagine a model class so expressive that for any dataset of n examples, we can find a set of parameters that can perfectly fit arbitrary labels, even if randomly assigned. In this case, even if we fit our training data perfectly, how can we conclude anything about the generalization error? For all we know, our generalization error might be no better than random guessing.

고전 이론에서는 단순한 모델과 풍부한 데이터가 있을 때 훈련 및 일반화 오류가 가까운 경향이 있습니다. 그러나 더 복잡한 모델 및/또는 더 적은 수의 예제를 사용하면 학습 오류는 줄어들지만 일반화 격차는 커질 것으로 예상됩니다. 이것은 놀라운 일이 아닙니다. n개의 예제로 구성된 데이터세트에 대해 무작위로 할당되더라도 임의의 레이블에 완벽하게 맞는 매개변수 집합을 찾을 수 있을 만큼 표현력이 뛰어난 모델 클래스를 상상해 보세요. 이 경우 훈련 데이터를 완벽하게 적합하더라도 일반화 오류에 대해 어떻게 결론을 내릴 수 있습니까? 우리가 아는 한, 일반화 오류는 무작위 추측보다 나을 것이 없을 수도 있습니다.

In general, absent any restriction on our model class, we cannot conclude, based on fitting the training data alone, that our model has discovered any generalizable pattern (Vapnik et al., 1994). On the other hand, if our model class was not capable of fitting arbitrary labels, then it must have discovered a pattern. Learning-theoretic ideas about model complexity derived some inspiration from the ideas of Karl Popper, an influential philosopher of science, who formalized the criterion of falsifiability. According to Popper, a theory that can explain any and all observations is not a scientific theory at all! After all, what has it told us about the world if it has not ruled out any possibility? In short, what we want is a hypothesis that could not explain any observations we might conceivably make and yet nevertheless happens to be compatible with those observations that we in fact make.

일반적으로 모델 클래스에 대한 제한이 없으면 훈련 데이터만 피팅하는 것만으로는 모델이 일반화 가능한 패턴을 발견했다고 결론을 내릴 수 없습니다(Vapnik et al., 1994). 반면, 모델 클래스가 임의의 레이블을 맞출 수 없다면 패턴을 발견했을 것입니다. 모델 복잡성에 대한 학습 이론적인 아이디어는 반증 가능성의 기준을 공식화한 영향력 있는 과학 철학자 칼 포퍼(Karl Popper)의 아이디어에서 영감을 얻었습니다. 포퍼에 따르면, 모든 관찰을 설명할 수 있는 이론은 전혀 과학 이론이 아닙니다! 결국, 어떤 가능성도 배제하지 않는다면 세상은 우리에게 무엇을 말해주는 것일까요? 간단히 말해서, 우리가 원하는 것은 우리가 할 수 있는 어떤 관찰도 설명할 수 없지만 그럼에도 불구하고 실제로 우리가 하는 관찰과 양립할 수 있는 가설입니다.

Now what precisely constitutes an appropriate notion of model complexity is a complex matter. Often, models with more parameters are able to fit a greater number of arbitrarily assigned labels. However, this is not necessarily true. For instance, kernel methods operate in spaces with infinite numbers of parameters, yet their complexity is controlled by other means (Schölkopf and Smola, 2002). One notion of complexity that often proves useful is the range of values that the parameters can take. Here, a model whose parameters are permitted to take arbitrary values would be more complex. We will revisit this idea in the next section, when we introduce weight decay, your first practical regularization technique. Notably, it can be difficult to compare complexity among members of substantially different model classes (say, decision trees vs. neural networks).

이제 모델 복잡성에 대한 적절한 개념을 정확히 구성하는 것은 복잡한 문제입니다. 매개변수가 더 많은 모델은 임의로 할당된 레이블을 더 많이 수용할 수 있는 경우가 많습니다. 그러나 이것이 반드시 사실은 아닙니다. 예를 들어, 커널 방법은 무한한 수의 매개변수가 있는 공간에서 작동하지만 그 복잡성은 다른 수단으로 제어됩니다(Schölkopf 및 Smola, 2002). 종종 유용하다고 입증되는 복잡성에 대한 한 가지 개념은 매개변수가 취할 수 있는 값의 범위입니다. 여기서 매개변수가 임의의 값을 취하도록 허용된 모델은 더 복잡합니다. 첫 번째 실용적인 정규화 기술인 가중치 감소를 소개하는 다음 섹션에서 이 아이디어를 다시 살펴보겠습니다. 특히, 실질적으로 다른 모델 클래스(예: 의사결정 트리와 신경망)의 구성원 간의 복잡성을 비교하는 것은 어려울 수 있습니다.

At this point, we must stress another important point that we will revisit when introducing deep neural networks. When a model is capable of fitting arbitrary labels, low training error does not necessarily imply low generalization error. However, it does not necessarily imply high generalization error either! All we can say with confidence is that low training error alone is not enough to certify low generalization error. Deep neural networks turn out to be just such models: while they generalize well in practice, they are too powerful to allow us to conclude much on the basis of training error alone. In these cases we must rely more heavily on our holdout data to certify generalization after the fact. Error on the holdout data, i.e., validation set, is called the validation error.

이 시점에서 우리는 심층 신경망을 도입할 때 다시 살펴볼 또 다른 중요한 점을 강조해야 합니다. 모델이 임의의 레이블을 맞출 수 있는 경우 훈련 오류가 낮다고 해서 반드시 일반화 오류가 낮다는 의미는 아닙니다. 그러나 이것이 반드시 높은 일반화 오류를 의미하는 것은 아닙니다! 우리가 자신있게 말할 수 있는 것은 낮은 훈련 오류만으로는 낮은 일반화 오류를 인증하는 데 충분하지 않다는 것입니다. 심층 신경망은 바로 그러한 모델임이 밝혀졌습니다. 실제로는 잘 일반화되지만 훈련 오류만으로 많은 결론을 내릴 수 없을 정도로 강력합니다. 이러한 경우 사실 이후 일반화를 인증하기 위해 홀드아웃 데이터에 더 많이 의존해야 합니다. 홀드아웃 데이터, 즉 검증 세트에 대한 오류를 검증 오류라고 합니다.

3.6.2. Underfitting or Overfitting?

When we compare the training and validation errors, we want to be mindful of two common situations. First, we want to watch out for cases when our training error and validation error are both substantial but there is a little gap between them. If the model is unable to reduce the training error, that could mean that our model is too simple (i.e., insufficiently expressive) to capture the pattern that we are trying to model. Moreover, since the generalization gap (Remp−R) between our training and generalization errors is small, we have reason to believe that we could get away with a more complex model. This phenomenon is known as underfitting.

훈련 오류와 검증 오류를 비교할 때 두 가지 일반적인 상황에 유의하고 싶습니다. 먼저, 훈련 오류와 검증 오류가 모두 상당하지만 그 사이에 약간의 차이가 있는 경우를 주의하고 싶습니다. 모델이 훈련 오류를 줄일 수 없다면 이는 모델이 너무 단순하여(즉, 표현력이 부족하여) 모델링하려는 패턴을 포착할 수 없음을 의미할 수 있습니다. 더욱이 훈련 오류와 일반화 오류 사이의 일반화 격차(Remp-R)가 작기 때문에 더 복잡한 모델을 사용하여 벗어날 수 있다고 믿을 이유가 있습니다. 이 현상을 과소적합이라고 합니다.

On the other hand, as we discussed above, we want to watch out for the cases when our training error is significantly lower than our validation error, indicating severe overfitting. Note that overfitting is not always a bad thing. In deep learning especially, the best predictive models often perform far better on training data than on holdout data. Ultimately, we usually care about driving the generalization error lower, and only care about the gap insofar as it becomes an obstacle to that end. Note that if the training error is zero, then the generalization gap is precisely equal to the generalization error and we can make progress only by reducing the gap.

반면, 위에서 논의한 것처럼 훈련 오류가 검증 오류보다 현저히 낮아 심각한 과적합을 나타내는 경우를 주의해야 합니다. 과적합이 항상 나쁜 것은 아닙니다. 특히 딥 러닝에서는 최고의 예측 모델이 홀드아웃 데이터보다 훈련 데이터에서 훨씬 더 나은 성능을 발휘하는 경우가 많습니다. 궁극적으로 우리는 일반적으로 일반화 오류를 낮추는 데 관심을 갖고, 그 목적에 장애물이 되는 한 격차에만 관심을 갖습니다. 훈련 오류가 0이면 일반화 격차는 일반화 오류와 정확하게 동일하며 격차를 줄여야만 진전을 이룰 수 있습니다.

3.6.2.1. Polynomial Curve Fitting

To illustrate some classical intuition about overfitting and model complexity, consider the following: given training data consisting of a single feature x and a corresponding real-valued label y, we try to find the polynomial of degree d for estimating the label y.

과적합 및 모델 복잡성에 대한 몇 가지 고전적 직관을 설명하기 위해 다음을 고려하십시오. 단일 특성 x와 해당 실제 값 레이블 y로 구성된 훈련 데이터가 주어지면 레이블 y를 추정하기 위해 d차 다항식을 찾으려고 합니다.

This is just a linear regression problem where our features are given by the powers of x, the model’s weights are given by wi, and the bias is given by w0 since x**0=1 for all x. Since this is just a linear regression problem, we can use the squared error as our loss function.

이는 모든 x에 대해 x**0=1이므로 특성이 x의 거듭제곱으로 제공되고 모델의 가중치가 wi로 제공되며 편향이 w0으로 제공되는 선형 회귀 문제입니다. 이것은 선형 회귀 문제이므로 제곱 오차를 손실 함수로 사용할 수 있습니다.

A higher-order polynomial function is more complex than a lower-order polynomial function, since the higher-order polynomial has more parameters and the model function’s selection range is wider. Fixing the training dataset, higher-order polynomial functions should always achieve lower (at worst, equal) training error relative to lower-degree polynomials. In fact, whenever each data example has a distinct value of x, a polynomial function with degree equal to the number of data examples can fit the training set perfectly. We compare the relationship between polynomial degree (model complexity) and both underfitting and overfitting in Fig. 3.6.1.

고차 다항식 함수는 저차 다항식 함수보다 더 복잡합니다. 왜냐하면 고차 다항식은 더 많은 매개변수를 갖고 모델 함수의 선택 범위가 더 넓기 때문입니다. 훈련 데이터 세트를 수정하면 고차 다항식 함수는 항상 저차 다항식에 비해 더 낮은(최악의 경우 동일한) 훈련 오류를 달성해야 합니다. 실제로 각 데이터 예제에 고유한 x 값이 있을 때마다 데이터 예제 수와 동일한 차수를 갖는 다항식 함수가 훈련 세트에 완벽하게 맞을 수 있습니다. 그림 3.6.1에서 다항식 차수(모델 복잡도)와 과소적합 및 과적합 간의 관계를 비교합니다.

Fig. 3.6.1  Influence of model complexity on underfitting and overfitting.

3.6.2.2. Dataset Siz

As the above bound already indicates, another big consideration to bear in mind is dataset size. Fixing our model, the fewer samples we have in the training dataset, the more likely (and more severely) we are to encounter overfitting. As we increase the amount of training data, the generalization error typically decreases. Moreover, in general, more data never hurts. For a fixed task and data distribution, model complexity should not increase more rapidly than the amount of data. Given more data, we might attempt to fit a more complex model. Absent sufficient data, simpler models may be more difficult to beat. For many tasks, deep learning only outperforms linear models when many thousands of training examples are available. In part, the current success of deep learning owes considerably to the abundance of massive datasets arising from Internet companies, cheap storage, connected devices, and the broad digitization of the economy.

위의 한계에서 이미 알 수 있듯이 염두에 두어야 할 또 다른 큰 고려 사항은 데이터 세트 크기입니다. 모델을 수정하면 훈련 데이터 세트에 있는 샘플 수가 줄어들수록 과적합이 발생할 가능성이 더 높아집니다(심각하게도). 훈련 데이터의 양을 늘리면 일반적으로 일반화 오류가 감소합니다. 또한 일반적으로 더 많은 데이터가 해를 끼치 지 않습니다. 고정된 작업 및 데이터 분포의 경우 모델 복잡성이 데이터 양보다 더 빠르게 증가해서는 안 됩니다. 더 많은 데이터가 주어지면 더 복잡한 모델을 맞추려고 시도할 수도 있습니다. 데이터가 충분하지 않으면 단순한 모델을 이기기가 더 어려울 수 있습니다. 많은 작업에서 딥 러닝은 수천 개의 학습 예제를 사용할 수 있는 경우에만 선형 모델보다 성능이 뛰어납니다. 부분적으로 현재 딥 러닝의 성공은 인터넷 회사, 저렴한 스토리지, 연결된 장치 및 경제의 광범위한 디지털화에서 발생하는 풍부한 대규모 데이터 세트에 크게 기인합니다.

3.6.3. Model Selection

Typically, we select our final model only after evaluating multiple models that differ in various ways (different architectures, training objectives, selected features, data preprocessing, learning rates, etc.). Choosing among many models is aptly called model selection.

일반적으로 우리는 다양한 방식(다양한 아키텍처, 교육 목표, 선택한 기능, 데이터 전처리, 학습 속도 등)이 다른 여러 모델을 평가한 후에만 최종 모델을 선택합니다. 여러 모델 중에서 선택하는 것을 적절하게는 모델 선택이라고 합니다.

In principle, we should not touch our test set until after we have chosen all our hyperparameters. Were we to use the test data in the model selection process, there is a risk that we might overfit the test data. Then we would be in serious trouble. If we overfit our training data, there is always the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever know? See Ong et al. (2005) for an example of how this can lead to absurd results even for models where the complexity can be tightly controlled.

원칙적으로 모든 하이퍼파라미터를 선택할 때까지 테스트 세트를 건드리면 안 됩니다. 모델 선택 과정에서 테스트 데이터를 사용한다면 테스트 데이터에 과적합될 위험이 있습니다. 그러면 우리는 심각한 문제에 빠지게 될 것입니다. 훈련 데이터를 과대적합하는 경우 정직성을 유지하기 위해 항상 테스트 데이터에 대한 평가가 있습니다. 하지만 테스트 데이터에 과대적합되면 어떻게 알 수 있을까요? Ong et al. (2005)은 복잡성이 엄격하게 제어될 수 있는 모델의 경우에도 이것이 어떻게 터무니없는 결과로 이어질 수 있는지에 대한 예를 제공합니다.

Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on the training data for model selection either because we cannot estimate the generalization error on the very data that we use to train the model.

따라서 모델 선택을 위해 테스트 데이터에 의존해서는 안 됩니다. 그러나 모델을 훈련하는 데 사용하는 바로 그 데이터에 대한 일반화 오류를 추정할 수 없기 때문에 모델 선택을 위해 훈련 데이터에만 의존할 수는 없습니다.

In practical applications, the picture gets muddier. While ideally we would only touch the test data once, to assess the very best model or to compare a small number of models with each other, real-world test data is seldom discarded after just one use. We can seldom afford a new test set for each round of experiments. In fact, recycling benchmark data for decades can have a significant impact on the development of algorithms, e.g., for image classification and optical character recognition.

실제 적용에서는 그림이 더 흐릿해집니다. 이상적으로는 테스트 데이터를 한 번만 만지는 반면, 최고의 모델을 평가하거나 소수의 모델을 서로 비교하기 위해 실제 테스트 데이터는 한 번만 사용한 후 거의 삭제되지 않습니다. 우리는 각 실험 라운드마다 새로운 테스트 세트를 제공할 여력이 거의 없습니다. 실제로 수십 년 동안 벤치마크 데이터를 재활용하면 이미지 분류, 광학 문자 인식 등의 알고리즘 개발에 상당한 영향을 미칠 수 있습니다.

The common practice for addressing the problem of training on the test set is to split our data three ways, incorporating a validation set in addition to the training and test datasets. The result is a murky business where the boundaries between validation and test data are worryingly ambiguous. Unless explicitly stated otherwise, in the experiments in this book we are really working with what should rightly be called training data and validation data, with no true test sets. Therefore, the accuracy reported in each experiment of the book is really the validation accuracy and not a true test set accuracy.

테스트 세트에 대한 교육 문제를 해결하기 위한 일반적인 방법은 데이터를 세 가지 방식으로 분할하고 교육 및 테스트 데이터세트 외에 검증 세트를 통합하는 것입니다. 그 결과 검증 데이터와 테스트 데이터 사이의 경계가 걱정스러울 정도로 모호한 비즈니스가 불투명해졌습니다. 달리 명시적으로 언급하지 않는 한, 이 책의 실험에서 우리는 실제 테스트 세트 없이 훈련 데이터와 검증 데이터라고 해야 할 것을 실제로 사용하고 있습니다. 따라서 책의 각 실험에서 보고된 정확도는 실제 검증 정확도이지 실제 테스트 세트 정확도가 아닙니다.

3.6.3.1. Cross-Validation

When training data is scarce, we might not even be able to afford to hold out enough data to constitute a proper validation set. One popular solution to this problem is to employ K-fold cross-validation. Here, the original training data is split into K non-overlapping subsets. Then model training and validation are executed K times, each time training on K −1 subsets and validating on a different subset (the one not used for training in that round). Finally, the training and validation errors are estimated by averaging over the results from the K experiments.

훈련 데이터가 부족하면 적절한 검증 세트를 구성하기에 충분한 데이터를 보유할 여력조차 없을 수도 있습니다. 이 문제에 대한 인기 있는 해결책 중 하나는 K-겹 교차 검증을 사용하는 것입니다. 여기서는 원본 훈련 데이터가 K개의 겹치지 않는 하위 집합으로 분할됩니다. 그런 다음 모델 훈련 및 검증이 K번 실행되며, 매번 K −1 하위 집합에 대해 훈련하고 다른 하위 집합(해당 라운드에서 훈련에 사용되지 않은 것)에 대해 검증합니다. 마지막으로 훈련 및 검증 오류는 K 실험 결과를 평균하여 추정됩니다.

3.6.4. Summary

This section explored some of the underpinnings of generalization in machine learning. Some of these ideas become complicated and counterintuitive when we get to deeper models; here, models are capable of overfitting data badly, and the relevant notions of complexity can be both implicit and counterintuitive (e.g., larger architectures with more parameters generalizing better). We leave you with a few rules of thumb:

이 섹션에서는 기계 학습에서 일반화의 몇 가지 토대를 살펴보았습니다. 이러한 아이디어 중 일부는 더 심층적인 모델에 도달하면 복잡해지고 직관에 반하게 됩니다. 여기서 모델은 데이터를 잘못 과적합할 수 있으며 관련 복잡성 개념은 암시적일 수도 있고 반직관적일 수도 있습니다(예: 더 많은 매개변수를 가진 더 큰 아키텍처가 더 잘 일반화됨). 몇 가지 경험 법칙을 알려드리겠습니다.

Use validation sets (or K-fold cross-validation) for model selection;
모델 선택을 위해 검증 세트(또는 K-겹 교차 검증)를 사용합니다.
More complex models often require more data;
더 복잡한 모델에는 더 많은 데이터가 필요한 경우가 많습니다.
Relevant notions of complexity include both the number of parameters and the range of values that they are allowed to take;
복잡성과 관련된 개념에는 매개변수의 수와 허용되는 값의 범위가 모두 포함됩니다.
Keeping all else equal, more data almost always leads to better generalization;
다른 모든 것을 동일하게 유지하면 더 많은 데이터가 거의 항상 더 나은 일반화로 이어집니다.
This entire talk of generalization is all predicated on the IID assumption. If we relax this assumption, allowing for distributions to shift between the train and testing periods, then we cannot say anything about generalization absent a further (perhaps milder) assumption.
일반화에 대한 이 전체 이야기는 모두 IID 가정에 근거합니다. 이 가정을 완화하여 열차와 테스트 기간 사이에 분포가 이동하도록 허용하면 추가(아마도 더 온화한) 가정 없이 일반화에 대해 아무 말도 할 수 없습니다.

3.6.5. Exercises

When can you solve the problem of polynomial regression exactly?
Give at least five examples where dependent random variables make treating the problem as IID data inadvisable.
Can you ever expect to see zero training error? Under which circumstances would you see zero generalization error?
Why is K -fold cross-validation very expensive to compute?
Why is the K -fold cross-validation error estimate biased?
The VC dimension is defined as the maximum number of points that can be classified with arbitrary labels {±1} by a function of a class of functions. Why might this not be a good idea for measuring how complex the class of functions is? Hint: consider the magnitude of the functions.
Your manager gives you a difficult dataset on which your current algorithm does not perform so well. How would you justify to him that you need more data? Hint: you cannot increase the data but you can decrease it.

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Linear Neural Networks' 카테고리의 다른 글

D2L - 3.7. Weight Decay (1)	2023.12.03
D2L - 4.7. Environment and Distribution Shift (0)	2023.06.27
D2L - 4.6. Generalization in Classification (0)	2023.06.27
D2L - 4.5. Concise Implementation of Softmax Regression¶ (0)	2023.06.27
D2L - 4.4. Softmax Regression Implementation from Scratch (0)	2023.06.26
D2L - 4.3. The Base Classification Model (0)	2023.06.26
D2L - 4.2. The Image Classification Dataset (0)	2023.06.26
D2L 4.1. Softmax Regression (0)	2023.06.26
D2L - 4. Linear Neural Networks for Classification (0)	2023.06.26
D2L - Local Environment Setting (0)	2023.06.26

Dive into Deep Learning/D2L Preliminaries

D2L - 2.7. Documentation

2023. 10. 14. 00:35 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/lookup-api.html

2.7. Documentation — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.7. Documentation

While we cannot possibly introduce every single PyTorch function and class (and the information might become outdated quickly), the API documentation and additional tutorials and examples provide such documentation. This section provides some guidance for how to explore the PyTorch API.

PyTorch의 모든 기능과 클래스를 모두 소개할 수는 없지만(정보가 빨리 구식이 될 수 있음) API 문서와 추가 튜토리얼 및 예제에서 이러한 문서를 제공합니다. 이 섹션에서는 PyTorch API를 탐색하는 방법에 대한 몇 가지 지침을 제공합니다.

import torch

2.7.1. Functions and Classes in a Module

To know which functions and classes can be called in a module, we invoke the dir function. For instance, we can query all properties in the module for generating random numbers:

모듈에서 어떤 함수와 클래스를 호출할 수 있는지 알기 위해 dir 함수를 호출합니다. 예를 들어 난수 생성을 위해 모듈의 모든 속성을 쿼리할 수 있습니다.

print(dir(torch.distributions))

주어진 코드는 PyTorch에서 확률 분포와 관련된 클래스와 함수를 탐색하기 위해 사용되는 코드입니다. 코드는 PyTorch의 torch 라이브러리를 사용하며, torch.Distributions 모듈 아래에 있는 클래스와 함수를 나열합니다. 아래는 코드의 설명입니다:

import torch: PyTorch 라이브러리를 임포트합니다. PyTorch는 딥 러닝 및 확률적 모델링을 위한 라이브러리로 널리 사용됩니다.
print(dir(torch.Distributions)): torch.Distributions 모듈 아래에 있는 클래스와 함수를 나열하고 출력합니다. 이 명령은 해당 모듈에 포함된 모든 객체 및 함수의 이름을 보여줍니다. 이를 통해 PyTorch에서 제공하는 확률 분포와 관련된 클래스와 함수의 리스트를 확인할 수 있습니다.

PyTorch의 torch.Distributions 모듈에는 다양한 확률 분포를 다루는 클래스와 함수가 포함되어 있으며, 이러한 분포를 사용하여 확률적 모델링을 수행할 수 있습니다. 이 코드를 실행하면 해당 모듈의 내용을 확인할 수 있으며, 확률 분포와 관련된 작업을 수행할 때 유용한 클래스와 함수를 찾을 수 있습니다.

['AbsTransform', 'AffineTransform', 'Bernoulli', 'Beta', 'Binomial', 'CatTransform', 'Categorical', 'Cauchy', 'Chi2', 'ComposeTransform', 'ContinuousBernoulli', 'CorrCholeskyTransform', 'CumulativeDistributionTransform', 'Dirichlet', 'Distribution', 'ExpTransform', 'Exponential', 'ExponentialFamily', 'FisherSnedecor', 'Gamma', 'Geometric', 'Gumbel', 'HalfCauchy', 'HalfNormal', 'Independent', 'IndependentTransform', 'Kumaraswamy', 'LKJCholesky', 'Laplace', 'LogNormal', 'LogisticNormal', 'LowRankMultivariateNormal', 'LowerCholeskyTransform', 'MixtureSameFamily', 'Multinomial', 'MultivariateNormal', 'NegativeBinomial', 'Normal', 'OneHotCategorical', 'OneHotCategoricalStraightThrough', 'Pareto', 'Poisson', 'PositiveDefiniteTransform', 'PowerTransform', 'RelaxedBernoulli', 'RelaxedOneHotCategorical', 'ReshapeTransform', 'SigmoidTransform', 'SoftmaxTransform', 'SoftplusTransform', 'StackTransform', 'StickBreakingTransform', 'StudentT', 'TanhTransform', 'Transform', 'TransformedDistribution', 'Uniform', 'VonMises', 'Weibull', 'Wishart', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'bernoulli', 'beta', 'biject_to', 'binomial', 'categorical', 'cauchy', 'chi2', 'constraint_registry', 'constraints', 'continuous_bernoulli', 'dirichlet', 'distribution', 'exp_family', 'exponential', 'fishersnedecor', 'gamma', 'geometric', 'gumbel', 'half_cauchy', 'half_normal', 'identity_transform', 'independent', 'kl', 'kl_divergence', 'kumaraswamy', 'laplace', 'lkj_cholesky', 'log_normal', 'logistic_normal', 'lowrank_multivariate_normal', 'mixture_same_family', 'multinomial', 'multivariate_normal', 'negative_binomial', 'normal', 'one_hot_categorical', 'pareto', 'poisson', 'register_kl', 'relaxed_bernoulli', 'relaxed_categorical', 'studentT', 'transform_to', 'transformed_distribution', 'transforms', 'uniform', 'utils', 'von_mises', 'weibull', 'wishart']

Generally, we can ignore functions that start and end with __ (special objects in Python) or functions that start with a single _(usually internal functions). Based on the remaining function or attribute names, we might hazard a guess that this module offers various methods for generating random numbers, including sampling from the uniform distribution (uniform), normal distribution (normal), and multinomial distribution (multinomial).

일반적으로 __(파이썬의 특수 객체)로 시작하고 끝나는 함수나 단일 _(일반적으로 내부 함수)로 시작하는 함수를 무시할 수 있습니다. 나머지 함수 또는 속성 이름을 기반으로 이 모듈이 균일 분포(균일), 정규 분포(정규) 및 다항 분포(다항)로부터의 샘플링을 포함하여 난수를 생성하는 다양한 방법을 제공한다고 추측할 수 있습니다.

2.7.2. Specific Functions and Classes

For specific instructions on how to use a given function or class, we can invoke the help function. As an example, let’s explore the usage instructions for tensors’ ones function.

주어진 함수나 클래스를 사용하는 방법에 대한 구체적인 지침을 보려면 도움말 함수를 호출할 수 있습니다. 예를 들어, 텐서의 함수에 대한 사용 지침을 살펴보겠습니다.

help(torch.ones)

주어진 코드는 PyTorch에서 torch.Ones 함수에 대한 도움말 문서를 출력하는 코드입니다. help() 함수는 파이썬 내장 함수로, 주어진 객체나 함수에 대한 도움말 문서를 출력합니다. 아래는 코드의 설명입니다:

help(torch.Ones): 이 코드는 PyTorch의 torch.Ones 함수에 대한 도움말을 요청합니다. torch.Ones 함수는 텐서를 생성하며, 모든 요소가 1인 텐서를 생성합니다. 이 함수의 도움말 문서에는 함수의 사용법, 매개변수, 반환값 등에 대한 정보가 포함되어 있을 것입니다.

도움말 문서는 함수나 클래스의 사용법을 이해하고 해당 함수 또는 클래스를 효과적으로 활용하기 위해 유용한 정보를 제공합니다. PyTorch와 같은 라이브러리의 도움말 문서를 읽어보면 함수나 클래스의 매개변수와 반환값에 대한 이해를 높일 수 있으며, 코드를 개발하고 디버그하는 데 도움이 됩니다.

Help on built-in function ones in module torch:

ones(...)
    ones(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False) -> Tensor
    
    Returns a tensor filled with the scalar value `1`, with the shape defined
    by the variable argument :attr:`size`.
    
    Args:
        size (int...): a sequence of integers defining the shape of the output tensor.
            Can be a variable number of arguments or a collection like a list or tuple.
    
    Keyword arguments:
        out (Tensor, optional): the output tensor.
        dtype (:class:`torch.dtype`, optional): the desired data type of returned tensor.
            Default: if ``None``, uses a global default (see :func:`torch.set_default_tensor_type`).
        layout (:class:`torch.layout`, optional): the desired layout of returned Tensor.
            Default: ``torch.strided``.
        device (:class:`torch.device`, optional): the desired device of returned tensor.
            Default: if ``None``, uses the current device for the default tensor type
            (see :func:`torch.set_default_tensor_type`). :attr:`device` will be the CPU
            for CPU tensor types and the current CUDA device for CUDA tensor types.
        requires_grad (bool, optional): If autograd should record operations on the
            returned tensor. Default: ``False``.
    
    Example::
    
        >>> torch.ones(2, 3)
        tensor([[ 1.,  1.,  1.],
                [ 1.,  1.,  1.]])
    
        >>> torch.ones(5)
        tensor([ 1.,  1.,  1.,  1.,  1.])

From the documentation, we can see that the ones function creates a new tensor with the specified shape and sets all the elements to the value of 1. Whenever possible, you should run a quick test to confirm your interpretation:

문서에서 ones 함수가 지정된 모양을 가진 새 텐서를 생성하고 모든 요소를 1의 값으로 설정하는 것을 볼 수 있습니다. 가능할 때마다 빠른 테스트를 실행하여 해석을 확인해야 합니다.

torch.ones(4)

주어진 코드는 PyTorch에서 torch.Ones(4)를 사용하여 모든 요소가 1로 초기화된 1차원 텐서를 생성하는 코드입니다. 아래는 코드의 설명입니다:

torch.Ones(4): 이 코드는 PyTorch에서 torch.Ones 함수를 호출하여 4개의 요소로 구성된 1차원 텐서를 생성합니다. torch.Ones 함수는 모든 요소가 1로 초기화된 텐서를 생성하는 함수입니다. 따라서 이 코드는 길이가 4이고 모든 요소가 1로 초기화된 1차원 텐서를 생성합니다.

이렇게 생성된 텐서는 PyTorch를 사용하여 다양한 수치 연산을 수행하는 데 사용할 수 있으며, 데이터 처리 및 머신 러닝 작업에서 유용하게 활용될 수 있습니다.

tensor([1., 1., 1., 1.])

In the Jupyter notebook, we can use ? to display the document in another window. For example, list? will create content that is almost identical to help(list), displaying it in a new browser window. In addition, if we use two question marks, such as list??, the Python code implementing the function will also be displayed.

Jupyter 노트북에서는? 문서를 다른 창에 표시하려면 예를 들어 목록? help(list)와 거의 동일한 콘텐츠를 생성하여 새 브라우저 창에 표시합니다. 또한 list??와 같은 물음표 두 개를 사용하면 해당 함수를 구현하는 Python 코드도 표시됩니다.

The official documentation provides plenty of descriptions and examples that are beyond this book. We emphasize important use cases that will get you started quickly with practical problems, rather than completeness of coverage. We also encourage you to study the source code of the libraries to see examples of high-quality implementations of production code. By doing this you will become a better engineer in addition to becoming a better scientist.

공식 문서는 이 책보다 더 많은 설명과 예제를 제공합니다. 우리는 적용 범위의 완전성보다는 실제 문제를 빠르게 시작할 수 있는 중요한 사용 사례를 강조합니다. 또한 프로덕션 코드의 고품질 구현 예를 보려면 라이브러리의 소스 코드를 연구하는 것이 좋습니다. 이렇게 함으로써 당신은 더 나은 과학자가 될 뿐만 아니라 더 나은 엔지니어가 될 것입니다.

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.6. Probability and Statistics

2023. 10. 14. 00:22 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/probability.html

2.6. Probability and Statistics — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.6. Probability and Statistics

One way or another, machine learning is all about uncertainty. In supervised learning, we want to predict something unknown (the target) given something known (the features). Depending on our objective, we might attempt to predict the most likely value of the target. Or we might predict the value with the smallest expected distance from the target. And sometimes we wish not only to predict a specific value but to quantify our uncertainty. For example, given some features describing a patient, we might want to know how likely they are to suffer a heart attack in the next year. In unsupervised learning, we often care about uncertainty. To determine whether a set of measurements are anomalous, it helps to know how likely one is to observe values in a population of interest. Furthermore, in reinforcement learning, we wish to develop agents that act intelligently in various environments. This requires reasoning about how an environment might be expected to change and what rewards one might expect to encounter in response to each of the available actions.

어떤 식으로든 머신러닝은 불확실성에 관한 것입니다. 지도 학습 supervised learning 에서는 알려진 것(특성)을 고려하여 알려지지 않은 것(목표)을 예측하려고 합니다. 목표에 따라 목표의 가장 가능성 있는 값을 예측하려고 시도할 수도 있습니다. 또는 대상으로부터 예상되는 거리가 가장 작은 값을 예측할 수도 있습니다. 때로는 특정 값을 예측하는 것뿐만 아니라 불확실성을 정량화하고 싶을 때도 있습니다. 예를 들어, 환자를 설명하는 일부 특징이 주어지면 해당 환자가 내년에 심장마비를 겪을 가능성이 얼마나 되는지 알고 싶을 수 있습니다. 비지도 학습 unsupervised learning 에서는 종종 불확실성에 관심을 갖습니다. 일련의 측정값이 비정상적인지 여부를 확인하려면 관심 모집단에서 값을 관찰할 가능성이 얼마나 되는지 아는 것이 도움이 됩니다. 또한 강화학습 reinforcement learning 에서는 다양한 환경에서 지능적으로 행동하는 에이전트를 개발하고자 합니다. 이를 위해서는 환경이 어떻게 변할 것으로 예상되는지, 그리고 사용 가능한 각 조치에 대한 응답으로 어떤 보상을 받을 수 있는지에 대한 추론이 필요합니다.

Probability is the mathematical field concerned with reasoning under uncertainty. Given a probabilistic model of some process, we can reason about the likelihood of various events. The use of probabilities to describe the frequencies of repeatable events (like coin tosses) is fairly uncontroversial. In fact, frequentist scholars adhere to an interpretation of probability that applies only to such repeatable events. By contrast Bayesian scholars use the language of probability more broadly to formalize reasoning under uncertainty. Bayesian probability is characterized by two unique features: (i) assigning degrees of belief to non-repeatable events, e.g., what is the probability that a dam will collapse?; and (ii) subjectivity. While Bayesian probability provides unambiguous rules for how one should update their beliefs in light of new evidence, it allows for different individuals to start off with different prior beliefs. Statistics helps us to reason backwards, starting off with collection and organization of data and backing out to what inferences we might draw about the process that generated the data. Whenever we analyze a dataset, hunting for patterns that we hope might characterize a broader population, we are employing statistical thinking. Many courses, majors, theses, careers, departments, companies, and institutions have been devoted to the study of probability and statistics. While this section only scratches the surface, we will provide the foundation that you need to begin building models.

확률 Probability 은 불확실성 하에서 추론과 관련된 수학적 분야입니다. 일부 프로세스의 확률적 모델이 주어지면 다양한 이벤트가 발생할 가능성에 대해 추론할 수 있습니다. 반복 가능한 사건(예: 동전 던지기)의 빈도를 설명하기 위해 확률을 사용하는 것은 논란의 여지가 없습니다. 사실, 빈도주의 학자들은 그러한 반복 가능한 사건에만 적용되는 확률의 해석을 고수합니다. 대조적으로 베이지안 학자들은 불확실성 하에서 추론을 공식화하기 위해 확률이라는 언어를 보다 광범위하게 사용합니다. 베이지안 확률은 두 가지 고유한 특징이 특징입니다. (i) 반복 불가능한 사건에 대한 신뢰도 할당(예: 댐이 붕괴할 확률은 얼마입니까?) (ii) 주관성. 베이지안 확률은 새로운 증거에 비추어 자신의 믿음을 업데이트하는 방법에 대한 명확한 규칙을 제공하지만, 개인마다 다른 이전 믿음으로 시작할 수 있습니다. 통계는 데이터 수집 및 구성부터 시작하여 데이터를 생성한 프로세스에 대해 어떤 추론을 이끌어낼 수 있는지 역으로 추론하는 데 도움이 됩니다. 우리는 데이터 세트를 분석하고 더 광범위한 인구를 특징짓는 패턴을 찾을 때마다 통계적 사고를 사용합니다. 많은 과정, 전공, 논문, 경력, 학과, 회사 및 기관에서 확률과 통계 연구에 전념해 왔습니다. 이 섹션에서는 표면적인 내용만 다루지만 모델 구축을 시작하는 데 필요한 기초를 제공합니다.

https://youtu.be/SoKjCUcDBf0?si=M5wjpR8J_O7w3Lai

%matplotlib inline
import random
import torch
from torch.distributions.multinomial import Multinomial
from d2l import torch as d2l

이 코드는 다음 작업을 수행하기 위해 필요한 라이브러리 및 설정을 가져오고 있습니다:

%matplotlib inline: 이는 IPython 환경에서 Matplotlib를 사용하여 그래프 및 이미지를 노트북에 직접 표시하도록 설정합니다. 즉, 그래프를 노트북 내에서 바로 볼 수 있도록 합니다.
random: Python의 기본 모듈로서, 무작위 수를 생성하는 데 사용됩니다.
torch: PyTorch 라이브러리입니다. 딥러닝 작업을 수행하는 데 사용됩니다.
Multinomial: PyTorch에서 제공하는 분포 클래스 중 하나인 다항 분포(Multinomial distribution)에 대한 모듈입니다. 다항 분포는 여러 범주 중 하나를 샘플링하는 확률 모델을 나타냅니다.
d2l: "Dive into Deep Learning" 프로젝트의 PyTorch 버전인 d2l(torch) 라이브러리입니다. 딥러닝 모델을 학습하고 시각화하는 데 도움이 되는 기능과 도구를 제공합니다.

이 코드는 딥러닝 모델을 구축하고 실험하기 위한 환경을 설정하는 데 사용될 수 있습니다.

2.6.1. A Simple Example: Tossing Coins

Imagine that we plan to toss a coin and want to quantify how likely we are to see heads (vs. tails). If the coin is fair, then both outcomes (heads and tails), are equally likely. Moreover if we plan to toss the coin n times then the fraction of heads that we expect to see should exactly match the expected fraction of tails. One intuitive way to see this is by symmetry: for every possible outcome with nh heads and nt=(n−nh) tails, there is an equally likely outcome with nt heads and nh tails. Note that this is only possible if on average we expect to see 1/2 of tosses come up heads and 1/2 come up tails. Of course, if you conduct this experiment many times with n=1000000 tosses each, you might never see a trial where nh=nt exactly.

우리가 동전을 던질 계획을 세우고 앞면과 뒷면이 나올 확률을 정량화하고 싶다고 가정해 보세요. 동전이 공정하다면 두 가지 결과(앞면과 뒷면)의 확률은 동일합니다. 더욱이 동전을 n번 던질 계획이라면 우리가 볼 것으로 예상되는 앞면의 비율은 예상되는 뒷면의 비율과 정확히 일치해야 합니다. 이것을 보는 한 가지 직관적인 방법은 대칭을 이용하는 것입니다. nh개의 앞면과 nt=(n−nh)개의 꼬리가 있는 모든 가능한 결과에 대해 nt개의 앞면과 nh개의 꼬리가 있는 동일한 가능성의 결과가 있습니다. 이것은 평균적으로 던진 숫자 중 1/2이 앞면이 나오고 1/2이 뒷면이 나올 것으로 예상하는 경우에만 가능합니다. 물론 n=1000000번 던지기로 이 실험을 여러 번 수행하면 정확히 nh=nt인 실험을 결코 볼 수 없을 수도 있습니다.

Formally, the quantity 1/2 is called a probability and here it captures the certainty with which any given toss will come up heads. Probabilities assign scores between 0 and 1 to outcomes of interest, called events. Here the event of interest is heads and we denote the corresponding probability P(heads). A probability of 1 indicates absolute certainty (imagine a trick coin where both sides were heads) and a probability of 0 indicates impossibility (e.g., if both sides were tails). The frequencies nh/n and nt/n are not probabilities but rather statistics. Probabilities are theoretical quantities that underly the data generating process. Here, the probability 1/2 is a property of the coin itself. By contrast, statistics are empirical quantities that are computed as functions of the observed data. Our interests in probabilistic and statistical quantities are inextricably intertwined. We often design special statistics called estimators that, given a dataset, produce estimates of model parameters such as probabilities. Moreover, when those estimators satisfy a nice property called consistency, our estimates will converge to the corresponding probability. In turn, these inferred probabilities tell about the likely statistical properties of data from the same population that we might encounter in the future.

공식적으로 1/2이라는 수량을 확률이라고 하며 여기서는 주어진 던지기에서 앞면이 나올 확실성을 포착합니다. 확률은 이벤트라고 하는 관심 결과에 0과 1 사이의 점수를 할당합니다. 여기서 관심 있는 이벤트는 앞면이고 해당 확률 P(앞면)를 나타냅니다. 확률 1은 절대적 확실성(양쪽이 앞면인 트릭 코인을 상상해 보세요)을 나타내고 확률 0은 불가능함(예: 양쪽이 뒷면인 경우)을 나타냅니다. 빈도 nh/n 및 nt/n은 확률이 아니라 통계입니다. 확률은 데이터 생성 프로세스의 기초가 되는 이론적 수량입니다. 여기서 확률 1/2은 코인 자체의 속성입니다. 대조적으로, 통계는 관찰된 데이터의 함수로 계산되는 경험적 수량입니다. 확률적 수량과 통계적 수량에 대한 우리의 관심은 서로 밀접하게 얽혀 있습니다. 우리는 주어진 데이터 세트에서 확률과 같은 모델 매개변수의 추정치를 생성하는 추정기라는 특수 통계를 설계하는 경우가 많습니다. 또한 이러한 추정기가 일관성이라는 좋은 속성을 충족하면 추정치가 해당 확률로 수렴됩니다. 결과적으로 이러한 추론된 확률은 우리가 미래에 접할 수 있는 동일한 모집단의 데이터에 대한 통계적 속성에 대해 알려줍니다.

Suppose that we stumbled upon a real coin for which we did not know the true P(heads). To investigate this quantity with statistical methods, we need to (i) collect some data; and (ii) design an estimator. Data acquisition here is easy; we can toss the coin many times and record all the outcomes. Formally, drawing realizations from some underlying random process is called sampling. As you might have guessed, one natural estimator is the ratio of the number of observed heads to the total number of tosses.

우리가 실제 P(앞면)를 알지 못하는 실제 동전을 우연히 발견했다고 가정해 보십시오. 통계적 방법으로 이 수량을 조사하려면 (i) 일부 데이터를 수집해야 합니다. (ii) 추정기를 설계합니다. 여기서 데이터 수집은 쉽습니다. 동전을 여러 번 던지고 모든 결과를 기록할 수 있습니다. 공식적으로는 일부 기본 무작위 프로세스에서 구현한 그림을 샘플링이라고 합니다. 짐작할 수 있듯이 자연 추정량 중 하나는 총 던지기 횟수에 대한 관찰된 앞면 숫자의 비율입니다.

Now, suppose that the coin was in fact fair, i.e., P(heads)=0.5. To simulate tosses of a fair coin, we can invoke any random number generator. There are some easy ways to draw samples of an event with probability 0.5. For example Python’s random.random yields numbers in the interval [0,1] where the probability of lying in any sub-interval [a,b]⊂[0,1] is equal to b−a. Thus we can get out 0 and 1 with probability 0.5 each by testing whether the returned float number is greater than 0.5:

이제 동전이 실제로 공정했다고 가정합니다. 즉, P(heads)=0.5입니다. 공정한 동전 던지기를 시뮬레이션하기 위해 난수 생성기를 호출할 수 있습니다. 확률이 0.5인 사건의 샘플을 추출하는 몇 가지 쉬운 방법이 있습니다. 예를 들어 Python의 random.random은 하위 구간 [a,b]⊂[0,1]에 속할 확률이 b−a와 같은 구간 [0,1]의 숫자를 생성합니다. 따라서 반환된 부동 소수점 숫자가 0.5보다 큰지 테스트하여 각각 0.5 확률로 0과 1을 얻을 수 있습니다.

num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(num_tosses)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])

이 코드는 동전 던지기 실험을 시뮬레이션하는 데 사용됩니다. 다음과 같이 동작합니다:

num_tosses = 100: num_tosses라는 변수를 정의하고 100으로 설정합니다. 이는 동전을 100번 던지겠다는 것을 나타냅니다.
heads = sum([random.random() > 0.5 for _ in range(num_tosses)]): heads 변수는 동전 던지기 실험에서 앞면(heads)이 나온 횟수를 나타냅니다. 리스트 내포(list comprehension)를 사용하여 0.5보다 큰 임의의 숫자(0~1 범위)를 100번 생성하고, 이 숫자가 0.5보다 크면(True인 경우), random.random() > 0.5는 1(참)이 됩니다. 그리고 sum() 함수를 사용하여 1의 개수를 세어 앞면(heads) 횟수를 계산합니다.
tails = num_tosses - heads: tails 변수는 동전 던지기 실험에서 뒷면(tails)이 나온 횟수를 나타냅니다. 전체 던진 횟수(num_tosses)에서 앞면(heads) 횟수를 뺌으로써 얻습니다.
print("heads, tails: ", [heads, tails]): 앞면(heads) 횟수와 뒷면(tails) 횟수를 출력합니다.

이 코드는 무작위로 동전을 100번 던진 후 앞면(heads)과 뒷면(tails)이 나오는 횟수를 계산하고 출력합니다. 결과는 매번 다를 수 있으며, 대략적으로 동전 던지기의 확률을 시뮬레이션합니다.

heads, tails:  [44, 56]

More generally, we can simulate multiple draws from any variable with a finite number of possible outcomes (like the toss of a coin or roll of a die) by calling the multinomial function, setting the first argument to the number of draws and the second as a list of probabilities associated with each of the possible outcomes. To simulate ten tosses of a fair coin, we assign probability vector [0.5, 0.5], interpreting index 0 as heads and index 1 as tails. The function returns a vector with length equal to the number of possible outcomes (here, 2), where the first component tells us the number of occurrences of heads and the second component tells us the number of occurrences of tails.

보다 일반적으로, 다항 함수를 호출하고 첫 번째 인수를 무승부 횟수로 설정하고 두 번째 인수를 가능한 각 결과와 관련된 확률 목록입니다. 공정한 동전 던지기 10회를 시뮬레이션하기 위해 확률 벡터 [0.5, 0.5]를 할당하여 인덱스 0을 앞면으로 해석하고 인덱스 1을 뒷면으로 해석합니다. 이 함수는 가능한 결과 수(여기서는 2)와 동일한 길이의 벡터를 반환합니다. 여기서 첫 번째 구성 요소는 앞면이 나타나는 횟수를 나타내고 두 번째 구성 요소는 꼬리가 나타나는 횟수를 알려줍니다.

fair_probs = torch.tensor([0.5, 0.5])
Multinomial(100, fair_probs).sample()

이 코드는 다수의 동전 던지기 실험을 시뮬레이션하는 데 사용됩니다. 코드를 간단히 설명하겠습니다:
1. fair_probs = torch.tensor([0.5, 0.5]): fair_probs는 각 동전이 앞면(heads) 또는 뒷면(tails)을 나올 확률을 나타내는 텐서입니다. 여기서 [0.5, 0.5]는 공평한(fair) 동전을 나타내며, 앞면과 뒷면이 나올 확률이 각각 50%입니다.
2. Multinomial(100, fair_probs): 이 부분은 Multinomial 분포에서 무작위 표본(sample)을 생성하는데 사용됩니다. 100은 시행 횟수, fair_probs는 각 결과(앞면 또는 뒷면)가 나올 확률을 나타내는 확률 분포입니다.
3. .sample(): 이 메서드는 Multinomial 분포를 따르는 난수 생성을 수행합니다. 여기서는 100번의 동전 던지기 시뮬레이션을 수행하며, 각 시행에서 앞면 또는 뒷면이 나오는 횟수를 반환합니다.
결과로 나오는 텐서는 [heads_count, tails_count]와 같은 형태를 가지며, heads_count는 앞면이 나온 횟수, tails_count는 뒷면이 나온 횟수를 나타냅니다. 위의 fair_probs에서는 공평한 동전을 사용했으므로, heads_count와 tails_count는 대략적으로 50, 50이 될 것입니다.하지만 시뮬레이션 결과는 무작위성에 의해 실제로는 다를 수 있습니다.

tensor([50., 50.])

Each time you run this sampling process, you will receive a new random value that may differ from the previous outcome. Dividing by the number of tosses gives us the frequency of each outcome in our data. Note that these frequencies, just like the probabilities that they are intended to estimate, sum to 1.

이 샘플링 프로세스를 실행할 때마다 이전 결과와 다를 수 있는 새로운 임의 값을 받게 됩니다. 던지는 횟수로 나누면 데이터의 각 결과 빈도를 알 수 있습니다. 추정하려는 확률과 마찬가지로 이러한 빈도의 합은 1이 됩니다.

Multinomial(100, fair_probs).sample() / 100

이 코드는 앞면(heads)과 뒷면(tails)이 나오는 확률이 0.5로 동일한 공평한 동전을 100번 던진 후에 각 결과의 상대 빈도를 계산합니다. 코드를 설명하겠습니다.

Multinomial(100, fair_probs): 이 부분은 Multinomial 분포에서 100번의 시행을 수행하여 무작위로 결과를 샘플링합니다. 100은 시행 횟수이고, fair_probs는 각 결과(앞면 또는 뒷면)가 나올 확률을 나타내는 확률 분포입니다.
.sample(): 이 메서드는 Multinomial 분포를 따르는 난수 생성을 수행합니다. 따라서 이 부분의 결과는 [heads_count, tails_count]와 같은 형태의 텐서로, heads_count는 앞면이 나오는 횟수, tails_count는 뒷면이 나오는 횟수를 나타냅니다.
/ 100: 여기서 나눗셈 연산을 수행하여 각 결과의 상대적인 빈도를 계산합니다. 각 결과(앞면 또는 뒷면)의 횟수를 100으로 나누면 각 결과가 나오는 상대적인 확률을 얻을 수 있습니다. 이 결과는 공평한 동전에서 앞면 또는 뒷면이 나올 확률에 가까워질 것이며, 대략적으로 [0.5, 0.5]가 될 것입니다.

즉, 이 코드는 공평한 동전 던지기 실험을 100번 수행하여 앞면과 뒷면의 상대적인 빈도(확률)를 계산합니다.

tensor([0.4800, 0.5200])

Here, even though our simulated coin is fair (we ourselves set the probabilities [0.5, 0.5]), the counts of heads and tails may not be identical. That is because we only drew a relatively small number of samples. If we did not implement the simulation ourselves, and only saw the outcome, how would we know if the coin were slightly unfair or if the possible deviation from 1/2 was just an artifact of the small sample size? Let’s see what happens when we simulate 10,000 tosses.

여기서, 시뮬레이션된 코인이 공정하더라도(우리가 직접 확률 [0.5, 0.5] 설정) 앞면과 뒷면의 개수가 동일하지 않을 수 있습니다. 그 이유는 상대적으로 적은 수의 샘플만 추출했기 때문입니다. 시뮬레이션을 직접 구현하지 않고 결과만 본다면 동전이 약간 불공평한지, 아니면 1/2에서 벗어날 수 있는 편차가 단지 작은 샘플 크기의 인공물인지 어떻게 알 수 있습니까? 10,000번의 던지기를 시뮬레이션하면 어떤 일이 일어나는지 살펴보겠습니다.

counts = Multinomial(10000, fair_probs).sample()
counts / 10000

이 코드는 공평한 동전을 10,000번 던지고, 그 결과를 사용하여 각 결과(앞면 또는 뒷면)의 상대적인 확률을 계산합니다. 코드를 단계별로 설명하겠습니다.

Multinomial(10000, fair_probs): 이 부분은 Multinomial 분포에서 10,000번의 시행을 수행하여 무작위로 결과를 샘플링합니다. 10,000은 시행 횟수이고, fair_probs는 각 결과(앞면 또는 뒷면)가 나올 확률을 나타내는 확률 분포입니다. 이 부분에서는 10,000번의 독립적인 동전 던지기 시뮬레이션을 수행합니다.
.sample(): 이 메서드는 Multinomial 분포를 따르는 난수 생성을 수행합니다. 따라서 이 부분의 결과는 [heads_count, tails_count]와 같은 형태의 텐서로, heads_count는 앞면이 나오는 횟수, tails_count는 뒷면이 나오는 횟수를 나타냅니다.
/ 10,000: 마지막으로, 각 결과(앞면 또는 뒷면)의 횟수를 10,000으로 나누면 각 결과가 나오는 상대적인 확률을 얻을 수 있습니다. 이 결과는 공평한 동전에서 앞면 또는 뒷면이 나오는 확률에 가까워질 것이며, 대략적으로 [0.5, 0.5]가 될 것입니다.

따라서 이 코드는 공평한 동전 던지기 실험을 10,000번 수행하여 앞면과 뒷면의 상대적인 빈도(확률)를 계산합니다.

tensor([0.4966, 0.5034])

In general, for averages of repeated events (like coin tosses), as the number of repetitions grows, our estimates are guaranteed to converge to the true underlying probabilities. The mathematical formulation of this phenomenon is called the law of large numbers and the central limit theorem tells us that in many situations, as the sample size n grows, these errors should go down at a rate of (1/√n). Let's get some more intuition by studying how our estimate evolves as we grow the number of tosses from 1 to 10,000.

일반적으로 반복되는 사건(예: 동전 던지기)의 평균의 경우 반복 횟수가 증가함에 따라 우리의 추정치는 실제 기본 확률로 수렴됩니다. 이 현상의 수학적 공식을 대수의 법칙이라고 하며 중심 극한 정리는 많은 상황에서 표본 크기 n이 커짐에 따라 이러한 오류가 (1/√n)의 비율로 감소해야 함을 알려줍니다. 던지기 횟수를 1회에서 10,000회까지 증가시키면서 추정치가 어떻게 변화하는지 연구하여 좀 더 직관력을 갖도록 합시다.

counts = Multinomial(1, fair_probs).sample((10000,))
cum_counts = counts.cumsum(dim=0)
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)
estimates = estimates.numpy()

d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();

이 코드는 공평한 동전 던지기 실험에서 상대적인 확률을 추정하고 그 추정치를 그래프로 표현하는 부분입니다.

counts = Multinomial(1, fair_probs).sample((10000,)): 이 부분은 공평한 동전을 1번 던질 때, 앞면(coin=heads) 또는 뒷면(coin=tails)이 나오는 횟수를 10,000번 샘플링합니다. counts는 10,000번의 시뮬레이션 결과를 담고 있는 텐서입니다.
cum_counts = counts.cumsum(dim=0): 누적 횟수(cumulative counts)를 계산합니다. 이것은 각 시뮬레이션 스텝에서 앞면과 뒷면의 상대적인 누적 횟수를 나타냅니다.
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True): 이 부분에서는 누적 횟수를 누적 합으로 나눠 상대적인 확률을 계산합니다. 이것은 각 시뮬레이션 스텝에서 앞면과 뒷면의 상대적인 확률을 나타냅니다.
estimates = estimates.numpy(): 계산된 확률 추정치를 NumPy 배열로 변환합니다.
그래프를 그리는 부분: 계산된 확률 추정치를 이용하여 앞면(coin=heads)과 뒷면(coin=tails)의 확률 변화를 그래프로 표현합니다. d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))는 앞면의 확률을 나타내는 그래프를 그리고, d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))는 뒷면의 확률을 나타내는 그래프를 그립니다. d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')는 확률 0.5에 수평 점선을 추가합니다. 나머지 코드는 그래프의 레이블과 축 이름을 설정하고 범례를 추가합니다.

결과적으로 이 코드는 공평한 동전 던지기 실험에서 시뮬레이션된 결과를 기반으로 앞면과 뒷면의 확률을 추정하고, 그 추정치를 그래프로 시각화합니다. 동전 던지기 시뮬레이션 횟수가 증가함에 따라 추정치가 공평한 동전의 확률(0.5)로 수렴하는 것을 관찰할 수 있습니다.

Each solid curve corresponds to one of the two values of the coin and gives our estimated probability that the coin turns up that value after each group of experiments. The dashed black line gives the true underlying probability. As we get more data by conducting more experiments, the curves converge towards the true probability. You might already begin to see the shape of some of the more advanced questions that preoccupy statisticians: How quickly does this convergence happen? If we had already tested many coins manufactured at the same plant, how might we incorporate this information?

각 실선은 동전의 두 값 중 하나에 해당하며 각 실험 그룹 후에 동전이 해당 값을 나타낼 확률을 추정합니다. 검은 점선은 실제 기본 확률을 나타냅니다. 더 많은 실험을 수행하여 더 많은 데이터를 얻을수록 곡선은 실제 확률로 수렴됩니다. 통계학자들을 사로잡는 고급 질문 중 일부의 형태가 이미 보이기 시작했을 수도 있습니다. 이 수렴은 얼마나 빨리 발생합니까? 동일한 공장에서 제조된 많은 동전을 이미 테스트했다면 이 정보를 어떻게 통합할 수 있을까요?

2.6.2. A More Formal Treatment

We have already gotten pretty far: posing a probabilistic model, generating synthetic data, running a statistical estimator, empirically assessing convergence, and reporting error metrics (checking the deviation). However, to go much further, we will need to be more precise.

우리는 이미 확률 모델 제시, 합성 데이터 생성, 통계 추정기 실행, 경험적 수렴 평가, 오류 측정항목 보고(편차 확인) 등 꽤 많은 작업을 수행했습니다. 그러나 더 나아가려면 더 정확해야 합니다.

When dealing with randomness, we denote the set of possible outcomes S and call it the sample space or outcome space. Here, each element is a distinct possible outcome. In the case of rolling a single coin, S={heads,tails}. For a single die, S={1,2,3,4,5,6}. When flipping two coins, possible outcomes are {(heads,heads),(heads,tails),(tails,heads),(tails,tails)}. Events are subsets of the sample space. For instance, the event “the first coin toss comes up heads” corresponds to the set {(heads,heads),(heads,tails)}. Whenever the outcome z of a random experiment satisfies z∈A, then event A has occurred. For a single roll of a die, we could define the events “seeing a 5” (A={5}) and “seeing an odd number” (B={1,3,5}). In this case, if the die came up 5, we would say that both A and B occurred. On the other hand, if z=3, then A did not occur but B did.

무작위성을 다룰 때 가능한 결과 집합 S를 표시하고 이를 표본 공간 또는 결과 공간이라고 부릅니다. 여기서 각 요소는 서로 다른 가능한 결과입니다. 단일 동전을 굴리는 경우 S={heads,tails}입니다. 단일 주사위의 경우 S={1,2,3,4,5,6}입니다. 두 개의 동전을 뒤집을 때 가능한 결과는 {(앞면, 앞면),(앞면,뒷면),(뒷면,앞면),(뒷면,뒷면)}입니다. 사건은 표본 공간의 부분 집합입니다. 예를 들어, "첫 번째 동전 던지기에서 앞면이 나옵니다" 이벤트는 {(앞면, 앞면),(앞면, 뒷면)} 집합에 해당합니다. 무작위 실험의 결과 z가 z∈A를 충족할 때마다 사건 A가 발생합니다. 주사위를 한 번 굴릴 때 "5를 보는 것"(A={5})과 "홀수를 보는 것"(B={1,3,5}) 이벤트를 정의할 수 있습니다. 이 경우 주사위가 5가 나오면 A와 B가 모두 발생했다고 말할 수 있습니다. 반면, z=3이면 A는 발생하지 않았지만 B는 발생했습니다.

A probability function maps events onto real values P:A⊆S→[0,1]. The probability, denoted P(A), of an event A in the given sample space S, has the following properties:

확률 함수는 사건을 실제 값 P:A⊆S→[0,1]에 매핑합니다. 주어진 표본 공간 S에서 사건 A의 확률 P(A)는 다음과 같은 속성을 갖습니다.

The probability of any event A is a nonnegative real number, i.e., P(A)≥0;
어떤 사건 A의 확률은 음이 아닌 실수입니다. 즉, P(A)≥0입니다.
The probability of the entire sample space is 1, i.e., P(S)=1;
전체 표본 공간의 확률은 1입니다. 즉, P(S)=1입니다.
For any countable sequence of events A1,A2,… that are mutually exclusive (i.e., Ai∩Aj=∅ for all i≠j), the probability that any of them happens is equal to the sum of their individual probabilities, i.e., P(⋃_i=1^∞ A_i)=∑_i=1^∞ P(A_i).
상호 배타적인 사건 A1,A2,…의 셀 수 있는 시퀀스에 대해(즉, 모든 i≠j에 대해 Ai∩Aj=∅), 그 중 하나가 발생할 확률은 개별 확률의 합과 같습니다. 즉, P(⋃_i=1^∞ A_i)=∑_i=1^∞ P(A_i).

These axioms of probability theory, proposed by Kolmogorov (1933), can be applied to rapidly derive a number of important consequences. For instance, it follows immediately that the probability of any event A or its complement A′ occurring is 1 (because A∪A′=S). We can also prove that P(∅)=0 because 1=P(S∪S′)=P(S∪∅)=P(S)+P(∅)=1+P(∅). Consequently, the probability of any event A and its complement A′ occurring simultaneously is P(A∩A′)=0. Informally, this tells us that impossible events have zero probability of occurring.

Kolmogorov(1933)가 제안한 확률 이론의 이러한 공리는 여러 가지 중요한 결과를 신속하게 도출하는 데 적용될 수 있습니다. 예를 들어, 임의의 사건 A 또는 그 보수 A'가 발생할 확률은 1입니다(A∪A'=S이기 때문에). 1=P(S∪S′)=P(S∪∅)=P(S)+P(∅)=1+P(∅)이기 때문에 P(∅)=0임을 증명할 수도 있습니다. 결과적으로, 사건 A와 그 보수 A'가 동시에 발생할 확률은 P(A∩A')=0입니다. 비공식적으로 이는 불가능한 사건이 발생할 확률이 0이라는 것을 알려줍니다.

2.6.3. Random Variables

When we spoke about events like the roll of a die coming up odds or the first coin toss coming up heads, we were invoking the idea of a random variable. Formally, random variables are mappings from an underlying sample space to a set of (possibly many) values. You might wonder how a random variable is different from the sample space, since both are collections of outcomes. Importantly, random variables can be much coarser than the raw sample space. We can define a binary random variable like “greater than 0.5” even when the underlying sample space is infinite, e.g., points on the line segment between 0 and 1. Additionally, multiple random variables can share the same underlying sample space. For example “whether my home alarm goes off” and “whether my house was burgled” are both binary random variables that share an underlying sample space. Consequently, knowing the value taken by one random variable can tell us something about the likely value of another random variable. Knowing that the alarm went off, we might suspect that the house was likely burgled.

주사위를 굴려 앞면이 나올 확률이나 첫 번째 동전 던지기에서 앞면이 나올 때와 같은 사건에 대해 이야기할 때 우리는 무작위 변수라는 아이디어를 떠올렸습니다. 공식적으로 확률 변수는 기본 표본 공간에서 (아마도 많은) 값 세트로의 매핑입니다. 둘 다 결과 모음이기 때문에 확률 변수가 표본 공간과 어떻게 다른지 궁금할 것입니다. 중요한 것은 확률 변수가 원시 표본 공간보다 훨씬 더 거칠 수 있다는 것입니다. 기본 표본 공간이 무한하더라도(예: 0과 1 사이 선분의 점) "0.5보다 큼"과 같은 이진 확률 변수를 정의할 수 있습니다. 또한 여러 확률 변수가 동일한 기본 표본 공간을 공유할 수 있습니다. 예를 들어 "내 집 알람이 울리는지 여부"와 "내 집에 도난이 발생했는지 여부"는 모두 기본 표본 공간을 공유하는 이진 무작위 변수입니다. 결과적으로, 하나의 무작위 변수가 취하는 값을 알면 다른 무작위 변수의 가능한 값에 대해 알 수 있습니다. 경보기가 울렸다는 것을 알면 집에 도둑이 들었을 가능성이 있다고 의심할 수 있습니다.

Every value taken by a random variable corresponds to a subset of the underlying sample space. Thus the occurrence where the random variable X takes value v, denoted by X=v, is an event and P(X=v) denotes its probability. Sometimes this notation can get clunky, and we can abuse notation when the context is clear. For example, we might use P(X) to refer broadly to the distribution of X, i.e., the function that tells us the probability that X takes any given value. Other times we write expressions like P(X,Y)=P(X)P(Y), as a shorthand to express a statement that is true for all of the values that the random variables X and Y can take, i.e., for all i,j it holds that P(X=i and Y=j)=P(X=i)P(Y=j). Other times, we abuse notation by writing P(v) when the random variable is clear from the context. Since an event in probability theory is a set of outcomes from the sample space, we can specify a range of values for a random variable to take. For example, P(1≤X≤3) denotes the probability of the event {1≤X≤3}.

무작위 변수가 취한 모든 값은 기본 표본 공간의 하위 집합에 해당합니다. 따라서 확률 변수 X가 X=v로 표시되는 값 v를 취하는 발생은 이벤트이고 P(X=v)는 해당 확률을 나타냅니다. 때로는 이 표기법이 투박해질 수 있으며, 문맥이 명확할 때 표기법을 남용할 수 있습니다. 예를 들어, P(X)를 사용하여 X의 분포, 즉 X가 주어진 값을 취할 확률을 알려주는 함수를 광범위하게 나타낼 수 있습니다. 다른 경우에는 확률 변수 X와 Y가 취할 수 있는 모든 값에 대해 참인 진술을 표현하기 위해 P(X,Y)=P(X)P(Y)와 같은 표현식을 작성합니다. 모든 i,j는 P(X=i and Y=j)=P(X=i)P(Y=j)를 유지합니다. 다른 경우에는 무작위 변수가 문맥에서 명확할 때 P(v)를 작성하여 표기법을 남용합니다. 확률 이론의 사건은 표본 공간의 결과 집합이므로 무작위 변수가 취할 값의 범위를 지정할 수 있습니다. 예를 들어 P(1<X<<3)은 사건 {1<<X<<3}의 확률을 나타냅니다.

Note that there is a subtle difference between discrete random variables, like flips of a coin or tosses of a die, and continuous ones, like the weight and the height of a person sampled at random from the population. In this case we seldom really care about someone’s exact height. Moreover, if we took precise enough measurements, we would find that no two people on the planet have the exact same height. In fact, with fine enough measurements, you would never have the same height when you wake up and when you go to sleep. There is little point in asking about the exact probability that someone is 1.801392782910287192 meters tall. Instead, we typically care more about being able to say whether someone’s height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we work with probability densities. The height of exactly 1.80 meters has no probability, but nonzero density. To work out the probability assigned to an interval, we must take an integral of the density over that interval.

동전 던지기나 주사위 던지기와 같은 이산형 확률 변수와 모집단에서 무작위로 추출된 사람의 체중 및 키와 같은 연속 변수 사이에는 미묘한 차이가 있습니다. 이 경우 우리는 누군가의 정확한 키에 대해 거의 신경 쓰지 않습니다. 더욱이, 우리가 충분히 정확하게 측정한다면, 지구상에 정확히 같은 키를 가진 사람은 두 명도 없다는 것을 알게 될 것입니다. 사실, 충분히 세밀하게 측정하면 잠에서 깰 때와 잠에 들 때의 키가 결코 같지 않을 것입니다. 누군가의 키가 1.801392782910287192미터일 정확한 확률에 대해 묻는 것은 별 의미가 없습니다. 대신, 우리는 일반적으로 누군가의 키가 주어진 간격, 즉 1.79미터에서 1.81미터 사이에 속하는지 여부를 말할 수 있는지에 더 관심을 둡니다. 이 경우 우리는 확률 밀도를 사용하여 작업합니다. 정확히 1.80미터의 높이는 확률은 없지만 밀도는 0이 아닙니다. 구간에 할당된 확률을 계산하려면 해당 구간에 대한 밀도를 적분해야 합니다.

2.6.4. Multiple Random Variables

You might have noticed that we could not even make it through the previous section without making statements involving interactions among multiple random variables (recall P(X,Y)=P(X)P(Y)). Most of machine learning is concerned with such relationships. Here, the sample space would be the population of interest, say customers who transact with a business, photographs on the Internet, or proteins known to biologists. Each random variable would represent the (unknown) value of a different attribute. Whenever we sample an individual from the population, we observe a realization of each of the random variables. Because the values taken by random variables correspond to subsets of the sample space that could be overlapping, partially overlapping, or entirely disjoint, knowing the value taken by one random variable can cause us to update our beliefs about which values of another random variable are likely. If a patient walks into a hospital and we observe that they are having trouble breathing and have lost their sense of smell, then we believe that they are more likely to have COVID-19 than we might if they had no trouble breathing and a perfectly ordinary sense of smell.

여러 확률 변수 간의 상호 작용을 포함하는 진술을 작성하지 않고는 이전 섹션을 완료할 수도 없다는 점을 눈치챘을 것입니다(P(X,Y)=P(X)P(Y)를 기억하세요). 대부분의 기계 학습은 이러한 관계와 관련이 있습니다. 여기서 표본 공간은 관심 모집단, 즉 기업과 거래하는 고객, 인터넷 사진, 생물학자에게 알려진 단백질이 될 것입니다. 각 무작위 변수는 다른 속성의 (알 수 없는) 값을 나타냅니다. 모집단에서 개인을 샘플링할 때마다 각 무작위 변수가 실현되는 것을 관찰합니다. 무작위 변수가 취한 값은 중복되거나, 부분적으로 겹치거나, 완전히 분리될 수 있는 표본 공간의 하위 집합에 해당하기 때문에, 하나의 무작위 변수가 취한 값을 알면 다른 무작위 변수의 어떤 값이 가능성이 높은지에 대한 우리의 믿음을 업데이트할 수 있습니다. . 환자가 병원에 들어왔을 때 호흡 곤란을 겪고 후각을 잃은 것을 관찰하면, 우리는 호흡 곤란이 없고 완전히 평범한 후각이 있는 경우보다 코로나19에 걸릴 가능성이 더 높다고 믿습니다.

When working with multiple random variables, we can construct events corresponding to every combination of values that the variables can jointly take. The probability function that assigns probabilities to each of these combinations (e.g. A=a and B=b) is called the joint probability function and simply returns the probability assigned to the intersection of the corresponding subsets of the sample space. The joint probability assigned to the event where random variables A and B take values a and b, respectively, is denoted P(A=a,B=b), where the comma indicates “and”. Note that for any values a and b, it follows that

여러 확률 변수를 사용하여 작업할 때 변수가 공동으로 취할 수 있는 모든 값 조합에 해당하는 이벤트를 구성할 수 있습니다. 이러한 각 조합(예: A=a 및 B=b)에 확률을 할당하는 확률 함수를 결합 확률 함수라고 하며 단순히 표본 공간의 해당 하위 집합의 교차점에 할당된 확률을 반환합니다. 확률 변수 A와 B가 각각 a와 b 값을 갖는 사건에 할당된 결합 확률은 P(A=a,B=b)로 표시되며, 여기서 쉼표는 "and"를 나타냅니다. 임의의 값 a와 b에 대해 다음이 따른다는 점에 유의하십시오.

since for A=a and B=b to happen, A=a has to happen and B=b also has to happen. Interestingly, the joint probability tells us all that we can know about these random variables in a probabilistic sense, and can be used to derive many other useful quantities, including recovering the individual distributions P(A) and P(B). To recover P(A=a) we simply sum up P(A=a,B=v) over all values v that the random variable B can take: P(A=a)=∑_vP(A=a,B=v).

A=a와 B=b가 발생하려면 A=a가 발생해야 하고 B=b도 발생해야 하기 때문입니다. 흥미롭게도 결합 확률은 우리가 확률적 의미에서 이러한 확률 변수에 대해 알 수 있는 모든 것을 알려주고 개별 분포 P(A) 및 P(B)를 복구하는 것을 포함하여 다른 많은 유용한 양을 도출하는 데 사용될 수 있습니다. P(A=a)를 복구하려면 모든 값 v에 대해 P(A=a,B=v)를 간단히 합산하면 됩니다.확률 변수 B는 P(A=a)=∑_vP(A=a,B=v)를 취할 수 있습니다.

The ratio P(A=a,B=b)/P(A=a)≤1 turns out to be extremely important. It is called the conditional probability, and is denoted via the “∣” symbol:

P(A=a,B=b)/P(A=a)≤1 비율은 매우 중요합니다. 조건부 확률이라고 하며 "∣" 기호로 표시됩니다.

It tells us the new probability associated with the event B=b, once we condition on the fact A=a took place. We can think of this conditional probability as restricting attention only to the subset of the sample space associated with A=a and then renormalizing so that all probabilities sum to 1. Conditional probabilities are in fact just ordinary probabilities and thus respect all of the axioms, as long as we condition all terms on the same event and thus restrict attention to the same sample space. For instance, for disjoint events B and B′, we have that P(B∪B′∣A=a)=P(B∣A=a)+P(B′∣A=a).

A=a가 발생했다는 사실을 조건으로 하면 사건 B=b와 관련된 새로운 확률을 알려줍니다. 이 조건부 확률은 A=a와 관련된 표본 공간의 하위 집합에만 주의를 제한하고 모든 확률의 합이 1이 되도록 재정규화하는 것으로 생각할 수 있습니다. 조건부 확률은 실제로 일반적인 확률이므로 모든 공리를 존중합니다. 모든 항을 동일한 사건에 대해 조건을 지정하여 동일한 표본 공간에 주의를 기울이는 한. 예를 들어, 분리된 사건 B와 B'에 대해 P(B∪B'∣A=a)=P(B∣A=a)+P(B'∣A=a)가 됩니다.

Using the definition of conditional probabilities, we can derive the famous result called Bayes’ theorem. By construction, we have that P(A,B)=P(B∣A)P(A) and P(A,B)=P(A∣B)P(B). Combining both equations yields P(B∣A)P(A) = P(A∣B)P(B) and hence

조건부 확률의 정의를 사용하면 베이즈 정리라는 유명한 결과를 도출할 수 있습니다. 구성에 따르면 P(A,B)=P(B∣A)P(A) 및 P(A,B)=P(A∣B)P(B)가 있습니다. 두 방정식을 결합하면 P(B∣A)P(A) = P(A∣B)P(B)가 생성되므로

Bayes' theorem 이란?

베이즈 정리(Bayes' theorem)는 확률 이론의 중요한 개념 중 하나로, 조건부 확률을 계산하는 데 사용됩니다. 이 정리는 머신 러닝, 통계학, 확률 이론, 생물학, 의학, 자연어 처리 및 다양한 분야에서 다양한 응용을 가지고 있습니다. 베이즈 정리는 역사적으로 영국의 수학자 토머스 베이즈(Thomas Bayes)의 이름을 따서 명명되었습니다.

베이즈 정리의 일반적인 형태는 다음과 같습니다:

P(A|B) = (P(B|A) * P(A)) / P(B)

여기서 각 요소의 의미는 다음과 같습니다:

P(A|B): 사건 B가 발생한 조건에서 사건 A가 발생할 확률, 즉 A의 조건부 확률.
P(B|A): 사건 A가 발생한 조건에서 사건 B가 발생할 확률, 즉 B의 조건부 확률.
P(A): 사건 A가 발생할 사전 확률.
P(B): 사건 B가 발생할 사전 확률.

베이즈 정리의 주요 아이디어는 조건부 확률을 계산할 때, 사전 확률과 관련 사건들 간의 확률을 사용하여 업데이트할 수 있다는 것입니다. 즉, 사건 B가 발생한 후에는 사건 A의 확률을 더 정확하게 추정할 수 있습니다.

베이즈 정리의 응용 분야 중 하나는 베이지안 통계(Bayesian statistics)입니다. 이 방법론은 불확실성을 다루는데 특히 유용하며, 사후 확률을 업데이트하고 불확실성을 줄이는 데 활용됩니다. 머신 러닝에서는 베이지안 모델링을 통해 다양한 문제를 해결하고 불확실성을 고려한 예측을 수행하는데 사용됩니다.

베이즈 정리는 많은 현실 세계의 문제를 다루는데 유용한 도구로, 사후 확률을 업데이트하고 불확실성을 처리하기 위해 다양한 분야에서 활발하게 사용됩니다.

This simple equation has profound implications because it allows us to reverse the order of conditioning. If we know how to estimate P(B∣A), P(A), and P(B), then we can estimate P(A∣B). We often find it easier to estimate one term directly but not the other and Bayes’ theorem can come to the rescue here. For instance, if we know the prevalence of symptoms for a given disease, and the overall prevalences of the disease and symptoms, respectively, we can determine how likely someone is to have the disease based on their symptoms. In some cases we might not have direct access to P(B), such as the prevalence of symptoms. In this case a simplified version of Bayes’ theorem comes in handy:

이 간단한 방정식은 조건화 순서를 뒤집을 수 있기 때문에 심오한 의미를 갖습니다. P(B∣A), P(A), P(B)를 추정하는 방법을 안다면 P(A∣B)를 추정할 수 있습니다. 우리는 종종 한 항을 직접 추정하는 것이 더 쉽지만 다른 항은 그렇지 않다는 것을 알게 되며 여기서 베이즈 정리가 도움이 될 수 있습니다. 예를 들어, 특정 질병에 대한 증상의 유병률과 질병 및 증상의 전반적인 유병률을 각각 안다면 증상을 기반으로 누군가가 질병에 걸릴 가능성이 얼마나 되는지 판단할 수 있습니다. 어떤 경우에는 증상의 유병률과 같이 P(B)에 직접 접근할 수 없을 수도 있습니다. 이 경우 베이즈 정리의 단순화된 버전이 유용합니다.

Since we know that P(A∣B) must be normalized to 1, i.e., ∑_aP(A=a∣B)=1, we can use it to compute

P(A∣B)가 1로 정규화되어야 함, 즉 ∑_aP(A=a∣B)=1이라는 것을 알고 있으므로 이를 사용하여 계산할 수 있습니다.

In Bayesian statistics, we think of an observer as possessing some (subjective) prior beliefs about the plausibility of the available hypotheses encoded in the prior P(H), and a likelihood function that says how likely one is to observe any value of the collected evidence for each of the hypotheses in the class P(E∣H). Bayes’ theorem is then interpreted as telling us how to update the initial prior P(H) in light of the available evidence E to produce posterior beliefs P(H∣E) = P(E∣H)P(H)/P(E). Informally, this can be stated as “posterior equals prior times likelihood, divided by the evidence”. Now, because the evidence P(E) is the same for all hypotheses, we can get away with simply normalizing over the hypotheses.

베이지안 통계에서 우리는 관찰자가 사전 P(H)에 인코딩된 사용 가능한 가설의 타당성에 대한 일부 (주관적) 사전 신념과 수집된 값을 관찰할 가능성이 얼마나 되는지 알려주는 우도 함수를 보유하고 있다고 생각합니다. P(E∣H) 클래스의 각 가설에 대한 증거. 베이즈 정리는 사후 신념 P(H∣E) = P(E∣H)P(H)/P(E)를 생성하기 위해 이용 가능한 증거 E에 비추어 초기 사전 P(H)를 업데이트하는 방법을 알려주는 것으로 해석됩니다. 비공식적으로, 이는 "사후는 이전 가능성과 증거를 나눈 값과 동일합니다"라고 말할 수 있습니다. 이제 증거 P(E)는 모든 가설에 대해 동일하므로 단순히 가설에 대해 정규화하면 됩니다.

Note that ∑_aP(A=a∣B)=1 also allows us to marginalize over random variables. That is, we can drop variables from a joint distribution such as P(A,B). After all, we have that Independence is another fundamentally important concept that forms the backbone of many important ideas in statistics.

∑_aP(A=a∣B)=1을 사용하면 무작위 변수를 소외시킬 수도 있습니다. 즉, P(A,B)와 같은 결합 분포에서 변수를 삭제할 수 있습니다. 결국 우리는 독립성이 통계학의 많은 중요한 아이디어의 중추를 형성하는 또 다른 근본적으로 중요한 개념이라는 것을 알고 있습니다.

In short, two variables are independent if conditioning on the value of A does not cause any change to the probability distribution associated with B and vice versa. More formally, independence, denoted A⊥B, requires that P(A∣B) = P(A) and, consequently, that P(A,B) = P(A∣B)P(B) = P(A)P(B). Independence is often an appropriate assumption. For example, if the random variable A represents the outcome from tossing one fair coin and the random variable B represents the outcome from tossing another, then knowing whether A came up heads should not influence the probability of B coming up heads.

간단히 말해서, A 값에 대한 조건이 B와 관련된 확률 분포에 어떠한 변화도 일으키지 않고 그 반대의 경우에도 두 변수는 독립적입니다. 보다 공식적으로, A⊥B로 표시되는 독립성은 P(A∣B) = P(A)를 요구하며 결과적으로 P(A,B) = P(A∣B)P(B) = P(A)를 충족해야 합니다. P(B). 독립성은 종종 적절한 가정입니다. 예를 들어, 확률 변수 A가 공정한 동전 하나를 던진 결과를 나타내고 확률 변수 B가 다른 동전을 던진 결과를 나타내는 경우 A가 앞면이 나오는지 여부를 아는 것이 B가 앞면이 나올 확률에 영향을 주어서는 안 됩니다.

Independence is especially useful when it holds among the successive draws of our data from some underlying distribution (allowing us to make strong statistical conclusions) or when it holds among various variables in our data, allowing us to work with simpler models that encode this independence structure. On the other hand, estimating the dependencies among random variables is often the very aim of learning. We care to estimate the probability of disease given symptoms specifically because we believe that diseases and symptoms are not independent.

독립성은 일부 기본 분포에서 데이터를 연속적으로 추출할 때(강력한 통계적 결론을 내릴 수 있음) 데이터의 다양한 변수 간에 유지될 때 특히 유용하며 이 독립성 구조를 인코딩하는 더 간단한 모델로 작업할 수 있습니다. 반면, 확률 변수 간의 종속성을 추정하는 것이 학습의 목적인 경우가 많습니다. 우리는 질병과 증상이 독립적이지 않다고 믿기 때문에 특정 증상에 따른 질병의 확률을 추정하는 데 관심을 둡니다.

Note that because conditional probabilities are proper probabilities, the concepts of independence and dependence also apply to them. Two random variables A and B are conditionally independent given a third variable C if and only if P(A,B∣C) = P(A∣C)P(B∣C). Interestingly, two variables can be independent in general but become dependent when conditioning on a third. This often occurs when the two random variables A and B correspond to causes of some third variable C. For example, broken bones and lung cancer might be independent in the general population but if we condition on being in the hospital then we might find that broken bones are negatively correlated with lung cancer. That is because the broken bone explains away why some person is in the hospital and thus lowers the probability that they are hospitalized because of having lung cancer.

조건부 확률은 고유 확률이므로 독립성과 종속성의 개념도 적용됩니다. 두 개의 확률 변수 A와 B는 P(A,B∣C) = P(A∣C)P(B∣C)인 경우에만 세 번째 변수 C가 주어지면 조건부 독립입니다. 흥미롭게도 두 변수는 일반적으로 독립적일 수 있지만 세 번째 변수에 조건을 적용하면 종속성이 됩니다. 이는 두 개의 무작위 변수 A와 B가 세 번째 변수 C의 원인에 해당할 때 자주 발생합니다. 예를 들어, 부러진 뼈와 폐암은 일반 인구 집단에서는 독립적일 수 있지만 병원에 입원하는 것을 조건으로 하면 뼈가 부러진 것을 발견할 수 있습니다. 뼈는 폐암과 음의 상관관계가 있습니다. 이는 부러진 뼈가 어떤 사람이 병원에 있는 이유를 설명하여 폐암으로 인해 입원할 가능성을 낮추기 때문입니다.

And conversely, two dependent random variables can become independent upon conditioning on a third. This often happens when two otherwise unrelated events have a common cause. Shoe size and reading level are highly correlated among elementary school students, but this correlation disappears if we condition on age.

그리고 반대로, 두 개의 종속 확률 변수는 세 번째 변수에 대한 조건에 따라 독립 변수가 될 수 있습니다. 이는 서로 관련이 없는 두 가지 사건에 공통 원인이 있는 경우에 자주 발생합니다. 신발 사이즈와 독서 수준은 초등학생 사이에서 높은 상관관계가 있지만, 연령을 조건으로 하면 이러한 상관관계가 사라집니다.

2.6.5. An Example

Let’s put our skills to the test. Assume that a doctor administers an HIV test to a patient. This test is fairly accurate and fails only with 1% probability if the patient is healthy but reported as diseased, i.e., healthy patients test positive in 1% of cases. Moreover, it never fails to detect HIV if the patient actually has it. We use D1∈{0,1} to indicate the diagnosis (0 if negative and 1 if positive) and H∈{0,1} to denote the HIV status.

우리의 능력을 시험해 봅시다. 의사가 환자에게 HIV 테스트를 실시한다고 가정합니다. 이 테스트는 매우 정확하며 환자가 건강하지만 질병이 있는 것으로 보고된 경우, 즉 건강한 환자가 1%의 사례에서 양성 반응을 보이는 경우 1% 확률로만 실패합니다. 더욱이, 환자가 실제로 HIV에 감염되어 있는 경우에도 HIV를 탐지하는 데 실패하지 않습니다. D1∈{0,1}을 사용하여 진단(음성인 경우 0, 양성인 경우 1)을 나타내고 H∈{0,1}을 사용하여 HIV 상태를 나타냅니다.

Note that the column sums are all 1 (but the row sums do not), since they are conditional probabilities. Let’s compute the probability of the patient having HIV if the test comes back positive, i.e., P(H=1∣D1=1). Intuitively this is going to depend on how common the disease is, since it affects the number of false alarms. Assume that the population is fairly free of the disease, e.g., P(H=1)=0.0015. To apply Bayes’ theorem, we need to apply marginalization to determine

조건부 확률이므로 열 합은 모두 1입니다(행 합은 그렇지 않음). 검사 결과가 양성으로 나오면 환자가 HIV에 감염될 확률을 계산해 보겠습니다(예: P(H=1∣D1=1)). 직관적으로 이것은 잘못된 경보의 수에 영향을 미치기 때문에 질병이 얼마나 흔한지에 따라 달라집니다. 인구에 질병이 전혀 없다고 가정합니다(예: P(H=1)=0.0015). 베이즈 정리를 적용하려면 주변화를 적용하여 다음을 결정해야 합니다.

In other words, there is only a 13.06% chance that the patient actually has HIV, despite the test being pretty accurate. As we can see, probability can be counterintuitive. What should a patient do upon receiving such terrifying news? Likely, the patient would ask the physician to administer another test to get clarity. The second test has different characteristics and it is not as good as the first one.

즉, 테스트가 매우 정확함에도 불구하고 환자가 실제로 HIV에 감염될 확률은 13.06%에 불과합니다. 보시다시피 확률은 직관에 반할 수 있습니다. 이런 무서운 소식을 접한 환자는 어떻게 해야 할까요? 아마도 환자는 명확성을 얻기 위해 의사에게 또 다른 검사를 실시해 달라고 요청할 것입니다. 두 번째 테스트는 특성이 다르고 첫 번째 테스트만큼 좋지 않습니다.

Unfortunately, the second test comes back positive, too. Let’s calculate the requisite probabilities to invoke Bayes’ theorem by assuming conditional independence:

불행히도 두 번째 테스트에서도 양성 반응이 나왔습니다. 조건부 독립을 가정하여 베이즈 정리를 적용하는 데 필요한 확률을 계산해 보겠습니다.

Now we can apply marginalization to obtain the probability that both tests come back positive:

이제 우리는 소외화를 적용하여 두 테스트가 모두 양성으로 돌아올 확률을 얻을 수 있습니다.

Finally, the probability of the patient having HIV given that both tests are positive is

마지막으로, 두 검사 모두 양성인 경우 환자가 HIV에 감염될 확률은 다음과 같습니다.

That is, the second test allowed us to gain much higher confidence that not all is well. Despite the second test being considerably less accurate than the first one, it still significantly improved our estimate. The assumption of both tests being conditionally independent of each other was crucial for our ability to generate a more accurate estimate. Take the extreme case where we run the same test twice. In this situation we would expect the same outcome both times, hence no additional insight is gained from running the same test again. The astute reader might have noticed that the diagnosis behaved like a classifier hiding in plain sight where our ability to decide whether a patient is healthy increases as we obtain more features (test outcomes).

즉, 두 번째 테스트를 통해 우리는 모든 것이 좋지 않다는 훨씬 더 높은 확신을 얻을 수 있었습니다. 두 번째 테스트는 첫 번째 테스트보다 정확도가 상당히 떨어졌음에도 불구하고 여전히 우리의 추정치를 크게 향상시켰습니다. 두 테스트가 서로 조건부 독립이라는 가정은 보다 정확한 추정치를 생성하는 데 매우 중요했습니다. 동일한 테스트를 두 번 실행하는 극단적인 경우를 생각해 보겠습니다. 이 상황에서는 두 번 모두 동일한 결과가 나올 것으로 예상되므로 동일한 테스트를 다시 실행해도 추가적인 통찰력을 얻을 수 없습니다. 기민한 독자는 진단이 더 많은 특징(테스트 결과)을 얻을수록 환자의 건강 여부를 결정하는 능력이 증가하는 눈에 잘 띄는 곳에 숨어 있는 분류기처럼 행동한다는 것을 알아차렸을 것입니다.

2.6.6. Expectations

Often, making decisions requires not just looking at the probabilities assigned to individual events but composing them together into useful aggregates that can provide us with guidance. For example, when random variables take continuous scalar values, we often care about knowing what value to expect on average. This quantity is formally called an expectation. If we are making investments, the first quantity of interest might be the return we can expect, averaging over all the possible outcomes (and weighting by the appropriate probabilities). For instance, say that with 50% probability, an investment might fail altogether, with 40% probability it might provide a 2× return, and with 10% probability it might provide a 10× return 10×. To calculate the expected return, we sum over all returns, multiplying each by the probability that they will occur. This yields the expectation 0.5⋅0+0.4⋅2+0.1⋅10=1.8. Hence the expected return is 1.8×.

종종 결정을 내리려면 개별 사건에 할당된 확률을 살펴보는 것뿐만 아니라 지침을 제공할 수 있는 유용한 집계로 이를 함께 구성해야 합니다. 예를 들어, 확률 변수가 연속적인 스칼라 값을 취하는 경우 우리는 평균적으로 어떤 값을 기대하는지 아는 데 종종 관심을 갖습니다. 이 수량을 공식적으로 기대값이라고 합니다. 투자를 하는 경우 첫 번째 관심 수량은 가능한 모든 결과에 대한 평균을 계산하고 적절한 확률에 따라 가중치를 부여하여 기대할 수 있는 수익일 수 있습니다. 예를 들어, 50% 확률로 투자가 완전히 실패할 수 있고, 40% 확률로 2배의 수익을 제공할 수 있으며, 10% 확률로 10배의 수익을 제공할 수 있다고 가정해 보겠습니다. 기대 수익을 계산하기 위해 우리는 모든 수익을 합산하고 각 수익에 발생할 확률을 곱합니다. 이는 기대값 0.5⋅0+0.4⋅2+0.1⋅10=1.8을 산출합니다. 따라서 기대수익률은 1.8×입니다.

In general, the expectation (or average) of the random variable X is defined as

일반적으로 확률 변수 X의 기대값(또는 평균)은 다음과 같이 정의됩니다.

Likewise, for densities we obtain E[X]=∫x dp(x). Sometimes we are interested in the expected value of some function of x. We can calculate these expectations as

마찬가지로 밀도의 경우 E[X]=∫xdp(x)를 얻습니다. 때때로 우리는 x의 어떤 함수의 기대값에 관심이 있습니다. 우리는 이러한 기대치를 다음과 같이 계산할 수 있습니다.

for discrete probabilities and densities, respectively. Returning to the investment example from above, f might be the utility (happiness) associated with the return. Behavior economists have long noted that people associate greater disutility with losing money than the utility gained from earning one dollar relative to their baseline. Moreover, the value of money tends to be sub-linear. Possessing 100k dollars versus zero dollars can make the difference between paying the rent, eating well, and enjoying quality healthcare versus suffering through homelessness. On the other hand, the gains due to possessing 200k versus 100k are less dramatic. Reasoning like this motivates the cliché that “the utility of money is logarithmic”.

이산 확률과 밀도에 대해 각각. 위의 투자 예로 돌아가면, f는 수익과 관련된 효용(행복)일 수 있습니다. 행동경제학자들은 사람들이 자신의 기준에 비해 1달러를 벌어서 얻는 효용보다 돈을 잃는 데 더 큰 비효용성을 연관시킨다는 점을 오랫동안 지적해 왔습니다. 더욱이 화폐의 가치는 준선형적인 경향이 있습니다. 10만 달러를 소유하는 것과 0달러를 소유하는 것은 집세를 내고, 잘 먹고, 양질의 의료 서비스를 받는 것과 노숙자로 고통받는 것 사이의 차이를 만들 수 있습니다. 반면에 200,000개와 100,000개를 소유함으로써 얻을 수 있는 이득은 덜 극적입니다. 이런 추론은 “돈의 효용은 대수적이다”라는 상투적인 말을 하게 만든다.

If the utility associated with a total loss were −1, and the utilities associated with returns of 1, 2, and 10 were 1, 2 and 4, respectively, then the expected happiness of investing would be 0.5⋅(−1)+0.4⋅2+0.1⋅4=0.7 (an expected loss of utility of 30%). If indeed this were your utility function, you might be best off keeping the money in the bank.

총 손실과 관련된 효용이 −1이고 수익 1, 2, 10과 관련된 효용이 각각 1, 2, 4라면 투자의 기대 행복은 0.5⋅(−1)+0.4가 됩니다. ⋅2+0.1⋅4=0.7(예상 효용 손실 30%). 실제로 이것이 유틸리티 기능이라면 돈을 은행에 보관하는 것이 가장 좋습니다.

For financial decisions, we might also want to measure how risky an investment is. Here, we care not just about the expected value but how much the actual values tend to vary relative to this value. Note that we cannot just take the expectation of the difference between the actual and expected values. This is because the expectation of a difference is the difference of the expectations, i.e., E[X−E[X]]=E[X]−E[E[X]]=0. However, we can look at the expectation of any non-negative function of this difference. The variance of a random variable is calculated by looking at the expected value of the squared differences:

재정적 결정을 위해 투자가 얼마나 위험한지 측정하고 싶을 수도 있습니다. 여기서는 기대값뿐만 아니라 이 값에 비해 실제 값이 얼마나 달라지는 경향이 있는지도 중요합니다. 실제 값과 예상 값의 차이를 기대하는 것만으로는 충분하지 않습니다. 이는 차이에 대한 기대가 기대의 차이, 즉 E[X−E[X]]=E[X]−E[E[X]]=0이기 때문입니다. 그러나 우리는 이 차이의 음이 아닌 함수에 대한 기대를 살펴볼 수 있습니다. 확률 변수의 분산은 차이 제곱의 기대값을 확인하여 계산됩니다.

Here the equality follows by expanding (X−E[X])**2 = X**2 − 2XE[X]+E[X]**2 and taking expectations for each term. The square root of the variance is another useful quantity called the standard deviation. While this and the variance convey the same information (either can be calculated from the other), the standard deviation has the nice property that it is expressed in the same units as the original quantity represented by the random variable.

여기서는 (X−E[X])**2 = X**2 − 2XE[X]+E[X]**2를 확장하고 각 항에 대한 기대값을 취하여 동등성을 따릅니다. 분산의 제곱근은 표준편차라고 불리는 또 다른 유용한 양입니다. 이것과 분산은 동일한 정보를 전달하지만(둘 중 하나를 다른 것으로 계산할 수 있음), 표준 편차는 랜덤 변수가 나타내는 원래 양과 동일한 단위로 표현된다는 좋은 속성을 가지고 있습니다.

Lastly, the variance of a function of a random variable is defined analogously as

마지막으로, 확률 변수의 함수 분산은 다음과 유사하게 정의됩니다.

Returning to our investment example, we can now compute the variance of the investment. It is given by 0.5⋅0+0.4⋅2**2+0.1⋅10**2−1.8**2=8.36. For all intents and purposes this is a risky investment. Note that by mathematical convention mean and variance are often referenced as μ and σ**2. This is particularly the case whenever we use it to parametrize a Gaussian distribution.

투자 예로 돌아가서 이제 투자의 분산을 계산할 수 있습니다. 이는 0.5⋅0+0.4⋅2**2+0.1⋅10**2−1.8**2=8.36으로 제공됩니다. 모든 의도와 목적을 위해 이것은 위험한 투자입니다. 수학적 관례에 따라 평균과 분산은 종종 μ 및 σ**2로 참조됩니다. 특히 가우스 분포를 매개변수화하는 데 사용할 때마다 그렇습니다.

In the same way as we introduced expectations and variance for scalar random variables, we can do so for vector-valued ones. Expectations are easy, since we can apply them elementwise. For instance, μ def= Ex∼p[x] has coordinates μ i = Ex∼p[xi]. Covariances are more complicated. We define them by taking expectations of the outer product of the difference between random variables and their mean:

스칼라 확률 변수에 대한 기대치와 분산을 도입한 것과 같은 방식으로 벡터 값 변수에 대해서도 그렇게 할 수 있습니다. 요소별로 적용할 수 있으므로 기대하기 쉽습니다. 예를 들어 μ def= Ex∼p[x]는 μ i = Ex∼p[xi] 좌표를 갖는다. 공분산은 더 복잡합니다. 우리는 확률 변수와 평균 간의 차이에 대한 외부 곱을 기대하여 이를 정의합니다.

This matrix ∑ is referred to as the covariance matrix. An easy way to see its effect is to consider some vector v of the same size as x. It follows that

As such, ∑ allows us to compute the variance for any linear function of x by a simple matrix multiplication. The off-diagonal elements tell us how correlated the coordinates are: a value of 0 means no correlation, where a larger positive value means that they are more strongly correlated.

따라서 ∑를 사용하면 간단한 행렬 곱셈을 통해 x의 모든 선형 함수에 대한 분산을 계산할 수 있습니다. 비대각선 요소는 좌표의 상관 관계를 알려줍니다. 값이 0이면 상관 관계가 없음을 의미하고 양수 값이 클수록 상관 관계가 더 강하다는 의미입니다.

2.6.7. Discussion

In machine learning, there are many things to be uncertain about! We can be uncertain about the value of a label given an input. We can be uncertain about the estimated value of a parameter. We can even be uncertain about whether data arriving at deployment is even from the same distribution as the training data.

머신러닝에는 불확실한 부분이 많습니다! 입력이 주어지면 레이블의 값이 불확실할 수 있습니다. 매개변수의 추정값이 불확실할 수 있습니다. 배포 시 도착하는 데이터가 교육 데이터와 동일한 분포에서 나온 것인지 여부도 불확실할 수 있습니다.

By aleatoric uncertainty, we mean uncertainty that is intrinsic to the problem, and due to genuine randomness unaccounted for by the observed variables. By epistemic uncertainty, we mean uncertainty over a model’s parameters, the sort of uncertainty that we can hope to reduce by collecting more data. We might have epistemic uncertainty concerning the probability that a coin turns up heads, but even once we know this probability, we are left with aleatoric uncertainty about the outcome of any future toss. No matter how long we watch someone tossing a fair coin, we will never be more or less than 50% certain that the next toss will come up heads. These terms come from mechanical modeling, (see e.g., Der Kiureghian and Ditlevsen (2009) for a review on this aspect of uncertainty quantification). It is worth noting, however, that these terms constitute a slight abuse of language. The term epistemic refers to anything concerning knowledge and thus, in the philosophical sense, all uncertainty is epistemic.

우연적 불확실성이란 문제에 내재된 불확실성, 관찰된 변수에 의해 설명되지 않는 진정한 무작위성으로 인한 불확실성을 의미합니다. 인식론적 불확실성이란 모델 매개변수에 대한 불확실성, 즉 더 많은 데이터를 수집하여 줄일 수 있는 불확실성을 의미합니다. 동전이 앞면이 나올 확률에 대해 인식론적 불확실성이 있을 수 있지만, 이 확률을 알더라도 미래 던지기의 결과에 대한 우연적 불확실성이 남아 있습니다. 누군가가 공정한 동전을 던지는 것을 얼마나 오랫동안 지켜보더라도 우리는 다음 번 던질 때 앞면이 나올 것이라는 확신을 50% 이상 또는 이하로 결코 확신할 수 없습니다. 이러한 용어는 기계적 모델링에서 유래되었습니다(불확도 정량화의 이러한 측면에 대한 검토는 Der Kiureghian 및 Ditlevsen(2009) 참조). 그러나 이러한 용어가 약간의 언어 남용을 구성한다는 점은 주목할 가치가 있습니다. 인식론이라는 용어는 지식에 관한 모든 것을 의미하므로 철학적 의미에서 모든 불확실성은 인식론적입니다.

We saw that sampling data from some unknown probability distribution can provide us with information that can be used to estimate the parameters of the data generating distribution. That said, the rate at which this is possible can be quite slow. In our coin tossing example (and many others) we can do no better than to design estimators that converge at a rate of 1/ √ n, where n is the sample size (e.g., the number of tosses). This means that by going from 10 to 1000 observations (usually a very achievable task) we see a tenfold reduction of uncertainty, whereas the next 1000 observations help comparatively little, offering only a 1.41 times reduction. This is a persistent feature of machine learning: while there are often easy gains, it takes a very large amount of data, and often with it an enormous amount of computation, to make further gains. For an empirical review of this fact for large scale language models see Revels et al. (2016).

우리는 알려지지 않은 확률 분포의 샘플링 데이터가 데이터 생성 분포의 매개변수를 추정하는 데 사용할 수 있는 정보를 제공할 수 있음을 확인했습니다. 즉, 이것이 가능한 속도는 상당히 느릴 수 있습니다. 동전 던지기 예제(및 기타 여러 예제)에서 우리는 1/ √n의 비율로 수렴하는 추정기를 설계하는 것보다 더 나은 것을 할 수 없습니다. 여기서 n은 표본 크기(예: 던지기 횟수)입니다. 이는 10개에서 1000개의 관측값(일반적으로 매우 달성 가능한 작업)으로 이동하면 불확실성이 10배 감소한 반면, 다음 1000개의 관측값은 1.41배만 감소하여 비교적 거의 도움이 되지 않는다는 것을 의미합니다. 이는 기계 학습의 지속적인 특징입니다. 쉽게 얻을 수 있는 경우가 많지만 추가 이득을 얻으려면 매우 많은 양의 데이터와 엄청난 양의 계산이 필요한 경우가 많습니다. 대규모 언어 모델에 대한 이 사실에 대한 실증적 검토는 Revels et al. (2016).

We also sharpened our language and tools for statistical modeling. In the process of that we learned about conditional probabilities and about one of the most important equations in statistics—Bayes’ theorem. It is an effective tool for decoupling information conveyed by data through a likelihood term P(B∣A) that addresses how well observations B match a choice of parameters A, and a prior probability P(A) which governs how plausible a particular choice of A was in the first place. In particular, we saw how this rule can be applied to assign probabilities to diagnoses, based on the efficacy of the test and the prevalence of the disease itself (i.e., our prior).

우리는 또한 통계 모델링을 위한 언어와 도구를 개선했습니다. 그 과정에서 우리는 조건부 확률과 통계에서 가장 중요한 방정식 중 하나인 베이즈 정리에 대해 배웠습니다. 이는 관측치 B가 매개변수 A의 선택과 얼마나 잘 일치하는지를 다루는 가능성 항 P(B∣A)와 매개변수 A의 선택이 얼마나 타당한지를 제어하는 사전 확률 P(A)를 통해 데이터에 의해 전달되는 정보를 분리하는 효과적인 도구입니다. A가 먼저였다. 특히, 우리는 테스트의 효능과 질병 자체의 유병률(예: 이전)을 기반으로 진단에 확률을 할당하기 위해 이 규칙을 적용할 수 있는 방법을 살펴보았습니다.

Lastly, we introduced a first set of nontrivial questions about the effect of a specific probability distribution, namely expectations and variances. While there are many more than just linear and quadratic expectations for a probability distribution, these two already provide a good deal of knowledge about the possible behavior of the distribution. For instance, Chebyshev’s inequality states that P(X|X− μ|≥k σ)≤1/k**2, where μ is the expectation, σ**2 is the variance of the distribution, and k>1 is a confidence parameter of our choosing. It tells us that draws from a distribution lie with at least 50% probability within a [− √ 2σ , √ 2σ ] interval centered on the expectation.

마지막으로, 특정 확률 분포, 즉 기대값과 분산의 효과에 대한 첫 번째 중요하지 않은 질문 세트를 소개했습니다. 확률 분포에 대한 선형 및 2차 기대치보다 더 많은 것이 있지만 이 두 가지는 이미 분포의 가능한 동작에 대한 많은 지식을 제공합니다. 예를 들어, 체비쇼프 부등식은 P(X|X− μ|≥k σ)≤1/k**2라고 말합니다. 여기서 μ는 기대값이고, σ**2는 분포의 분산이고, k>1은 우리가 선택한 신뢰 매개변수. 이는 기대값을 중심으로 [− √ 2σ , √ 2σ ] 간격 내에서 최소 50% 확률로 분포 거짓말에서 도출된다는 것을 알려줍니다.

2.6.8. Exercises

Give an example where observing more data can reduce the amount of uncertainty about the outcome to an arbitrarily low level.

더 많은 데이터를 관찰하면 결과에 대한 불확실성을 임의로 낮은 수준으로 줄일 수 있는 예를 들어보세요.
Give an example where observing more data will only reduce the amount of uncertainty up to a point and then no further. Explain why this is the case and where you expect this point to occur.

더 많은 데이터를 관찰하면 불확실성의 양이 어느 정도 줄어들 뿐 그 이상은 줄어들지 않는 예를 들어보세요. 왜 이런 일이 발생하는지, 그리고 이러한 상황이 어디서 발생할 것으로 예상하는지 설명하세요.
We empirically demonstrated convergence to the mean for the toss of a coin. Calculate the variance of the estimate of the probability that we see a head after drawing n samples.

우리는 동전 던지기의 평균에 대한 수렴을 경험적으로 증명했습니다. n개의 샘플을 뽑은 후 머리가 보일 확률 추정치의 분산을 계산합니다.
1. How does the variance scale with the number of observations?
  
  관측치 수에 따라 분산이 어떻게 확장되나요?
2. Use Chebyshev’s inequality to bound the deviation from the expectation.
  
  체비쇼프 부등식을 사용하여 기대치로부터의 편차를 제한합니다.
3. How does it relate to the central limit theorem?
  
  중심극한정리와 어떤 관련이 있나요?

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.5. Automatic Differentiation

2023. 10. 12. 23:42 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/autograd.html

2.5. Automatic Differentiation — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.5. Automatic Differentiation

Recall from Section 2.4 that calculating derivatives is the crucial step in all the optimization algorithms that we will use to train deep networks. While the calculations are straightforward, working them out by hand can be tedious and error-prone, and these issues only grow as our models become more complex.

섹션 2.4에서 도함수를 계산하는 것이 심층 네트워크를 훈련하는 데 사용할 모든 최적화 알고리즘에서 중요한 단계라는 점을 상기해 보세요. 계산은 간단하지만 손으로 계산하는 것은 지루하고 오류가 발생하기 쉬울 수 있으며 이러한 문제는 모델이 더 복잡해질수록 커집니다.

Fortunately all modern deep learning frameworks take this work off our plates by offering automatic differentiation (often shortened to autograd). As we pass data through each successive function, the framework builds a computational graph that tracks how each value depends on others. To calculate derivatives, automatic differentiation works backwards through this graph applying the chain rule. The computational algorithm for applying the chain rule in this fashion is called backpropagation.

다행스럽게도 모든 최신 딥 러닝 프레임워크는 자동 차별화 automatic differentiation (종종 autograd로 축약됨)를 제공하여 이 작업을 수행합니다. 각 연속 함수를 통해 데이터를 전달할 때 프레임워크는 각 값이 다른 값에 어떻게 의존하는지 추적하는 계산 그래프를 작성합니다. 도함수를 계산하기 위해 자동 미분은 체인 규칙을 적용하여 이 그래프를 통해 역방향으로 작동합니다. 이러한 방식으로 체인 규칙을 적용하는 계산 알고리즘을 역전파라고 합니다.

While autograd libraries have become a hot concern over the past decade, they have a long history. In fact the earliest references to autograd date back over half of a century (Wengert, 1964). The core ideas behind modern backpropagation date to a PhD thesis from 1980 (Speelpenning, 1980) and were further developed in the late 1980s (Griewank, 1989). While backpropagation has become the default method for computing gradients, it is not the only option. For instance, the Julia programming language employs forward propagation (Revels et al., 2016). Before exploring methods, let’s first master the autograd package.

Autograd 라이브러리는 지난 10년 동안 뜨거운 관심사가 되었지만 오랜 역사를 가지고 있습니다. 실제로 autograd에 대한 최초의 언급은 반세기 전으로 거슬러 올라갑니다(Wengert, 1964). 현대 역전파 backpropagation 의 핵심 아이디어는 1980년 박사 학위 논문(Speelpenning, 1980)으로 시작되었으며 1980년대 후반에 더욱 발전되었습니다(Griewank, 1989). 역전파가 기울기를 계산하는 기본 방법이 되었지만 이것이 유일한 옵션은 아닙니다. 예를 들어 Julia 프로그래밍 언어는 순방향 전파 forward propagation 를 사용합니다(Revels et al., 2016). 방법을 살펴보기 전에 먼저 autograd 패키지를 마스터해 보겠습니다.

import torch

2.5.1. A Simple Function

Let’s assume that we are interested in differentiating the function y=2x^⊤x with respect to the column vector x. To start, we assign x an initial value.

열 벡터 x에 대해 함수 y=2x^⊤x를 미분하는 데 관심이 있다고 가정해 보겠습니다. 시작하려면 x에 초기 값을 할당합니다.

x = torch.arange(4.0)
x

이 코드는 파이토치(PyTorch)를 사용하여 텐서를 생성하고 값을 출력하는 간단한 코드입니다. 각 줄의 코드를 설명하겠습니다:

x = torch.arange(4.0): 이 코드는 0부터 3까지의 연속된 실수 값을 가지는 1차원 텐서를 생성합니다. torch는 파이토치 라이브러리를 나타내며, torch.arange(4.0)는 0.0, 1.0, 2.0, 3.0으로 구성된 텐서를 만듭니다. 따라서 x에는 이러한 값들이 저장됩니다.
x: 이 부분은 x를 출력하는 코드입니다. 따라서 코드를 실행하면 텐서 x의 값이 표시됩니다.

실행 결과로 x에는 0.0, 1.0, 2.0, 3.0이 포함된 1차원 텐서가 저장되며, 출력에서는 이 값들이 표시됩니다.

tensor([0., 1., 2., 3.])

Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters a great many times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued with the same shape as x.

x에 대한 y의 기울기를 계산하기 전에 이를 저장할 장소가 필요합니다. 일반적으로 딥 러닝에서는 동일한 매개변수에 대해 여러 번 연속적으로 derivatives 을 계산해야 하고 메모리가 부족할 위험이 있으므로 derivatives 을 가져올 때마다 새 메모리를 할당하는 것을 피합니다. 벡터 x에 대한 스칼라 값 함수의 기울기는 x와 동일한 모양으로 벡터 값을 갖습니다.

# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad  # The gradient is None by default

이 코드는 PyTorch를 사용하여 그레디언트(gradient)를 계산하기 위해 텐서에 requires_grad 속성을 추가하는 방법을 보여줍니다. 한국어로 코드를 설명하겠습니다:

x.requires_grad_(True): 이 코드는 x 텐서의 requires_grad 속성을 True로 설정합니다. 이것은 텐서 x에서 그레디언트를 계산하려는 의도를 나타냅니다. 즉, x의 값이 어떻게 변경되는지 추적하여 나중에 그레디언트를 계산할 수 있도록 합니다.
x.grad: 이 코드는 x 텐서의 그레디언트를 검색합니다. 그러나 여기서는 x의 그레디언트가 아직 계산되지 않았으므로 그 값은 기본적으로 None입니다.

즉, x.requires_grad_(True)를 통해 PyTorch에게 x의 그레디언트를 추적하도록 지시하고, 나중에 해당 그레디언트를 계산할 수 있도록 준비를 마칩니다. 현재는 아직 그레디언트가 계산되지 않았기 때문에 x.grad의 값은 None입니다. 그레디언트는 손실 함수 등의 역전파(backpropagation) 과정에서 계산됩니다.

We now calculate our function of x and assign the result to y.

이제 x의 함수를 계산하고 그 결과를 y에 할당합니다.

y = 2 * torch.dot(x, x)
y

이 코드는 PyTorch를 사용하여 스칼라 값을 계산하고 y에 저장하는 예제입니다.

x는 PyTorch 텐서입니다. 이 코드에서는 벡터 x와 자기 자신을 내적(dot product)한 결과를 활용하려고 합니다.
torch.dot(x, x)는 x 텐서와 x 자기 자신과의 내적을 계산합니다.
2 * torch.dot(x, x)는 내적 결과에 2를 곱한 값을 y에 할당합니다. 즉, y는 2 * x 벡터의 내적값입니다.
따라서 y에는 2 * x 벡터의 내적값이 저장됩니다.

y에는 스칼라 값이 저장되므로 y는 스칼라 텐서입니다. 내적은 벡터 간의 유사도를 계산하는 데 사용되며, 위의 코드에서는 x 벡터와 x 벡터의 유사도를 계산하여 2를 곱한 결과가 y에 저장됩니다.

tensor(28., grad_fn=<MulBackward0>)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

이제 역방향 메소드를 호출하여 x에 대한 y의 기울기를 얻을 수 있습니다. 다음으로, x의 grad 속성을 통해 그래디언트에 접근할 수 있습니다.

y.backward()
x.grad

이 코드는 PyTorch에서 역전파(backpropagation)를 사용하여 그래디언트(gradient)를 계산하는 예제입니다.

y는 이전 코드에서 정의한 스칼라 값입니다. 이 값을 계산하기 위해 사용된 연산 그래프를 통해 역전파를 수행하려고 합니다.
y.backward()는 y의 그래디언트(도함수)를 계산하는 역전파(backpropagation)를 시작합니다. backward 메서드는 그래디언트를 계산하고 x.grad에 저장합니다.
x.grad는 x 텐서에 대한 그래디언트 값을 나타냅니다. 그래디언트는 손실 함수(여기서는 y)를 x에 대해 편미분한 결과로, x의 각 요소에 대한 미분값이 저장됩니다.

즉, x.grad에는 x 텐서의 각 원소에 대한 미분값이 저장되며, 이것은 역전파를 통해 손실 함수 y를 x에 대해 미분한 결과입니다. 이를 통해 PyTorch를 사용하여 그래디언트를 계산하고, 이후에 그래디언트 기반 최적화 알고리즘(예: 확률적 경사 하강법)을 사용하여 모델을 업데이트할 수 있습니다.

tensor([ 0.,  4.,  8., 12.])

We already know that the gradient of the function y=2x^⊤x with respect to x should be 4x. We can now verify that the automatic gradient computation and the expected result are identical.

우리는 x에 대한 함수 y=2x^⊤x의 기울기가 4x여야 한다는 것을 이미 알고 있습니다. 이제 자동 기울기 계산과 예상 결과가 동일한 것을 확인할 수 있습니다.

x.grad == 4 * x

이 코드는 PyTorch를 사용하여 그래디언트(gradient)를 계산하고 검증하는 부분입니다.

x.grad는 x 텐서의 그래디언트 값을 나타냅니다. 이 그래디언트는 y = 2 * torch.dot(x, x)의 손실 함수에 대한 x에 대한 미분값으로 이전 코드에서 y.backward()를 호출하여 계산되었습니다.
4 * x는 x 텐서의 각 요소에 4를 곱한 결과를 나타냅니다.
x.grad == 4 * x는 x.grad와 4 * x를 비교하여 각 요소가 동일한지 여부를 확인합니다. 이것은 그래디언트 계산이 올바르게 수행되었는지 검증하는 부분입니다. 만약 x.grad의 각 요소가 4 * x와 동일하다면, 그래디언트 계산이 정확하게 이루어졌음을 의미합니다.

이를 통해 코드는 그래디언트를 계산하고 이 값이 수동으로 계산한 기대값과 일치하는지 확인하는 유효성 검사(validation)를 수행합니다.

tensor([True, True, True, True])

Now let’s calculate another function of x and take its gradient. Note that PyTorch does not automatically reset the gradient buffer when we record a new gradient. Instead, the new gradient is added to the already-stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero_() as follows:

이제 x의 또 다른 함수를 계산하고 그 기울기를 살펴보겠습니다. PyTorch는 새 그래디언트를 기록할 때 그래디언트 버퍼를 자동으로 재설정하지 않습니다. 대신 이미 저장된 그래디언트에 새 그래디언트가 추가됩니다. 이 동작은 여러 목적 함수의 합을 최적화하려고 할 때 유용합니다. 그래디언트 버퍼를 재설정하려면 다음과 같이 x.grad.zero_()를 호출할 수 있습니다.

x.grad.zero_()  # Reset the gradient
y = x.sum()
y.backward()
x.grad

이 코드는 PyTorch를 사용하여 그래디언트(gradient)를 계산하고 초기화하는 부분입니다.

tensor([1., 1., 1., 1.])

x.grad.zero_()은 x 텐서의 그래디언트 값을 초기화합니다. 그래디언트를 초기화하는 이유는 이전 그래디언트 값이 아직 y.backward()로 계산되지 않았을 때 그대로 남아있을 수 있기 때문입니다. 이 함수를 호출하여 모든 그래디언트 값을 0으로 설정합니다.
y = x.sum()는 x 텐서의 모든 요소를 더한 값을 y에 저장합니다.
y.backward()는 y를 사용하여 x 텐서의 그래디언트를 계산합니다. 여기서 y는 x에 대한 함수이며, x의 각 요소에 대한 편미분을 계산합니다.
x.grad는 이렇게 계산된 그래디언트를 나타냅니다. x의 각 요소에 대한 미분값이 저장되어 있습니다.

따라서 이 코드는 그래디언트를 초기화한 다음 y를 사용하여 그래디언트를 계산하고, x.grad에 이 그래디언트 값을 저장합니다.

2.5.2. Backward for Non-Scalar Variables

When y is a vector, the most natural representation of the derivative of y with respect to a vector x is a matrix called the Jacobian that contains the partial derivatives of each component of y with respect to each component of x. Likewise, for higher-order y and x, the result of differentiation could be an even higher-order tensor.

y가 벡터인 경우 벡터 x에 대한 y의 도함수의 가장 자연스러운 표현은 x의 각 구성 요소에 대한 y의 각 구성 요소의 부분 도함수를 포함하는 야코비안(Jacobian)이라는 행렬입니다. 마찬가지로, 고차 y와 x의 경우 미분의 결과는 훨씬 더 고차 텐서가 될 수 있습니다.

Jacobian matrix 란?

The Jacobian matrix, often denoted as J, is a fundamental concept in mathematics and plays a crucial role in various fields, particularly in calculus, linear algebra, and optimization. It is essentially a matrix of all the first-order partial derivatives of a vector-valued function. Let's break down the Jacobian matrix step by step:

야코비안 행렬(Jacobian matrix)은 수학에서 중요한 개념 중 하나로, 주로 미적분학, 선형 대수 및 최적화 분야에서 핵심 역할을 합니다. 이것은 벡터 값 함수의 모든 일차 편미분치를 나타내는 행렬입니다. 야코비안 행렬을 단계별로 살펴보겠습니다.

1. Vector-Valued Function:

The Jacobian matrix is used to represent the derivative of a vector-valued function, which takes one or more input variables and maps them to a vector of output variables.
야코비안 행렬은 하나 이상의 입력 변수를 가지고 이들을 출력 변수 벡터로 매핑하는 벡터 값 함수의 도함수를 나타내는 데 사용됩니다.

2. Components of the Jacobian:

Consider a vector-valued function F(x), where x is an input vector, and F(x) is a vector of functions (F₁(x), F₂(x), ..., Fₙ(x)).
벡터 값 함수 F(x)를 고려해봅시다. 여기서 x는 입력 벡터이고 F(x)는 함수 값 벡터(F₁(x), F₂(x), ..., Fₙ(x))입니다.
The Jacobian matrix J of F(x) consists of all the first-order partial derivatives of these component functions with respect to the input variables:
F(x)의 야코비안 행렬 J는 이러한 구성 함수들의 입력 변수 x에 대한 모든 일차 편미분을 포함합니다:
Here, each entry Jᵢⱼ represents the partial derivative of Fᵢ with respect to xⱼ.
여기서 각 항목 Jᵢⱼ는 Fᵢ가 xⱼ에 대한 편미분을 나타냅니다.

3. Interpretation:

Each row of the Jacobian matrix corresponds to one of the component functions Fᵢ, and each column corresponds to one of the input variables xⱼ.
야코비안 행렬의 각 행은 출력 Fᵢ의 하나의 구성 함수에 해당하고, 각 열은 입력 변수 xⱼ 중 하나에 해당합니다.
The value in the Jᵢⱼ entry represents how much the i-th component of the output Fᵢ changes with respect to a small change in the j-th input variable xⱼ.
Jᵢⱼ 항목의 값은 출력 Fᵢ의 i번째 구성 요소가 입력 변수 xⱼ에 대한 작은 변화에 얼마나 민감한지를 나타냅니다.
Essentially, it quantifies the sensitivity of the component functions to changes in the input variables.
기본적으로, 입력 변수의 변화에 대한 구성 함수의 민감도를 측정합니다.

4. Applications:

The Jacobian matrix is widely used in various fields:
야코비안 행렬은 다양한 분야에서 널리 활용됩니다:
- Calculus: It helps in solving problems related to multivariate calculus, such as gradient descent, optimization, and Taylor series expansions.
- 미적분학: 경사 하강법, 최적화 및 테일러 급수 전개와 관련된 다변수 미적분 문제 해결에 사용됩니다.
- Physics: It plays a role in mechanics, quantum mechanics, and the study of dynamic systems.
- 물리학: 역학, 양자 역학 및 동적 시스템 연구에 역할을 합니다.
- Engineering: Engineers use it in control theory and robotics to understand the relationships between input and output variables.
- 공학: 제어 이론 및 로봇 공학에서 입력과 출력 변수 간의 관계를 이해하는 데 사용됩니다.
- Machine Learning: It is used in neural networks for backpropagation, where the Jacobian matrix helps calculate gradients.
- 기계 학습: 야코비안 행렬은 역전파(backpropagation)와 관련하여 신경망에서 기울기를 계산하는 데 사용됩니다.
- Economics: It has applications in economics models that involve multiple variables and equations.
- 경제학: 여러 변수와 방정식을 포함하는 경제 모델에서 응용됩니다.

In summary, the Jacobian matrix is a powerful mathematical tool used to understand how a vector-valued function responds to small changes in its input variables. It is a fundamental concept in various scientific and engineering disciplines and is a key component of many numerical and analytical techniques.

요약하면 야코비안 행렬은 벡터 값 함수가 입력 변수의 작은 변화에 어떻게 반응하는지를 이해하는 강력한 수학적 도구입니다. 다양한 과학 및 공학 분야에서 중요한 개념이며 많은 수치 및 해석적 기술의 핵심 구성 요소입니다.

While Jacobians do show up in some advanced machine learning techniques, more commonly we want to sum up the gradients of each component of y with respect to the full vector x, yielding a vector of the same shape as x. For example, we often have a vector representing the value of our loss function calculated separately for each example among a batch of training examples. Here, we just want to sum up the gradients computed individually for each example.

Jacobians 행렬은 일부 고급 기계 학습 기술에 나타나기도 하지만, 더 일반적으로는 전체 벡터 x에 대한 y의 각 구성 요소의 기울기를 합산하여 x와 동일한 모양의 벡터를 생성하려고 합니다. 예를 들어, 훈련 예제 배치 중 각 예제에 대해 별도로 계산된 손실 함수 값을 나타내는 벡터가 있는 경우가 많습니다. 여기서는 각 예에 대해 개별적으로 계산된 그래디언트를 요약하려고 합니다.

Because deep learning frameworks vary in how they interpret gradients of non-scalar tensors, PyTorch takes some steps to avoid confusion. Invoking backward on a non-scalar elicits an error unless we tell PyTorch how to reduce the object to a scalar. More formally, we need to provide some vector v such that backward will compute v^⊤ ∂ _xy rather than ∂ _xy . This next part may be confusing, but for reasons that will become clear later, this argument (representing v) is named gradient. For a more detailed description, see Yang Zhang’s Medium post.

딥 러닝 프레임워크는 비 스칼라 텐서의 기울기를 해석하는 방법이 다양하므로 PyTorch는 혼란을 피하기 위해 몇 가지 조치를 취합니다. 스칼라가 아닌 항목을 역으로 호출하면 PyTorch에 객체를 스칼라로 줄이는 방법을 알려주지 않는 한 오류가 발생합니다. 보다 공식적으로, 우리는 역방향이 ∂_xy 대신 v^⊤∂_xy를 계산하도록 일부 벡터 v를 제공해야 합니다. 다음 부분은 혼란스러울 수 있지만 나중에 명확해질 이유로 이 인수(v를 나타냄)의 이름은 그래디언트입니다. 자세한 설명은 Yang Zhang의 Medium 게시물을 참조하세요.

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y)))  # Faster: y.sum().backward()
x.grad

이 코드는 PyTorch를 사용하여 텐서 x에 대한 그래디언트를 계산하는 예제입니다. 아래에서 코드를 한 줄씩 설명하겠습니다.

x.grad는 x에 대한 그래디언트(미분)를 나타내는 PyTorch 텐서입니다. 이 코드는 이 그래디언트를 0으로 초기화합니다. 즉, 이전에 계산된 그래디언트를 지우고 새로운 그래디언트를 계산할 준비를 합니다.

새로운 텐서 y를 생성하며, 이는 x의 제곱을 계산한 결과입니다.

y.backward() 메서드는 y에 대한 그래디언트를 계산하는 역전파(backpropagation)를 수행합니다. 여기서 gradient 매개변수는 역전파 시, 역전파 시작점에서의 그래디언트 값을 설정합니다.
gradient=torch.ones(len(y))은 y가 스칼라값이 아니라 벡터(여기서는 x의 길이만큼)라는 것을 고려해, 역전파의 시작점에서 그래디언트를 모든 요소가 1인 벡터로 설정합니다.
이렇게 하면 x에 대한 그래디언트가 각 원소마다 2 * x로 설정됩니다.

이제 x.grad에는 x에 대한 그래디언트 값이 포함되어 있습니다. 이 경우, x.grad의 모든 원소는 2 * x와 동일한 값을 가지게 됩니다.

이 코드는 x를 사용하여 y = x^2를 계산하고, 이후 x에 대한 그래디언트를 역전파를 통해 계산하는 간단한 예제를 보여줍니다. 역전파를 수행하면 x.grad에 그래디언트 값이 저장되므로, 이를 통해 x에 대한 미분을 계산하거나 경사 하강법과 같은 최적화 알고리즘을 수행할 수 있습니다.

tensor([0., 2., 4., 6.])

2.5.3. Detaching Computation

Sometimes, we wish to move some calculations outside of the recorded computational graph. For example, say that we use the input to create some auxiliary intermediate terms for which we do not want to compute a gradient. In this case, we need to detach the respective computational graph from the final result. The following toy example makes this clearer: suppose we have z = x * y and y = x * x but we want to focus on the direct influence of x on z rather than the influence conveyed via y. In this case, we can create a new variable u that takes the same value as y but whose provenance (how it was created) has been wiped out. Thus u has no ancestors in the graph and gradients do not flow through u to x. For example, taking the gradient of z = x * u will yield the result u, (not 3 * x * x as you might have expected since z = x * x * x).

때로는 기록된 계산 그래프 외부로 일부 계산을 이동하고 싶을 때도 있습니다. 예를 들어 입력을 사용하여 기울기를 계산하지 않으려는 일부 보조 중간 항을 생성한다고 가정해 보겠습니다. 이 경우 최종 결과에서 해당 계산 그래프를 분리해야 합니다. 다음 toy 예제는 이를 더 명확하게 해줍니다. z = x * y 및 y = x * x가 있지만 y를 통해 전달되는 영향보다는 x가 z에 미치는 직접적인 영향에 초점을 맞추고 싶다고 가정합니다. 이 경우 y와 동일한 값을 가지지만 출처(생성 방법)가 지워진 새 변수 u를 만들 수 있습니다. 따라서 u에는 그래프에 조상이 없으며 기울기는 u를 통해 x로 흐르지 않습니다. 예를 들어, z = x * u의 기울기를 취하면 결과 u가 생성됩니다(z = x * x * x 이후 예상했던 3 * x * x가 아님).

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

이 코드는 PyTorch를 사용하여 그래디언트(미분)와 .detach() 메서드의 역할을 설명하는 예제입니다. 아래에서 코드를 한 줄씩 설명하겠습니다.

x.grad는 x에 대한 그래디언트(미분)를 나타내는 PyTorch 텐서입니다. 이 코드는 이전에 계산된 그래디언트를 지우고 새로운 그래디언트를 계산할 준비를 합니다.

새로운 텐서 y를 생성하며, 이는 x의 제곱을 계산한 결과입니다.

.detach() 메서드는 텐서를 분리(detach)하고 그래디언트 연산을 중단합니다. 이는 u가 y와 동일한 값을 가지지만 그래디언트가 연결되어 있지 않음을 의미합니다.

z는 u와 x의 곱셈으로 계산됩니다.

z의 모든 원소의 합에 대한 그래디언트를 계산합니다. 이러한 그래디언트 계산은 역전파(backpropagation)를 통해 수행됩니다.

이 코드는 x.grad와 u를 비교합니다. 여기서 x.grad는 z.sum()에 대한 그래디언트이며, u는 y를 detach하여 생성된 텐서입니다. 이 둘은 같은 값을 가지므로 이 비교는 True를 반환합니다.

즉, x.grad는 z.sum()에 대한 그래디언트이며, 이는 u를 사용하여 x를 곱한 결과인 z에 대한 그래디언트와 같습니다.detach() 메서드를 사용하여 그래디언트 연산을 분리함으로써 그래디언트가 일부 연산에서 중지되도록 할 수 있습니다.

tensor([True, True, True, True])

Note that while this procedure detaches y’s ancestors from the graph leading to z, the computational graph leading to y persists and thus we can calculate the gradient of y with respect to x.

이 절차가 z로 이어지는 그래프에서 y의 조상을 분리하는 동안 y로 이어지는 계산 그래프는 지속되므로 x에 대한 y의 기울기를 계산할 수 있습니다.

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

이 코드는 PyTorch를 사용하여 그래디언트(미분)를 계산하는 예제입니다. 아래에서 코드를 한 줄씩 설명하겠습니다.

x.grad는 x에 대한 그래디언트(미분)를 나타내는 PyTorch 텐서입니다. 이 코드는 이전에 계산된 그래디언트를 지우고 새로운 그래디언트를 계산할 준비를 합니다. zero_() 메서드는 그래디언트를 모두 0으로 초기화합니다.

y의 모든 원소의 합에 대한 그래디언트를 계산합니다. 이러한 그래디언트 계산은 역전파(backpropagation)를 통해 수행됩니다. 여기서 y는 x의 제곱인 텐서입니다.

이 코드는 x.grad와 2 * x를 비교합니다. x.grad는 y.sum()에 대한 그래디언트이며, 이것은 y가 x의 제곱이므로 2 * x입니다. 따라서 이 비교는 True를 반환합니다.

즉, x.grad는 y를 x로 미분한 결과이며, 이는 y가 2 * x의 형태를 가지는 관계를 반영합니다. 이것은 연쇄 법칙(Chain Rule)을 통해 계산되며, y가 x에 대한 제곱 함수인 경우 그래디언트는 2 * x가 됩니다.

tensor([True, True, True, True])

2.5.4. Gradients and Python Control Flow

So far we reviewed cases where the path from input to output was well defined via a function such as z = x * x * x. Programming offers us a lot more freedom in how we compute results. For instance, we can make them depend on auxiliary variables or condition choices on intermediate results. One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. To illustrate this, consider the following code snippet where the number of iterations of the while loop and the evaluation of the if statement both depend on the value of the input a.

지금까지 우리는 z = x * x * x와 같은 함수를 통해 입력에서 출력까지의 경로가 잘 정의된 사례를 검토했습니다. 프로그래밍은 결과를 계산하는 방법에 있어 훨씬 더 많은 자유를 제공합니다. 예를 들어, 중간 결과에 대한 보조 변수나 조건 선택에 의존하도록 만들 수 있습니다. 자동 미분을 사용하면 미로 같은 Python 제어 흐름(예: 조건문, 루프 및 임의 함수 호출)을 통과해야 하는 함수의 계산 그래프를 작성하더라도 결과 변수의 기울기를 계속 계산할 수 있다는 이점이 있습니다. 이를 설명하기 위해 while 루프의 반복 횟수와 if 문의 평가가 모두 입력 a의 값에 따라 달라지는 다음 코드 조각을 고려해보세요.

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

이 코드는 파이썬 함수 f(a)를 정의합니다. 이 함수는 입력으로 스칼라 텐서 a를 받아서 다음과 같은 작업을 수행합니다.

b라는 새로운 변수를 생성하고 a에 2를 곱한 값을 할당합니다.
b의 L2 노름(norm)이 1000보다 작을 때까지 b를 2배씩 계속해서 곱해갑니다. 이것은 b가 L2 노름이 1000보다 커질 때까지 반복하는 루프입니다.
만약 b의 원소들의 합이 0보다 크다면, c에 b를 할당합니다.
그렇지 않다면 (즉, b의 원소들의 합이 0 이하인 경우), c에 100을 곱한 b를 할당합니다.
최종적으로 c를 반환합니다.

이 함수는 입력 a를 사용하여 b를 계산하고, 이후에 b의 크기와 합을 고려하여 c를 정합니다. c의 값은 a와 b에 따라 다르며, 함수의 결과로 반환됩니다.

Below, we call this function, passing in a random value, as input. Since the input is a random variable, we do not know what form the computational graph will take. However, whenever we execute f(a) on a specific input, we realize a specific computational graph and can subsequently run backward.

아래에서는 이 함수를 호출하여 임의의 값을 입력으로 전달합니다. 입력이 랜덤 변수이기 때문에 계산 그래프가 어떤 형태를 취할지 알 수 없습니다. 그러나 특정 입력에 대해 f(a)를 실행할 때마다 특정 계산 그래프를 실현하고 이후에 역방향으로 실행할 수 있습니다.

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

이 코드는 PyTorch를 사용하여 작성된 것으로 보이는 예제입니다. 코드는 다음과 같은 작업을 수행합니다:

a라는 이름의 스칼라 텐서를 생성합니다. requires_grad=True 매개변수를 사용하여 a에 대한 경사도(gradient)가 계산되도록 설정합니다.
이전에 정의한 f(a) 함수를 호출하여 a를 입력으로 사용하고, 이를 d에 할당합니다.
backward 메서드를 사용하여 d의 경사도를 계산합니다. 이것은 연쇄 법칙(chain rule)을 사용하여 d를 a에 대한 함수로 간주하고, d의 a에 대한 경사도를 계산합니다.

즉, 코드는 a에서 f(a)로의 계산 그래프를 구성하고, f(a)의 결과 d에 대한 a의 경사도를 계산합니다. 이렇게 하면 a.grad에 a에 대한 경사도가 저장됩니다.

Even though our function f is, for demonstration purposes, a bit contrived, its dependence on the input is quite simple: it is a linear function of a with piecewise defined scale. As such, f(a) / a is a vector of constant entries and, moreover, f(a) / a needs to match the gradient of f(a) with respect to a.

함수 f는 데모 목적으로 약간 인위적으로 만들어졌지만 입력에 대한 의존성은 매우 간단합니다. 부분적으로 정의된 스케일을 사용하는 a의 선형 함수입니다. 따라서 f(a) / a는 상수 항목으로 구성된 벡터이며 더욱이 f(a) / a는 a에 대한 f(a)의 기울기와 일치해야 합니다.

a.grad == d / a

이 코드는 a.grad와 d를 a로 나눈 값인 d / a를 비교하여 결과를 확인합니다.

a.grad는 a에 대한 경사도(gradient)를 나타내며, 이 값은 .backward()를 호출하여 계산됩니다.

d는 f(a) 함수의 결과를 나타내는 변수입니다.

따라서, a.grad는 d를 a로 미분한 값이 됩니다. 코드는 이 계산된 경사도 a.grad와 d / a를 비교하여 두 값이 동일한지 확인합니다.

이 비교가 참인 경우, 이는 PyTorch의 자동 미분이 제대로 작동하고 있다는 것을 의미합니다. 즉, d를 a로 미분한 결과가 d / a와 일치한다는 것을 의미합니다.

tensor(True)

Dynamic control flow is very common in deep learning. For instance, when processing text, the computational graph depends on the length of the input. In these cases, automatic differentiation becomes vital for statistical modeling since it is impossible to compute the gradient a priori.

동적 제어 흐름은 딥러닝에서 매우 일반적입니다. 예를 들어 텍스트를 처리할 때 계산 그래프는 입력 길이에 따라 달라집니다. 이러한 경우 사전에 기울기를 계산하는 것이 불가능하기 때문에 통계 모델링에 자동 미분이 필수적입니다.

2.5.5. Discussion

You have now gotten a taste of the power of automatic differentiation. The development of libraries for calculating derivatives both automatically and efficiently has been a massive productivity booster for deep learning practitioners, liberating them so they can focus on less menial. Moreover, autograd lets us design massive models for which pen and paper gradient computations would be prohibitively time consuming. Interestingly, while we use autograd to optimize models (in a statistical sense) the optimization of autograd libraries themselves (in a computational sense) is a rich subject of vital interest to framework designers. Here, tools from compilers and graph manipulation are leveraged to compute results in the most expedient and memory-efficient manner.

이제 자동 미분의 힘을 맛보셨습니다. 도함수를 자동으로 효율적으로 계산하기 위한 라이브러리의 개발은 딥 러닝 실무자들의 생산성을 대폭 향상시켜 그들이 덜 천박한 일에 집중할 수 있도록 해방시켜 주었습니다. 게다가, autograd를 사용하면 펜과 종이의 그라디언트 계산에 엄청난 시간이 소요되는 대규모 모델을 설계할 수 있습니다. 흥미롭게도, 우리가 모델을 최적화하기 위해(통계적 의미에서) autograd를 사용하는 반면, autograd 라이브러리 자체의 최적화(계산적 의미에서)는 프레임워크 설계자들에게 매우 중요한 주제입니다. 여기서는 컴파일러 도구와 그래프 조작을 활용하여 가장 편리하고 메모리 효율적인 방식으로 결과를 계산합니다.

For now, try to remember these basics: (i) attach gradients to those variables with respect to which we desire derivatives; (ii) record the computation of the target value; (iii) execute the backpropagation function; and (iv) access the resulting gradient.

지금은 다음 기본 사항을 기억해 보십시오. (i) 도함수를 원하는 변수에 기울기를 적용합니다. (ii) 목표값의 계산을 기록하고; (iii) 역전파 기능을 실행합니다. (iv) 결과 그래디언트에 액세스합니다.

2.5.6. Exercises

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.4. Calculus : 미적분

2023. 10. 12. 11:53 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/calculus.html

2.4. Calculus — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.4. Calculus

For a long time, how to calculate the area of a circle remained a mystery. Then, in Ancient Greece, the mathematician Archimedes came up with the clever idea to inscribe a series of polygons with increasing numbers of vertices on the inside of a circle (Fig. 2.4.1). For a polygon with n vertices, we obtain n triangles. The height of each triangle approaches the radius r as we partition the circle more finely. At the same time, its base approaches 2 π r/n, since the ratio between arc and secant approaches 1 for a large number of vertices. Thus, the area of the polygon approaches n⋅r⋅1/2(2 π r /n)= π r **2.

오랫동안 원의 넓이를 계산하는 방법은 미스터리로 남아 있었습니다. 그런 다음 고대 그리스의 수학자 아르키메데스는 원 내부에 정점 수가 증가하는 일련의 다각형을 새기는 영리한 아이디어를 생각해 냈습니다(그림 2.4.1). n개의 꼭지점을 가진 다각형의 경우 n개의 삼각형을 얻습니다. 원을 더 세밀하게 분할할수록 각 삼각형의 높이는 반지름 r에 가까워집니다. 동시에, 그 밑변은 2π r/n에 가까워집니다. 왜냐하면 호와 시컨트 사이의 비율이 많은 수의 꼭지점에 대해 1에 가까워지기 때문입니다. 따라서 다각형의 면적은 n⋅r⋅1/2(2 π r /n)= π r **2에 가까워집니다.

Fig. 2.4.1  Finding the area of a circle as a limit procedure.

This limiting procedure is at the root of both differential calculus and integral calculus. The former can tell us how to increase or decrease a function’s value by manipulating its arguments. This comes in handy for the optimization problems that we face in deep learning, where we repeatedly update our parameters in order to decrease the loss function. Optimization addresses how to fit our models to training data, and calculus is its key prerequisite. However, do not forget that our ultimate goal is to perform well on previously unseen data. That problem is called generalization and will be a key focus of other chapters.

이 제한 절차는 미분 계산 calculus 과 적분 계산 integral calculus 의 기초입니다. 전자는 인수를 조작하여 함수의 값을 늘리거나 줄이는 방법을 알려줄 수 있습니다. 이는 손실 함수를 줄이기 위해 매개변수를 반복적으로 업데이트하는 딥러닝에서 직면하는 최적화 문제에 유용합니다. 최적화는 모델을 훈련 데이터에 맞추는 방법을 다루며, 미적분학은 핵심 전제 조건입니다. 그러나 우리의 궁극적인 목표는 이전에 볼 수 없었던 데이터를 잘 활용하는 것임을 잊지 마십시오. 이 문제를 일반화 generalization 라고 하며 다른 장에서 중점적으로 다룰 것입니다.

%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l

주어진 코드는 Python 코드로, Jupyter Notebook 또는 IPython 환경에서 사용됩니다. 코드는 다음과 같은 작업을 수행합니다:

%matplotlib inline: 이 코드는 Jupyter Notebook에서 사용하는 "매직 명령어" 중 하나로, 그래프나 그림을 출력할 때 그림을 노트북 내부에 표시하도록 설정하는 역할을 합니다. 이렇게 하면 그림이 노트북 내에서 바로 볼 수 있습니다.
import numpy as np: NumPy 라이브러리를 불러옵니다. NumPy는 파이썬의 수치 계산과 배열 처리에 유용한 라이브러리로, 주로 다차원 배열과 관련된 작업에 사용됩니다. np는 일반적으로 NumPy의 별칭으로 사용됩니다.
from matplotlib_inline import backend_inline: matplotlib_inline 라이브러리에서 backend_inline 모듈을 가져옵니다. 이 모듈은 Matplotlib 그래프를 노트북 내에서 인라인으로 표시할 때 사용됩니다.
from d2l import torch as d2l: D2L (Dive into Deep Learning) 라이브러리에서 torch 모듈을 가져온 다음, 이 모듈을 d2l이라는 별칭으로 사용합니다. D2L은 딥러닝 및 머신러닝 교육과 관련된 코드와 자료를 제공하는 라이브러리로, torch 모듈은 PyTorch를 기반으로 한 딥러닝 코드를 작성하기 위해 사용됩니다.

이 코드는 주로 딥러닝 및 머신러닝 관련 작업을 수행하고 시각화를 위한 환경 설정을 위해 사용됩니다. 또한 이 코드는 Jupyter Notebook 또는 IPython 환경에서 노트북 내에서 그래프 및 그림을 표시하기 위한 설정을 제공합니다.

2.4.1. Derivatives and Differentiation

Put simply, a derivative is the rate of change in a function with respect to changes in its arguments. Derivatives can tell us how rapidly a loss function would increase or decrease were we to increase or decrease each parameter by an infinitesimally small amount. Formally, for functions ƒ : ℝ → ℝ , that map from scalars to scalars, the derivative of ƒ at a point x is defined as

간단히 말해서, 도함수 derivative 는 인수의 변화에 대한 함수의 변화율입니다. Derivatives 은 각 매개변수를 극소량씩 늘리거나 줄이면 손실 함수가 얼마나 빠르게 증가하거나 감소하는지 알려줄 수 있습니다. 공식적으로, 스칼라에서 스칼라로 매핑되는 함수 f : ℝ → ℝ에 대해 점 x에서 의 도함수는 다음과 같이 정의됩니다.

lim (극한) 이란?

In mathematics, "lim" is an abbreviation for the limit. The limit is a fundamental concept in calculus and analysis that describes the behavior of a function or sequence as it approaches a certain value or approaches infinity or negative infinity.

수학에서 "lim"은 "극한(limit)"의 줄임말로 사용됩니다. 극한은 미적분학과 해석학에서 중요한 개념으로, 함수나 수열이 특정한 값을 향하거나 무한대 또는 음의 무한대를 향하는 과정을 기술합니다.

The limit of a function f(x) as x approaches a particular value, say 'a', is denoted as:

특정 값 'a'로 접근할 때 함수 f(x)의 극한은 다음과 같이 나타납니다:

lim (x → a) f(x)

This notation represents the value that the function approaches as x gets closer and closer to 'a'. In other words, it describes what happens to the function as x gets infinitely close to 'a'.

이 표기법은 x가 'a'에 점점 가까워질 때 함수가 어떻게 동작하는지를 설명합니다. 다시 말해, x가 'a'에 무한히 가까워질 때 함수가 어떻게 행동하는지를 나타냅니다.

Limits are used to study the behavior of functions, analyze continuity, and find derivatives and integrals in calculus. They are a crucial tool for understanding the fundamental properties and characteristics of functions, especially in the context of differential and integral calculus. The concept of limits is also essential in real analysis, where it is used to rigorously define continuity, convergence, and other key mathematical concepts.

극한은 미적분학에서 함수의 행동을 연구하고 연속성을 분석하며, 미분 및 적분을 찾는 데 사용되는 중요한 도구입니다. 극한은 함수의 기본적인 특성과 특징을 이해하는 데 필수적이며, 특히 미분 및 적분 미적분학의 맥락에서 중요합니다. 극한의 개념은 또한 해석학에서 사용되어 연속성, 수렴 및 기타 주요 수학적 개념을 엄밀하게 정의하는 데 필수적입니다.

This term on the right hand side is called a limit and it tells us what happens to the value of an expression as a specified variable approaches a particular value. This limit tells us what the ratio between a perturbation ℎ and the change in the function value f(x+ℎ)− f(x) converges to as we shrink its size to zero.

오른쪽에 있는 용어는 극한 limit 이라고 하며 지정된 변수가 특정 값에 접근할 때 표현식의 값에 어떤 일이 발생하는지 알려줍니다. 이 극한 limit 는 크기를 0으로 줄이면 섭동 perturbation ℎ와 함수 값 f(x+ℎ)− f(x)의 변화 사이의 비율이 수렴되는 것을 알려줍니다.

When f ′(x) exists, f is said to be differentiable at x; and when f ′(x) exists for all x on a set, e.g., the interval [a,b], we say that f is differentiable on this set. Not all functions are differentiable, including many that we wish to optimize, such as accuracy and the area under the receiving operating characteristic (AUC). However, because computing the derivative of the loss is a crucial step in nearly all algorithms for training deep neural networks, we often optimize a differentiable surrogate instead.

f ′(x)가 존재할 때 f는 x에서 미분 가능하다고 합니다. 그리고 f ′(x)가 세트의 모든 x에 대해 존재할 때(예: 구간 [a,b]), 우리는 f가 이 세트에서 미분 가능하다고 말합니다. 정확도, 수신 작동 특성(AUC) 하의 영역 등 최적화하려는 많은 기능을 포함하여 모든 기능이 차별화 가능한 것은 아닙니다. 그러나 손실의 도함수 derivative 를 계산하는 것은 심층 신경망 훈련을 위한 거의 모든 알고리즘에서 중요한 단계이기 때문에 대신 미분 가능한 대리자를 최적화하는 경우가 많습니다.

We can interpret the derivative f ′(x) as the instantaneous rate of change of f(x) with respect to x. Let’s develop some intuition with an example. Define u= f(x)=3x**2 − 4x.

도함수 f'(x)를 x에 대한 f(x)의 순간 변화율로 해석할 수 있습니다. 예를 들어 직관력을 키워 봅시다. u= f(x)=3x**2−4x를 정의합니다.

def f(x):
    return 3 * x ** 2 - 4 * x

주어진 코드는 파이썬에서 정의한 함수를 설명하고 있습니다. 함수 이름은 f이며, 주어진 입력(x)에 대한 출력을 계산하는 방법을 정의합니다. 함수의 내용은 다음과 같이 나타납니다:

이 코드의 설명은 다음과 같습니다:

def f(x):: def 키워드는 파이썬 함수를 정의하기 위해 사용되며, 함수 이름인 f를 정의합니다. 괄호 안에 있는 x는 함수의 입력 매개변수(parameter)입니다. 이 함수는 x라는 입력을 받아서 계산을 수행하고 결과를 반환합니다.
return 3 * x ** 2 - 4 * x: 이 줄은 함수의 본문(body)을 정의합니다. 주어진 x를 사용하여 함수가 수행할 계산을 기술합니다. 여기서는 입력 x를 이용하여 다음의 계산을 수행합니다:
- x를 제곱한 후 3을 곱하고,
- x를 4 곱한 다음 뺍니다.

이 함수는 주어진 x 값에 대한 결과를 반환합니다. 예를 들어, f(2)를 호출하면 x에 2를 대입하여 3 * 2 ** 2 - 4 * 2를 계산하고, 결과로 -4를 반환할 것입니다.

이 함수를 사용하면 주어진 입력값 x에 대한 함수의 출력을 계산할 수 있으며, 이러한 함수 정의는 미적분 및 수학적 모델링과 같은 다양한 수학적 응용 분야에서 사용됩니다.

Setting x=1, we see that f(x+ℎ)− f(x)/ℎ approaches 2 as ℎ approaches 0. While this experiment lacks the rigor of a mathematical proof, we can quickly see that indeed f′(1)=2.

x=1로 설정하면 ℎ가 0에 가까워짐에 따라 f(x+ℎ)− f(x)/ℎ도 2에 가까워지는 것을 알 수 있습니다. 이 실험에는 수학적 증명의 엄격함이 부족하지만, 우리는 실제로 'f′(1)=2'라는 것을 빨리 알 수 있습니다.

for h in 10.0**np.arange(-1, -6, -1):
    print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')

주어진 코드는 파이썬의 반복문을 사용하여 함수 f(x)의 수치 미분(numerical derivative)을 계산하고 출력하는 작업을 수행합니다. 코드는 h 값의 범위를 설정하고, 각 h에 대해 f(x) 함수의 미분 값을 계산하고 출력합니다. 아래는 코드의 설명입니다:

for h in 10.0**np.arange(-1, -6, -1): 이 줄은 h라는 변수를 사용하여 반복을 설정합니다. np.arange(-1, -6, -1)는 -1에서 -6까지 1씩 감소하는 수열을 생성합니다. 그리고 10.0**를 사용하여 각 수열의 값에 10의 지수를 적용하여 h 값을 생성합니다. 이렇게 함으로써 h는 0.1, 0.01, 0.001, 0.0001, 0.00001의 값을 순서대로 가지게 됩니다.
print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}'): 이 줄은 각 h 값에 대한 미분 값을 계산하고 출력합니다.
- h={h:.5f}: h의 값을 소수 다섯 번째 자리까지 출력합니다.
- numerical limit={(f(1+h)-f(1))/h:.5f}: f(1+h)에서 f(1)을 뺀 다음 h로 나눈 값을 소수 다섯 번째 자리까지 출력합니다. 이 값은 f(x) 함수의 수치 미분을 나타내며, h 값이 작을수록 정확한 미분 값을 얻을 수 있습니다.

따라서 이 코드는 서로 다른 h 값에 대해 f(x) 함수의 수치 미분 값을 계산하고 출력하여, h 값이 작아질수록 정확한 미분 값을 얻는 과정을 보여줍니다. 이러한 과정은 미분 근사를 이해하고 미분 값을 추정하는 데 사용됩니다.

h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

There are several equivalent notational conventions for derivatives. Given y=f(x), the following expressions are equivalent:

derivatives 에 대한 몇 가지 동등한 표기 규칙이 있습니다. y=f(x)라고 가정하면 다음 표현식은 동일합니다.

where the symbols d/dx and D are differentiation operators. Below, we present the derivatives of some common functions:

여기서 기호 d/dx와 D는 미분 연산자 differentiation operators 입니다. 아래에서는 몇 가지 일반적인 함수의 derivatives 을 제시합니다.

Functions composed from differentiable functions are often themselves differentiable. The following rules come in handy for working with compositions of any differentiable functions f and g, and constant C.

미분 가능한 함수로 구성된 함수는 종종 그 자체로 미분 가능합니다. 다음 규칙은 미분 가능한 함수 f와 g 및 상수 C의 구성 작업에 유용합니다.

Using this, we can apply the rules to find the derivative of 3x**2 − 4x via

이를 사용하여 다음을 통해 3x**2 − 4x의 도함수를 찾는 규칙을 적용할 수 있습니다.

Plugging in x=1 shows that, indeed, the derivative equals 2 at this location. Note that derivatives tell us the slope of a function at a particular location.

x=1을 대입하면 실제로 이 위치에서 도함수는 2와 같다는 것을 알 수 있습니다. 도함수 derivatives 는 특정 위치에서 함수의 기울기를 알려줍니다.

2.4.2. Visualization Utilities

We can visualize the slopes of functions using the matplotlib library. We need to define a few functions. As its name indicates, use_svg_display tells matplotlib to output graphics in SVG format for crisper images. The comment #@save is a special modifier that allows us to save any function, class, or other code block to the d2l package so that we can invoke it later without repeating the code, e.g., via d2l.use_svg_display().

matplotlib 라이브러리를 사용하여 함수의 기울기를 시각화할 수 있습니다. 몇 가지 함수를 정의해야 합니다. 이름에서 알 수 있듯이 use_svg_display는 matplotlib에게 보다 선명한 이미지를 위해 SVG 형식으로 그래픽을 출력하도록 지시합니다. #@save 주석은 함수, 클래스 또는 기타 코드 블록을 d2l 패키지에 저장하여 나중에 코드를 반복하지 않고(예: d2l.use_svg_display()를 통해) 호출할 수 있도록 하는 특수 수정자 special modifier 입니다.

def use_svg_display():  #@save
    """Use the svg format to display a plot in Jupyter."""
    backend_inline.set_matplotlib_formats('svg')

주어진 코드는 Jupyter Notebook 환경에서 그래프나 그림을 SVG(Scalable Vector Graphics) 형식으로 표시하는 함수를 정의하는 파이썬 코드입니다. 아래는 코드의 설명입니다:

def use_svg_display():: 이 코드는 use_svg_display라는 이름의 함수를 정의합니다. 이 함수는 그래프나 그림을 SVG 형식으로 표시하도록 설정하는 역할을 합니다.
backend_inline.set_matplotlib_formats('svg'): 이 함수 내에서는 backend_inline 라이브러리의 set_matplotlib_formats 함수를 호출합니다. 이 함수는 Matplotlib 그래프의 출력 형식을 설정하는 역할을 합니다. 여기서 'svg'를 사용하여 SVG 형식으로 그래프를 설정합니다. SVG 형식은 확대하더라도 이미지가 깨지지 않고 고품질로 표시되며, Jupyter Notebook 환경에서 렌더링할 때 특히 유용합니다.

따라서 이 코드는 Jupyter Notebook 환경에서 그래프를 그릴 때 그래프의 형식을 SVG로 설정하여, 그래프가 화면에 고해상도로 표시되도록 하는 함수를 정의하고 있습니다. 이 함수를 사용하면 Jupyter Notebook에서 품질 좋은 그래프를 생성하고 시각화할 때 유용합니다.

Conveniently, we can set figure sizes with set_figsize. Since the import statement from matplotlib import pyplot as plt was marked via #@save in the d2l package, we can call d2l.plt.

편리하게도 set_figsize를 사용하여 그림 크기를 설정할 수 있습니다. matplotlib import pyplot as plt의 import 문이 d2l 패키지의 #@save를 통해 표시되었으므로 d2l.plt를 호출할 수 있습니다.

def set_figsize(figsize=(3.5, 2.5)):  #@save
    """Set the figure size for matplotlib."""
    use_svg_display()
    d2l.plt.rcParams['figure.figsize'] = figsize

주어진 코드는 Matplotlib를 사용하여 그림의 크기를 설정하는 함수를 정의하는 파이썬 코드입니다. 아래는 코드의 설명입니다:

def set_figsize(figsize=(3.5, 2.5)):: 이 코드는 set_figsize라는 이름의 함수를 정의합니다. 이 함수는 그림의 크기를 설정하는 역할을 합니다. figsize 매개변수를 사용하여 원하는 그림 크기를 지정할 수 있으며, 기본값은 (3.5, 2.5)로 설정되어 있습니다.
use_svg_display(): use_svg_display 함수를 호출하여 그래프를 SVG 형식으로 표시하도록 설정합니다. 이전 질문에 설명한 대로, SVG 형식은 고품질 그래프를 표시하는 데 유용합니다.
d2l.plt.rcParams['figure.figsize'] = figsize: Matplotlib의 rcParams를 사용하여 그림의 크기를 설정합니다. figsize 매개변수에 지정된 크기로 그림을 설정합니다. 이것은 Matplotlib 그래프의 기본 크기를 변경하는 것으로, figsize를 통해 그림의 가로와 세로 크기를 지정할 수 있습니다.

따라서 이 코드는 Matplotlib를 사용하여 그림의 크기를 설정하는 함수를 정의하고 있으며, 그림 크기를 사용자 지정하거나 기본 설정을 변경하는 데 사용됩니다. 이를 통해 생성되는 그림은 원하는 크기로 표시되며, 시각화 결과를 조절할 수 있습니다.

The set_axes function can associate axes with properties, including labels, ranges, and scales.

set_axes 함수는 축을 레이블, 범위, 스케일 등의 속성과 연결할 수 있습니다.

#@save
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """Set the axes for matplotlib."""
    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
    axes.set_xscale(xscale), axes.set_yscale(yscale)
    axes.set_xlim(xlim),     axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()

주어진 코드는 Matplotlib 그래프의 축(axis) 설정을 수행하는 함수를 설명하는 파이썬 코드입니다. 이 함수를 사용하면 그래프의 축 레이블, 범위, 스케일, 범례, 그리드 등을 설정할 수 있습니다. 아래는 코드의 설명입니다:

def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):: 이 코드는 set_axes라는 이름의 함수를 정의합니다. 이 함수는 Matplotlib 그래프의 축 설정을 담당합니다. 함수는 다음과 같은 매개변수를 사용합니다:
- axes: 설정할 축(axis) 객체.
- xlabel: x-축 레이블.
- ylabel: y-축 레이블.
- xlim: x-축 범위.
- ylim: y-축 범위.
- xscale: x-축 스케일.
- yscale: y-축 스케일.
- legend: 범례 설정.
axes.set_xlabel(xlabel), axes.set_ylabel(ylabel): x-축과 y-축의 레이블을 설정합니다.
axes.set_xscale(xscale), axes.set_yscale(yscale): x-축과 y-축의 스케일을 설정합니다. 스케일은 "linear" 또는 "log"와 같은 값으로 설정할 수 있으며, 스케일을 변경하면 축의 데이터 표시 방식이 변경됩니다.
axes.set_xlim(xlim), axes.set_ylim(ylim): x-축과 y-축의 범위를 설정합니다. 범위는 그래프에서 보여질 데이터의 최솟값과 최댓값을 지정합니다.
if legend: axes.legend(legend): 만약 legend 매개변수가 주어지면 그래프에 범례를 추가합니다. 범례는 그래프에서 각 선 또는 데이터 시리즈를 설명하는 레이블을 표시하는 데 사용됩니다.
axes.grid(): 그리드를 추가하여 그래프에 격자 눈금을 표시합니다. 격자 눈금은 데이터의 위치를 더 쉽게 파악할 수 있도록 도와줍니다.

이 함수를 사용하면 Matplotlib 그래프의 축을 사용자 정의하고, 그래프를 보다 명확하게 표시하는 데 도움이 됩니다.

With these three functions, we can define a plot function to overlay multiple curves. Much of the code here is just ensuring that the sizes and shapes of inputs match.

이 세 가지 함수를 사용하면 여러 곡선을 오버레이하는 플롯 함수를 정의할 수 있습니다. 여기 코드의 대부분은 입력의 크기와 모양이 일치하는지 확인하는 것입니다.

#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
         ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """Plot data points."""

    def has_one_axis(X):  # True if X (tensor or list) has 1 axis
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))

    if has_one_axis(X): X = [X]
    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]
    if len(X) != len(Y):
        X = X * len(Y)

    set_figsize(figsize)
    if axes is None:
        axes = d2l.plt.gca()
    axes.cla()
    for x, y, fmt in zip(X, Y, fmts):
        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)

주어진 코드는 데이터 포인트를 그리는 함수를 설명하는 파이썬 코드입니다. 이 함수를 사용하면 데이터 포인트를 그래프로 표시할 수 있으며, 그래프의 모양, 축, 범위, 스케일, 범례, 그리드 등을 설정할 수 있습니다. 아래는 코드의 설명입니다:

plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None, ylim=None, xscale='linear', yscale='linear', fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None): 이 함수는 데이터 포인트를 그리는 함수로, 다양한 설정 옵션을 제공합니다. 이 함수는 다음과 같은 매개변수를 사용합니다:
- X: x-축에 대한 데이터 포인트 또는 데이터 포인트 리스트.
- Y: y-축에 대한 데이터 포인트 또는 데이터 포인트 리스트. 기본값은 None이며, 이 경우 X가 y-축 데이터로 사용됩니다.
- xlabel: x-축 레이블.
- ylabel: y-축 레이블.
- legend: 범례 설정. 여러 데이터 시리즈에 대한 범례를 지정할 수 있습니다.
- xlim: x-축 범위 설정.
- ylim: y-축 범위 설정.
- xscale: x-축 스케일 설정. 기본값은 "linear"로 설정되어 있습니다.
- yscale: y-축 스케일 설정. 기본값은 "linear"로 설정되어 있습니다.
- fmts: 그래프의 스타일 설정. 여러 다른 선 스타일을 제공하며, 기본값은 ('-', 'm--', 'g-.', 'r:')로 설정되어 있습니다.
- figsize: 그래프의 크기 설정. 기본값은 (3.5, 2.5)로 설정되어 있습니다.
- axes: 그래프를 그릴 Matplotlib 축 객체. 기본값은 None으로 설정되어 있으며, 필요한 경우 사용자가 직접 설정할 수 있습니다.
def has_one_axis(X): 이 내부 함수는 주어진 데이터(X)가 1차원 배열 또는 리스트인지 확인합니다. 1차원이면 True를 반환하고, 그렇지 않으면 False를 반환합니다.
if has_one_axis(X): X = [X]: 만약 X가 1차원 배열이라면, X를 원소가 하나인 리스트로 변환합니다. 이렇게 함으로써 여러 데이터 시리즈를 다룰 때 편리하게 처리할 수 있습니다.
if Y is None: X, Y = [[]] * len(X), X: 만약 Y가 주어지지 않았다면, X를 y-축 데이터로 사용하기 위해 Y를 X로 설정합니다. 그리고 X를 원소가 비어 있는 리스트로 설정합니다.
set_figsize(figsize): 그래프의 크기를 figsize에 지정된 크기로 설정하는 함수를 호출합니다.
if axes is None: axes = d2l.plt.gca(): 만약 축(axes)가 주어지지 않았다면, 현재 활성화된 Matplotlib 축 객체를 가져와서 사용합니다.
axes.cla(): 축 객체를 초기화하여 이전에 그려진 그래프를 지우고 새로운 그래프를 그릴 준비를 합니다.
for x, y, fmt in zip(X, Y, fmts): axes.plot(x, y, fmt) if len(x) else axes.plot(y, fmt): X와 Y에 대한 데이터 시리즈와 스타일(fmt)을 순회하면서 그래프를 그립니다. 만약 x 데이터 시리즈가 비어 있다면(len(x) == 0), y 데이터 시리즈를 그래프로 그립니다.
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend): set_axes 함수를 호출하여 축의 레이블, 범위, 스케일, 범례, 그리드 등을 설정합니다.

이 함수를 사용하면 주어진 데이터 포인트를 그래프로 시각화하고, 그래프의 모양과 설정을 유연하게 조절할 수 있습니다. 이는 데이터 시각화 및 그래프 생성에 유용한 도구입니다.

Now we can plot the function u=f(x) and its tangent line y=2x−3 at x=1, where the coefficient 2 is the slope of the tangent line.

이제 x=1에서 함수 u=f(x)와 접선 y=2x−3을 그릴 수 있습니다. 여기서 계수 2는 접선의 기울기입니다.

x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])

주어진 코드는 주어진 범위에서 함수 f(x)와 x=1에서의 접선을 그래프로 표시하는 예제 코드입니다. 아래는 코드의 설명입니다:

x = np.arange(0, 3, 0.1): 이 코드는 0에서 3까지의 범위에서 0.1 간격으로 숫자를 생성하여 x에 할당합니다. 이렇게 생성된 x 값은 함수 f(x)와 접선을 그래프로 그릴 때 x-축 값으로 사용됩니다.
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)']): plot 함수를 호출하여 그래프를 그립니다. 이때, 다음 매개변수들이 사용됩니다:
- x: x-축 데이터로 사용될 범위(0에서 3까지의 값).
- [f(x), 2 * x - 3]: y-축 데이터로 사용될 값. 여기서 f(x)는 함수 f(x)의 값이고, 2 * x - 3은 x=1에서의 접선의 방정식입니다.
- 'x': x-축 레이블로 사용될 문자열.
- 'f(x)': y-축 레이블로 사용될 문자열.
- legend=['f(x)', 'Tangent line (x=1)']: 범례 설정으로, 각 데이터 시리즈에 대한 설명을 제공합니다.

이 코드는 x 범위에서 f(x) 함수와 x=1에서의 접선을 그래프로 그립니다. 이를 통해 함수와 해당 지점에서의 기울기를 시각화할 수 있습니다.

2.4.3. Partial Derivatives and Gradients

Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of many variables. We briefly introduce notions of the derivative that apply to such multivariate functions.

지금까지 우리는 단 하나의 변수에 대한 함수를 차별화해 왔습니다. 딥러닝에서는 다양한 변수의 함수를 다루어야 합니다. 우리는 그러한 다변량 함수에 적용되는 미분의 개념을 간략하게 소개합니다.

Let y=f(x1,x2,…,xn) be a function with n variables. The partial derivative of y with respect to its i th parameter xi is

y=f(x1,x2,…,xn)을 n개의 변수를 갖는 함수로 둡니다. i 번째 매개변수 xi에 대한 y의 편도함수 multivariate functions는 다음과 같습니다.

To calculate ∂y/ ∂xi, we can treat x1,…,xi−1,xi+1,…,xn as constants and calculate the derivative of y with respect to xi. The following notational conventions for partial derivatives are all common and all mean the same thing:

∂y/ ∂xi를 계산하려면 x1,…,xi−1,xi+1,…,xn을 상수로 처리하고 xi에 대한 y의 도함수를 계산할 수 있습니다. 부분 도함수에 대한 다음 표기 규칙은 모두 공통적이며 모두 같은 의미입니다.

We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain a vector that is called the gradient of the function. Suppose that the input of function f: ℝ**n→ ℝ is an n-dimensional vector x=[x1,x2,…,xn]**⊤ and the output is a scalar. The gradient of the function f with respect to x is a vector of n partial derivatives:

모든 변수에 대해 다변량 함수의 부분 도함수를 연결하여 함수의 기울기라고 하는 벡터를 얻을 수 있습니다. 함수 f: ℝ**n→ ℝ의 입력이 n차원 벡터 x=[x1,x2,…,xn]**⊤이고 출력이 스칼라라고 가정합니다. x에 대한 함수 f의 기울기는 n 편도함수의 벡터입니다.

When there is no ambiguity, ∇_xf(x) is typically replaced by ∇f(x). The following rules come in handy for differentiating multivariate functions:

모호성 ambiguity 이 없으면 ∇_xf(x)는 일반적으로 ∇f(x)로 대체됩니다. 다변량 함수를 차별화하는 데 다음 규칙이 유용합니다.

Similarly, for any matrix X, we have ∇_x|X|²_F=2X.

마찬가지로, 임의의 행렬 X에 대해 ∇_x|X|²_F=2X가 됩니다.

2.4.4. Chain Rule

In deep learning, the gradients of concern are often difficult to calculate because we are working with deeply nested functions (of functions (of functions…)). Fortunately, the chain rule takes care of this. Returning to functions of a single variable, suppose that y=f(g(x)) and that the underlying functions y=f(u) and u=g(x) are both differentiable. The chain rule states that

딥 러닝에서는 깊게 중첩된 함수(함수 중(함수…))로 작업하기 때문에 관심 기울기를 계산하기 어려운 경우가 많습니다. 다행히도 체인 규칙이 이를 처리합니다. 단일 변수의 함수로 돌아가서, y=f(g(x))와 기본 함수 y=f(u) 및 u=g(x)가 모두 미분 가능하다고 가정합니다. 체인 규칙은 다음과 같이 명시합니다.

Turning back to multivariate functions, suppose that y=f(u) has variables u1,u2,…,un, where each ui=gi(X) has variables x1,x2,…,xn, i.e., u=g(x). Then the chain rule states that

다변량 함수로 돌아가서, y=f(u)에 변수 u1,u2,…,un이 있다고 가정합니다. 여기서 각 ui=gi(X)에는 변수 x1,x2,…,xn이 있습니다. 즉, u=g(x) . 그런 다음 체인 규칙은 다음과 같이 명시합니다.

where A∈ ℝ ^n×m is a matrix that contains the derivative of vector u with respect to vector x. Thus, evaluating the gradient requires computing a vector–matrix product. This is one of the key reasons why linear algebra is such an integral building block in building deep learning systems.

여기서 A∈ ℝ^n×m은 벡터 x에 대한 벡터 u의 도함수를 포함하는 행렬입니다. 따라서 기울기를 평가하려면 벡터-행렬 곱을 계산해야 합니다. 이것이 선형 대수학이 딥 러닝 시스템을 구축하는 데 필수적인 구성 요소인 주요 이유 중 하나입니다.

2.4.5. Discussion

While we have just scratched the surface of a deep topic, a number of concepts already come into focus: first, the composition rules for differentiation can be applied routinely, enabling us to compute gradients automatically. This task requires no creativity and thus we can focus our cognitive powers elsewhere. Second, computing the derivatives of vector-valued functions requires us to multiply matrices as we trace the dependency graph of variables from output to input. In particular, this graph is traversed in a forward direction when we evaluate a function and in a backwards direction when we compute gradients. Later chapters will formally introduce backpropagation, a computational procedure for applying the chain rule.

우리는 단지 깊은 주제의 표면만 긁었을 뿐이지만 이미 여러 가지 개념에 초점을 맞추고 있습니다. 첫째, 미분을 위한 합성 규칙을 일상적으로 적용하여 기울기를 자동으로 계산할 수 있습니다. 이 작업에는 창의성이 필요하지 않으므로 인지 능력을 다른 곳에 집중할 수 있습니다. 둘째, 벡터 값 함수의 도함수를 계산하려면 출력에서 입력까지 변수의 종속성 그래프를 추적하면서 행렬을 곱해야 합니다. 특히, 이 그래프는 함수를 평가할 때 정방향으로 이동하고 기울기를 계산할 때 역방향으로 이동합니다. 이후 장에서는 체인 규칙을 적용하기 위한 계산 절차인 역전파를 공식적으로 소개할 것입니다.

From the viewpoint of optimization, gradients allow us to determine how to move the parameters of a model in order to lower the loss, and each step of the optimization algorithms used throughout this book will require calculating the gradient.

최적화 관점에서 기울기를 사용하면 손실을 낮추기 위해 모델의 매개변수를 이동하는 방법을 결정할 수 있으며, 이 책 전체에서 사용되는 최적화 알고리즘의 각 단계에서는 기울기 계산이 필요합니다.

2.4.6. Exercises

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.3. Linear Algebra - 선형 대수학

2023. 10. 11. 13:26 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/linear-algebra.html

2.3. Linear Algebra — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.3. Linear Algebra

By now, we can load datasets into tensors and manipulate these tensors with basic mathematical operations. To start building sophisticated models, we will also need a few tools from linear algebra. This section offers a gentle introduction to the most essential concepts, starting from scalar arithmetic and ramping up to matrix multiplication.

이제 데이터 세트를 텐서에 로드하고 기본적인 수학 연산을 통해 이러한 텐서를 조작할 수 있습니다. 정교한 모델 구축을 시작하려면 선형 대수학의 몇 가지 도구도 필요합니다. 이 섹션에서는 스칼라 산술부터 시작하여 행렬 곱셈까지 가장 필수적인 개념을 부드럽게 소개합니다.

import torch

import torch: 이 코드는 PyTorch 라이브러리를 현재 Python 스크립트 또는 환경으로 가져옵니다. PyTorch는 딥러닝 및 텐서 연산을 위한 라이브러리로, 다양한 딥러닝 모델을 구축하고 학습시키며 텐서 관련 작업을 수행하는 데 사용됩니다.

2.3.1. Scalars

Most everyday mathematics consists of manipulating numbers one at a time. Formally, we call these values scalars. For example, the temperature in Palo Alto is a balmy 72 degrees Fahrenheit. If you wanted to convert the temperature to Celsius you would evaluate the expression c=5/9( ƒ −32), setting ƒ to 72. In this equation, the values 5, 9, and 32 are constant scalars. The variables c and ƒ in general represent unknown scalars.

대부분의 일상 수학은 한 번에 하나씩 숫자를 조작하는 것으로 구성됩니다. 공식적으로는 이러한 값을 스칼라라고 부릅니다. 예를 들어, 팔로알토(Palo Alto)의 기온은 화씨 72도입니다. 온도를 섭씨로 변환하려면 c=5/9( f −32) 표현식을 평가하고 f를 72로 설정합니다. 이 방정식에서 값 5, 9, 32는 상수 스칼라입니다. 변수 c와 f는 일반적으로 알 수 없는 스칼라를 나타냅니다.

We denote scalars by ordinary lower-cased letters (e.g., x, y, and z) and the space of all (continuous) real-valued scalars by ℝ . For expedience, we will skip past rigorous definitions of spaces: just remember that the expression x∈ ℝ is a formal way to say that x is a real-valued scalar. The symbol ∈ (pronounced “in”) denotes membership in a set. For example, x,y∈{0,1} indicates that x and y are variables that can only take values 0 or 1.

스칼라는 일반 소문자(예: x, y, z)로 표시하고 모든(연속) 실수 값 스칼라의 공간은 ℝ로 표시합니다. 편의상 공간에 대한 엄격한 정의는 생략하겠습니다. x∈ ℝ라는 표현은 x가 실수 값 스칼라임을 나타내는 형식적인 방법이라는 점만 기억하세요. 기호 ∈(“in”으로 발음)는 집합에 속한다는 것을 나타냅니다. 예를 들어, x,y∈{0,1}은 x와 y가 0 또는 1 값만 가질 수 있는 변수임을 나타냅니다.

Scalars are implemented as tensors that contain only one element. Below, we assign two scalars and perform the familiar addition, multiplication, division, and exponentiation operations.

스칼라는 하나의 요소만 포함하는 텐서로 구현됩니다. 아래에서는 두 개의 스칼라를 할당하고 익숙한 덧셈, 곱셈, 나눗셈 및 지수 연산을 수행합니다.

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y, x-y

주어진 코드는 PyTorch를 사용하여 두 개의 텐서 x와 y를 생성하고, 이를 활용하여 다양한 수학 연산을 수행하는 예제입니다. 아래는 코드의 설명입니다:

x = torch.tensor(3.0): x라는 이름의 PyTorch 텐서를 생성하고, 값으로 3.0을 할당합니다.
y = torch.tensor(2.0): y라는 이름의 PyTorch 텐서를 생성하고, 값으로 2.0을 할당합니다.
x + y: x와 y의 덧셈을 수행합니다. 결과로 새로운 텐서가 생성되며, 이 텐서의 값은 3.0 + 2.0으로 5.0이 됩니다.
x * y: x와 y의 곱셈을 수행합니다. 결과로 새로운 텐서가 생성되며, 이 텐서의 값은 3.0 * 2.0으로 6.0이 됩니다.
x / y: x를 y로 나눗셈을 수행합니다. 결과로 새로운 텐서가 생성되며, 이 텐서의 값은 3.0 / 2.0으로 1.5가 됩니다.
x**y: x를 y 제곱 연산을 수행합니다. 결과로 새로운 텐서가 생성되며, 이 텐서의 값은 3.0의 2.0 제곱으로 9.0이 됩니다.
x - y: x와 y의 뺄셈을 수행합니다. 결과로 새로운 텐서가 생성되며, 이 텐서의 값은 3.0 - 2.0으로 1.0이 됩니다.

코드를 실행하면 각 연산의 결과가 출력됩니다. PyTorch를 사용하면 텐서를 활용하여 다양한 수학적 연산을 수행할 수 있으며, 딥러닝 모델의 학습과 예측 등에 활용됩니다.

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.), tensor(1.))

2.3.2. Vectors

For current purposes, you can think of a vector as a fixed-length array of scalars. As with their code counterparts, we call these scalars the elements of the vector (synonyms include entries and components). When vectors represent examples from real-world datasets, their values hold some real-world significance. For example, if we were training a model to predict the risk of a loan defaulting, we might associate each applicant with a vector whose components correspond to quantities like their income, length of employment, or number of previous defaults. If we were studying the risk of heart attack, each vector might represent a patient and its components might correspond to their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. We denote vectors by bold lowercase letters, (e.g., x, y, and z).

현재 목적상 벡터를 스칼라의 고정 길이 배열로 생각할 수 있습니다. 해당 코드와 마찬가지로 이러한 스칼라를 벡터의 요소라고 부릅니다(동의어에는 항목과 구성 요소가 포함됨). 벡터가 실제 데이터 세트의 예를 나타내는 경우 해당 값은 실제 의미를 갖습니다. 예를 들어, 대출 불이행 위험을 예측하기 위해 모델을 훈련하는 경우 각 지원자를 소득, 고용 기간 또는 이전 불이행 횟수와 같은 수량에 해당하는 구성요소가 있는 벡터와 연결할 수 있습니다. 심장 마비의 위험을 연구하는 경우 각 벡터는 환자를 나타낼 수 있으며 그 구성 요소는 가장 최근의 활력 징후, 콜레스테롤 수치, 일일 운동 시간(분) 등에 해당할 수 있습니다. 벡터는 굵은 소문자로 표시됩니다(예: x, y, z).

Vectors are implemented as 1 st-order tensors. In general, such tensors can have arbitrary lengths, subject to memory limitations. Caution: in Python, as in most programming languages, vector indices start at 0, also known as zero-based indexing, whereas in linear algebra subscripts begin at 1 (one-based indexing).

벡터는 1차 텐서로 구현됩니다. 일반적으로 이러한 텐서는 메모리 제한에 따라 임의의 길이를 가질 수 있습니다. 주의: Python에서는 대부분의 프로그래밍 언어와 마찬가지로 벡터 인덱스가 0부터 시작합니다(0부터 시작하는 인덱싱이라고도 함). 반면 선형 대수학 첨자는 1(1부터 시작하는 인덱싱)에서 시작합니다.

x = torch.arange(3)
x

주어진 코드는 PyTorch를 사용하여 텐서 x를 생성하는 예제입니다. 아래는 코드의 설명입니다:

x = torch.arange(3): torch.arange() 함수를 사용하여 x라는 이름의 PyTorch 텐서를 생성합니다. torch.arange() 함수는 주어진 범위 내의 정수를 순서대로 생성하는 함수로, 여기서는 0부터 2까지의 정수를 생성하게 됩니다. 따라서 x는 [0, 1, 2]라는 값을 가지는 1차원 텐서가 됩니다.
x: 텐서 x를 출력합니다. 이 코드는 텐서 x의 내용을 화면에 출력하여 확인할 수 있습니다.

결과적으로, 코드를 실행하면 x라는 이름의 PyTorch 텐서가 생성되며, 그 값은 [0, 1, 2]인 1차원 배열로 출력됩니다. PyTorch의 텐서는 수학 및 딥러닝 연산에 사용되며, 다양한 데이터 처리 및 모델 학습 작업에 활용됩니다.

tensor([0, 1, 2])

We can refer to an element of a vector by using a subscript. For example, x₂ denotes the second element of x. Since x₂ is a scalar, we do not bold it. By default, we visualize vectors by stacking their elements vertically.

아래 첨자를 사용하여 벡터의 요소를 참조할 수 있습니다. 예를 들어 x₂는 x의 두 번째 요소를 나타냅니다. x₂는 스칼라이므로 굵게 표시하지 않습니다. 기본적으로 벡터의 요소를 수직으로 쌓아서 벡터를 시각화합니다.

Here x₁,…,x_n are elements of the vector. Later on, we will distinguish between such column vectors and row vectors whose elements are stacked horizontally. Recall that we access a tensor’s elements via indexing.

여기서 x₁,…,x_n은 벡터의 요소입니다. 나중에 이러한 열 벡터와 요소가 가로로 쌓인 행 벡터를 구분할 것입니다. 인덱싱을 통해 텐서의 요소에 액세스한다는 점을 기억하세요.

x[2]

주어진 코드는 PyTorch 텐서 x에서 특정 인덱스에 해당하는 값을 추출하는 예제입니다. 아래는 코드의 설명입니다:

x[2]: 이 코드는 PyTorch 텐서 x에서 인덱스 2에 해당하는 값을 추출하는 작업을 수행합니다. 텐서 x는 0부터 시작하는 인덱스를 가지며, 2번 인덱스는 세 번째 원소를 나타냅니다.

예를 들어, 만약 x가 [0, 1, 2]라는 값을 가진 1차원 텐서라면, x[2]는 2번 인덱스에 해당하는 값인 2를 반환합니다.

이 코드를 실행하면 x 텐서에서 해당 인덱스의 값을 추출하여 반환합니다. 결과적으로, x[2]는 x 텐서의 세 번째 원소에 해당하는 값을 반환합니다.

tensor(2)

To indicate that a vector contains n elements, we write x∈ ℝⁿ. Formally, we call n the dimensionality of the vector. In code, this corresponds to the tensor’s length, accessible via Python’s built-in len function.

벡터에 n개의 요소가 포함되어 있음을 나타내기 위해 x∈ ℝⁿ이라고 씁니다. 공식적으로 n을 벡터의 차원이라고 부릅니다. 코드에서 이는 Python의 내장 len 함수를 통해 액세스할 수 있는 텐서의 길이에 해당합니다.

len(x)

주어진 코드는 PyTorch 텐서 x의 길이(원소의 개수)를 반환하는 예제입니다. 아래는 코드의 설명입니다:

len(x): 이 코드는 PyTorch 텐서 x의 길이를 반환합니다. 텐서의 길이는 해당 텐서에 포함된 원소의 개수를 나타냅니다.

예를 들어, x가 [0, 1, 2]라는 값을 가진 1차원 텐서라면, len(x)는 3을 반환합니다. 즉, 이 텐서에는 3개의 원소가 포함되어 있습니다.

이 코드를 실행하면 x 텐서의 길이가 반환됩니다. 결과적으로, len(x)는 x 텐서에 포함된 원소의 개수를 나타내는 정수를 반환합니다.

We can also access the length via the shape attribute. The shape is a tuple that indicates a tensor’s length along each axis. Tensors with just one axis have shapes with just one element.

Shape 속성을 통해 길이에 접근할 수도 있습니다. 모양은 각 축을 따라 텐서의 길이를 나타내는 튜플입니다. 축이 하나뿐인 텐서는 요소가 하나만 있는 모양을 갖습니다.

x.shape

주어진 코드는 PyTorch 텐서 x의 모양(shape)을 반환하는 예제입니다. 아래는 코드의 설명입니다:

x.shape: 이 코드는 PyTorch 텐서 x의 모양(shape)을 반환합니다. 모양은 텐서가 어떻게 구성되어 있는지를 나타내며, 차원과 각 차원의 크기를 포함합니다.

예를 들어, 만약 x가 2차원 텐서이고 모양이 (3, 4)라면, x.shape는 (3, 4)를 반환합니다. 이는 3개의 행과 4개의 열로 이루어진 2차원 텐서임을 나타냅니다.

이 코드를 실행하면 x 텐서의 모양이 반환됩니다. 결과적으로, x.shape는 텐서의 차원과 각 차원의 크기를 나타내는 튜플을 반환합니다.

torch.Size([3])

Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the length along a particular axis. To avoid this confusion, we use order to refer to the number of axes and dimensionality exclusively to refer to the number of components.

종종 " dimension 차원"이라는 단어는 축 수와 특정 축의 길이를 모두 의미하는 것으로 오버로드됩니다. 이러한 혼란을 피하기 위해 우리는 축 수를 참조하기 위해 순서 order 를 사용하고 구성 요소 수를 참조하기 위해 차원 dimensionality 을 독점적으로 사용합니다.

2.3.3. Matrices

Just as scalars are 0 th-order tensors and vectors are 1 st-order tensors, matrices are 2 nd-order tensors. We denote matrices by bold capital letters (e.g., X, Y, and Z), and represent them in code by tensors with two axes. The expression A∈ ℝ ^m×n indicates that a matrix A contains m×n real-valued scalars, arranged as m rows and n columns. When m=n, we say that a matrix is square. Visually, we can illustrate any matrix as a table. To refer to an individual element, we subscript both the row and column indices, e.g., a_ij is the value that belongs to A’s i th row and j th column:

스칼라가 0차 텐서이고 벡터가 1차 텐서인 것처럼 행렬은 2차 텐서입니다. 행렬은 굵은 대문자(예: X, Y, Z)로 표시하고 두 개의 축이 있는 텐서로 코드로 표시합니다. A∈ ℝ^m×n 표현식은 행렬 A가 m행과 n열로 배열된 m×n 실수 값 스칼라를 포함함을 나타냅니다. m=n일 때, 행렬은 square 이라고 말합니다. 시각적으로 모든 행렬을 테이블로 설명할 수 있습니다. 개별 요소를 참조하기 위해 행과 열 인덱스를 모두 첨자로 표시합니다. 예를 들어 a_ij는 A의 i 번째 행과 j 번째 열에 속하는 값입니다.

In code, we represent a matrix A∈ ℝ m×n by a 2nd-order tensor with shape (m, n). We can convert any appropriately sized m×n tensor into an m×n matrix by passing the desired shape to reshape:

코드에서는 행렬 A∈ ℝ^m×n을 모양이 (m, n)인 2차 텐서로 표현합니다. 원하는 모양을 전달하여 적절한 크기의 m×n 텐서를 m×n 행렬로 변환할 수 있습니다.

A = torch.arange(6).reshape(3, 2)
A

주어진 코드는 PyTorch를 사용하여 텐서 A를 생성하고, 그 모양을 변경하는 작업을 수행하는 예제입니다. 아래는 코드의 설명입니다:

A = torch.arange(6): torch.arange() 함수를 사용하여 0부터 5까지의 정수로 이루어진 1차원 텐서 A를 생성합니다. 이 텐서는 [0, 1, 2, 3, 4, 5]와 같은 값을 가지게 됩니다.
.reshape(3, 2): 생성한 텐서 A의 모양을 변경합니다. .reshape() 메서드를 사용하여 텐서의 모양을 변경할 수 있으며, 여기서는 (3, 2) 모양으로 변경합니다. 따라서 텐서 A는 3개의 행과 2개의 열로 이루어진 2차원 텐서가 됩니다.
A: 변경된 텐서 A를 출력합니다. 이 코드는 변경된 모양의 텐서 A를 확인할 수 있도록 출력합니다.

결과적으로, 코드를 실행하면 0부터 5까지의 값을 가진 1차원 텐서 A가 생성되고, 그 후 (3, 2) 모양으로 변경된 텐서 A가 출력됩니다. 이처럼 PyTorch를 사용하면 텐서의 모양을 변경하여 데이터를 원하는 형태로 조작할 수 있습니다.

tensor([[0, 1],
        [2, 3],
        [4, 5]])

Sometimes we want to flip the axes. When we exchange a matrix’s rows and columns, the result is called its transpose. Formally, we signify a matrix A’s transpose by A^⊤ and if B=A^⊤, then b_ij=a_ji for all i and j. Thus, the transpose of an m×n matrix is an n×m matrix:

때로는 축을 뒤집고 싶을 때도 있습니다. 행렬의 행과 열을 교환할 때의 결과를 전치 transpose 라고 합니다. 공식적으로, 행렬 A의 전치를 A^⊤로 표시하고 B=A^⊤이면 모든 i와 j에 대해 b_ij=a_ji입니다. 따라서 m×n 행렬의 전치는 n×m 행렬입니다.

In code, we can access any matrix’s transpose as follows:

코드에서는 다음과 같이 모든 행렬의 전치에 액세스할 수 있습니다.

A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

주어진 코드 A.T는 PyTorch 텐서 A의 전치(transpose)를 반환하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.T: 이 코드는 텐서 A의 전치(transpose)를 반환합니다. 전치란 원본 행렬 또는 텐서의 행과 열을 바꾼 것을 의미합니다. 즉, 텐서 A의 행은 열로, 열은 행으로 바뀝니다.

예를 들어, 만약 텐서 A가 다음과 같다면:

A = torch.tensor([[0, 1],
                  [2, 3],
                  [4, 5]])

A.T는 다음과 같이 전치된 텐서를 반환합니다:

tensor([[0, 2, 4],
        [1, 3, 5]])

결과적으로, A.T 코드를 실행하면 텐서 A의 전치된 형태인 새로운 텐서가 반환됩니다. 이렇게 하면 원본 텐서의 행과 열이 바뀐 모양의 텐서를 얻을 수 있습니다.

Symmetric matrices are the subset of square matrices that are equal to their own transposes: A=A^⊤. The following matrix is symmetric:

대칭 행렬은 자체 전치와 동일한 정사각형 행렬의 하위 집합입니다(A=A^⊤). 다음 행렬은 대칭입니다.

A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

주어진 코드는 PyTorch 텐서 A와 그 전치(A.T) 간의 요소별 비교를 수행하고 결과를 반환하는 예제입니다. 아래는 코드의 설명입니다:

A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]]): 텐서 A를 생성합니다. 이 텐서는 3x3 크기의 2차원 배열을 나타내며, 각 요소의 값은 주어진 값으로 초기화됩니다.
A.T: 텐서 A의 전치(transpose)를 계산합니다. 이렇게 하면 원본 텐서의 행과 열이 바뀐 형태의 텐서가 생성됩니다.
A == A.T: 원본 텐서 A와 그 전치 텐서 A.T 간의 요소별(원소별) 비교를 수행합니다. 두 텐서의 같은 위치에 있는 요소끼리 비교하며, 결과는 두 텐서가 동일한 요소를 가지면 True로, 다른 경우에는 False로 나타납니다.

결과적으로, A == A.T 코드를 실행하면 A와 A의 전치 텐서 A.T 간의 요소별 비교 결과를 반환합니다. 이 코드의 결과는 A와 A.T가 대칭 행렬인 경우에만 모든 요소가 True가 됩니다. 다시 말해, A가 대칭 행렬이라면 A == A.T는 모든 요소가 True를 반환할 것입니다.

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Matrices are useful for representing datasets. Typically, rows correspond to individual records and columns correspond to distinct attributes.

행렬은 데이터 세트를 나타내는 데 유용합니다. 일반적으로 행은 개별 레코드에 해당하고 열은 고유한 속성에 해당합니다.

2.3.4. Tensors

While you can go far in your machine learning journey with only scalars, vectors, and matrices, eventually you may need to work with higher-order tensors. Tensors give us a generic way of describing extensions to nth-order arrays. We call software objects of the tensor class “tensors” precisely because they too can have arbitrary numbers of axes. While it may be confusing to use the word tensor for both the mathematical object and its realization in code, our meaning should usually be clear from context. We denote general tensors by capital letters with a special font face (e.g., X, Y, and Z) and their indexing mechanism (e.g., x_ijk and [X]_1,2i−1,3) follows naturally from that of matrices.

스칼라, 벡터 및 행렬만으로 기계 학습 여정을 멀리할 수 있지만 결국에는 고차 텐서를 사용하여 작업해야 할 수도 있습니다. 텐서는 n차 배열에 대한 확장을 설명하는 일반적인 방법을 제공합니다. 우리는 텐서 클래스의 소프트웨어 객체를 "텐서"라고 부릅니다. 왜냐하면 그들 역시 임의의 수의 축을 가질 수 있기 때문입니다. 수학적 객체와 코드에서의 구현 모두에 대해 텐서라는 단어를 사용하는 것이 혼란스러울 수 있지만 일반적으로 의미는 문맥에서 명확해야 합니다. 일반 텐서를 특수 글꼴(예: X, Y, Z)이 있는 대문자로 표시하며 해당 인덱싱 메커니즘(예: x_ijk 및 [X]_1,2i−1,3)은 자연스럽게 행렬의 인덱싱 메커니즘을 따릅니다.

Tensors will become more important when we start working with images. Each image arrives as a 3rd-order tensor with axes corresponding to the height, width, and channel. At each spatial location, the intensities of each color (red, green, and blue) are stacked along the channel. Furthermore, a collection of images is represented in code by a 4th-order tensor, where distinct images are indexed along the first axis. Higher-order tensors are constructed, as were vectors and matrices, by growing the number of shape components.

이미지 작업을 시작하면 Tensor가 더욱 중요해집니다. 각 이미지는 높이, 너비 및 채널에 해당하는 축이 있는 3차 텐서로 도착합니다. 각 공간 위치에서 각 색상(빨간색, 녹색, 파란색)의 강도가 채널을 따라 누적됩니다. 또한, 이미지 모음은 코드에서 4차 텐서로 표현되며, 여기서 개별 이미지는 첫 번째 축을 따라 인덱싱됩니다. 고차 텐서는 벡터 및 행렬과 마찬가지로 모양 구성 요소의 수를 늘려 구성됩니다.

torch.arange(24).reshape(2, 3, 4)

주어진 코드는 PyTorch를 사용하여 2x3x4 크기의 3차원 텐서를 생성하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

torch.arange(24): 이 코드는 0부터 23까지의 정수를 포함하는 1차원 텐서를 생성합니다. torch.arange() 함수는 주어진 범위 내의 정수를 생성하는 함수입니다.
.reshape(2, 3, 4): 앞서 생성한 1차원 텐서를 2x3x4 크기의 다차원 텐서로 형태를 변경합니다. .reshape() 메서드를 사용하여 텐서의 모양을 변경할 수 있으며, 여기서는 (2, 3, 4) 모양으로 변경합니다. 따라서 이제 텐서는 3차원 배열로 표현되며, 크기는 2개의 "깊이" (depth), 각 깊이당 3개의 "행" (rows), 그리고 각 행당 4개의 "열" (columns)로 구성됩니다.

결과적으로, torch.arange(24).reshape(2, 3, 4) 코드를 실행하면 2x3x4 크기의 3차원 텐서가 생성됩니다. 이 텐서는 다양한 데이터를 저장하거나 다차원 배열 연산을 수행하는 데 사용될 수 있습니다.

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

2.3.5. Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.

스칼라, 벡터, 행렬 및 고차 텐서는 모두 몇 가지 편리한 속성을 가지고 있습니다. 예를 들어 요소별 연산은 피연산자와 모양이 동일한 출력을 생성합니다.

A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

주어진 코드는 PyTorch를 사용하여 두 개의 텐서 A와 B를 생성하고, 이를 활용하여 텐서 간의 덧셈 연산을 수행하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A = torch.arange(6, dtype=torch.float32).reshape(2, 3): 먼저, 텐서 A를 생성합니다. torch.arange() 함수는 0부터 시작하여 5까지의 값을 가지는 1차원 텐서를 생성하고, 이를 .reshape(2, 3) 메서드를 사용하여 2x3 크기의 2차원 텐서로 형태를 변경합니다. 결과적으로 A는 다음과 같이 표현됩니다:

tensor([[0., 1., 2.],
        [3., 4., 5.]])

B = A.clone(): 다음으로, 텐서 B를 생성합니다. 여기서 A.clone()을 사용하여 텐서 A의 복사본을 만듭니다. 이 때, 새로운 메모리를 할당하여 A와 동일한 값을 가지는 B를 생성합니다. 이로써 A와 B는 동일한 값을 가지지만 서로 다른 메모리 공간을 사용하게 됩니다.
A, A + B: 텐서 A와 텐서 B를 출력합니다. 그리고 A + B를 사용하여 두 텐서의 요소별 덧셈 연산을 수행한 결과를 출력합니다. 덧셈 연산은 같은 위치에 있는 요소끼리 더해지며, 결과는 다음과 같이 표시됩니다:

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

결과적으로, 코드를 실행하면 두 개의 텐서 A와 B가 생성되고, 이를 사용하여 요소별 덧셈 연산이 수행되어 두 개의 텐서가 반환됩니다. 이렇게 PyTorch를 사용하면 텐서 연산을 쉽게 수행하고, 복사본을 만들어 원본 데이터를 보존할 수 있습니다.

The elementwise product of two matrices is called their Hadamard product (denoted ⊙). We can spell out the entries of the Hadamard product of two matrices A,B∈ ℝ ^m×n:

두 행렬의 요소별 곱을 하다마드 곱(Hadamard product)이라고 합니다(기호 ⊙). 두 행렬 A,B∈ ℝ ^m×n의 Hadamard 곱의 항목을 spell out 할 수 있습니다.

A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

주어진 코드 A * B는 PyTorch 텐서 A와 B 간의 요소별 곱셈 연산을 나타냅니다. 아래는 코드의 설명입니다:

A * B: 이 코드는 텐서 A와 B 간의 요소별 곱셈을 수행합니다. 요소별 곱셈은 두 텐서의 같은 위치에 있는 요소끼리 곱셈을 수행하며, 결과는 새로운 텐서로 반환됩니다.

예를 들어, A와 B가 다음과 같다면:

A = torch.tensor([[0., 1., 2.],
                  [3., 4., 5.]])
B = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]])

A * B는 다음과 같이 요소별로 곱셈이 수행된 결과를 반환합니다:

tensor([[ 0.,  2.,  6.],
        [12., 20., 30.]])

결과적으로, A * B 코드를 실행하면 두 개의 텐서 A와 B 간의 요소별 곱셈 연산이 수행되어 새로운 텐서가 반환됩니다. 이렇게 하면 각 요소가 서로 곱해진 결과가 나타납니다. PyTorch를 사용하면 다양한 요소별 연산을 쉽게 수행할 수 있으며, 이를 활용하여 수학적 계산을 수행할 수 있습니다.

Hadamard product (Hadamard 곱)이란?

The Hadamard product, also known as the element-wise product or entrywise product, is a mathematical operation that involves multiplying each corresponding element of two matrices or vectors together.

Hadamard 곱(또는 요소별 곱셈 또는 항별 곱셈)은 두 행렬 또는 벡터의 각 해당 요소를 서로 곱하는 수학적 연산입니다.

Specifically, for two matrices A and B of the same shape (i.e., they have the same number of rows and columns), the Hadamard product C is calculated as follows:

구체적으로, 동일한 모양을 가진 두 행렬 A와 B에 대해 Hadamard 곱 C는 다음과 같이 계산됩니다:

C[i][j] = A[i][j] * B[i][j] for all valid indices i and j.

In other words, each element in the resulting matrix C is obtained by multiplying the corresponding elements of A and B. The Hadamard product differs from the more conventional matrix multiplication (dot product) in which you perform element-wise multiplication and then sum the results.

다시 말해, 결과 행렬 C의 각 요소는 A와 B의 해당 요소를 곱하여 얻어집니다. Hadamard 곱은 요소별 곱셈을 수행하고 결과를 합산하는 일반적인 행렬 곱셈과는 다릅니다.

The Hadamard product is often denoted by a circle with a dot inside it (∘) or by simply using the multiplication symbol (*), without any special operation notation.

Hadamard 곱은 두 행렬 또는 벡터의 각 요소를 독립적으로 처리하려는 경우 요소 간의 관계를 보존하면서 수행할 때 특히 유용합니다. Hadamard 곱은 일반적으로 점(∘)이 내부에 있는 원으로 표시되거나 간단히 곱셈 기호(*)를 사용하여 나타납니다.

The Hadamard product is used in various mathematical and scientific contexts, including linear algebra, signal processing, and statistics. It's particularly useful in cases where you want to perform operations on individual components of matrices or vectors independently, preserving their element-wise relationships.

Hadamard 곱은 선형 대수, 신호 처리 및 통계를 포함한 다양한 수학적 및 과학적 맥락에서 사용됩니다. 이 연산은 행렬이나 벡터의 개별 구성 요소에 대한 연산을 독립적으로 수행하고 요소 간의 관계를 보존할 때 특히 유용합니다.

Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.

스칼라와 텐서를 더하거나 곱하면 원래 텐서와 모양이 같은 결과가 생성됩니다. 여기서 텐서의 각 요소는 스칼라에 추가되거나 곱해집니다.

a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],

         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

주어진 코드는 PyTorch를 사용하여 스칼라 값 a와 3차원 텐서 X 간의 덧셈 연산과 곱셈 연산을 수행하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

a = 2: 변수 a에 숫자 2를 할당합니다. 이 값은 스칼라(하나의 숫자)를 나타냅니다.
X = torch.arange(24).reshape(2, 3, 4): 0부터 23까지의 정수로 이루어진 1차원 텐서를 생성한 후, .reshape(2, 3, 4) 메서드를 사용하여 이를 2x3x4 크기의 3차원 텐서로 형태를 변경합니다. 결과적으로 X는 3차원 배열로 표현되며, 크기는 2개의 "깊이" (depth), 각 깊이당 3개의 "행" (rows), 그리고 각 행당 4개의 "열" (columns)로 구성됩니다.
a + X: 스칼라 값 a와 3차원 텐서 X 간의 요소별 덧셈 연산을 수행합니다. 스칼라 값 a가 X의 각 요소에 더해지며, 결과는 X와 동일한 크기의 텐서로 반환됩니다.
(a * X).shape: 스칼라 값 a와 3차원 텐서 X 간의 요소별 곱셈 연산을 수행합니다. 결과로 나오는 텐서의 모양(shape)을 확인합니다. .shape 속성은 텐서의 모양을 반환합니다.

결과적으로, 코드를 실행하면 두 개의 결과가 반환됩니다:

첫 번째 결과는 a가 X의 각 요소에 더해져서 생성된 텐서입니다. 이 텐서는 X와 동일한 크기를 가지며, 각 요소는 a와 X의 해당 위치 요소를 더한 값입니다.
두 번째 결과는 a가 X의 각 요소에 곱해진 결과인 텐서의 모양(shape)입니다. 이 경우, 곱셈 연산은 모양(shape)을 변경하지 않으므로 (2, 3, 4)와 같은 모양을 가진 원래 X와 동일한 모양을 반환합니다.

이 코드를 통해 PyTorch에서 스칼라와 텐서 간의 연산을 어떻게 수행하는지를 이해할 수 있습니다.

2.3.6. Reduction

Often, we wish to calculate the sum of a tensor’s elements. To express the sum of the elements in a vector x of length n, we write ∑ⁿ _i=1 x_i. There is a simple function for it:

종종 우리는 텐서 요소의 합을 계산하고 싶습니다. 길이 n의 벡터 x에 있는 요소의 합을 표현하기 위해 ∑ⁿ _i=1 x_i라고 씁니다. 이를 위한 간단한 기능이 있습니다:

x = torch.arange(3, dtype=torch.float32)
x, x.sum()

주어진 코드는 PyTorch를 사용하여 1차원 텐서 x를 생성하고, 그 텐서의 합계를 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

x = torch.arange(3, dtype=torch.float32): torch.arange() 함수를 사용하여 0부터 2까지의 정수로 이루어진 1차원 텐서 x를 생성합니다. dtype=torch.float32를 사용하여 텐서의 데이터 타입을 부동소수점(float32)으로 설정합니다.
x: 생성한 텐서 x를 출력합니다. 이 코드는 텐서 x를 확인하기 위해 화면에 출력합니다.
x.sum(): 텐서 x의 합계를 계산합니다. .sum() 메서드는 텐서의 모든 요소의 합을 반환합니다. 여기서는 x의 요소가 [0.0, 1.0, 2.0]이므로 합계는 0.0 + 1.0 + 2.0으로 3.0이 됩니다.

결과적으로, 코드를 실행하면 텐서 x가 생성되고, 이 텐서의 내용이 출력됩니다. 또한 x.sum()을 호출하여 텐서 x의 합계가 계산되고 출력됩니다. 이 코드는 PyTorch를 사용하여 텐서를 생성하고 기본적인 연산을 수행하는 예제를 보여줍니다.

(tensor([0., 1., 2.]), tensor(3.))

To express sums over the elements of tensors of arbitrary shape, we simply sum over all its axes. For example, the sum of the elements of an m×n matrix A could be written ∑^m _i=1∑ⁿ _j=1 a_ij.

임의 형태의 텐서 요소에 대한 합을 표현하려면 단순히 모든 축에 대한 합을 구하면 됩니다. 예를 들어, m×n 행렬 A의 요소들의 합은 ∑^m _i=1∑ⁿ _j=1 a_ij로 쓸 수 있습니다.

A.shape, A.sum()

(torch.Size([2, 3]), tensor(15.))

주어진 코드는 PyTorch 텐서 A의 모양(shape)과 텐서 내의 모든 요소의 합계를 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.shape: 텐서 A의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하면 텐서의 차원과 각 차원의 크기를 나타내는 튜플이 반환됩니다.
A.sum(): 텐서 A의 모든 요소의 합계를 계산하는 작업입니다. .sum() 메서드를 사용하면 텐서 내의 모든 요소를 합한 결과가 반환됩니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 텐서 A의 모양(shape)을 나타내는 튜플입니다. 이 튜플은 텐서의 차원과 각 차원의 크기를 포함하고 있습니다. 예를 들어, (2, 3, 4)와 같은 튜플은 3차원 텐서이며, 첫 번째 차원은 2, 두 번째 차원은 3, 세 번째 차원은 4의 크기를 가집니다.
두 번째 결과는 텐서 A 내의 모든 요소의 합계를 나타내는 값입니다. 이 값은 텐서 내의 모든 요소를 합한 결과이므로, A의 모든 요소를 합친 총합이 됩니다.

이 코드는 텐서의 모양과 요소의 합을 확인하는 데 유용하며, 데이터 분석 및 딥러닝 모델 학습 중에 자주 사용됩니다.

By default, invoking the sum function reduces a tensor along all of its axes, eventually producing a scalar. Our libraries also allow us to specify the axes along which the tensor should be reduced. To sum over all elements along the rows (axis 0), we specify axis=0 in sum. Since the input matrix reduces along axis 0 to generate the output vector, this axis is missing from the shape of the output.

기본적으로 sum 함수를 호출하면 모든 축을 따라 텐서가 줄어들고 결국 스칼라가 생성됩니다. 우리 라이브러리를 사용하면 텐서가 감소되어야 하는 축을 지정할 수도 있습니다. 행(축 0)을 따라 모든 요소를 합산하려면 합산에 axis=0을 지정합니다. 입력 행렬은 출력 벡터를 생성하기 위해 축 0을 따라 감소하므로 이 축은 출력의 모양에서 누락됩니다.

A.shape, A.sum(axis=0).shape

주어진 코드는 PyTorch 텐서 A의 모양(shape)과 axis=0를 사용하여 첫 번째 차원(열)을 따라 합산한 결과의 모양을 확인하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.shape: 텐서 A의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하면 텐서의 차원과 각 차원의 크기를 나타내는 튜플이 반환됩니다.
A.sum(axis=0): 텐서 A의 첫 번째 차원(열)을 따라 합산한 결과를 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 합산할지 지정할 수 있으며, 여기서는 axis=0을 사용하여 첫 번째 차원을 따라 합산합니다.
A.sum(axis=0).shape: 합산된 결과 텐서의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하여 모양을 확인합니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 원래 텐서 A의 모양(shape)을 나타내는 튜플입니다.
두 번째 결과는 첫 번째 차원(열)을 따라 합산한 결과 텐서의 모양(shape)을 나타내는 튜플입니다. 이 결과 텐서는 첫 번째 차원이 합산되었기 때문에 원래 텐서보다 한 차원이 줄어듭니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 차원 및 차원 간의 연산을 조작하고 모양(shape)을 확인하는 방법을 이해할 수 있습니다.

(torch.Size([2, 3]), torch.Size([3]))

Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of all the columns.

axis=1을 지정하면 모든 열의 요소를 합산하여 열 차원(축 1)이 줄어듭니다.

A.shape, A.sum(axis=1).shape

주어진 코드는 PyTorch 텐서 A의 모양(shape)과 axis=1을 사용하여 두 번째 차원(행)을 따라 합산한 결과의 모양을 확인하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.shape: 텐서 A의 모얥(shape)을 확인하는 작업입니다. .shape 속성을 사용하면 텐서의 차원과 각 차원의 크기를 나타내는 튜플이 반환됩니다.
A.sum(axis=1): 텐서 A의 두 번째 차원(행)을 따라 합산한 결과를 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 합산할지 지정할 수 있으며, 여기서는 axis=1을 사용하여 두 번째 차원을 따라 합산합니다.
A.sum(axis=1).shape: 합산된 결과 텐서의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하여 모양을 확인합니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 원래 텐서 A의 모양(shape)을 나타내는 튜플입니다.
두 번째 결과는 두 번째 차원(행)을 따라 합산한 결과 텐서의 모양(shape)을 나타내는 튜플입니다. 이 결과 텐서는 두 번째 차원이 합산되었기 때문에 원래 텐서보다 한 차원이 줄어듭니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 차원 및 차원 간의 연산을 조작하고 모양(shape)을 확인하는 방법을 이해할 수 있습니다.

(torch.Size([2, 3]), torch.Size([2]))

Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.

합산을 통해 행과 열 모두에서 행렬을 줄이는 것은 행렬의 모든 요소를 합산하는 것과 같습니다.

A.sum(axis=[0, 1]) == A.sum()  # Same as A.sum()

주어진 코드는 PyTorch 텐서 A에 대해 axis=[0, 1]를 사용하여 두 개의 축(차원)을 따라 합산한 결과와 전체 텐서를 합산한 결과를 비교하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.sum(axis=[0, 1]): 텐서 A에 대해 axis=[0, 1]을 사용하여 0번째 차원(열)과 1번째 차원(행)을 따라 합산한 결과를 계산합니다. 여기서 [0, 1]은 텐서의 0번째 차원과 1번째 차원을 모두 합산하라는 의미입니다. 결과는 전체 텐서의 합계를 나타내게 됩니다.
A.sum(): 텐서 A의 모든 요소를 합산한 결과를 계산합니다. .sum() 메서드를 사용하면 텐서 내의 모든 요소를 합산한 결과가 반환됩니다.
A.sum(axis=[0, 1]) == A.sum(): A.sum(axis=[0, 1])과 A.sum()의 결과를 비교하는 작업을 수행합니다. 두 결과가 같은지 여부를 확인하기 위한 비교 연산을 수행하며, 결과는 True 또는 False로 나타납니다.

결과적으로, 코드를 실행하면 A.sum(axis=[0, 1])과 A.sum()의 결과를 비교하여 두 값이 동일한지 여부를 확인하는 논리식이 반환됩니다. 이 코드는 텐서의 다차원 연산과 축을 따라 합산하는 방법을 보여주며, axis=[0, 1]을 사용하여 전체 텐서를 합산한 결과와 A.sum()를 사용한 결과가 동일함을 나타냅니다.

tensor(True)

A related quantity is the mean, also called the average. We calculate the mean by dividing the sum by the total number of elements. Because computing the mean is so common, it gets a dedicated library function that works analogously to sum.

관련 수량은 average 이라고도 하는 평균 mean 입니다. 합계를 전체 요소 수로 나누어 평균을 계산합니다. 평균을 계산하는 것은 매우 일반적이기 때문에 합계와 유사하게 작동하는 전용 라이브러리 함수를 얻습니다.

A.mean(), A.sum() / A.numel()

주어진 코드는 PyTorch 텐서 A의 평균과 텐서의 모든 요소의 합계를 요소의 총 개수로 나눈 결과를 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.mean(): 텐서 A의 평균을 계산합니다. .mean() 메서드를 사용하면 텐서 내의 모든 요소의 평균값이 반환됩니다.
A.sum(): 텐서 A의 모든 요소를 합산한 결과를 계산합니다. .sum() 메서드를 사용하면 텐서 내의 모든 요소를 합산한 결과가 반환됩니다.
A.numel(): 텐서 A의 요소의 총 개수를 계산합니다. .numel() 메서드는 텐서 내의 모든 요소의 개수를 반환합니다.
A.sum() / A.numel(): A.sum()을 텐서의 요소 개수(A.numel())로 나눈 결과를 계산합니다. 이렇게 함으로써 텐서의 평균을 구합니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 텐서 A의 모든 요소의 평균값을 나타내는 값입니다. 이 값은 텐서의 모든 요소를 더하고 요소의 총 개수로 나눈 결과입니다.
두 번째 결과는 텐서 A의 모든 요소를 합산한 값을 나타내는 값입니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 요소를 평균화하고 총합을 계산하는 방법을 이해할 수 있습니다.

(tensor(2.5000), tensor(2.5000))

Likewise, the function for calculating the mean can also reduce a tensor along specific axes.

마찬가지로, 평균을 계산하는 함수는 특정 축을 따라 텐서를 줄일 수도 있습니다.

A.mean(axis=0), A.sum(axis=0) / A.shape[0]

주어진 코드는 PyTorch 텐서 A에 대해 axis=0을 사용하여 0번째 차원(열)을 따라 평균과 합계를 계산한 결과를 비교하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.mean(axis=0): 텐서 A에 대해 axis=0을 사용하여 0번째 차원(열)을 따라 평균을 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 평균을 계산할지 지정할 수 있으며, 여기서는 axis=0을 사용하여 0번째 차원(열)을 따라 평균을 계산합니다.
A.sum(axis=0): 텐서 A에 대해 axis=0을 사용하여 0번째 차원(열)을 따라 합계를 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 합계를 계산할지 지정할 수 있으며, 여기서는 axis=0을 사용하여 0번째 차원(열)을 따라 합계를 계산합니다.
A.shape[0]: 텐서 A의 0번째 차원(열)의 크기를 확인합니다. A.shape은 텐서의 모양(shape)을 나타내는 튜플을 반환하며, 여기서 [0]을 사용하여 튜플의 첫 번째 요소를 얻습니다.
A.sum(axis=0) / A.shape[0]: A.sum(axis=0)을 0번째 차원(열)의 크기(A.shape[0])로 나눈 결과를 계산합니다. 이렇게 함으로써 0번째 차원을 따라 합산한 평균을 구합니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 텐서 A의 0번째 차원(열)을 따라 평균을 계산한 결과입니다.
두 번째 결과는 텐서 A의 0번째 차원(열)을 따라 합산한 결과를 텐서의 0번째 차원 크기로 나눈 평균을 나타내는 값입니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 특정 차원을 따라 평균 및 합계를 계산하고, 이를 텐서의 크기로 나누어 평균을 구하는 방법을 이해할 수 있습니다.

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

2.3.7. Non-Reduction Sum

Sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.

때로는 합계 또는 평균을 계산하는 함수를 호출할 때 축 수를 변경하지 않고 유지하는 것이 유용할 수 있습니다. 이는 브로드캐스트 메커니즘을 사용하려는 경우 중요합니다.

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

주어진 코드는 PyTorch 텐서 A에 대해 axis=1을 사용하여 1번째 차원(행)을 따라 합산한 결과를 계산하고, keepdims=True 옵션을 사용하여 결과 텐서의 차원을 유지한 채로 결과를 확인하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.sum(axis=1, keepdims=True): 텐서 A에 대해 axis=1을 사용하여 1번째 차원(행)을 따라 합산한 결과를 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 합산할지 지정할 수 있으며, 여기서는 axis=1을 사용하여 1번째 차원(행)을 따라 합산합니다. keepdims=True 옵션은 결과 텐서의 차원을 유지하도록 설정합니다.
sum_A: 합산 결과를 저장하는 변수로 A.sum(axis=1, keepdims=True)의 결과를 할당합니다.
sum_A.shape: sum_A 텐서의 모양(shape)을 확인합니다. .shape 속성을 사용하면 텐서의 차원과 각 차원의 크기를 나타내는 튜플을 반환합니다.

결과적으로, 코드를 실행하면 두 가지 결과가 반환됩니다:

첫 번째 결과는 텐서 A의 1번째 차원(행)을 따라 합산한 결과를 나타내는 텐서입니다. 이 텐서의 차원은 원래 텐서와 동일한 차원을 가지며, 합산 결과가 유지됩니다.
두 번째 결과는 sum_A 텐서의 모양(shape)을 나타내는 튜플입니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 특정 차원을 따라 합산하고, 결과 텐서의 차원을 유지하며 차원의 모양(shape)을 확인하는 방법을 이해할 수 있습니다.

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

For instance, since sum_A keeps its two axes after summing each row, we can divide A by sum_A with broadcasting to create a matrix where each row sums up to 1.

예를 들어 sum_A는 각 행을 합산한 후 두 개의 축을 유지하므로 브로드캐스팅을 통해 A를 sum_A로 나누어 각 행의 합이 다음과 같은 행렬을 만들 수 있습니다.

A / sum_A

주어진 코드는 PyTorch 텐서 A를 sum_A로 나누는 작업을 나타냅니다. sum_A는 이전 코드에서 A의 1번째 차원(행)을 따라 합산한 결과를 나타내는 텐서입니다. 아래는 코드의 설명입니다:

A / sum_A: 텐서 A를 sum_A로 나누는 작업을 수행합니다. 이 작업은 요소별로 (element-wise) 이루어지며, 텐서 A와 sum_A의 같은 위치에 있는 요소끼리 나눗셈을 수행합니다. 결과는 새로운 텐서로 반환됩니다.

결과적으로, 코드를 실행하면 A와 sum_A의 요소별 나눗셈 연산을 수행한 결과인 새로운 텐서가 반환됩니다. 이 코드를 통해 텐서의 요소끼리 나눗셈 연산을 수행하는 방법을 이해할 수 있습니다. 이러한 연산을 사용하면 데이터 정규화나 스케일링과 같은 다양한 작업을 수행할 수 있습니다.

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by row), we can call the cumsum function. By design, this function does not reduce the input tensor along any axis.

어떤 축(축=0(행 단위))을 따라 A 요소의 누적 합계를 계산하려면 cumsum 함수를 호출할 수 있습니다. 설계상 이 함수는 어떤 축에서도 입력 텐서를 줄이지 않습니다.

A.cumsum(axis=0)

주어진 코드는 PyTorch 텐서 A에 대해 axis=0을 사용하여 0번째 차원(열)을 따라 누적 합계(cumulative sum)를 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

A.cumsum(axis=0): 텐서 A에 대해 axis=0을 사용하여 0번째 차원(열)을 따라 누적 합계를 계산합니다. axis 매개변수를 사용하여 어떤 차원을 따라 누적 합계를 계산할지 지정할 수 있으며, 여기서는 axis=0을 사용하여 0번째 차원(열)을 따라 누적 합계를 계산합니다.

결과적으로, 코드를 실행하면 텐서 A의 0번째 차원(열)을 따라 누적 합계를 계산한 결과를 반환합니다. 이 결과 텐서는 원래 텐서와 동일한 모양(shape)을 가지며, 각 요소는 해당 열까지의 누적 합계를 나타냅니다.

이 코드를 통해 PyTorch를 사용하여 텐서의 특정 차원을 따라 누적 합계를 계산하는 방법을 이해할 수 있습니다. 누적 합계는 주어진 차원의 각 위치에서 해당 위치까지의 합계를 나타내는데 사용됩니다.

tensor([[0., 1., 2.],
        [3., 5., 7.]])

2.3.8. Dot Products

So far, we have only performed elementwise operations, sums, and averages. And if this was all we could do, linear algebra would not deserve its own section. Fortunately, this is where things get more interesting. One of the most fundamental operations is the dot product. Given two vectors x,y∈ ℝ ^d, their dot product x^⊤y (also known as inner product, ⟨x,y⟩) is a sum over the products of the elements at the same position: x^⊤y=∑^d _i=1 x_iy_i.

지금까지는 요소별 연산, 합계, 평균만 수행했습니다. 그리고 이것이 우리가 할 수 있는 전부라면 선형 대수학은 그 자체의 섹션을 가질 자격이 없을 것입니다. 다행히도 여기서 상황이 더욱 흥미로워집니다. 가장 기본적인 연산 중 하나는 내적(dot product)입니다. 두 개의 벡터 x,y∈ ℝ^d가 주어지면, 그 내적 x^⊤y(내적이라고도 함, ⟨x,y⟩)는 동일한 위치에 있는 요소의 곱에 대한 합입니다. x^⊤y=∑^d _i=1 x_iy_i.

y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)

주어진 코드는 PyTorch를 사용하여 두 벡터 x와 y를 생성하고, 이 두 벡터의 내적(dot product)을 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

y = torch.ones(3, dtype=torch.float32): 길이가 3인 1차원 텐서 y를 생성합니다. torch.ones() 함수를 사용하여 모든 요소가 1로 초기화된 텐서를 생성하며, dtype 매개변수를 사용하여 데이터 타입을 부동소수점(float32)으로 설정합니다.
x: 벡터 x를 생성한 변수입니다. 코드에서 직접 표시되지 않았지만, x는 이전에 어떤 값으로 초기화되었을 것입니다. x는 y와 동일한 길이를 가져야합니다.
y: 위에서 생성한 벡터 y입니다.
torch.dot(x, y): 벡터 x와 y 사이의 내적(dot product)을 계산합니다. 내적은 두 벡터의 대응하는 요소를 곱한 후 모두 더한 값을 의미합니다. 이 코드는 x와 y 벡터의 내적을 계산하고 결과를 반환합니다.

결과적으로, 코드를 실행하면 x와 y 벡터가 생성되고, 이 두 벡터의 내적이 계산되어 반환됩니다. 내적은 벡터 간의 유사성을 측정하는 데 사용되며, 벡터의 각 성분을 곱한 후 모두 합한 값으로 표현됩니다.

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

Equivalently, we can calculate the dot product of two vectors by performing an elementwise multiplication followed by a sum:

마찬가지로 요소별 곱셈과 합을 수행하여 두 벡터의 내적을 계산할 수 있습니다.

torch.sum(x * y)

주어진 코드는 PyTorch를 사용하여 두 벡터 x와 y를 요소별로 곱하고 그 결과를 모두 합산하는 작업을 나타냅니다. 코드에서 올바른 방법으로 작성하기 위해서는 torch.sum() 함수를 사용해야 합니다. 아래는 코드의 설명입니다:

torch.sum(x * y): 벡터 x와 y의 요소별 곱셈을 수행한 후, 그 결과를 모두 합산합니다. 이 코드는 x와 y의 각 성분을 곱한 후 그 결과를 모두 합산하는 동작을 수행합니다.

결과적으로, 코드를 실행하면 두 벡터 x와 y의 요소별 곱셈 결과가 계산되고, 이 결과를 모두 합산한 값이 반환됩니다. 내적과 유사하지만 내적은 두 벡터의 대응하는 요소를 곱한 후 모두 더하는 것이며, 이 코드는 요소별 곱셈 결과를 모두 더합니다.

tensor(3.)

Dot products are useful in a wide range of contexts. For example, given some set of values, denoted by a vector x∈ ℝⁿ, and a set of weights, denoted by w∈ ℝ ⁿ , the weighted sum of the values in x according to the weights w could be expressed as the dot product x^⊤w. When the weights are nonnegative and sum to 1, i.e., (∑ⁿ_i=1 w_i=1), the dot product expresses a weighted average. After normalizing two vectors to have unit length, the dot products express the cosine of the angle between them. Later in this section, we will formally introduce this notion of length.

내적은 다양한 상황에서 유용합니다. 예를 들어, 벡터 x∈ ℝ ⁿ으로 표시되는 일부 값 세트와 w∈ ℝ ⁿ으로 표시되는 가중치 세트가 주어지면 가중치 w에 따른 x 값의 가중 합은 점으로 표현될 수 있습니다. 제품 x^⊤w. 가중치가 음수가 아니고 합이 1인 경우, 즉 (∑ⁿ _i=1 w_i=1), 내적은 가중 평균을 나타냅니다. 두 벡터를 단위 길이로 정규화한 후, 내적은 두 벡터 사이의 각도의 코사인을 나타냅니다. 이 섹션의 뒷부분에서 길이에 대한 개념을 공식적으로 소개하겠습니다.

2.3.9. Matrix–Vector Products

Now that we know how to calculate dot products, we can begin to understand the product between an m×n matrix A and an n-dimensional vector x. To start off, we visualize our matrix in terms of its row vectors where each a^⊤_i∈ ℝⁿ is a row vector representing the i th row of the matrix A.

이제 내적을 계산하는 방법을 알았으므로 m×n 행렬 A와 n차원 벡터 x 사이의 곱을 이해할 수 있습니다. 시작하려면 각 a^⊤_i∈ ℝⁿ이 행렬 A의 i번째 행을 나타내는 행 벡터인 행 벡터 측면에서 행렬을 시각화합니다.

The matrix–vector product Ax is simply a column vector of length m, whose i th element is the dot product a^⊤_i x:

행렬-벡터 곱 Ax는 단순히 길이가 m인 열 벡터이며, i 번째 요소는 내적 a^⊤_i x입니다.

We can think of multiplication with a matrix A∈ ℝ^m×n as a transformation that projects vectors from ℝⁿ to ℝ^m. These transformations are remarkably useful. For example, we can represent rotations as multiplications by certain square matrices. Matrix–vector products also describe the key calculation involved in computing the outputs of each layer in a neural network given the outputs from the previous layer.

행렬 A∈ ℝ ^m×n을 사용한 곱셈을 ℝⁿ에서 ℝ^m으로 벡터를 투영하는 변환으로 생각할 수 있습니다. 이러한 변환은 매우 유용합니다. 예를 들어 회전을 특정 정사각 행렬의 곱셈으로 표현할 수 있습니다. 행렬-벡터 곱은 이전 계층의 출력을 바탕으로 신경망의 각 계층 출력을 계산하는 데 관련된 주요 계산도 설명합니다.

To express a matrix–vector product in code, we use the mv function. Note that the column dimension of A (its length along axis 1) must be the same as the dimension of x (its length). Python has a convenience operator @ that can execute both matrix–vector and matrix–matrix products (depending on its arguments). Thus we can write A@x.

행렬-벡터 곱을 코드로 표현하기 위해 mv 함수를 사용합니다. A의 열 치수(축 1을 따른 길이)는 x의 치수(길이)와 동일해야 합니다. Python에는 행렬-벡터 및 행렬-행렬 곱을 모두 실행할 수 있는 편리한 연산자 @가 있습니다(인수에 따라 다름). 따라서 우리는 A@x를 쓸 수 있습니다.

A.shape, x.shape, torch.mv(A, x), A@x

주어진 코드는 PyTorch를 사용하여 행렬-벡터 곱셈을 수행하는 작업을 나타냅니다. 코드에서 A는 행렬이고, x는 벡터입니다. 코드는 두 가지 다른 방법으로 행렬 A와 벡터 x를 곱셈하고, 그 결과를 비교합니다. 아래는 코드의 설명입니다:

A.shape: 행렬 A의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하면 행렬의 차원과 각 차원의 크기를 나타내는 튜플이 반환됩니다.
x.shape: 벡터 x의 모양(shape)을 확인하는 작업입니다. .shape 속성을 사용하면 벡터의 차원과 크기를 나타내는 튜플이 반환됩니다.
torch.mv(A, x): 행렬-벡터 곱셈을 수행하는 함수입니다. 이 함수는 행렬 A와 벡터 x를 곱셈하고 결과 벡터를 반환합니다. mv는 "matrix-vector"의 약자입니다.
A @ x: Python에서는 @ 연산자를 사용하여 행렬-벡터 곱셈을 간결하게 나타낼 수 있습니다. 이 코드는 행렬 A와 벡터 x를 곱셈하고 결과 벡터를 반환합니다.

결과적으로, 코드를 실행하면 다음 네 가지 결과가 반환됩니다:

첫 번째 결과는 행렬 A의 모양(shape)을 나타내는 튜플입니다.
두 번째 결과는 벡터 x의 모양(shape)을 나타내는 튜플입니다.
세 번째 결과는 torch.mv(A, x)를 사용하여 계산된 행렬-벡터 곱셈의 결과 벡터입니다.
네 번째 결과는 A @ x를 사용하여 계산된 행렬-벡터 곱셈의 결과 벡터로, 두 번째 결과와 동일해야 합니다.

이 코드는 행렬과 벡터 간의 곱셈을 수행하는 방법을 보여주며, Python에서 제공되는 간결한 @ 연산자를 사용하여도 동일한 결과를 얻을 수 있습니다.

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

2.3.10. Matrix–Matrix Multiplication

Once you have gotten the hang of dot products and matrix–vector products, then matrix–matrix multiplication should be straightforward.

내적과 행렬-벡터 곱에 익숙해지면 행렬-행렬 곱셈이 간단해집니다.

Say that we have two matrices A∈ ℝ^n×k and B∈ ℝ ^k×m:

두 개의 행렬 A∈ ℝ ^n×k와 B∈ ℝ ^k×m이 있다고 가정해 보겠습니다.

Let a^⊤_i∈ ℝ^k denote the row vector representing the i th row of the matrix A and let b_j∈ ℝ^k denote the column vector from the j th column of the matrix B:

a^⊤_i∈ ℝ^k가 행렬 A의 i번째 행을 나타내는 행 벡터를 나타내고 b_j∈ ℝ^k가 행렬 B의 j번째 열의 열 벡터를 나타낸다고 가정합니다.

To form the matrix product C∈ ℝ ^m×n, we simply compute each element c_ij as the dot product between the i th row of A and the j th column of B, i.e., a^⊤_ib_j:

행렬 곱 C∈ ℝ^m×n을 형성하기 위해 각 요소 c_ij를 A의 i번째 행과 B의 j번째 열 사이의 내적, 즉 a^⊤_ib_j로 간단히 계산합니다.

We can think of the matrix–matrix multiplication AB as performing m matrix–vector products or m×n dot products and stitching the results together to form an n×m matrix. In the following snippet, we perform matrix multiplication on A and B. Here, A is a matrix with two rows and three columns, and B is a matrix with three rows and four columns. After multiplication, we obtain a matrix with two rows and four columns.

행렬-행렬 곱셈 AB는 m개의 행렬-벡터 곱 또는 m×n 도트 곱을 수행하고 그 결과를 함께 연결하여 n×m 행렬을 형성하는 것으로 생각할 수 있습니다. 다음 코드에서는 A와 B에 대해 행렬 곱셈을 수행합니다. 여기서 A는 행 2개와 열 3개로 구성된 행렬이고 B는 행 3개와 열 4개로 구성된 행렬입니다. 곱셈을 하면 2개의 행과 4개의 열이 있는 행렬을 얻습니다.

B = torch.ones(3, 4)
torch.mm(A, B), A@B

주어진 코드는 PyTorch를 사용하여 두 행렬 A와 B의 행렬-행렬 곱셈을 수행하는 작업을 나타냅니다. 코드는 두 가지 다른 방법으로 행렬-행렬 곱셈을 수행하고, 그 결과를 비교합니다. 아래는 코드의 설명입니다:

B = torch.ones(3, 4): 3x4 크기의 모든 요소가 1로 초기화된 행렬 B를 생성합니다.
torch.mm(A, B): torch.mm() 함수를 사용하여 두 행렬 A와 B의 행렬-행렬 곱셈을 계산합니다. 이 함수는 두 행렬의 행렬-행렬 곱셈을 수행하고 결과 행렬을 반환합니다.
A @ B: Python에서는 @ 연산자를 사용하여 행렬-행렬 곱셈을 간결하게 나타낼 수 있습니다. 이 코드는 두 행렬 A와 B의 행렬-행렬 곱셈을 계산하고 결과 행렬을 반환합니다.

결과적으로, 코드를 실행하면 다음 네 가지 결과가 반환됩니다:

첫 번째 결과는 torch.mm(A, B)를 사용하여 계산된 행렬-행렬 곱셈의 결과 행렬입니다.
두 번째 결과는 A @ B를 사용하여 계산된 행렬-행렬 곱셈의 결과 행렬로, 첫 번째 결과와 동일해야 합니다.

이 코드는 PyTorch를 사용하여 행렬-행렬 곱셈을 수행하는 방법을 보여주며, Python에서 제공되는 @ 연산자를 사용하여도 동일한 결과를 얻을 수 있습니다.

(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]))

The term matrix–matrix multiplication is often simplified to matrix multiplication, and should not be confused with the Hadamard product.

행렬-행렬 곱셈이라는 용어는 행렬 곱셈으로 단순화되는 경우가 많으므로 Hadamard 곱과 혼동해서는 안 됩니다.

2.3.11. Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big it is. For instance, the ℓ₂ norm measures the (Euclidean) length of a vector. Here, we are employing a notion of size that concerns the magnitude of a vector’s components (not its dimensionality).

선형 대수학에서 가장 유용한 연산자 중 일부는 norms 입니다. 비공식적으로 벡터의 노름(Norm)은 벡터가 얼마나 큰지 알려줍니다. 예를 들어, ℓ₂ 노름은 벡터의 (유클리드) 길이를 측정합니다. 여기서는 벡터 구성 요소의 크기(차원이 아닌)와 관련된 크기 개념을 사용합니다.

A norm is a function ‖⋅‖ that maps a vector to a scalar and satisfies the following three properties:

노름(Norm)은 벡터를 스칼라에 매핑하고 다음 세 가지 속성을 충족하는 함수 ‖⋅‖ 입니다.

Given any vector x, if we scale (all elements of) the vector by a scalar α ∈ ℝ , its norm scales accordingly:

벡터 x가 주어지면 벡터의 모든 요소를 스칼라 α ∈ ℝ로 스케일링하면 해당 노름도 그에 따라 스케일링됩니다.

2. For any vectors x and y: norms satisfy the triangle inequality:

2. 모든 벡터 x 및 y에 대해: 노름은 삼각형 부등식을 충족합니다.

3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:

3. 벡터의 노름은 음수가 아니며 벡터가 0인 경우에만 사라집니다.

Many functions are valid norms and different norms encode different notions of size. The Euclidean norm that we all learned in elementary school geometry when calculating the hypotenuse of a right triangle is the square root of the sum of squares of a vector’s elements. Formally, this is called the ℓ₂ norm and expressed as

많은 함수는 유효한 norms 이며, 서로 다른 norms 은 서로 다른 크기 개념을 인코딩합니다. 우리가 초등학교 기하학에서 직각삼각형의 빗변을 계산할 때 배운 유클리드 표준은 벡터 요소의 제곱합의 제곱근입니다. 공식적으로 이를 ℓ₂ 표준이라고 하며 다음과 같이 표현됩니다.

Norms 란?

In mathematics, and more specifically in linear algebra, "norms" refer to functions that assign a positive length or size to vectors (including points in Euclidean space) or other objects. Norms are used to quantify the magnitude or size of an object, and they have various applications in mathematics, physics, engineering, and computer science.

수학에서 "노름(norms)"은 벡터(유클리드 공간의 점을 포함한) 또는 다른 객체에 길이 또는 크기를 할당하는 함수를 의미합니다. 노름은 객체의 크기 또는 크기를 양의 값으로 표현하며, 수학, 물리학, 공학 및 컴퓨터 과학의 다양한 분야에서 응용됩니다.

There are different types of norms, but the most common one is the Euclidean norm, also known as the L2 norm. The Euclidean norm of a vector is calculated by taking the square root of the sum of the squares of its components. In 2D space, it corresponds to the length of a vector from the origin to a point in the plane. In n-dimensional space, it generalizes to:

노름의 다양한 유형이 있지만, 가장 일반적인 것은 유클리드 노름 또는 L2 노름입니다. 벡터의 유클리드 노름은 구성 요소의 제곱의 합의 제곱근을 취하여 계산됩니다. 2차원 공간에서 이것은 평면상의 한 점에서 원점까지의 벡터의 길이에 해당합니다. n차원 공간에서 일반화됩니다:

L2 Norm or Euclidean Norm of a vector x = (x1, x2, ..., xn) is defined as:

벡터 x = (x1, x2, ..., xn)의 L2 노름 또는 유클리드 노름은 다음과 같이 정의됩니다:

||x||₂ = √(x1² + x2² + ... + xn²)

Other common norms include the L1 norm, which is the sum of the absolute values of the components, and the infinity norm (L∞ norm), which is the maximum absolute value of any component. There are other norms as well, such as the Lp norms, which generalize the L1 and L2 norms.

다른 일반적인 노름에는 L1 노름이 있으며, 이것은 구성 요소의 절대값의 합입니다. 그리고 무한대 노름 (L∞ 노름)은 어떤 구성 요소의 최대 절대값입니다. Lp 노름과 같은 다른 노름도 있으며, L1 및 L2 노름을 일반화합니다.

Norms have various applications in machine learning, optimization, signal processing, and many other fields. They are used to measure distances, define regularization terms in machine learning models, and assess the convergence of iterative algorithms, among other things. The choice of norm depends on the specific problem and the properties you want to capture.

노름은 머신러닝, 최적화, 신호 처리 및 기타 다양한 분야에서 응용됩니다. 거리 측정, 머신러닝 모델에서 정규화 용어를 정의하고 반복 알고리즘의 수렴을 평가하는 등 다양한 목적으로 사용됩니다. 노름의 선택은 특정 문제와 캡처하려는 특성에 따라 다릅니다.

The method norm calculates the ℓ2 norm.

method 노름은 ℓ2 노름을 계산합니다.

u = torch.tensor([3.0, -4.0])
torch.norm(u)

주어진 코드는 PyTorch를 사용하여 2차원 벡터 u의 L2 노름(유클리드 노름)을 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

u = torch.tensor([3.0, -4.0]): 길이가 2인 1차원 텐서 u를 생성합니다. 이 벡터는 두 개의 요소로 구성되며, [3.0, -4.0]으로 초기화됩니다. 이러한 벡터는 평면 상의 점을 나타낼 수 있습니다.
torch.norm(u): PyTorch의 torch.norm() 함수를 사용하여 벡터 u의 L2 노름을 계산합니다. L2 노름은 벡터의 모든 요소의 제곱을 합산한 후 그 합의 제곱근을 취한 것으로, 벡터의 크기나 길이를 나타냅니다.

결과적으로, 코드를 실행하면 u 벡터의 L2 노름이 계산되고 그 값이 반환됩니다. 이것은 원점(0, 0)에서 벡터 u의 끝점까지의 거리를 나타내며, L2 노름은 일반적으로 벡터의 크기를 계산하는 데 사용됩니다.

tensor(5.)

The ℓ1 norm is also common and the associated measure is called the Manhattan distance. By definition, the ℓ1 norm sums the absolute values of a vector’s elements:

ℓ1 노름도 일반적이며 관련 측정값을 맨해튼 거리라고 합니다. 정의에 따르면 ℓ1 노름은 벡터 요소의 절대값을 합산합니다.

Compared to the ℓ2 norm, it is less sensitive to outliers. To compute the ℓ1 norm, we compose the absolute value with the sum operation.

ℓ2 norm 에 비해 이상값에 덜 민감합니다. ℓ1 노름을 계산하기 위해 합계 연산을 통해 절대값을 구성합니다.

torch.abs(u).sum()

주어진 코드는 PyTorch를 사용하여 벡터 u의 각 요소의 절댓값을 계산하고, 이 절댓값들을 모두 더하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

torch.Abs(u): torch.Abs() 함수를 사용하여 벡터 u의 각 요소의 절댓값을 계산합니다. 이 함수는 각 요소를 양수로 만들어줍니다.
.sum(): .sum() 메서드를 사용하여 절댓값이 계산된 벡터의 모든 요소를 더합니다. 이로써 벡터의 각 요소의 절댓값의 총합을 얻을 수 있습니다.

결과적으로, 코드를 실행하면 u 벡터의 각 요소의 절댓값을 계산하고, 그 절댓값들을 모두 더한 값이 반환됩니다. 이것은 벡터의 모든 요소의 절댓값을 합산하여 얻은 값으로, 해당 벡터의 크기나 길이를 나타냅니다.

tensor(7.)

Both the ℓ2 and ℓ1 norms are special cases of the more general ℓp norms:

ℓ2 및 ℓ1 노름은 모두 보다 일반적인 ℓp 노름의 특별한 경우입니다.

In the case of matrices, matters are more complicated. After all, matrices can be viewed both as collections of individual entries and as objects that operate on vectors and transform them into other vectors. For instance, we can ask by how much longer the matrix–vector product Xv could be relative to v. This line of thought leads to what is called the spectral norm. For now, we introduce the Frobenius norm, which is much easier to compute and defined as the square root of the sum of the squares of a matrix’s elements:

행렬의 경우 문제는 더 복잡합니다. 결국 행렬은 개별 항목의 모음으로 볼 수도 있고 벡터에 대해 연산을 수행하고 이를 다른 벡터로 변환하는 개체로 볼 수도 있습니다. 예를 들어, 행렬-벡터 곱 Xv가 v에 비해 얼마나 더 길 수 있는지 물어볼 수 있습니다. 이러한 생각은 스펙트럼 표준이라고 불리는 것으로 이어집니다. 지금은 계산하기가 훨씬 쉽고 행렬 요소의 제곱합의 제곱근으로 정의되는 프로베니우스 노름(Frobenius Norm)을 소개합니다.

The Frobenius norm behaves as if it were an ℓ2 norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix.

Frobenius 노름은 마치 행렬 모양 벡터의 ℓ2 노름인 것처럼 동작합니다. 다음 함수를 호출하면 행렬의 Frobenius 표준이 계산됩니다.

torch.norm(torch.ones((4, 9)))

주어진 코드는 PyTorch를 사용하여 4x9 크기의 모든 요소가 1로 초기화된 행렬을 생성하고, 이 행렬의 L2 노름(유클리드 노름)을 계산하는 작업을 나타냅니다. 아래는 코드의 설명입니다:

torch.ones((4, 9)): 4x9 크기의 모든 요소가 1로 초기화된 행렬을 생성합니다. torch.ones() 함수를 사용하여 모든 요소가 1인 행렬을 생성할 수 있으며, 입력 매개변수로 원하는 크기 (여기서는 4x9)를 지정합니다.
torch.norm(...): torch.norm() 함수를 사용하여 생성한 4x9 행렬의 L2 노름을 계산합니다. L2 노름은 행렬의 각 요소의 제곱을 합산한 후 그 합의 제곱근을 취한 것으로, 행렬의 크기나 길이를 나타냅니다.

결과적으로, 코드를 실행하면 4x9 크기의 행렬의 L2 노름이 계산되고 그 값이 반환됩니다. 이것은 행렬의 크기나 길이를 나타내며, 모든 요소가 1인 행렬의 경우에는 행렬의 크기와 동일한 값을 가집니다.

tensor(6.)

While we do not want to get too far ahead of ourselves, we already can plant some intuition about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.

우리는 너무 앞서나가고 싶지 않지만 이러한 개념이 왜 유용한지에 대한 직관을 이미 심을 수 있습니다. 딥러닝에서 우리는 종종 최적화 문제를 해결하려고 노력합니다. 관찰된 데이터에 할당된 확률을 최대화합니다. 추천 모델과 관련된 수익을 극대화합니다. 예측과 실제 관찰 사이의 거리를 최소화합니다. 같은 사람의 사진 표현 사이의 거리를 최소화하는 동시에 다른 사람의 사진 표현 사이의 거리를 최대화합니다. 딥러닝 알고리즘의 목표를 구성하는 이러한 거리는 종종 norms 으로 표현됩니다.

2.3.12. Discussion

In this section, we have reviewed all the linear algebra that you will need to understand a significant chunk of modern deep learning. There is a lot more to linear algebra, though, and much of it is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tensors to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be more inclined to learn more mathematics once you have gotten your hands dirty applying machine learning to real datasets. So while we reserve the right to introduce more mathematics later on, we wrap up this section here.

이 섹션에서는 현대 딥러닝의 상당 부분을 이해하는 데 필요한 모든 선형 대수학을 검토했습니다. 하지만 선형 대수학에는 더 많은 내용이 있으며 그 중 대부분은 기계 학습에 유용합니다. 예를 들어, 행렬은 요인으로 분해될 수 있으며 이러한 분해는 실제 데이터세트의 저차원 구조를 드러낼 수 있습니다. 행렬 분해와 고차 텐서에 대한 일반화를 사용하여 데이터 세트의 구조를 발견하고 예측 문제를 해결하는 데 초점을 맞춘 기계 학습의 전체 하위 필드가 있습니다. 하지만 이 책은 딥러닝에 중점을 두고 있습니다. 그리고 실제 데이터 세트에 기계 학습을 적용해 본 후에는 더 많은 수학을 배우고 싶은 마음이 더 커질 것이라고 믿습니다. 따라서 나중에 더 많은 수학을 소개할 권리가 있지만 여기에서 이 섹션을 마무리합니다.

If you are eager to learn more linear algebra, there are many excellent books and online resources. For a more advanced crash course, consider checking out Strang (1993), Kolter (2008), and Petersen and Pedersen (2008).

선형 대수학을 더 배우고 싶다면 훌륭한 책과 온라인 리소스가 많이 있습니다. 보다 고급 집중 과정을 보려면 Strang(1993), Kolter(2008), Petersen 및 Pedersen(2008)을 확인해 보세요.

To recap: 요약하자면:

Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear algebra and have zero, one, two, and an arbitrary number of axes, respectively.
스칼라, 벡터, 행렬 및 텐서는 선형 대수학에 사용되는 기본 수학적 객체이며 각각 0, 1, 2 및 임의 개수의 축을 갖습니다.
Tensors can be sliced or reduced along specified axes via indexing, or operations such as sum and mean, respectively.
텐서는 인덱싱이나 합계 및 평균과 같은 연산을 통해 지정된 축을 따라 분할하거나 축소할 수 있습니다.
Elementwise products are called Hadamard products. By contrast, dot products, matrix–vector products, and matrix–matrix products are not elementwise operations and in general return objects having shapes that are different from the the operands.
Elementwise products 을 Hadamard products 이라고 합니다. 대조적으로, 내적, 행렬-벡터 곱, 행렬-행렬 곱은 요소별 연산이 아니며 일반적으로 피연산자와 다른 모양을 갖는 객체를 반환합니다.
Compared to Hadamard products, matrix–matrix products take considerably longer to compute (cubic rather than quadratic time).
Hadamard 곱과 비교하여 행렬-행렬 곱은 계산하는 데 상당히 오랜 시간이 걸립니다(2차 시간이 아닌 3차 시간).
Norms capture various notions of the magnitude of a vector (or matrix), and are commonly applied to the difference of two vectors to measure their distance apart.
Norms는 벡터(또는 행렬)의 크기에 대한 다양한 개념을 포착하며 일반적으로 두 벡터의 차이에 적용되어 거리를 측정합니다.
Common vector norms include the ℓ1 and ℓ2 norms, and common matrix norms include the spectral and Frobenius norms.
공통 벡터 노름에는 ℓ1 및 ℓ2 노름이 포함되고, 공통 행렬 노름에는 스펙트럼 및 프로베니우스 노름이 포함됩니다.

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.2. Data Preprocessing

2023. 10. 9. 12:33 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/pandas.html

2.2. Data Preprocessing — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.2. Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section, while no substitute for a proper pandas tutorial, will give you a crash course on some of the most common routines.

지금까지 우리는 ready-made tensors 에 도착한 synthetic data 를 사용해 작업해 왔습니다. 그러나 실제 딥러닝을 적용하려면 임의의 형식으로 저장된 지저분한 데이터를 추출하고 필요에 맞게 전처리해야 합니다. 다행스럽게도 pandas 라이브러리는 많은 무거운 작업을 수행할 수 있습니다. 이 섹션은 적절한 Pandas 튜토리얼을 대체할 수는 없지만 가장 일반적인 루틴 중 일부에 대한 단기 집중 강좌를 제공합니다.

2.2.1. Reading the Dataset

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics”. To demonstrate how to load CSV files with pandas, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

Comma-separated values (CSV) 파일은 표 형식(스프레드시트와 같은) 데이터를 저장하는 데 널리 사용됩니다. 여기에서 각 행은 하나의 레코드에 해당하며 여러(쉼표로 구분된) 필드로 구성됩니다(예: "Albert Einstein, March 14 1879, Ulm, Federal Polytechnic school, field of gravitational 물리"). Pandas로 CSV 파일을 로드하는 방법을 보여주기 위해 ../data/house_tiny.csv 아래에 CSV 파일을 만듭니다. 이 파일은 주택의 데이터세트를 나타냅니다. 여기서 각 행은 고유한 주택에 해당하고 열은 방 수(NumRooms), 지붕 유형(RoofType) 및 가격(Price)에 해당합니다.

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

주어진 코드는 Python을 사용하여 디렉토리를 생성하고 CSV 파일을 만드는 작업을 수행하는 코드입니다. 아래는 코드의 설명입니다:

import os: os 모듈을 가져옵니다. 이 모듈은 운영체제와 상호작용하여 파일 및 디렉토리를 다룰 때 유용합니다.
os.makedirs(os.path.join('..', 'data'), exist_ok=True): 이 부분은 상위 디렉토리에서 "data" 디렉토리를 생성합니다. os.path.join() 함수는 경로를 조합하는 데 사용되며, exist_ok=True를 설정하여 디렉토리가 이미 존재하면 오류를 발생시키지 않고 계속 진행합니다.
data_file = os.path.join('..', 'data', 'house_tiny.csv'): 이 부분은 CSV 파일의 경로를 생성하여 data_file 변수에 저장합니다.
with open(data_file, 'w') as f:: data_file 경로로 파일을 열고 쓰기 모드('w')로 열기 시작합니다. with 문을 사용하면 파일이 올바르게 닫히도록 보장됩니다.
f.write('''NumRooms,RoofType,Price\nNA,NA,127500\n2,NA,106000\n4,Slate,178100\nNA,NA,140000'''): 이 부분은 CSV 파일에 데이터를 씁니다. 여기서는 간단한 데이터를 작성하고 있으며, 각 줄은 쉼표로 구분된 값(NumRooms, RoofType, Price)으로 구성되어 있습니다. 이 데이터는 주택 정보를 나타내는 것으로 보입니다.

코드를 실행하면 "data" 디렉토리가 생성되고 그 안에 "house_tiny.csv" 파일이 작성됩니다. 이 CSV 파일은 주택 정보를 담고 있으며, 데이터 과학 또는 머신 러닝 작업에서 사용될 수 있습니다.

Now let’s import pandas and load the dataset with read_csv.

이제 pandas를 가져오고 read_csv를 사용하여 데이터 세트를 로드해 보겠습니다.

import pandas as pd

data = pd.read_csv(data_file)
print(data)

주어진 코드는 Python의 Pandas 라이브러리를 사용하여 CSV 파일을 읽고 데이터를 DataFrame으로 표시하는 작업을 수행하는 코드입니다. 아래는 코드의 설명입니다:

import pandas as pd: Pandas 라이브러리를 가져오고, 일반적으로 Pandas를 pd라는 별칭으로 사용합니다.
data = pd.read_csv(data_file): 이 부분은 data_file 경로에 있는 CSV 파일을 읽어와 데이터를 DataFrame으로 저장합니다. pd.read_csv() 함수는 CSV 파일을 읽고 그 내용을 DataFrame 형태로 반환합니다. 이 DataFrame은 표 형태로 데이터를 저장하고 다루기 용이합니다.
print(data): 읽어온 데이터를 출력합니다. 이 코드는 DataFrame의 내용을 표시하며, CSV 파일에 저장된 주택 정보 데이터를 출력합니다.

결과적으로, 코드를 실행하면 CSV 파일에 있는 데이터가 Pandas DataFrame으로 읽어와지고, 그 데이터가 화면에 출력됩니다. 이를 통해 데이터를 쉽게 검토하고 분석할 수 있습니다.

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000

2.2.2. Data Preparation

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing (iloc).

지도 학습에서는 일부 입력 값 세트가 주어지면 지정된 목표 값을 예측하도록 모델을 훈련합니다. 데이터 세트 처리의 첫 번째 단계는 입력 값과 목표 값에 해당하는 열을 분리하는 것입니다. 이름이나 정수 위치 기반 인덱싱(iloc)을 통해 열을 선택할 수 있습니다.

You might have noticed that pandas replaced all CSV entries with value NA with a special NaN (not a number) value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

pandas가 값 NA가 있는 모든 CSV 항목을 특수 NaN(숫자 아님) 값으로 바꾼 것을 눈치챘을 것입니다. 이는 항목이 비어 있을 때마다(예: "3,,,270000") 발생할 수도 있습니다. 이를 누락된 값이라고 하며 데이터 과학의 " bed bugs "이며, 경력 전반에 걸쳐 직면하게 될 지속적인 위협입니다. 컨텍스트에 따라 누락된 값은 대치 또는 삭제를 통해 처리될 수 있습니다. 대치에서는 누락된 값을 예상 값으로 바꾸는 반면, 삭제에서는 누락된 값이 포함된 행이나 열을 삭제합니다.

Here are some common imputation heuristics. For categorical input fields, we can treat NaN as a category. Since the RoofType column takes values Slate and NaN, pandas can convert this column into two columns RoofType_Slate and RoofType_nan. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_nan to 1 and 0, respectively. The converse holds for a row with a missing RoofType value.

다음은 몇 가지 일반적인 대치 휴리스틱입니다. 범주형 입력 필드의 경우 NaN을 범주로 처리할 수 있습니다. RoofType 열은 Slate 및 NaN 값을 사용하므로 Pandas는 이 열을 RoofType_Slate 및 RoofType_nan의 두 열로 변환할 수 있습니다. 지붕 유형이 슬레이트인 행은 RoofType_Slate 및 RoofType_nan 값을 각각 1과 0으로 설정합니다. RoofType 값이 누락된 행에는 반대의 경우가 적용됩니다.

inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

주어진 코드는 Python의 Pandas 라이브러리를 사용하여 데이터를 전처리하는 작업을 수행하는 코드입니다. 아래는 코드의 설명입니다:

inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]: 이 부분에서 데이터를 입력과 타겟으로 나누고 있습니다. data.iloc[:, 0:2]는 DataFrame data의 모든 행(:)과 0부터 1번 열(0:2)까지의 데이터를 선택하여 inputs로 저장합니다. data.iloc[:, 2]는 DataFrame data의 모든 행과 2번 열의 데이터를 선택하여 targets로 저장합니다. 이렇게 하면 입력 데이터와 타겟 데이터가 나누어집니다.
inputs = pd.get_dummies(inputs, dummy_na=True): 이 부분에서는 입력 데이터를 전처리합니다. pd.get_dummies() 함수를 사용하여 범주형 변수를 더미 변수로 변환합니다. dummy_na=True를 설정하면 결측치를 나타내는 더미 변수도 생성됩니다. 이렇게 하면 범주형 변수가 숫자로 인코딩되고 모델 학습에 사용할 수 있는 형태로 변환됩니다.
print(inputs): 전처리된 입력 데이터를 출력합니다. 이 코드는 더미 변수로 변환된 입력 데이터를 화면에 출력하여 확인할 수 있습니다.

결과적으로, 이 코드는 입력 데이터와 타겟 데이터를 분리하고, 입력 데이터를 범주형 변수를 더미 변수로 변환하여 전처리하는 작업을 수행합니다. 이렇게 전처리된 데이터는 머신 러닝 모델에 입력으로 사용될 수 있습니다.

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True

inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
print(targets)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN               0             1
1       2.0               0             1
2       4.0               1             0
3       NaN               0             1

0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64

For missing numerical values, one common heuristic is to replace the NaN entries with the mean value of the corresponding column.

누락된 숫자 값의 경우 일반적인 경험적 방법 중 하나는 NaN 항목을 해당 열의 평균 값으로 바꾸는 것입니다.

inputs = inputs.fillna(inputs.mean())
print(inputs)

주어진 코드는 Python의 Pandas 라이브러리를 사용하여 결측치(누락된 데이터)를 해당 열의 평균 값으로 대체하는 작업을 수행하는 코드입니다. 아래는 코드의 설명입니다:

inputs = inputs.fillna(inputs.mean()): 이 부분에서는 inputs DataFrame의 결측치를 대체하는 작업을 수행합니다. fillna() 함수는 DataFrame에서 결측치를 대체할 때 사용됩니다. 여기서는 inputs.mean()을 사용하여 각 열의 평균 값을 계산하고, 이 평균 값으로 해당 열의 결측치를 대체합니다. 즉, 각 열에 대해 결측치가 있는 경우 그 열의 평균 값으로 결측치를 채웁니다.
print(inputs): 결측치가 대체된 후의 inputs DataFrame을 출력합니다. 이 코드는 결측치가 대체된 데이터를 화면에 출력하여 확인할 수 있습니다.

결과적으로, 이 코드는 결측치를 해당 열의 평균 값으로 대체하여 데이터를 전처리하는 작업을 수행합니다. 이렇게 하면 모델 학습에 결측치가 없는 데이터로 사용할 수 있습니다.

교재

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True

CoLab

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1

2.2.3. Conversion to the Tensor Format

Now that all the entries in inputs and targets are numerical, we can load them into a tensor (recall Section 2.1).

이제 입력과 목표의 모든 항목이 숫자이므로 이를 텐서에 로드할 수 있습니다(섹션 2.1을 기억하세요).

import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

주어진 코드는 Python에서 PyTorch 라이브러리를 사용하여 데이터를 텐서로 변환하는 작업을 수행하는 코드입니다. 아래는 코드의 설명입니다:

import torch: PyTorch 라이브러리를 가져옵니다. PyTorch는 딥러닝 및 텐서 연산을 위한 라이브러리입니다.
X = torch.tensor(inputs.to_numpy(dtype=float)): 입력 데이터인 inputs를 PyTorch 텐서로 변환합니다. 먼저 inputs를 NumPy 배열로 변환하기 위해 to_numpy() 함수를 사용하고, 이후에 torch.tensor() 함수를 사용하여 NumPy 배열을 PyTorch 텐서로 변환합니다. dtype=float를 사용하여 텐서의 데이터 타입을 부동소수점으로 설정합니다.
y = torch.tensor(targets.to_numpy(dtype=float)): 타겟 데이터인 targets를 PyTorch 텐서로 변환합니다. 위와 같은 방식으로 NumPy 배열로 변환한 후, PyTorch 텐서로 변환합니다.
X, y: 변환된 입력 데이터와 타겟 데이터를 출력합니다. 이 코드는 데이터를 NumPy 배열에서 PyTorch 텐서로 변환한 후, 그 결과를 확인하기 위해 화면에 출력합니다.

결과적으로, 이 코드는 입력 데이터와 타겟 데이터를 PyTorch 텐서로 변환하여 딥러닝 모델 또는 텐서 연산을 수행할 준비를 마칩니다. PyTorch를 사용하면 이러한 텐서를 활용하여 모델을 학습하고 예측하는 등의 다양한 머신 러닝 작업을 수행할 수 있습니다.

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

2.2.4. Discussion

You now know how to partition data columns, impute missing variables, and load pandas data into tensors. In Section 5.7, you will pick up some more data processing skills. While this crash course kept things simple, data processing can get hairy. For example, rather than arriving in a single CSV file, our dataset might be spread across multiple files extracted from a relational database. For instance, in an e-commerce application, customer addresses might live in one table and purchase data in another. Moreover, practitioners face myriad data types beyond categorical and numeric, for example, text strings, images, audio data, and point clouds. Oftentimes, advanced tools and efficient algorithms are required in order to prevent data processing from becoming the biggest bottleneck in the machine learning pipeline. These problems will arise when we get to computer vision and natural language processing. Finally, we must pay attention to data quality. Real-world datasets are often plagued by outliers, faulty measurements from sensors, and recording errors, which must be addressed before feeding the data into any model. Data visualization tools such as seaborn, Bokeh, or matplotlib can help you to manually inspect the data and develop intuitions about the type of problems you may need to address.

이제 데이터 열을 분할하고, 누락된 변수를 대치하고, Pandas 데이터를 텐서에 로드하는 방법을 알았습니다. 섹션 5.7에서는 데이터 처리 기술을 더 배우게 됩니다. 이 단기 집중 강좌에서는 작업을 단순하게 유지했지만 데이터 처리가 까다로울 수 있습니다. 예를 들어 단일 CSV 파일로 도착하는 대신 데이터 세트가 관계형 데이터베이스에서 추출된 여러 파일에 분산될 수 있습니다. 예를 들어 전자 상거래 애플리케이션에서 고객 주소는 한 테이블에 있고 구매 데이터는 다른 테이블에 있을 수 있습니다. 더욱이 실무자는 텍스트 문자열, 이미지, 오디오 데이터, 포인트 클라우드 등 범주형 및 숫자형을 넘어서는 수많은 데이터 유형에 직면합니다. 데이터 처리가 기계 학습 파이프라인에서 가장 큰 병목 현상이 되는 것을 방지하려면 고급 도구와 효율적인 알고리즘이 필요한 경우가 많습니다. 이러한 문제는 컴퓨터 비전과 자연어 처리에 접근할 때 발생합니다. 마지막으로 데이터 품질에 주의를 기울여야 합니다. 실제 데이터 세트는 종종 이상치, 센서의 잘못된 측정, 기록 오류로 인해 어려움을 겪습니다. 이러한 문제는 데이터를 모델에 공급하기 전에 해결해야 합니다. seaborn, Bokeh 또는 matplotlib와 같은 데이터 시각화 도구를 사용하면 데이터를 수동으로 검사하고 해결해야 할 문제 유형에 대한 직관을 개발하는 데 도움이 될 수 있습니다.

2.2.5. Exercises

Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?

UCI Machine Learning Repository에서 Abalone과 같은 데이터세트를 로드하고 해당 속성을 검사해 보세요. 그 중 누락된 값이 있는 부분은 얼마나 됩니까? 변수 중 숫자형, 범주형 또는 텍스트형 변수는 얼마나 됩니까?
Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.

열 번호 대신 이름으로 데이터 열을 인덱싱하고 선택해 보세요. 인덱싱에 대한 팬더 문서에는 이 작업을 수행하는 방법에 대한 자세한 내용이 나와 있습니다.
How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?

이 방법으로 얼마나 큰 데이터 세트를 로드할 수 있다고 생각하시나요? 한계는 무엇입니까? 힌트: 데이터, 표현, 처리 및 메모리 공간을 읽는 데 걸리는 시간을 고려하세요. 노트북에서 사용해 보세요. 서버에서 시험해 보면 어떻게 되나요?
How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?

카테고리 수가 매우 많은 데이터를 어떻게 처리하시겠습니까? 카테고리 라벨이 모두 고유하면 어떻게 되나요? 후자를 포함해야 할까요?
What alternatives to pandas can you think of? How about loading NumPy tensors from a file? Check out Pillow, the Python Imaging Library.

팬더에 대한 어떤 대안을 생각할 수 있나요? 파일에서 NumPy 텐서를 로드하는 것은 어떻습니까? Python 이미징 라이브러리인 Pillow를 확인해 보세요.

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.1. Data Manipulation (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

Dive into Deep Learning/D2L Preliminaries

D2L - 2.1. Data Manipulation

2023. 10. 9. 11:12 | Posted by 솔웅

https://d2l.ai/chapter_preliminaries/ndarray.html

2.1. Data Manipulation — Dive into Deep Learning 1.0.3 documentation

d2l.ai

2.1. Data Manipulation

In order to get anything done, we need some way to store and manipulate data. Generally, there are two important things we need to do with data: (i) acquire them; and (ii) process them once they are inside the computer. There is no point in acquiring data without some way to store it, so to start, let’s get our hands dirty with n-dimensional arrays, which we also call tensors. If you already know the NumPy scientific computing package, this will be a breeze. For all modern deep learning frameworks, the tensor class (ndarray in MXNet, Tensor in PyTorch and TensorFlow) resembles NumPy’s ndarray, with a few killer features added. First, the tensor class supports automatic differentiation. Second, it leverages GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs. These properties make neural networks both easy to code and fast to run.

어떤 작업을 수행하려면 데이터를 저장하고 조작할 수 있는 방법이 필요합니다. 일반적으로 데이터와 관련하여 중요한 두 가지 작업이 있습니다. (i) 데이터를 획득합니다. (ii) 컴퓨터 내부에 있는 경우 이를 처리합니다. 데이터를 저장할 방법 없이 데이터를 얻는 것은 의미가 없습니다. 먼저 텐서라고도 불리는 n차원 배열을 사용해 보겠습니다. NumPy 과학 컴퓨팅 패키지를 이미 알고 있다면 매우 쉬울 것입니다. 모든 최신 딥 러닝 프레임워크의 경우 텐서 클래스(MXNet의 ndarray, PyTorch의 Tensor 및 TensorFlow)는 몇 가지 킬러 기능이 추가된 NumPy의 ndarray와 유사합니다. 첫째, 텐서 클래스는 자동 미분을 지원합니다. 둘째, NumPy는 CPU에서만 실행되는 반면 GPU를 활용하여 수치 계산을 가속화합니다. 이러한 속성은 신경망을 코딩하기 쉽고 빠르게 실행할 수 있게 해줍니다.

2.1.1. Getting Started

To start, we import the PyTorch library. Note that the package name is torch.

시작하려면 PyTorch 라이브러리를 가져옵니다. 패키지 이름은 torch입니다.

A tensor represents a (possibly multidimensional) array of numerical values. In the one-dimensional case, i.e., when only one axis is needed for the data, a tensor is called a vector. With two axes, a tensor is called a matrix. With k >2 axes, we drop the specialized names and just refer to the object as a k th-order tensor.

텐서는 숫자 값의 (아마도 다차원) 배열을 나타냅니다. 1차원의 경우, 즉 데이터에 하나의 축만 필요한 경우 텐서를 벡터라고 합니다. 두 개의 축이 있는 텐서를 행렬이라고 합니다. k >2 축을 사용하면 특수한 이름을 삭제하고 객체를 k차 텐서로 참조합니다.

PyTorch provides a variety of functions for creating new tensors prepopulated with values. For example, by invoking arange(n), we can create a vector of evenly spaced values, starting at 0 (included) and ending at n (not included). By default, the interval size is 1. Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation.

PyTorch는 값이 미리 채워진 새로운 텐서를 생성하기 위한 다양한 기능을 제공합니다. 예를 들어 arange(n)을 호출하면 0(포함)에서 시작하여 n(포함되지 않음)으로 끝나는 균일한 간격의 값으로 구성된 벡터를 만들 수 있습니다. 기본적으로 간격 크기는 1입니다. 달리 지정하지 않는 한 새 텐서는 주 메모리에 저장되고 CPU 기반 계산을 위해 지정됩니다.

x = torch.arange(12, dtype=torch.float32)
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])

Each of these values is called an element of the tensor. The tensor x contains 12 elements. We can inspect the total number of elements in a tensor via its numel method.

이러한 각 값을 텐서의 요소라고 합니다. 텐서 x는 12개의 요소를 포함합니다. numel 메소드를 통해 텐서의 총 요소 수를 검사할 수 있습니다.

x.numel()

We can access a tensor’s shape (the length along each axis) by inspecting its shape attribute. Because we are dealing with a vector here, the shape contains just a single element and is identical to the size.

모양 속성을 검사하여 텐서의 모양(각 축의 길이)에 접근할 수 있습니다. 여기서는 벡터를 다루기 때문에 모양에는 단일 요소만 포함되고 크기도 동일합니다.

x.shape

torch.Size([12])

We can change the shape of a tensor without altering its size or values, by invoking reshape. For example, we can transform our vector x whose shape is (12,) to a matrix X with shape (3, 4). This new tensor retains all elements but reconfigures them into a matrix. Notice that the elements of our vector are laid out one row at a time and thus x[3] == X[0, 3].

reshape를 호출하면 크기나 값을 변경하지 않고도 텐서의 모양을 변경할 수 있습니다. 예를 들어 모양이 (12,)인 벡터 x를 모양이 (3, 4)인 행렬 X로 변환할 수 있습니다. 이 새로운 텐서는 모든 요소를 유지하지만 이를 행렬로 재구성합니다. 벡터의 요소는 한 번에 한 행씩 배치되므로 x[3] == X[0, 3]입니다.

X = x.reshape(3, 4)
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

Note that specifying every shape component to reshape is redundant. Because we already know our tensor’s size, we can work out one component of the shape given the rest. For example, given a tensor of size n and target shape (ℎ, w), we know that w=n/ℎ. To automatically infer one component of the shape, we can place a -1 for the shape component that should be inferred automatically. In our case, instead of calling x.reshape(3, 4), we could have equivalently called x.reshape(-1, 4) or x.reshape(3, -1).

모든 shape component 특정해서 reshape 하는 것은 redundant하다는 것을 주목하세요.. 우리는 이미 텐서의 크기를 알고 있기 때문에 나머지가 주어지면 shape 의 한 구성 요소를 계산할 수 있습니다. 예를 들어 크기가 n이고 대상 shape (ℎ, w) 있는 텐서가 있으면 w=n/ℎ임을 알 수 있습니다. 모양의 한 구성 요소를 자동으로 추론하려면 자동으로 추론해야 하는 모양 구성 요소에 -1을 배치할 수 있습니다. 우리의 경우 x.reshape(3, 4)를 호출하는 대신 x.reshape(-1, 4) 또는 x.reshape(3, -1)을 호출할 수도 있습니다.

Practitioners often need to work with tensors initialized to contain all 0s or 1s. We can construct a tensor with all elements set to 0 and a shape of (2, 3, 4) via the zeros function.

Practitioners 는 모두 0 또는 1을 포함하도록 초기화된 텐서를 사용하여 작업해야 하는 경우가 많습니다. zeros 함수를 통해 모든 요소가 0으로 설정되고 모양이 (2, 3, 4)인 텐서를 구성할 수 있습니다.

torch.zeros((2, 3, 4))

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

Similarly, we can create a tensor with all 1s by invoking ones.

마찬가지로, 1을 호출하여 모두 1로 구성된 텐서를 생성할 수 있습니다.

torch.ones((2, 3, 4))

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

We often wish to sample each element randomly (and independently) from a given probability distribution. For example, the parameters of neural networks are often initialized randomly. The following snippet creates a tensor with elements drawn from a standard Gaussian (normal) distribution with mean 0 and standard deviation 1.

우리는 주어진 확률 분포에서 각 요소를 무작위로(그리고 독립적으로) 샘플링하려는 경우가 많습니다. 예를 들어 신경망의 매개변수는 무작위로 초기화되는 경우가 많습니다. 다음 스니펫은 평균이 0이고 표준편차가 1인 표준 가우스(정규) 분포에서 추출된 요소로 텐서를 생성합니다.

torch.randn(3, 4)

tensor([[-0.6921, -1.7850, -0.0397,  0.3334],
        [-0.6288, -0.7518, -0.4018, -0.9821],
        [-1.3914,  1.5492, -0.3178, -0.9031]])

Finally, we can construct tensors by supplying the exact values for each element by supplying (possibly nested) Python list(s) containing numerical literals. Here, we construct a matrix with a list of lists, where the outermost list corresponds to axis 0, and the inner list corresponds to axis 1.

마지막으로, 숫자 리터럴이 포함된 (중첩된) Python 목록을 제공하여 각 요소에 대한 정확한 값을 제공함으로써 텐서를 구성할 수 있습니다. 여기서는 목록 목록으로 행렬을 구성합니다. 여기서 가장 바깥쪽 목록은 축 0에 해당하고 내부 목록은 축 1에 해당합니다.

torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

tensor([[2, 1, 4, 3],
        [1, 2, 3, 4],
        [4, 3, 2, 1]])

2.1.2. Indexing and Slicing

As with Python lists, we can access tensor elements by indexing (starting with 0). To access an element based on its position relative to the end of the list, we can use negative indexing. Finally, we can access whole ranges of indices via slicing (e.g., X[start:stop]), where the returned value includes the first index (start) but not the last (stop). Finally, when only one index (or slice) is specified for a k th-order tensor, it is applied along axis 0. Thus, in the following code, [-1] selects the last row and [1:3] selects the second and third rows.

Python 목록과 마찬가지로 인덱싱(0부터 시작)을 통해 텐서 요소에 액세스할 수 있습니다. 목록 끝을 기준으로 요소의 위치를 기준으로 요소에 액세스하려면 음수 인덱싱을 사용할 수 있습니다. 마지막으로, 슬라이싱(예: X[start:stop])을 통해 전체 인덱스 범위에 액세스할 수 있습니다. 여기서 반환된 값에는 첫 번째 인덱스(start)가 포함되지만 마지막(stop)은 포함되지 않습니다. 마지막으로, k차 텐서에 대해 하나의 인덱스(또는 슬라이스)만 지정된 경우 축 0을 따라 적용됩니다. 따라서 다음 코드에서 [-1]은 마지막 행을 선택하고 [1:3]은 두 번째와 세 번째 행을 선택합니다.

X, X[0],X[1],X[2],X[-1],X[0:3],X[0:2],X[0:1], X[1:3],X[1:2],X[1:1],X[2:3],X[2:2],X[-1:3],X[-1:0],X[-1:1],X[-1:2],X[-2],X[-2:3]

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 tensor([0., 1., 2., 3.]),
 tensor([4., 5., 6., 7.]),
 tensor([ 8.,  9., 10., 11.]),
 tensor([ 8.,  9., 10., 11.]),
 tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 tensor([[0., 1., 2., 3.],
         [4., 5., 6., 7.]]),
 tensor([[0., 1., 2., 3.]]),
 tensor([[ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 tensor([[4., 5., 6., 7.]]),
 tensor([], size=(0, 4)),
 tensor([[ 8.,  9., 10., 11.]]),
 tensor([], size=(0, 4)),
 tensor([[ 8.,  9., 10., 11.]]),
 tensor([], size=(0, 4)),
 tensor([], size=(0, 4)),
 tensor([], size=(0, 4)),
 tensor([4., 5., 6., 7.]),
 tensor([[ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]))

Beyond reading them, we can also write elements of a matrix by specifying indices.

이를 읽는 것 외에도 인덱스를 지정하여 행렬의 요소를 작성할 수도 있습니다.

X[1, 2] = 17
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5., 17.,  7.],
        [ 8.,  9., 10., 11.]])

If we want to assign multiple elements the same value, we apply the indexing on the left-hand side of the assignment operation. For instance, [:2, :] accesses the first and second rows, where : takes all the elements along axis 1 (column). While we discussed indexing for matrices, this also works for vectors and for tensors of more than two dimensions.

여러 요소에 동일한 값을 할당하려면 할당 작업의 왼쪽에 인덱싱을 적용합니다. 예를 들어, [:2, :]는 첫 번째와 두 번째 행에 액세스합니다. 여기서 :는 축 1(열)을 따라 모든 요소를 가져옵니다. 행렬에 대한 인덱싱에 대해 논의했지만 이는 벡터와 2차원 이상의 텐서에도 적용됩니다.

X[:2, :] = 12
X

tensor([[12., 12., 12., 12.],
        [12., 12., 12., 12.],
        [ 8.,  9., 10., 11.]])

2.1.3. Operations

Now that we know how to construct tensors and how to read from and write to their elements, we can begin to manipulate them with various mathematical operations. Among the most useful of these are the elementwise operations. These apply a standard scalar operation to each element of a tensor. For functions that take two tensors as inputs, elementwise operations apply some standard binary operator on each pair of corresponding elements. We can create an elementwise function from any function that maps from a scalar to a scalar.

이제 텐서를 구성하는 방법과 해당 요소를 읽고 쓰는 방법을 알았으므로 다양한 수학적 연산을 사용하여 텐서를 조작할 수 있습니다. 이들 중 가장 유용한 것 중에는 요소별 연산이 있습니다. 이는 텐서의 각 요소에 표준 스칼라 연산을 적용합니다. 두 개의 텐서를 입력으로 사용하는 함수의 경우 요소별 연산은 해당 요소의 각 쌍에 일부 표준 이진 연산자를 적용합니다. 스칼라에서 스칼라로 매핑되는 모든 함수에서 요소별 함수를 만들 수 있습니다.

In mathematical notation, we denote such unary scalar operators (taking one input) by the signature ƒ : ℝ → ℝ . This just means that the function maps from any real number onto some other real number. Most standard operators, including unary ones like e**x, can be applied elementwise.

수학적 표기법에서는 이러한 단항 스칼라 연산자(하나의 입력을 취함)를 f: ℝ → ℝ 기호로 나타냅니다. 이는 단지 함수가 실수를 다른 실수로 매핑한다는 것을 의미합니다. e**x와 같은 단항 연산자를 포함한 대부분의 표준 연산자는 요소별로 적용할 수 있습니다.

torch.exp(x)

tensor([162754.7969, 162754.7969, 162754.7969, 162754.7969, 162754.7969,
        162754.7969, 162754.7969, 162754.7969,   2980.9580,   8103.0840,
         22026.4648,  59874.1406])

Likewise, we denote binary scalar operators, which map pairs of real numbers to a (single) real number via the signature ƒ : ℝ , ℝ → ℝ . Given any two vectors u and v of the same shape, and a binary operator ƒ , we can produce a vector c=F(u,v) by setting ci← ƒ (ui,vi) for all i, where ci,ui, and vi are the i th elements of vectors c,u, and v. Here, we produced the vector-valued F: ℝ**d, ℝ**d → ℝ**d by lifting the scalar function to an elementwise vector operation. The common standard arithmetic operators for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (**) have all been lifted to elementwise operations for identically-shaped tensors of arbitrary shape.

마찬가지로, 서명 ƒ : ℝ , ℝ → ℝ을 통해 실수 쌍을 (단일) 실수로 매핑하는 이진 스칼라 연산자를 나타냅니다. 동일한 모양의 두 벡터 u 및 v와 이항 연산자 ƒ 가 주어지면 모든 i에 대해 ci← ƒ (ui,vi)를 설정하여 벡터 c=F(u,v)를 생성할 수 있습니다. 여기서 ci,ui, vi는 벡터 c,u, v의 i 번째 요소입니다. 여기서는 스칼라 함수를 요소별 벡터 연산으로 끌어올려 벡터 값 F: ℝ**d, ℝ**d → ℝ**d를 생성했습니다. . 덧셈(+), 뺄셈(-), 곱셈(*), 나눗셈(/), 지수화(**)에 대한 일반적인 표준 산술 연산자가 모두 동일한 모양의 임의 모양의 텐서에 대한 요소별 연산으로 향상되었습니다.

x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y

(tensor([ 3.,  4.,  6., 10.]),
 tensor([-1.,  0.,  2.,  6.]),
 tensor([ 2.,  4.,  8., 16.]),
 tensor([0.5000, 1.0000, 2.0000, 4.0000]),
 tensor([ 1.,  4., 16., 64.]))

In addition to elementwise computations, we can also perform linear algebraic operations, such as dot products and matrix multiplications. We will elaborate on these in Section 2.3.

요소별 계산 외에도 내적 및 행렬 곱셈과 같은 선형 대수 연산을 수행할 수도 있습니다. 이에 대해서는 섹션 2.3에서 자세히 설명하겠습니다.

We can also concatenate multiple tensors, stacking them end-to-end to form a larger one. We just need to provide a list of tensors and tell the system along which axis to concatenate. The example below shows what happens when we concatenate two matrices along rows (axis 0) instead of columns (axis 1). We can see that the first output’s axis-0 length (6) is the sum of the two input tensors’ axis-0 lengths (3+3); while the second output’s axis-1 length (8) is the sum of the two input tensors’ axis-1 lengths (4+4).

또한 여러 개의 텐서를 연결하여 끝에서 끝까지 쌓아서 더 큰 텐서를 형성할 수도 있습니다. 텐서 목록을 제공하고 연결할 축을 시스템에 알려주기만 하면 됩니다. 아래 예는 열(축 1) 대신 행(축 0)을 따라 두 행렬을 연결할 때 어떤 일이 발생하는지 보여줍니다. 첫 번째 출력의 0축 길이(6)는 두 입력 텐서의 0축 길이(3+3)의 합이라는 것을 알 수 있습니다. 두 번째 출력의 축 1 길이(8)는 두 입력 텐서의 축 1 길이(4+4)의 합입니다.

X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)

설명

import torch

# 12개의 원소를 가지는 1차원 텐서 생성하고, 이를 (3, 4) 크기의 2차원 텐서로 변환합니다.
X = torch.arange(12, dtype=torch.float32).reshape((3, 4))

# 주어진 값으로 3x4 크기의 텐서 Y를 생성합니다.
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

# torch.cat을 사용하여 두 개의 텐서 X와 Y를 합칩니다. dim=0을 사용하면 행 방향으로 합치고, dim=1을 사용하면 열 방향으로 합칩니다.
result1 = torch.cat((X, Y), dim=0)  # 행 방향으로 합침
result2 = torch.cat((X, Y), dim=1)  # 열 방향으로 합침

# 결과 출력
print(result1)
print(result2)

이 코드는 PyTorch를 사용하여 두 개의 텐서 X와 Y를 합치고, 그 결과를 출력하는 예제입니다. torch.cat 함수를 사용하여 두 텐서를 합치는데, dim 매개변수를 사용하여 어느 방향(행 또는 열)으로 합칠지를 지정할 수 있습니다. 결과는 result1과 result2에 저장되며, 각각 행 방향과 열 방향으로 합쳐진 텐서를 나타냅니다.

결과

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.],
         [ 2.,  1.,  4.,  3.],
         [ 1.,  2.,  3.,  4.],
         [ 4.,  3.,  2.,  1.]]),
 tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
         [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
         [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]]))

첫 번째 결과인 result1은 행 방향으로 텐서 X와 Y가 합쳐진 것을 나타내며, 두 번째 결과인 result2는 열 방향으로 합쳐진 것을 나타냅니다.

Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example. For each position i, j, if X[i, j] and Y[i, j] are equal, then the corresponding entry in the result takes value 1, otherwise it takes value 0.

때로는 논리문을 통해 이진 텐서를 구성하고 싶을 때도 있습니다. X == Y를 예로 들어 보겠습니다. 각 위치 i, j에 대해 X[i, j]와 Y[i, j]가 동일하면 결과의 해당 항목은 값 1을 취하고, 그렇지 않으면 값 0을 갖습니다.

X == Y

tensor([[False,  True, False,  True],
        [False, False, False, False],
        [False, False, False, False]])

Summing all the elements in the tensor yields a tensor with only one element.

텐서의 모든 요소를 합산하면 요소가 하나만 있는 텐서가 생성됩니다.

X.sum()

tensor(66.)

2.1.4. Broadcasting

By now, you know how to perform elementwise binary operations on two tensors of the same shape. Under certain conditions, even when shapes differ, we can still perform elementwise binary operations by invoking the broadcasting mechanism. Broadcasting works according to the following two-step procedure: (i) expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape; (ii) perform an elementwise operation on the resulting arrays.

지금까지 동일한 모양의 두 텐서에 대해 요소별 이진 연산을 수행하는 방법을 알았습니다. 특정 조건에서는 모양이 다르더라도 브로드캐스팅 메커니즘을 호출하여 요소별 이진 연산을 계속 수행할 수 있습니다. 브로드캐스트는 다음 2단계 절차에 따라 작동합니다. (i) 길이가 1인 축을 따라 요소를 복사하여 하나 또는 두 배열을 확장하여 이 변환 후 두 텐서가 동일한 모양을 갖게 합니다. (ii) 결과 배열에 대해 요소별 연산을 수행합니다.

a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b

설명

import torch

# torch.arange를 사용하여 0부터 2까지의 값을 가지는 1차원 텐서를 생성하고, 이를 (3, 1) 크기의 2차원 텐서로 변환합니다.
a = torch.arange(3).reshape((3, 1))

# torch.arange를 사용하여 0부터 1까지의 값을 가지는 1차원 텐서를 생성하고, 이를 (1, 2) 크기의 2차원 텐서로 변환합니다.
b = torch.arange(2).reshape((1, 2))

# 결과 출력
print(a)
print(b)

이 코드는 PyTorch를 사용하여 두 개의 텐서 a와 b를 생성하고 그 결과를 출력하는 예제입니다.

첫 번째 부분에서 a는 0부터 2까지의 값을 가지는 1차원 텐서를 생성하고, .reshape((3, 1))을 사용하여 이를 (3, 1) 크기의 2차원 텐서로 변환합니다. 이렇게 하면 3개의 행과 1개의 열을 가진 행렬이 생성됩니다.
두 번째 부분에서 b는 0부터 1까지의 값을 가지는 1차원 텐서를 생성하고, .reshape((1, 2))를 사용하여 이를 (1, 2) 크기의 2차원 텐서로 변환합니다. 이로써 1개의 행과 2개의 열을 가진 행렬이 생성됩니다.

결과는 a와 b의 값이 출력되며, 각각 2차원 텐서의 형태를 가지고 있음을 확인할 수 있습니다.

결과

tensor([[0],
        [1],
        [2]])
tensor([[0, 1]])

첫 번째 결과는 텐서 a이며, (3, 1) 크기의 2차원 텐서입니다. 이는 3개의 행과 1개의 열을 가지며, 각 행에는 0, 1, 2라는 값을 가지고 있습니다.

두 번째 결과는 텐서 b이며, (1, 2) 크기의 2차원 텐서입니다. 이는 1개의 행과 2개의 열을 가지며, 각 열에는 0, 1이라는 값을 가지고 있습니다.

Since a and b are 3×1 and 1×2 matrices, respectively, their shapes do not match up. Broadcasting produces a larger 3×2 matrix by replicating matrix a along the columns and matrix b along the rows before adding them elementwise.

a와 b는 각각 3×1과 1×2 행렬이므로 모양이 일치하지 않습니다. 브로드캐스팅은 요소별로 추가하기 전에 열을 따라 행렬 a를, 행을 따라 행렬 b를 복제하여 더 큰 3×2 행렬을 생성합니다.

a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

2.1.5. Saving Memory

Running operations can cause new memory to be allocated to host results. For example, if we write Y = X + Y, we dereference the tensor that Y used to point to and instead point Y at the newly allocated memory. We can demonstrate this issue with Python’s id() function, which gives us the exact address of the referenced object in memory. Note that after we run Y = Y + X, id(Y) points to a different location. That is because Python first evaluates Y + X, allocating new memory for the result and then points Y to this new location in memory.

작업을 실행하면 호스트 결과에 새 메모리가 할당될 수 있습니다. 예를 들어, Y = X + Y라고 쓰면 Y가 가리키는 데 사용된 텐서를 역참조하고 대신 새로 할당된 메모리에서 Y를 가리킵니다. 메모리에서 참조된 객체의 정확한 주소를 제공하는 Python의 id() 함수를 사용하여 이 문제를 입증할 수 있습니다. Y = Y + X를 실행한 후 id(Y)는 다른 위치를 가리킵니다. 그 이유는 Python이 먼저 Y + X를 평가하여 결과에 새 메모리를 할당한 다음 Y를 메모리의 새 위치를 가리키기 때문입니다.

before = id(Y)
Y = Y + X
id(Y) == before

False

This might be undesirable for two reasons. First, we do not want to run around allocating memory unnecessarily all the time. In machine learning, we often have hundreds of megabytes of parameters and update all of them multiple times per second. Whenever possible, we want to perform these updates in place. Second, we might point at the same parameters from multiple variables. If we do not update in place, we must be careful to update all of these references, lest we spring a memory leak or inadvertently refer to stale parameters.

이는 두 가지 이유로 바람직하지 않을 수 있습니다. 첫째, 우리는 항상 불필요하게 메모리를 할당하는 것을 원하지 않습니다. 기계 학습에서는 종종 수백 메가바이트의 매개변수가 있고 모든 매개변수를 초당 여러 번 업데이트합니다. 가능할 때마다 이러한 업데이트를 제자리에서 수행하려고 합니다. 둘째, 여러 변수에서 동일한 매개변수를 가리킬 수 있습니다. 제자리에서 업데이트하지 않으면 메모리 누수가 발생하거나 부주의하게 오래된 매개변수를 참조하지 않도록 이러한 참조를 모두 업데이트하도록 주의해야 합니다.

Fortunately, performing in-place operations is easy. We can assign the result of an operation to a previously allocated array Y by using slice notation: Y[:] = <expression>. To illustrate this concept, we overwrite the values of tensor Z, after initializing it, using zeros_like, to have the same shape as Y.

다행히도 내부 작업을 수행하는 것은 쉽습니다. 슬라이스 표기법(Y[:] = <expression>)을 사용하여 이전에 할당된 배열 Y에 연산 결과를 할당할 수 있습니다. 이 개념을 설명하기 위해 zeros_like를 사용하여 초기화한 후 텐서 Z의 값을 Y와 동일한 모양으로 덮어씁니다.

Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))

설명

import torch

# Y와 동일한 크기와 데이터 타입을 가지는 모든 요소가 0인 텐서 Z를 생성합니다.
Z = torch.zeros_like(Y)

# 현재 Z의 메모리 주소를 출력합니다.
print('id(Z):', id(Z))

# Z의 모든 요소에 X와 Y의 합을 할당합니다.
Z[:] = X + Y

# 다시 Z의 메모리 주소를 출력합니다.
print('id(Z):', id(Z))

이 코드는 PyTorch를 사용하여 텐서 Z를 생성하고, 이후 X와 Y의 합을 Z에 할당하는 예제입니다. 코드를 단계별로 설명하겠습니다:

Z = torch.zeros_like(Y) : Y와 동일한 크기와 데이터 타입을 가지며, 모든 요소가 0으로 초기화된 텐서 Z를 생성합니다.
print('id(Z):', id(Z)) : id(Z)를 사용하여 현재 Z의 메모리 주소를 출력합니다.
Z[:] = X + Y : Z의 모든 요소에 X와 Y의 합을 할당합니다. 이는 요소별 덧셈을 수행하며, Z의 값이 갱신됩니다.
print('id(Z):', id(Z)) : 다시 id(Z)를 사용하여 Z의 메모리 주소를 출력합니다. 이 주소는 이전 출력과 동일해야 합니다.

결과적으로, 코드는 Z를 초기화하고 값을 할당한 후에도 Z의 메모리 주소가 변경되지 않는 것을 보여줍니다. 이는 PyTorch의 메모리 관리 방식 중 하나로, 새로운 값을 할당하더라도 기존 텐서의 메모리를 재사용하여 효율적으로 관리하는 방식을 나타냅니다.

결과

id(Z): 140381179266448
id(Z): 140381179266448

If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X += Y to reduce the memory overhead of the operation.

X 값이 후속 계산에서 재사용되지 않는 경우 X[:] = X + Y 또는 X += Y를 사용하여 작업의 메모리 오버헤드를 줄일 수도 있습니다.

before = id(X)
X += Y
id(X) == before

설명

# 현재 X의 메모리 주소를 저장합니다.
before = id(X)

# X에 Y를 더하고 X의 메모리 주소를 다시 확인합니다.
X += Y

# X의 메모리 주소가 이전과 동일한지를 확인합니다.
id(X) == before

True

2.1.6. Conversion to Other Python Objects

Converting to a NumPy tensor (ndarray), or vice versa, is easy. The torch tensor and NumPy array will share their underlying memory, and changing one through an in-place operation will also change the other.

NumPy 텐서(ndarray)로 변환하거나 그 반대로 변환하는 것은 쉽습니다. 토치 텐서와 NumPy 배열은 기본 메모리를 공유하며 내부 작업을 통해 하나를 변경하면 다른 것도 변경됩니다.

A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)

(numpy.ndarray, torch.Tensor)

To convert a size-1 tensor to a Python scalar, we can invoke the item function or Python’s built-in functions.

크기가 1인 텐서를 Python 스칼라로 변환하려면 항목 함수나 Python의 내장 함수를 호출하면 됩니다.

a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

(tensor([3.5000]), 3.5, 3.5, 3)

2.1.7. Summary

The tensor class is the main interface for storing and manipulating data in deep learning libraries. Tensors provide a variety of functionalities including construction routines; indexing and slicing; basic mathematics operations; broadcasting; memory-efficient assignment; and conversion to and from other Python objects.

텐서 클래스는 딥러닝 라이브러리에서 데이터를 저장하고 조작하기 위한 기본 인터페이스입니다. 텐서는 구성 루틴을 포함한 다음과 같은 다양한 기능을 제공합니다. 인덱싱 및 슬라이싱; 기본 수학 연산; broadcasting ; 메모리 효율적인 할당; 그리고 다른 Python 객체와의 변환.

2.1.8. Exercises

Run the code in this section. Change the conditional statement X == Y to X < Y or X > Y, and then see what kind of tensor you can get.

이 섹션의 코드를 실행하세요. 조건문 X == Y를 X < Y 또는 X > Y로 변경한 다음 어떤 종류의 텐서를 얻을 수 있는지 확인하세요.
Replace the two tensors that operate by element in the broadcasting mechanism with other shapes, e.g., 3-dimensional tensors. Is the result the same as expected?

방송 메커니즘의 요소별로 작동하는 두 개의 텐서를 다른 모양(예: 3차원 텐서)으로 교체합니다. 결과가 예상한 것과 같나요?

저작자표시 (새창열림)

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

D2L - 2.7. Documentation (2)	2023.10.14
D2L - 2.6. Probability and Statistics (0)	2023.10.14
D2L - 2.5. Automatic Differentiation (0)	2023.10.12
D2L - 2.4. Calculus : 미적분 (1)	2023.10.12
D2L - 2.3. Linear Algebra - 선형 대수학 (1)	2023.10.11
D2L - 2.2. Data Preprocessing (0)	2023.10.09
D2L - 2. Preliminaries (0)	2023.10.09

1 2 3 4 ··· 13

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리

'Dive into Deep Learning'에 해당되는 글 123건

Train and tune a deep learning model at scale

with Amazon SageMaker

Step 7. Tune the model with Amazon SageMaker automatic model tuning

Conclusion

'Dive into Deep Learning > Scratch' 카테고리의 다른 글

3.7. Weight Decay

3.7.1. Norms and Weight Decay

3.7.2. High-Dimensional Linear Regression

3.7.3. Implementation from Scratc

3.7.3.1. Defining ℓ2 Norm Penalty

3.7.3.2. Defining the Model

3.7.3.3. Training without Regularization

3.7.3.4. Using Weight Decay

3.7.4. Concise Implementation

3.7.5. Summary

3.7.6. Exercises

'Dive into Deep Learning > D2L Linear Neural Networks' 카테고리의 다른 글

3.6. Generalization

3.6.1. Training Error and Generalization Error

3.6.1.1. Model Complexit

3.6.2. Underfitting or Overfitting?

3.6.2.1. Polynomial Curve Fitting

3.6.2.2. Dataset Siz

3.6.3. Model Selection

3.6.3.1. Cross-Validation

3.6.4. Summary

3.6.5. Exercises

'Dive into Deep Learning > D2L Linear Neural Networks' 카테고리의 다른 글

2.7. Documentation

2.7.1. Functions and Classes in a Module

2.7.2. Specific Functions and Classes

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

2.6. Probability and Statistics

2.6.1. A Simple Example: Tossing Coins

2.6.2. A More Formal Treatment

2.6.3. Random Variables

2.6.4. Multiple Random Variables

2.6.5. An Example

2.6.6. Expectations

2.6.7. Discussion

2.6.8. Exercises

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

2.5. Automatic Differentiation

2.5.1. A Simple Function

2.5.2. Backward for Non-Scalar Variables

2.5.3. Detaching Computation

2.5.4. Gradients and Python Control Flow

2.5.5. Discussion

2.5.6. Exercises

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

2.4. Calculus

2.4.1. Derivatives and Differentiation

2.4.2. Visualization Utilities

2.4.3. Partial Derivatives and Gradients

2.4.4. Chain Rule

2.4.5. Discussion

2.4.6. Exercises

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

2.3. Linear Algebra

2.3.1. Scalars

2.3.2. Vectors

2.3.3. Matrices

2.3.4. Tensors

2.3.5. Basic Properties of Tensor Arithmetic

2.3.6. Reduction

2.3.7. Non-Reduction Sum

2.3.8. Dot Products

2.3.9. Matrix–Vector Products

2.3.10. Matrix–Matrix Multiplication

2.3.11. Norms

2.3.12. Discussion

'Dive into Deep Learning > D2L Preliminaries' 카테고리의 다른 글

2.2. Data Preprocessing