반응형
블로그 이미지
개발자로서 현장에서 일하면서 새로 접하는 기술들이나 알게된 정보 등을 정리하기 위한 블로그입니다. 운 좋게 미국에서 큰 회사들의 프로젝트에서 컬설턴트로 일하고 있어서 새로운 기술들을 접할 기회가 많이 있습니다. 미국의 IT 프로젝트에서 사용되는 툴들에 대해 많은 분들과 정보를 공유하고 싶습니다.
솔웅

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리


반응형

https://d2l.ai/chapter_linear-classification/environment-and-distribution-shift.html

 

4.7. Environment and Distribution Shift — Dive into Deep Learning 1.0.0-beta0 documentation

 

d2l.ai

 

4.7. Environment and Distribution Shift

 
In the previous sections, we worked through a number of hands-on applications of machine learning, fitting models to a variety of datasets. And yet, we never stopped to contemplate either where data comes from in the first place or what we plan to ultimately do with the outputs from our models. Too often, machine learning developers in possession of data rush to develop models without pausing to consider these fundamental issues.

이전 섹션에서는 기계 학습의 여러 hands-on applications -실습 응용 프로그램-을 통해 다양한 데이터 세트에 모델을 맞추었습니다. 그러나 우리는 처음부터 데이터가 어디에서 오는지 또는 궁극적으로 모델의 출력으로 무엇을 할 계획인지 고민하기 위한 시간을 충분히 갖지 않았습니다. 너무 자주 데이터를 소유한 기계 학습 개발자는 이러한 근본적인 문제를 고려하는것 보다 단지 모델을 개발하는 데만 몰두 합니다.

 

Many failed machine learning deployments can be traced back to this pattern. Sometimes models appear to perform marvelously as measured by test set accuracy but fail catastrophically in deployment when the distribution of data suddenly shifts. More insidiously, sometimes the very deployment of a model can be the catalyst that perturbs the data distribution. Say, for example, that we trained a model to predict who will repay vs. default on a loan, finding that an applicant’s choice of footwear was associated with the risk of default (Oxfords indicate repayment, sneakers indicate default). We might be inclined to thereafter grant loans to all applicants wearing Oxfords and to deny all applicants wearing sneakers.

 

실패한 많은 기계 학습 deployments  -배포-는 이 패턴으로 거슬러 올라갈 수 있습니다. 때때로 모델은 테스트 세트 정확도로 측정했을 때 놀라운 성능을 보이는 것처럼 보이지만 데이터 분포가 갑자기 바뀌면 배치에서 치명적인 실패를 합니다. 더 은밀하게도 때로는 모델의 배포 자체가 데이터 분포를 교란시키는 촉매제가 될 수 있습니다. 예를 들어, 누가 상환할 것인지 채무불이행인지 예측하기 위해 모델을 훈련시켰고, 신청자의 신발 선택이 채무불이행 위험과 관련이 있다는 것을 발견했다고 가정해 보겠습니다(Oxfords는 상환을 나타내고 운동화는 채무불이행을 나타냅니다). 그 후 우리는 옥스퍼드를 신는 모든 지원자에게 대출을 허용하고 운동화를 신는 모든 지원자를 거부하는 경향이 있을 수 있습니다.

 

In this case, our ill-considered leap from pattern recognition to decision-making and our failure to critically consider the environment might have disastrous consequences. For starters, as soon as we began making decisions based on footwear, customers would catch on and change their behavior. Before long, all applicants would be wearing Oxfords, without any coinciding improvement in credit-worthiness. Take a minute to digest this because similar issues abound in many applications of machine learning: by introducing our model-based decisions to the environment, we might break the model.

 

이 경우 패턴 인식에서 의사 결정으로의 잘못된 도약과 환경을 비판적으로 고려하지 못하는 것은 재앙적인 결과를 초래할 수 있습니다. 우선, 우리가 신발을 기반으로 결정을 내리기 시작하자마자 고객은 그들의 행동을 파악하고 변화시킬 것입니다. 오래지 않아 모든 지원자들은 신용도가 향상되지 않은 채 옥스퍼드 구두를 신게 될 것입니다. 기계 학습의 많은 응용 프로그램에서 유사한 문제가 많기 때문에 잠시 시간을 내어 이를 이해하십시오. 모델 기반 결정을 환경에 도입하면 모델이 깨질 수 있습니다.

 

While we cannot possibly give these topics a complete treatment in one section, we aim here to expose some common concerns, and to stimulate the critical thinking required to detect these situations early, mitigate damage, and use machine learning responsibly. Some of the solutions are simple (ask for the “right” data), some are technically difficult (implement a reinforcement learning system), and others require that we step outside the realm of statistical prediction altogether and grapple with difficult philosophical questions concerning the ethical application of algorithms.

 

이러한 주제를 한 섹션에서 완전히 다룰 수는 없지만 여기서는 몇 가지 일반적인 우려 사항을 밝히고 이러한 상황을 조기에 감지하고 손상을 완화하며 기계 학습을 책임감 있게 사용하는 데 필요한 비판적 사고를 자극하는 것을 목표로 합니다. 솔루션 중 일부는 간단하고("올바른" 데이터 요청), 일부는 기술적으로 어렵고(강화 학습 시스템 구현), 다른 솔루션은 통계적 예측 영역을 완전히 벗어나 윤리적 문제에 관한 어려운 철학적 질문과 씨름해야 합니다. 알고리즘의 적용.

 

4.7.1. Types of Distribution Shift

To begin, we stick with the passive prediction setting considering the various ways that data distributions might shift and what might be done to salvage model performance. In one classic setup, we assume that our training data was sampled from some distribution ps(x,y) but that our test data will consist of unlabeled examples drawn from some different distribution pT(x,y). Already, we must confront a sobering reality. Absent any assumptions on how ps and pr relate to each other, learning a robust classifier is impossible.

 

시작하려면 데이터 분포가 이동할 수 있는 다양한 방식과 모델 성능을 복구하기 위해 수행할 수 있는 작업을 고려하여 수동 예측 설정을 고수합니다. 하나의 고전적인 설정에서 훈련 데이터는 일부 분포 ps(x,y)에서 샘플링되었지만 테스트 데이터는 일부 다른 분포 pT(x,y)에서 가져온 레이블이 지정되지 않은 예제로 구성된다고 가정합니다. 이미 우리는 냉정한 현실에 직면해야 합니다. ps와 pr이 서로 어떻게 관련되어 있는지에 대한 가정이 없으면 강력한 분류기를 학습하는 것은 불가능합니다.

 

Consider a binary classification problem, where we wish to distinguish between dogs and cats. If the distribution can shift in arbitrary ways, then our setup permits the pathological case in which the distribution over inputs remains constant: ps(x)=pr(x), but the labels are all flipped: ps(y∣x)=1−pT(y∣x). In other words, if God can suddenly decide that in the future all “cats” are now dogs and what we previously called “dogs” are now cats—without any change in the distribution of inputs p(d), then we cannot possibly distinguish this setting from one in which the distribution did not change at all.

 

개와 고양이를 구별하려는 이진 분류 문제를 고려하십시오. 분포가 임의의 방식으로 이동할 수 있는 경우, 우리의 설정은 입력에 대한 분포가 일정하게 유지되는 병리학적 사례를 허용합니다: ps(x)=pr(x), 그러나 레이블은 모두 뒤집힙니다: ps(y∣x)=1 -pT(y∣x). 다시 말해서, 입력 p(d)의 분포에 어떤 변화도 없이 신이 미래에 모든 "고양이"는 이제 개이고 우리가 이전에 "개"라고 불렀던 것은 이제 고양이라고 갑자기 결정할 수 있다면, 우리는 아마도 구별할 수 없습니다. 이 설정은 배포가 전혀 변경되지 않은 설정입니다.

 

Fortunately, under some restricted assumptions on the ways our data might change in the future, principled algorithms can detect shift and sometimes even adapt on the fly, improving on the accuracy of the original classifier.

 

다행스럽게도 미래에 데이터가 변경될 수 있는 방식에 대한 제한된 가정 하에서 원칙적 알고리즘은 이동을 감지하고 때로는 즉석에서 적응하여 원래 분류기의 정확도를 향상시킬 수 있습니다.

 

4.7.1.1. Covariate Shift

Among categories of distribution shift, covariate shift may be the most widely studied. Here, we assume that while the distribution of inputs may change over time, the labeling function, i.e., the conditional distribution P(y∣x) does not change. Statisticians call this covariate shift because the problem arises due to a shift in the distribution of the covariates (features). While we can sometimes reason about distribution shift without invoking causality, we note that covariate shift is the natural assumption to invoke in settings where we believe that x causes y.

 

distribution shift-분포 이동-의 범주 중에서 covariate shift  -공변량 이동-이 가장 널리 연구될 수 있습니다. 여기서는 distribution of inputs -입력 분포-가 시간에 따라 변할 수 있지만 labeling function-레이블링 함수-, 즉 conditional distribution-조건부 분포- P(y∣x)는 변하지 않는다고 가정합니다. 통계학자들은 covariates (features)-공변량(특징)- 분포의 변화로 인해 문제가 발생하기 때문에 이를 covariate shift -공변량 변화-라고 부릅니다. 우리는 때때로 인과 관계 없이 distribution shift-분포 이동-에 대해 추론할 수 있지만, covariate shift-공변량 이동-은 x가 y를 유발한다고 믿는 환경에서 호출하는 자연스러운 가정이라는 점에 주목합니다.

 

Consider the challenge of distinguishing cats and dogs. Our training data might consist of images of the kind in Fig. 4.7.1.

 

4.7. Environment and Distribution Shift — Dive into Deep Learning 1.0.0-beta0 documentation

 

d2l.ai

고양이와 개를 구별하는 문제를 생각해 보십시오. 훈련 데이터는 그림 4.7.1과 같은 종류의 이미지로 구성될 수 있습니다.

 

 

At test time we are asked to classify the images in Fig. 4.7.2.

 

테스트 할 때 그림 4.7.2의 이미지를 분류하라는 요청을 받았습니다.

 

 

The training set consists of photos, while the test set contains only cartoons. Training on a dataset with substantially different characteristics from the test set can spell trouble absent a coherent plan for how to adapt to the new domain.

 

훈련 세트는 사진으로 구성되고 테스트 세트는 만화로만 구성됩니다. 테스트 세트와 상당히 다른 특성을 가진 데이터 세트에 대한 교육은 새로운 도메인에 적응하는 방법에 대한 일관된 계획이 없을 수 있습니다.

 

4.7.1.2. Label Shift

 

Label shift describes the converse problem. Here, we assume that the label marginal P(y) can change but the class-conditional distribution P(x∣y) remains fixed across domains. Label shift is a reasonable assumption to make when we believe that y causes x. For example, we may want to predict diagnoses given their symptoms (or other manifestations), even as the relative prevalence of diagnoses are changing over time. Label shift is the appropriate assumption here because diseases cause symptoms. In some degenerate cases the label shift and covariate shift assumptions can hold simultaneously. For example, when the label is deterministic, the covariate shift assumption will be satisfied, even when y causes x. Interestingly, in these cases, it is often advantageous to work with methods that flow from the label shift assumption. That is because these methods tend to involve manipulating objects that look like labels (often low-dimensional), as opposed to objects that look like inputs, which tend to be high-dimensional in deep learning.

 

레이블 이동은 반대 문제를 설명합니다. 여기에서 레이블 한계 P(y)는 변경될 수 있지만 클래스 조건부 분포 P(x∣y)는 도메인 전체에서 고정된 상태로 유지된다고 가정합니다. 레이블 이동은 y가 x를 유발한다고 믿을 때 합리적인 가정입니다. 예를 들어, 진단의 상대적 유병률이 시간이 지남에 따라 변하더라도 증상(또는 다른 징후)에 따라 진단을 예측할 수 있습니다. 질병이 증상을 유발하기 때문에 레이블 이동이 적절한 가정입니다. 일부 변질된 경우에는 레이블 이동 및 공변량 이동 가정이 동시에 유지될 수 있습니다. 예를 들어 레이블이 결정적이면 y가 x를 유발하는 경우에도 공변량 이동 가정이 충족됩니다. 흥미롭게도 이러한 경우 레이블 이동 가정에서 파생된 방법으로 작업하는 것이 종종 유리합니다. 이러한 방법은 딥 러닝에서 고차원 경향이 있는 입력처럼 보이는 개체와 달리 레이블(종종 저차원)처럼 보이는 개체를 조작하는 경향이 있기 때문입니다.

 

4.7.1.3. Concept Shift

We may also encounter the related problem of concept shift, which arises when the very definitions of labels can change. This sounds weird—a cat is a cat, no? However, other categories are subject to changes in usage over time. Diagnostic criteria for mental illness, what passes for fashionable, and job titles, are all subject to considerable amounts of concept shift. It turns out that if we navigate around the United States, shifting the source of our data by geography, we will find considerable concept shift regarding the distribution of names for soft drinks as shown in Fig. 4.7.3.

 

4.7. Environment and Distribution Shift — Dive into Deep Learning 1.0.0-beta0 documentation

 

d2l.ai

또한 레이블의 정의가 변경될 수 있을 때 발생하는 개념 이동과 관련된 문제에 직면할 수 있습니다. 이상하게 들립니다. 고양이는 고양이잖아요? 그러나 다른 범주는 시간이 지남에 따라 사용량이 변경될 수 있습니다. 정신 질환에 대한 진단 기준, 유행에 통하는 것, 직책은 모두 상당한 양의 개념 변화를 겪습니다. 우리가 미국을 돌아다니면서 지역별로 데이터 소스를 이동하면 그림 4.7.3에 표시된 것처럼 청량 음료의 이름 분포와 관련하여 상당한 개념 변화가 있음을 알 수 있습니다.

 

Fig. 4.7.3  Concept shift on soft drink names in the United States.

 

If we were to build a machine translation system, the distribution P(y∣x) might be different depending on our location. This problem can be tricky to spot. We might hope to exploit knowledge that shift only takes place gradually either in a temporal or geographic sense.

 

기계 번역 시스템을 구축한다면 위치에 따라 분포 P(y∣x)가 다를 수 있습니다. 이 문제는 파악하기 까다로울 수 있습니다. 우리는 변화가 시간적 또는 지리적 의미에서 점진적으로만 발생한다는 지식을 활용하기를 희망할 수 있습니다.

 

4.7.2. Examples of Distribution Shift

형식주의와 알고리즘을 탐구하기 전에 공변량 또는 개념 이동이 명확하지 않을 수 있는 몇 가지 구체적인 상황에 대해 논의할 수 있습니다.

 

4.7.2.1. Medical Diagnostics

Imagine that you want to design an algorithm to detect cancer. You collect data from healthy and sick people and you train your algorithm. It works fine, giving you high accuracy and you conclude that you are ready for a successful career in medical diagnostics. Not so fast.

 

암을 감지하는 알고리즘을 설계한다고 상상해 보십시오. 건강하고 아픈 사람들로부터 데이터를 수집하고 알고리즘을 훈련합니다. 그것은 잘 작동하고 높은 정확도를 제공하며 의료 진단 분야에서 성공적인 경력을 쌓을 준비가 되었다고 결론을 내립니다. 그렇게 빠르지 않습니다.

 

The distributions that gave rise to the training data and those you will encounter in the wild might differ considerably. This happened to an unfortunate startup that some of us (authors) worked with years ago. They were developing a blood test for a disease that predominantly affects older men and hoped to study it using blood samples that they had collected from patients. However, it is considerably more difficult to obtain blood samples from healthy men than sick patients already in the system. To compensate, the startup solicited blood donations from students on a university campus to serve as healthy controls in developing their test. Then they asked whether we could help them to build a classifier for detecting the disease.

 

훈련 데이터를 생성한 분포와 실제로 접하게 될 분포는 상당히 다를 수 있습니다. 이것은 우리 중 일부(저자)가 몇 년 전에 함께 일했던 불행한 시작에 일어났습니다. 그들은 주로 노인에게 영향을 미치는 질병에 대한 혈액 검사를 개발하고 있었고 환자로부터 수집한 혈액 샘플을 사용하여 연구하기를 희망했습니다. 그러나 이미 시스템에 있는 아픈 환자보다 건강한 남성에게서 혈액 샘플을 채취하는 것이 훨씬 더 어렵습니다. 이를 보완하기 위해 이 스타트업은 대학 캠퍼스의 학생들에게 헌혈을 요청하여 테스트를 개발하는 데 있어 건강한 통제 역할을 했습니다. 그런 다음 그들은 질병을 감지하기 위한 분류기를 구축하는 데 도움을 줄 수 있는지 물었습니다.

 

As we explained to them, it would indeed be easy to distinguish between the healthy and sick cohorts with near-perfect accuracy. However, that is because the test subjects differed in age, hormone levels, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients. Due to their sampling procedure, we could expect to encounter extreme covariate shift. Moreover, this case was unlikely to be correctable via conventional methods. In short, they wasted a significant sum of money.

 

우리가 그들에게 설명했듯이 거의 완벽에 가까운 정확도로 건강한 코호트와 아픈 코호트를 쉽게 구별할 수 있습니다. 그러나 이는 피험자들의 연령, 호르몬 수치, 신체 활동, 식이요법, 알코올 소비 및 질병과 관련 없는 더 많은 요인이 다르기 때문입니다. 이것은 실제 환자의 경우가 아닐 것입니다. 샘플링 절차로 인해 극단적인 공변량 이동이 발생할 것으로 예상할 수 있습니다. 게다가 이 경우는 기존의 방법으로는 교정할 수 없었습니다. 요컨대, 그들은 상당한 양의 돈을 낭비했습니다.

 

4.7.2.2. Self-Driving Cars

Say a company wanted to leverage machine learning for developing self-driving cars. One key component here is a roadside detector. Since real annotated data is expensive to get, they had the (smart and questionable) idea to use synthetic data from a game rendering engine as additional training data. This worked really well on “test data” drawn from the rendering engine. Alas, inside a real car it was a disaster. As it turned out, the roadside had been rendered with a very simplistic texture. More importantly, all the roadside had been rendered with the same texture and the roadside detector learned about this “feature” very quickly.

 

회사에서 자율 주행 자동차 개발을 위해 머신 러닝을 활용하고자 한다고 가정해 보겠습니다. 여기서 핵심 구성 요소 중 하나는 길가 감지기입니다. 주석이 달린 실제 데이터를 얻는 데 비용이 많이 들기 때문에 그들은 게임 렌더링 엔진의 합성 데이터를 추가 교육 데이터로 사용하는 (현명하고 의심스러운) 아이디어를 가졌습니다. 이것은 렌더링 엔진에서 가져온 "테스트 데이터"에서 정말 잘 작동했습니다. 아아, 실제 차 안에서는 재앙이었습니다. 결과적으로 길가는 매우 단순한 질감으로 렌더링되었습니다. 더 중요한 것은 모든 길가가 동일한 텍스처로 렌더링되었고 길가 탐지기가 이 "기능"에 대해 매우 빠르게 학습했다는 것입니다.

 

A similar thing happened to the US Army when they first tried to detect tanks in the forest. They took aerial photographs of the forest without tanks, then drove the tanks into the forest and took another set of pictures. The classifier appeared to work perfectly. Unfortunately, it had merely learned how to distinguish trees with shadows from trees without shadows—the first set of pictures was taken in the early morning, the second set at noon.

 

미군이 처음 숲에서 탱크를 탐지하려 했을 때 비슷한 일이 일어났습니다. 그들은 탱크가 없는 숲의 항공 사진을 찍은 다음 탱크를 숲으로 몰고 또 다른 사진을 찍었습니다. 분류기는 완벽하게 작동하는 것으로 나타났습니다. 안타깝게도 그림자가 있는 나무와 그림자가 없는 나무를 구별하는 방법을 배웠을 뿐이었습니다. 첫 번째 사진 세트는 이른 아침에, 두 번째 세트는 정오에 촬영했습니다.

 

4.7.2.3. Nonstationary Distributions

A much more subtle situation arises when the distribution changes slowly (also known as nonstationary distribution) and the model is not updated adequately. Below are some typical cases.

 

분포가 느리게 변경되고(비정상 분포라고도 함) 모델이 적절하게 업데이트되지 않는 경우 훨씬 더 미묘한 상황이 발생합니다. 다음은 몇 가지 일반적인 경우입니다.

 

  • We train a computational advertising model and then fail to update it frequently (e.g., we forget to incorporate that an obscure new device called an iPad was just launched).
    우리는 전산 광고 모델을 훈련한 다음 자주 업데이트하지 않습니다(예: iPad라는 모호한 새 장치가 방금 출시되었다는 사실을 통합하는 것을 잊었습니다).
  • We build a spam filter. It works well at detecting all spam that we have seen so far. But then the spammers wisen up and craft new messages that look unlike anything we have seen before.
    우리는 스팸 필터를 만듭니다. 지금까지 본 모든 스팸을 탐지하는 데 효과적입니다. 그러나 스패머는 정신을 차리고 우리가 이전에 본 것과는 전혀 다른 새로운 메시지를 만듭니다.
  • We build a product recommendation system. It works throughout the winter but then continues to recommend Santa hats long after Christmas.
    상품 추천 시스템을 구축합니다. 그것은 겨울 내내 작동하지만 크리스마스 이후에도 계속해서 산타 모자를 추천합니다.

 

4.7.2.4. More Anecdotes

  • We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data—the offending examples are close-ups where the face fills the entire image (no such data was in the training set).
    우리는 얼굴 검출기를 만듭니다. 모든 벤치마크에서 잘 작동합니다. 안타깝게도 테스트 데이터에서는 실패했습니다. 불쾌한 예는 얼굴이 전체 이미지를 채우는 클로즈업입니다(트레이닝 세트에는 그러한 데이터가 없었습니다).
  • We build a Web search engine for the US market and want to deploy it in the UK.
    우리는 미국 시장을 위한 웹 검색 엔진을 구축하고 영국에 배포하려고 합니다.
  • We train an image classifier by compiling a large dataset where each among a large set of classes is equally represented in the dataset, say 1000 categories, represented by 1000 images each. Then we deploy the system in the real world, where the actual label distribution of photographs is decidedly non-uniform.
    우리는 각각 1000개의 이미지로 표시되는 1000개의 범주와 같이 큰 클래스 세트 중 각각이 데이터 세트에서 동일하게 표현되는 큰 데이터 세트를 컴파일하여 이미지 분류기를 훈련합니다. 그런 다음 사진의 실제 레이블 배포가 확실히 균일하지 않은 실제 세계에 시스템을 배포합니다.

 

4.7.3. Correction of Distribution Shift

As we have discussed, there are many cases where training and test distributions P(x,y) are different. In some cases, we get lucky and the models work despite covariate, label, or concept shift. In other cases, we can do better by employing principled strategies to cope with the shift. The remainder of this section grows considerably more technical. The impatient reader could continue on to the next section as this material is not prerequisite to subsequent concepts.

 

논의한 바와 같이 훈련 및 테스트 분포 P(x,y)가 다른 경우가 많습니다. 어떤 경우에는 운이 좋아 공변량, 레이블 또는 개념 이동에도 불구하고 모델이 작동합니다. 다른 경우에는 변화에 대처하기 위해 원칙에 입각한 전략을 사용함으로써 더 잘할 수 있습니다. 이 섹션의 나머지 부분은 훨씬 더 기술적으로 성장합니다. 이 자료는 후속 개념의 전제 조건이 아니므로 참을성 없는 독자는 다음 섹션으로 계속 진행할 수 있습니다.

 

4.7.3.1. Empirical Risk and Risk

Let’s first reflect about what exactly is happening during model training: we iterate over features and associated labels of training data {(x1,y1),…,(xn,yn)} and update the parameters of a model f after every minibatch. For simplicity we do not consider regularization, so we largely minimize the loss on the training:

 

먼저 모델 학습 중에 정확히 어떤 일이 발생하는지 살펴보겠습니다. 학습 데이터 {(x1,y1),… 단순화를 위해 정규화를 고려하지 않으므로 교육 손실을 크게 최소화합니다.

where l is the loss function measuring “how bad” the prediction f(xi) is given the associated label yi. Statisticians call the term in (4.7.1) empirical risk. The empirical risk is an average loss over the training data to approximate the risk, which is the expectation of the loss over the entire population of data drawn from their true distribution p(x,y):

 

여기서 l은 "얼마나 나쁜지"를 측정하는 손실 함수입니다. 예측 f(xi)에는 관련 레이블 yi가 지정됩니다. 통계학자들은 이 용어를 (4.7.1) 경험적 위험이라고 부릅니다. 경험적 위험은 실제 분포 p(x,y)에서 가져온 전체 데이터 모집단에 대한 손실의 예상인 위험을 근사화하기 위한 훈련 데이터에 대한 평균 손실입니다.

 

However, in practice we typically cannot obtain the entire population of data. Thus, empirical risk minimization, which is minimizing the empirical risk in (4.7.1), is a practical strategy for machine learning, with the hope to approximate minimizing the risk.

 

그러나 실제로는 일반적으로 전체 데이터 모집단을 얻을 수 없습니다. 따라서 (4.7.1)에서 경험적 위험을 최소화하는 경험적 위험 최소화는 위험 최소화에 근접하기를 희망하는 기계 학습을 위한 실용적인 전략입니다.

 

4.7.3.2. Covariate Shift Correction

Assume that we want to estimate some dependency P(y∣x) for which we have labeled data (xi,yi). Unfortunately, the observations xi are drawn from some source distribution q(x) rather than the target distribution p(x). Fortunately, the dependency assumption means that the conditional distribution does not change: p(y∣x)=q(y∣x). If the source distribution q(x) is “wrong”, we can correct for that by using the following simple identity in the risk:

 

레이블이 지정된 데이터(xi,yi)에 대한 일부 종속성 P(y∣x)를 추정하려고 한다고 가정합니다. 불행하게도 관측값 xi는 목표 분포 p(x)가 아닌 일부 소스 분포 q(x)에서 가져옵니다. 다행스럽게도 종속성 가정은 조건부 분포가 변경되지 않음을 의미합니다: p(y∣x)=q(y∣x). 소스 분포 q(x)가 "잘못된" 경우 위험에서 다음과 같은 간단한 ID를 사용하여 이를 수정할 수 있습니다.

 

In other words, we need to reweigh each data example by the ratio of the probability that it would have been drawn from the correct distribution to that from the wrong one:

 

즉, 올바른 분포에서 도출된 확률과 잘못된 분포에서 도출된 확률의 비율로 각 데이터 예제를 다시 평가해야 합니다.

 

Plugging in the weight Bi for each data example (xi,yi) we can train our model using weighted empirical risk minimization:

 

각 데이터 예(xi,yi)에 가중치 Bi를 연결하면 가중 경험적 위험 최소화를 사용하여 모델을 훈련할 수 있습니다.

 

 

Alas, we do not know that ratio, so before we can do anything useful we need to estimate it. Many methods are available, including some fancy operator-theoretic approaches that attempt to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy principle. Note that for any such approach, we need samples drawn from both distributions—the “true” p, e.g., by access to test data, and the one used for generating the training set q (the latter is trivially available). Note however, that we only need features x∼p(x); we do not need to access labels y∼p(y).

 

아아, 우리는 그 비율을 모르기 때문에 유용한 일을 하기 전에 그것을 추정해야 합니다. 최소 표준 또는 최대 엔트로피 원리를 사용하여 기대 연산자를 직접 재조정하려는 멋진 연산자 이론적 접근 방식을 포함하여 많은 방법을 사용할 수 있습니다. 이러한 접근 방식의 경우 테스트 데이터에 대한 액세스와 같은 "참" p와 훈련 세트 q를 생성하는 데 사용되는 분포(후자는 쉽게 사용할 수 있음)의 두 분포에서 추출한 샘플이 필요합니다. 그러나 특징 x~p(x)만 필요하다는 점에 유의하십시오. 레이블 y~p(y)에 액세스할 필요가 없습니다.

 

In this case, there exists a very effective approach that will give almost as good results as the original: logistic regression, which is a special case of softmax regression (see Section 4.1) for binary classification. This is all that is needed to compute estimated probability ratios. We learn a classifier to distinguish between data drawn from p(x) and data drawn from q(x). If it is impossible to distinguish between the two distributions then it means that the associated instances are equally likely to come from either one of the two distributions. On the other hand, any instances that can be well discriminated should be significantly overweighted or underweighted accordingly.

 

이 경우 원본과 거의 동일한 결과를 제공하는 매우 효과적인 접근 방식이 존재합니다. 로지스틱 회귀는 이진 분류에 대한 소프트맥스 회귀(섹션 4.1 참조)의 특수한 경우입니다. 이것이 추정 확률 비율을 계산하는 데 필요한 전부입니다. 우리는 p(x)에서 가져온 데이터와 q(x)에서 가져온 데이터를 구별하기 위해 분류기를 배웁니다. 두 분포를 구별하는 것이 불가능하다면 연결된 인스턴스가 두 분포 중 하나에서 나올 가능성이 동일하다는 의미입니다. 반면에 잘 식별할 수 있는 인스턴스는 그에 따라 크게 과중되거나 과소되어야 합니다.

 

For simplicity’s sake assume that we have an equal number of instances from both distributions p(x) and q(x), respectively. Now denote by z labels that are 1 for data drawn from p and −1 for data drawn from q. Then the probability in a mixed dataset is given by

 

단순화를 위해 분포 p(x) 및 q(x) 각각에서 동일한 수의 인스턴스가 있다고 가정합니다. 이제 p에서 가져온 데이터의 경우 1이고 q에서 가져온 데이터의 경우 -1인 z 레이블로 표시합니다. 그런 다음 혼합 데이터 세트의 확률은 다음과 같이 지정됩니다.

 

 

Thus, if we use a logistic regression approach, where P(z=1∣x)=1/1+exp⁡(−ℎ(x)) ( is a parameterized function), it follows that

 

따라서 P(z=1∣x)=1/1+exp⁡(−ℎ(x))(ℎ는 매개변수화된 함수임)인 로지스틱 회귀 접근법을 사용하면 다음과 같이 됩니다.

 

 

As a result, we need to solve two problems: first one to distinguish between data drawn from both distributions, and then a weighted empirical risk minimization problem in (4.7.5) where we weigh terms by Bi.

 

결과적으로 우리는 두 가지 문제를 해결해야 합니다. 첫 번째 문제는 두 분포에서 가져온 데이터를 구별하는 것입니다. 그런 다음 Bi로 용어를 평가하는 (4.7.5)의 가중 경험적 위험 최소화 문제입니다.

 

Now we are ready to describe a correction algorithm. Suppose that we have a training set {(x1,y1),…,(xn,yn)} and an unlabeled test set {u1,…,um}. For covariate shift, we assume that xi for all 1 ≤ i ≤ n are drawn from some source distribution and ui for all 1 ≤ i ≤ m are drawn from the target distribution. Here is a prototypical algorithm for correcting covariate shift:

 

이제 보정 알고리즘을 설명할 준비가 되었습니다. 트레이닝 세트 {(x1,y1),…,(xn,yn)}와 레이블이 지정되지 않은 테스트 세트 {u1,…,um}이 있다고 가정합니다. 공변량 이동의 경우 모든 1 ≤ i ≤ n에 대한 xi는 일부 소스 분포에서 도출되고 모든 1 ≤ i ≤ m에 대한 ui는 대상 분포에서 도출된다고 가정합니다. 다음은 공변량 이동을 수정하기 위한 원형 알고리즘입니다.

 

  1. Generate a binary-classification training set: {(x1,−1),…,(xn,−1),(u1,1),…,(um,1)}.
  2. Train a binary classifier using logistic regression to get function .
  3. Weigh training data using Bi=exp⁡(ℎ(xi)) or better Bi=min(exp⁡(ℎ(xi)),c) for some constant c.
  4. Use weights Bi for training on {(x1,y1),…,(xn,yn)} in (4.7.5).

Note that the above algorithm relies on a crucial assumption. For this scheme to work, we need that each data example in the target (e.g., test time) distribution had nonzero probability of occurring at training time. If we find a point where p(x)>0 but q(x)=0, then the corresponding importance weight should be infinity.

 

위의 알고리즘은 중요한 가정에 의존합니다. 이 체계가 작동하려면 대상(예: 테스트 시간) 분포의 각 데이터 예제가 훈련 시간에 발생할 확률이 0이 아니어야 합니다. p(x)>0이지만 q(x)=0인 점을 찾으면 해당 중요도 가중치는 무한대여야 합니다.

 

4.7.3.3. Label Shift Correction

Assume that we are dealing with a classification task with k categories. Using the same notation in Section 4.7.3.2, q and p are the source distribution (e.g., training time) and target distribution (e.g., test time), respectively. Assume that the distribution of labels shifts over time: q(y)≠p(y), but the class-conditional distribution stays the same: q(x∣y)=p(x∣y). If the source distribution q(y) is “wrong”, we can correct for that according to the following identity in the risk as defined in (4.7.2):

 

k개의 범주로 분류 작업을 처리한다고 가정합니다. 섹션 4.7.3.2의 동일한 표기법을 사용하여 q 및 p는 각각 소스 분포(예: 교육 시간) 및 대상 분포(예: 테스트 시간)입니다. 레이블 분포는 시간이 지남에 따라 이동한다고 가정합니다: q(y)≠p(y). 그러나 클래스 조건부 분포는 동일하게 유지됩니다: q(x∣y)=p(x∣y). 소스 분포 q(y)가 "잘못된" 경우 (4.7.2)에 정의된 위험의 다음 ID에 따라 이를 수정할 수 있습니다.

Here, our importance weights will correspond to the label likelihood ratios

 

여기서 중요도 가중치는 레이블 우도 비율에 해당합니다.

One nice thing about label shift is that if we have a reasonably good model on the source distribution, then we can get consistent estimates of these weights without ever having to deal with the ambient dimension. In deep learning, the inputs tend to be high-dimensional objects like images, while the labels are often simpler objects like categories.

 

레이블 이동에 대한 한 가지 좋은 점은 소스 분포에 대해 상당히 좋은 모델이 있는 경우 주변 차원을 처리하지 않고도 이러한 가중치의 일관된 추정치를 얻을 수 있다는 것입니다. 딥 러닝에서 입력은 이미지와 같은 고차원 개체인 경향이 있는 반면 레이블은 범주와 같은 단순한 개체인 경우가 많습니다.

 

To estimate the target label distribution, we first take our reasonably good off-the-shelf classifier (typically trained on the training data) and compute its confusion matrix using the validation set (also from the training distribution). The confusion matrix, C, is simply a k×k matrix, where each column corresponds to the label category (ground truth) and each row corresponds to our model’s predicted category. Each cell’s value cij is the fraction of total predictions on the validation set where the true label was j and our model predicted i.

 

대상 레이블 분포를 추정하기 위해 먼저 합리적으로 우수한 기성 분류기(일반적으로 훈련 데이터에 대해 훈련됨)를 선택하고 유효성 검사 세트(역시 훈련 분포에서)를 사용하여 혼동 행렬을 계산합니다. 혼동 행렬 C는 단순히 k×k 행렬이며 각 열은 레이블 범주(실측 정보)에 해당하고 각 행은 모델의 예측 범주에 해당합니다. 각 셀의 값 cij는 실제 레이블이 j이고 모델이 i를 예측한 검증 세트에 대한 총 예측의 비율입니다.

 

Now, we cannot calculate the confusion matrix on the target data directly, because we do not get to see the labels for the examples that we see in the wild, unless we invest in a complex real-time annotation pipeline. What we can do, however, is average all of our models predictions at test time together, yielding the mean model outputs u(y^)∈Rk, whose ith element u(y^i) is the fraction of total predictions on the test set where our model predicted i.

 

이제는 복잡한 실시간 주석 파이프라인에 투자하지 않는 한 야생에서 보는 예제에 대한 레이블을 볼 수 없기 때문에 대상 데이터에 대한 혼동 행렬을 직접 계산할 수 없습니다. 그러나 우리가 할 수 있는 것은 테스트 시간에 모든 모델 예측을 함께 평균화하여 평균 모델 출력 u(y^)∈Rk를 산출하는 것입니다. 여기서 i번째 요소 u(y^i)는 테스트에 대한 총 예측의 비율입니다. 모델이 i를 예측한 위치를 설정합니다.

 

It turns out that under some mild conditions—if our classifier was reasonably accurate in the first place, and if the target data contains only categories that we have seen before, and if the label shift assumption holds in the first place (the strongest assumption here), then we can estimate the test set label distribution by solving a simple linear system

 

일부 온화한 조건에서 분류기가 처음부터 합리적으로 정확하고 대상 데이터에 이전에 본 범주만 포함되어 있고 레이블 이동 가정이 처음에 유지되는 경우(여기서 가장 강력한 가정은 ) 그런 다음 간단한 선형 시스템을 해결하여 테스트 세트 레이블 분포를 추정할 수 있습니다.

 

because as an estimate ∑k j=1cijP(yj)=u(y^i) holds for all 1≤ i ≤ k, where p(yj) is the j th element of the k-dimensional label distribution vector p(y). If our classifier is sufficiently accurate to begin with, then the confusion matrix C will be invertible, and we get a solution p(y)=C−1u(y^).

 

추정치로서 ∑k j=1cijP(yj)=u(y^i)는 모든 1≤ i ≤ k에 대해 유지되며, 여기서 p(yj)는 k차원 레이블 분포 벡터 p(y)의 j 번째 요소입니다. 분류기가 시작하기에 충분히 정확하다면 혼동 행렬 C는 가역적이며 솔루션 p(y)=C−1u(y^)를 얻습니다.

 

Because we observe the labels on the source data, it is easy to estimate the distribution q(y). Then for any training example i with label yi, we can take the ratio of our estimated p(yi)/q(yi) to calculate the weight Bi, and plug this into weighted empirical risk minimization in (4.7.5).

 

4.7. Environment and Distribution Shift — Dive into Deep Learning 1.0.0-beta0 documentation

 

d2l.ai

원본 데이터의 레이블을 관찰하기 때문에 분포 q(y)를 추정하기 쉽습니다. 그런 다음 레이블 yi가 있는 훈련 예제 i에 대해 추정된 p(yi)/q(yi)의 비율을 가져와 가중치 Bi를 계산하고 이를 (4.7.5)의 가중 경험적 위험 최소화에 연결할 수 있습니다.

 

4.7.3.4. Concept Shift Correction

Concept shift is much harder to fix in a principled manner. For instance, in a situation where suddenly the problem changes from distinguishing cats from dogs to one of distinguishing white from black animals, it will be unreasonable to assume that we can do much better than just collecting new labels and training from scratch. Fortunately, in practice, such extreme shifts are rare. Instead, what usually happens is that the task keeps on changing slowly. To make things more concrete, here are some examples:

 

개념 전환은 원칙적으로 수정하기가 훨씬 더 어렵습니다. 예를 들어, 문제가 고양이와 개를 구별하는 것에서 흰색과 검은색 동물을 구별하는 것으로 갑자기 바뀌는 상황에서 우리가 새 레이블을 수집하고 처음부터 훈련하는 것보다 훨씬 더 잘할 수 있다고 가정하는 것은 비합리적입니다. 다행스럽게도 실제로 이러한 극단적인 변화는 드뭅니다. 대신 일반적으로 발생하는 일은 작업이 계속해서 느리게 변경된다는 것입니다. 좀 더 구체적으로 설명하자면 다음과 같은 몇 가지 예입니다.

 

  • In computational advertising, new products are launched, old products become less popular. This means that the distribution over ads and their popularity changes gradually and any click-through rate predictor needs to change gradually with it.
  • 전산 광고에서는 신제품이 출시되고 오래된 제품은 인기가 떨어집니다. 즉, 광고에 대한 분포와 그 인기도가 점진적으로 변하고 모든 클릭률 예측기가 그에 따라 점진적으로 변해야 함을 의미합니다.
  • Traffic camera lenses degrade gradually due to environmental wear, affecting image quality progressively.
  • 교통 카메라 렌즈는 환경적 마모로 인해 점진적으로 저하되어 이미지 품질에 점진적으로 영향을 미칩니다.
  • News content changes gradually (i.e., most of the news remains unchanged but new stories appear).
  • 뉴스 콘텐츠는 점진적으로 변경됩니다(즉, 대부분의 뉴스는 변경되지 않지만 새로운 기사가 나타남).

In such cases, we can use the same approach that we used for training networks to make them adapt to the change in the data. In other words, we use the existing network weights and simply perform a few update steps with the new data rather than training from scratch.

 

이러한 경우 네트워크 훈련에 사용한 것과 동일한 접근 방식을 사용하여 데이터의 변화에 적응할 수 있습니다. 즉, 우리는 기존 네트워크 가중치를 사용하고 처음부터 훈련하는 대신 새 데이터로 몇 가지 업데이트 단계를 수행합니다.

 

4.7.4. A Taxonomy of Learning Problems

 

Armed with knowledge about how to deal with changes in distributions, we can now consider some other aspects of machine learning problem formulation.

 

분포의 변화를 다루는 방법에 대한 지식으로 무장한 우리는 이제 기계 학습 문제 공식화의 다른 측면을 고려할 수 있습니다.

 

4.7.4.1. Batch Learning

In batch learning, we have access to training features and labels {(x1,y1),…,(xn,yn)}, which we use to train a model f(x). Later on, we deploy this model to score new data (x,y) drawn from the same distribution. This is the default assumption for any of the problems that we discuss here. For instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then installed in a customer’s home and is never updated again (barring extreme circumstances).

 

배치 학습에서는 모델 f(x)를 훈련하는 데 사용하는 훈련 기능 및 레이블 {(x1,y1),…,(xn,yn)}에 액세스할 수 있습니다. 나중에 이 모델을 배포하여 동일한 분포에서 가져온 새 데이터(x,y)의 점수를 매깁니다. 이것은 여기서 논의하는 모든 문제에 대한 기본 가정입니다. 예를 들어 많은 고양이와 개 사진을 기반으로 고양이 감지기를 훈련시킬 수 있습니다. 훈련을 마치면 고양이만 들어올 수 있는 스마트 캣도어 컴퓨터 비전 시스템의 일부로 배송합니다. 그런 다음 고객의 집에 설치되고 다시는 업데이트되지 않습니다(극단적인 상황 제외).

 

4.7.4.2. Online Learning

 

Now imagine that the data (x1,y1) arrives one sample at a time. More specifically, assume that we first observe xi, then we need to come up with an estimate f(xi) and only once we have done this, we observe yi and with it, we receive a reward or incur a loss, given our decision. Many real problems fall into this category. For example, we need to predict tomorrow’s stock price, this allows us to trade based on that estimate and at the end of the day we find out whether our estimate allowed us to make a profit. In other words, in online learning, we have the following cycle where we are continuously improving our model given new observations:

 

이제 데이터(x1,y1)가 한 번에 하나의 샘플에 도착한다고 상상해 보십시오. 좀 더 구체적으로, 먼저 xi를 관찰한 다음 추정치 f(xi)를 산출해야 하고 이를 수행한 후에만 yi를 관찰하고 그것으로 우리의 결정에 따라 보상을 받거나 손실이 발생한다고 가정합니다. . 많은 실제 문제가 이 범주에 속합니다. 예를 들어, 우리는 내일의 주가를 예측해야 합니다. 이를 통해 해당 추정치를 기반으로 거래할 수 있고 하루가 끝날 때 추정치가 이익을 낼 수 있는지 여부를 알 수 있습니다. 즉, 온라인 학습에서 우리는 새로운 관찰을 통해 모델을 지속적으로 개선하는 다음 주기를 갖습니다.

 

 

4.7.4.3. Bandits

 

Bandits are a special case of the problem above. While in most learning problems we have a continuously parametrized function f where we want to learn its parameters (e.g., a deep network), in a bandit problem we only have a finite number of arms that we can pull, i.e., a finite number of actions that we can take. It is not very surprising that for this simpler problem stronger theoretical guarantees in terms of optimality can be obtained. We list it mainly since this problem is often (confusingly) treated as if it were a distinct learning setting.

 

산적은 위 문제의 특수한 경우입니다. 대부분의 학습 문제에서 매개변수(예: 심층 네트워크)를 학습하려는 연속적으로 매개변수화된 함수 f가 있는 반면 산적 문제에서는 당길 수 있는 팔의 수가 한정되어 있습니다. 우리가 취할 수 있는 조치. 이 간단한 문제에 대해 최적성 측면에서 더 강력한 이론적 보증을 얻을 수 있다는 것은 그리 놀라운 일이 아닙니다. 이 문제는 종종 (혼란스럽게도) 별개의 학습 환경인 것처럼 취급되기 때문에 주로 나열합니다.

 

4.7.4.4. Control

많은 경우 환경은 우리가 한 일을 기억합니다. 반드시 적대적인 방식은 아니지만 그냥 기억하고 반응은 이전에 일어난 일에 따라 달라집니다. 예를 들어, 커피 보일러 컨트롤러는 이전에 보일러를 가열했는지 여부에 따라 다른 온도를 관찰합니다. PID(proportional-integral-derivative) 컨트롤러 알고리즘이 널리 사용됩니다. 마찬가지로 뉴스 사이트에서 사용자의 행동은 이전에 사용자에게 보여준 내용에 따라 달라집니다(예: 사용자는 대부분의 뉴스를 한 번만 읽음). 그러한 많은 알고리즘은 의사결정이 덜 무작위적으로 보이도록 행동하는 환경의 모델을 형성합니다. 최근 제어 이론(예: PID 변형)은 하이퍼파라미터를 자동으로 조정하여 더 나은 풀림 및 재구성 품질을 달성하고 생성된 텍스트의 다양성과 생성된 이미지의 재구성 품질을 개선하는 데에도 사용되었습니다(Shao et al., 2020).

 

4.7.4.5. Reinforcement Learning

In the more general case of an environment with memory, we may encounter situations where the environment is trying to cooperate with us (cooperative games, in particular for non-zero-sum games), or others where the environment will try to win. Chess, Go, Backgammon, or StarCraft are some of the cases in reinforcement learning. Likewise, we might want to build a good controller for autonomous cars. The other cars are likely to respond to the autonomous car’s driving style in nontrivial ways, e.g., trying to avoid it, trying to cause an accident, and trying to cooperate with it.

 

메모리가 있는 환경의 보다 일반적인 경우 환경이 우리와 협력하려고 하는 상황(특히 논제로섬 게임의 경우 협력 게임) 또는 환경이 이기려고 하는 상황에 직면할 수 있습니다. 체스, 바둑, 주사위 놀이 또는 스타크래프트는 강화 학습의 일부 사례입니다. 마찬가지로 우리는 자율주행차를 위한 좋은 컨트롤러를 만들고 싶을 수도 있습니다. 다른 차량들은 자율주행차의 운전 스타일에 피하려고 노력하고, 사고를 일으키고, 협력하려고 하는 등 사소하지 않은 방식으로 반응할 가능성이 높습니다.

 

4.7.4.6. Considering the Environment

 

One key distinction between the different situations above is that the same strategy that might have worked throughout in the case of a stationary environment, might not work throughout when the environment can adapt. For instance, an arbitrage opportunity discovered by a trader is likely to disappear once he starts exploiting it. The speed and manner at which the environment changes determines to a large extent the type of algorithms that we can bring to bear. For instance, if we know that things may only change slowly, we can force any estimate to change only slowly, too. If we know that the environment might change instantaneously, but only very infrequently, we can make allowances for that. These types of knowledge are crucial for the aspiring data scientist to deal with concept shift, i.e., when the problem that he is trying to solve changes over time.

 

위의 여러 상황 사이의 주요 차이점 중 하나는 고정된 환경의 경우 전체적으로 작동했을 수 있는 동일한 전략이 환경이 적응할 수 있는 경우 내내 작동하지 않을 수 있다는 것입니다. 예를 들어, 트레이더가 발견한 차익 거래 기회는 일단 활용하기 시작하면 사라질 가능성이 높습니다. 환경이 변화하는 속도와 방식은 우리가 감당할 수 있는 알고리즘의 유형을 크게 결정합니다. 예를 들어, 사물이 천천히 변할 수 있다는 것을 알고 있다면 추정치도 천천히 변하도록 강제할 수 있습니다. 환경이 순간적으로 변할 수 있지만 매우 드물다는 것을 알고 있다면 이를 허용할 수 있습니다. 이러한 유형의 지식은 야심 찬 데이터 과학자가 개념 변화, 즉 그가 해결하려는 문제가 시간이 지남에 따라 변화하는 경우에 대처하는 데 매우 중요합니다.

 

4.7.5. Fairness, Accountability, and Transparency in Machine Learning

Finally, it is important to remember that when you deploy machine learning systems you are not merely optimizing a predictive model—you are typically providing a tool that will be used to (partially or fully) automate decisions. These technical systems can impact the lives of individuals subject to the resulting decisions. The leap from considering predictions to decisions raises not only new technical questions, but also a slew of ethical questions that must be carefully considered. If we are deploying a medical diagnostic system, we need to know for which populations it may work and which it may not. Overlooking foreseeable risks to the welfare of a subpopulation could cause us to administer inferior care. Moreover, once we contemplate decision-making systems, we must step back and reconsider how we evaluate our technology. Among other consequences of this change of scope, we will find that accuracy is seldom the right measure. For instance, when translating predictions into actions, we will often want to take into account the potential cost sensitivity of erring in various ways. If one way of misclassifying an image could be perceived as a racial sleight of hand, while misclassification to a different category would be harmless, then we might want to adjust our thresholds accordingly, accounting for societal values in designing the decision-making protocol. We also want to be careful about how prediction systems can lead to feedback loops. For example, consider predictive policing systems, which allocate patrol officers to areas with high forecasted crime. It is easy to see how a worrying pattern can emerge:

 

마지막으로 기계 학습 시스템을 배포할 때 단순히 예측 모델을 최적화하는 것이 아니라 일반적으로 의사 결정을 (부분적으로 또는 완전히) 자동화하는 데 사용할 도구를 제공한다는 점을 기억하는 것이 중요합니다. 이러한 기술 시스템은 결과 결정에 따라 개인의 삶에 영향을 미칠 수 있습니다. 예측 고려에서 의사 결정으로의 도약은 새로운 기술적 질문뿐만 아니라 신중하게 고려해야 하는 많은 윤리적 질문을 제기합니다. 의료 진단 시스템을 배포하는 경우 작동할 수 있는 인구와 그렇지 않을 수 있는 인구를 알아야 합니다. 하위 집단의 복지에 대한 예측 가능한 위험을 간과하면 열등한 치료를 관리하게 될 수 있습니다. 또한 의사 결정 시스템을 고려한 후에는 한 걸음 물러나 기술을 평가하는 방법을 재고해야 합니다. 이러한 범위 변경의 다른 결과 중에서 정확성이 올바른 척도인 경우는 거의 없다는 사실을 알게 될 것입니다. 예를 들어, 예측을 행동으로 변환할 때 다양한 방식으로 오류가 발생할 수 있는 잠재적인 비용 민감도를 고려하고자 하는 경우가 많습니다. 이미지를 잘못 분류하는 한 가지 방법이 인종적 손재주로 인식될 수 있는 반면 다른 범주로 잘못 분류하는 것은 무해하다면 의사 결정 프로토콜을 설계할 때 사회적 가치를 고려하여 그에 따라 임계값을 조정해야 할 수 있습니다. 또한 예측 시스템이 피드백 루프로 이어질 수 있는 방식에 대해 주의를 기울이고 싶습니다. 예를 들어 범죄 예보가 높은 지역에 순찰관을 배치하는 예측 치안 시스템을 고려하십시오. 걱정스러운 패턴이 어떻게 나타날 수 있는지 쉽게 알 수 있습니다.

 

  1. Neighborhoods with more crime get more patrols.
    범죄가 많은 동네일수록 더 많은 순찰을 받습니다.
  2. Consequently, more crimes are discovered in these neighborhoods, entering the training data available for future iterations.
    결과적으로 이러한 이웃에서 더 많은 범죄가 발견되어 향후 반복에 사용할 수 있는 훈련 데이터를 입력합니다.
  3. Exposed to more positives, the model predicts yet more crime in these neighborhoods.
    더 긍정적인 면에 노출되면 이 모델은 이러한 지역에서 더 많은 범죄를 예측합니다.
  4. In the next iteration, the updated model targets the same neighborhood even more heavily leading to yet more crimes discovered, etc.
    다음 반복에서 업데이트된 모델은 동일한 이웃을 훨씬 더 심하게 대상으로 하여 더 많은 범죄가 발견되는 등의 결과를 낳습니다.

Often, the various mechanisms by which a model’s predictions become coupled to its training data are unaccounted for in the modeling process. This can lead to what researchers call runaway feedback loops. Additionally, we want to be careful about whether we are addressing the right problem in the first place. Predictive algorithms now play an outsize role in mediating the dissemination of information. Should the news that an individual encounters be determined by the set of Facebook pages they have Liked? These are just a few among the many pressing ethical dilemmas that you might encounter in a career in machine learning.

 

종종 모델의 예측이 훈련 데이터와 결합되는 다양한 메커니즘이 모델링 프로세스에서 설명되지 않습니다. 이것은 연구자들이 런어웨이 피드백 루프라고 부르는 것으로 이어질 수 있습니다. 또한 처음부터 올바른 문제를 해결하고 있는지에 대해 주의를 기울이고 싶습니다. 예측 알고리즘은 이제 정보 보급을 중재하는 데 큰 역할을 합니다. 개인이 접하는 뉴스는 좋아요를 누른 Facebook 페이지 집합에 의해 결정되어야 합니까? 이는 기계 학습 분야에서 경력을 쌓으면서 직면할 수 있는 많은 시급한 윤리적 딜레마 중 일부에 불과합니다.

 

4.7.6. Summary

In many cases training and test sets do not come from the same distribution. This is called distribution shift. The risk is the expectation of the loss over the entire population of data drawn from their true distribution. However, this entire population is usually unavailable. Empirical risk is an average loss over the training data to approximate the risk. In practice, we perform empirical risk minimization.

 

많은 경우 훈련 세트와 테스트 세트는 동일한 분포에서 나오지 않습니다. 이를 분배 이동이라고 합니다. 위험은 실제 분포에서 가져온 전체 데이터 모집단에 대한 손실의 예상입니다. 그러나 이 전체 모집단은 일반적으로 사용할 수 없습니다. 경험적 위험은 위험을 근사화하기 위한 훈련 데이터에 대한 평균 손실입니다. 실제로 경험적 위험 최소화를 수행합니다.

 

Under the corresponding assumptions, covariate and label shift can be detected and corrected for at test time. Failure to account for this bias can become problematic at test time. In some cases, the environment may remember automated actions and respond in surprising ways. We must account for this possibility when building models and continue to monitor live systems, open to the possibility that our models and the environment will become entangled in unanticipated ways.

 

해당 가정 하에서 공변량 및 레이블 이동을 감지하고 테스트 시간에 수정할 수 있습니다. 이 편향을 고려하지 않으면 테스트 시 문제가 될 수 있습니다. 어떤 경우에는 환경이 자동화된 작업을 기억하고 놀라운 방식으로 대응할 수 있습니다. 우리는 모델을 구축할 때 이러한 가능성을 고려해야 하고 라이브 시스템을 계속 모니터링하여 모델과 환경이 예상치 못한 방식으로 얽힐 가능성을 열어 두어야 합니다.

 

4.7.7. Exercises

  1. What could happen when we change the behavior of a search engine? What might the users do? What about the advertisers?
  2. Implement a covariate shift detector. Hint: build a classifier.
  3. Implement a covariate shift corrector.
  4. Besides distribution shift, what else could affect how the empirical risk approximates the risk?

 

반응형