Udemy - AWS Machine Learning, AI, SageMaker - With Python
Summary
Section 3 Linear Regression
23. Summary
Squared Loss Function is parabolic in nature. It has an important property of not only telling us the loss at a given weight, but also tells us which way to go to minimize loss
Gradient Descent optimization algorithm uses loss function to move the weights of all the features and iteratively adjusts the weights until optimal value is reached
Batch Gradient Descent predicts y value for all training examples and then adjusts the value of weights based on loss. It can converge much slower when training set is very large. Training set order does not matter as every single example in the training set is considered before making adjustments.
Stochastic Gradient Descent predicts y value for next training example and immediately adjusts the value of weights. It can converge faster when training set is very large. Training set should be random order otherwise model will not learn correctly. AWS ML uses stochastic Gradient Descent
Section 4 AWS - Linear Regression Models
27. Concept - How to evaluate regression model accuracy?
Linear Regression - Residuals
- AWS ML Console provides a Histogram that shows distribution of examples that were over estimated and underestimated and to what extent
- Available as "explore model performance" option under Evaluation -> Summary
- Ideal: Over/Under estimation should be a normal curve centered at 0.
- Structural Issue: When you observe vast majority of example falling into one side. Adding more relevant features can help remedy the situation.
31. Model Performance Summary and Conclusion
RMSE (Root Mean Square Error) is the evaluation metric for Linear Regression. Smaller the value of RMSE, better the predictive accuracy of model. Perfect model would have RMSE of 0.
To prepare data for AWS ML, it requires data to be in
1. CSV file available in S3
2. AWS Redshift Datawarehouse
3. AWS Relational Database Service (RDS) MySQL DB
Batch Prediction results are stored by AWS ML to S3 in the specified bucket
We pulled the data from S3 to local folder and plotted them
Based on the distribution of data, AWS ML suggests a recipe for processing data.
In case of numeric features, it may suggest binning the data instead of treating a raw numeric
For this example, treating x as numeric provided best results
Section 5 Adding Features To Improve Model
35. Summary
1. Underfitting occurs when model does not accurately capture relationship between features and target
2. Underfitting would cause large training errors and evaluation errors
Training RMSE: 385.1816, Evaluation RMSE: 257.8979, Baseline RMSE: 437.311
3. Evaluation Summary - Prediction overestimation and underestimation histogram provided by AWS ML console provides important clues on how the model is behaving, under-estimation and over-estimation needs to be balanced and centered around 0
4. Box plot also highlights distribution differences between predicted and actual-negatives
5. To address underfitting, add higher order polynomials or more relevant features to capture complex relationship
Training RMSE: 132.2032, Evaluation RMSE: 63.6847, Baseline RMSE: 437.311
6. When working with datasets containing 100s or even 1000s of features, it important to rely on these metrics and distribution to gain insight into model performance
Section 6 Normalization
37. Concept: Normalization to smoothen magnitude differences
Normalization Transformation (Numeric)
- When there are very large differences in magnitude of features, features that have large magnitude can dominate Model
- Example : We saw this in Quadratic Extra Features dataset
- Normalization is a process of transforming features to have a mean of 0 and variance of 1. This will ensure all features have similar scale
: Feature normalized = (feature - mean) / (sigma)
where,
mean = mean of feature x
sigma = standard deviation of feature x
: Usage : normalize (numericFeature0
- Optimization algorithm may also converge faster with normalized features compared to features that have very large scale differences
39. Summary
1. Having lot of features and complex features can help improve prediction accuracy
2. When feature ranges are orders of magnitude different, it can dominate the outcome. Normalization is a process of transforming features to have a mean of 0 and variance of 1. This will ensure all feature have similar scale.
3. Without Normalization:
Training RMSE: 83973.66, Evaluation RMSE: 158260.62, Baseline RMSE: 437.31
4. With Normalization:
Training RMSE: 72.35, Evaluation RMSE: 51.7387, Baseline RMSE: 437.31
5. Normalization can be easily enabled using AWS ML Transformation recipes
Section 7 Adding Complex Features
46. Summary
Adding polynomial features allows us fit more complex shapes
To add polynomial features that combines all input features, use sci-kit module library. Anaconda includes these modules by default
We saw good performance with degree 4 and any additional feature may bring incremental improvement, but with added complexity of managing features.
1. Model Degree 1 Training RMSE:0.5063, Evaluation RMSE:0.4308, Baseline RMSE:0.689
2. Model Degree 4 Training RMSE:0.2563, Evaluation RMSE:0.1493, Baseline RMSE:0.689
3. Model Degree 15 Training RMSE:0.2984, Evaluation RMSE:0.1222, Baseline RMSE:0.689
Section 8 Kaggle Bike Hourly Rental Prediction
50. Linear Regression Wrapup and Summary
AWS ML - Linear Regression
* Linear Model
* Gradient Descent and Stochastic Gradient Descent
* Squared Error Loss Function
* AWS ML Training, Evaluation, Interactive Prediction, Batch Prediction
* Prediction Quality
- RMSE
- Residual Histograms
* Data visualization
* Normalization
* Higher order polynomials
Section 9 - Logistic Regression Models
Image result for Linear vs. Logistic regression model
In short: Linear Regression gives continuous output. i.e. any value between a range of values. ... GLM(Generalized linear models) does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression
https://techdifferences.com/difference-between-linear-and-logistic-regression.html
58. Summary
Binary Classifier : Predicts positive class probability of an observation
Logistic or Sigmod function has an important property where output is between 0 and 1 for any input. This output is used by binary classifiers as a probability of positive class.
True Positive - Samples that are actual-positives correctly predicted as positive
True Negative - Samples that are actual-negatives correctly predicted as negative
False Negative - Sampleas that are actual-positives incorrectly predicted as negative
False Positive - Samples that are actual-negatives incorrectly predicted as positive
Logistic Loss Function is parabolic in nature. It has an important property of not only telling us the loss at a given weight. but also tells us which way to go to minimize loss
Gradient Descent optimization algorithm uses loss function to move the weights of all the features and iteratively adjusts the weights until optimal value is reached
Batch Gradient Descent predicts y value for all training examples and then adjusts the value of weights based on loss. It can converge much slower when training set is very large. Training set order does not matter as every single example in the training set is considered before making adjustments.
Stochastic Gradient Descent predicts y value for next training example and immediately adjusts the value of weights. It can converge faster when training set is very large. Training set should be random order otherwise model will not learn correctly. AWS ML uses stochastic Gradient Descent
Section 10
62
Classification Metrics
True Positive = count(model correctly predicted positives). Students who passed exam correctly classified as pass.
True Negative = count (model correctly predicted negatives). Students who failed exam correctly classfied as fail.
False Positive = count (model misclassified negative as positive). Students who failed exam incorrectly classified as pass.
False Negative = count (model misclassified positive as negative). Students who passes exam incorrectly classified as fail.
* True Positive Rate, Recall, Probability of detection - Fraction of positive predicted correctly. larger value indicates better predictive accuracy.
TPR = True Positive / Actual Positive
* False Positive Rate, probability of false alarm - Fraction of negative predicted as positive. Smaller value indicates better predictive accuracy
FPR = False Positive / Actual Negative
* Precision - Fraction of true positive among all predicted positive. Larger value indicates better predictive accuracy
Precision = True Positive / Predicted Positive
* Accuracy - Fraction of correct predictions. Larger value indicates better predictive accuracy
Accuracy = True Positive + True Negative / negative
where, n is the number of examples
63.
Classification Insights with AWS Histograms
Histogram - Binary Classifier
* Positive and Negative histograms
* Interactive tool to test effect of various cut-off thresholds
* Ability to save a threshold for the model
* Available under :
Model -> Evaluation Summary -> Explore Performance
https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html
64
Concept: AUC Metric
AUC - Binary Classifer
* Area Under Curve(AUC) metric - 0 to 1. Larger Value indicates better predictive accuracy
* AUC is the area of a curve formed by plotting True Positive Rate against False positive Rate at different cut-off thresholds
* AUC value of 0.5 is baseline and it is considered random-guess
* AUC closer to 1 indicates better predictive accuracy
* AUC closer to 0 indicates model has learned correct patterns, but flipping predictions (0's are predicted as 1's and vice versa).
69 Summary
For Binary Classification, Area Under Curve (AUC) is the evaluation metric to assess the quality of model
AUC is the area of a curve formed by plotting True Positive Rate against False Positive Rate at different cut-off thresholds.
* AUC metric closer to 1 indicates highly accurate prediction
* AUC metric 0.5 indicates random guess - Baseline AUC
* AUC metric closer to 0 indicates model has learned from the features, but predictions are flipped
Advanced Metrics
* Accuracy - Fraction of correct predictions. Larger value indicates better predictive accuracy
* True Positive Rate - Probability of detection. Out of all positive, how many were correctly predicted as positive. Larger value indicates better predictive accuracy
* False Positive Rate _ Probability of false alarm. Smaller value indicates better predictive accuracy. Out of all negatives, how many were incorrectly predicted as positive.
* Precision - out of all predicted as positive, how man are true positive? Larger value indicates better predictive accuracy.
Section 11
72 Concept: Evaluating Predictive Quality of Multiclass Classifiers
Multi-class metrics
* F1 Score - Harmonic mean of Recall and Precision. Larger F1 Score indicates better predictive accuracy. Binary Metic
F1 Score = 2.Precision.Recall / Precision + Recall
* Average F1 Score - For multi-class problems, average of class wise F1 score is used for accessing predictive quality
* Baseline F1 Score - Hypothentical model that predicts only most frequent class as the answer
Concept: Confusion Matrix To Evaluating Predictive Quality
Multiclass - Metrics - Confusion Matrix
* Accessible from Model -> Evaluation Summary -> Explore Model performance
* Concise table that shows percentage and count of correct classification and incorrect classifications
* Visual look at model performance
* Up to 10 classes are shown - listed from most frequent to least frequent
* For more than 10 classes, first 9 most freq. classes are shown and 10th class will collapse rest of the classes and mark as otherwise
* Option to download confusion matrix
* https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-model-insights.html
77. Summary
Multi-Class Evaluation Metric
1. F1 Score is a binary classification metic. It is harmonic mean of precision and recall
F1 Score = 2 X Precision X Recall / (Precision + Recall)
Higher F1 Score reflects better predictive accuracy
2. Multi-Class Evaluation
Average of class wise F1 Score
3. Baseline F1 Score = Hypothetical model that predicts only most frequent class as the answer
4. Visualization - Confusion Matrix - Available on AWS ML Console
Matrix. Rows = true class. Columns = predicted class
Cell color - diagonal indicates true class prediction %
Cell color - non-diagonal indicates incorrect prediction %
Last column is F1 score for that class. Last but one column is true class distribution
Last row is predicted class distribution
Upto 10 classes are shown - listed from most frequent to least frequent
For more than 10 classes, first 9 most freq. classes are shown and 10th class will collapse rest of the classes and mark as otherwise
You can download the confusion matrix thru url-Explore Performance page under Evaluations
Prediction Summary
1. Eval with default recipe settings. Average F1 score: 0.905
2. Eval with numeric recipe settings: Average F1 score: 0.827
3. Batch prediction Results (predict all 150 example outcome)
a. With default recipe settings: Average F1 Score: 0.973
b. With numeric recipe settings:Average F1 Score: 0.78
4. Classification was better with binning. Versicolor classification was impacted when numeric setting was used
5. Higher F1 Score implies better prediction accuracy
Section 12 Text Based Classification with AWS Twitter Dataset
78. AWS Twitter Feed Classification for Customer Service
https://github.com/aws-samples/machine-learning-samples/tree/master/social-media
79. Lab: Train, Evaluate Model and Assess Predictive Quality, 80. Lab: Interactive Prediction with AWS
- Practice
81. Logistic Regression Summary
AWS ML - Logistic Regression
- Linear Model
- Logistic/Sigmoid Function to produce a probability
- Stochastic Gradient Descent
- Logistic Loss function
- AWS ML Training, Evaluation, Interactive Prediction, Batch Prediction
- Prediction Quality
: TPR
: FPR
: Accuracy
:Prediction
: AUC Metrics
: F1 Score
: Average F1 Score for multi-class
- Data visualization
- Text Processing
- Normalization
- Higher order polynomials
Section 13
82. Recipe Overview
Recipe
- Recipe is a set of instructions for pre-processing data
- Recipe is a JSON like document
- Consists of three parts: Groups, Assignments, Outputs
- Groups - Groups are collection of features for which similar transformation needs to be applied
: Built-in Group : ALL_TEXT, ALL_NUMERIC, ALL_CATEGORICAL, ALL_BINARY
: Define your own groups
- Assignments - Enable creation of new features derived from existing ones
- Outputs - List features used for learning process and optionally apply transformation
Recipe is automatically applied to training data, evaluation data and to data submitted through real-time and batch prediction APIs
83. Recepe Example
84. Text Transformation
* N-gram Text Transformation
- Tokenizes input text and combines them into a slideing window of n-words, where n is specified in the recipe
- Usage: ngram(textFeature, n), where n is the size
- By default all text data is tokenized with n=1
: Example: "Customer requests urgent response" text is tokenized as {"Customer", "requests", "urgent", "response"}
- With n=2, it generates one word and two word combinations
: {"Customer requests", "requests urgent", urgent response", "Customer", "requests", "urgent", "response"}
- N-grams of size up to 10 is supported
- N-grams breaks texts at whitespace. Punctuations will be considered part of word
- You can remove punctuations using no_punct transformation
* OSB Text Transformation
- Orthogonal Spare Bigram (OSB) Transformation provides more word combinations compared to n-gram
- Usage: osb(textFeature, size)
- Puts one underscore to indicate word boundary as well as every word skipped
- For example (AWS Document provided sample).
https://docs.aws.amazon.com/ko_kr/machine-learning/latest/dg/data-transformations-reference.html
Text: "The quick brown fox jumps over the lazy dog". osb(text,4)
WINDOW,{OSB GENERATED}
"The quick brown fox", {The_quick, The__brown, The___fox}
"quick brown fox jumps", {quick_brown, quick__fox, quick___jumps}
"brown fox jumps over", {brown_fox, brown__jumps, brown___over}
"fox jumps over the", {fox_jumps, fox__over, fox___the}
"jumps over the lazy", {jumps_over, jumps__the, jumps___lazy}
"over the lazy dog", {over_the, over__lazy, over___dog}
"the lazy dog", {the_lazy, the__dog}
"lazy dog", {lazy_dog}
* Lowercase and Punctuation
- Lower Case Transformation converts text to lowercase
: Usage : lowercase(textFeature)
: Example: "The Quick Brown Fox Jumps Over the Lazy Dog" -> "the quick brown fox jumps over the lazy dog"
- Remove punctuation Transformation - removes punctuations at word boundaries
: Usage: nopunct(textFeature)
: Example: "Customer Number: 123. Ord-No: AB1235" will be by default tokenized as
{"Customer","Number:","123.","Ord-No:","AB1235"}
: With nopunct transformation -> {"Customer","Number","123","ord-No","AB1235"}
: Note: only prefix, suffix punctuations are removed. Embedded punctuations are not removed "Ord-No"
85. Numeric Transformation - Quantile Binning
* Quantile Binning Transformation (Numeric)
- Used for converting a numeric value into a categorical bin number
- Usage: quantile_bin(numericFeature, n), where n is the number of bins
- AWS ML uses this information to establish n bins of equal size based on the distribution of all values of the specified numeric feature.
- It then maps incoming numericFeature value to corresponding bin and outputs bin number as categorical value
- AWS ML Recommendation: In some cases, relationship between numeric variable and target is not linear...... binning might be usful in those scenarios
- We actually saw where binning improved predictive accuracy with Iris Dataset
86. Numeric Transformation - Normalization
Normalization Transformation (Numeric)
- When there are very large differences in magnitude of features, features that have large magnitude can dominate Model
- Example: We saw this in Quadratic Extra Features dataset
- Normalization is a process of transforming features to have a mean of 0 and variance of 1. This will enshre all features have similar scale.
: Example Feature normalized = (feature - mean)/(sigma)
where,
mean = mean of feature x
sigma = standard deviation of feature x
: Usage normalize(numericFeature)
- Optimization algorithm may also converge faster with normalized features compared to features that have very large scale differences
87. Cartesian Product Transformation - Categorical and Text
* Cartesian Product Transformation (Categorical, Text)
- Cartesian transformation generates permutations of two or more text and categorical input variables
- For example: Season and Hour combined may have stronger influence on bike rentals. Instead of treating these two as separate features, we can create a new feature Season_Hour that will combine these values.
- Usage cartesian(feature1, feature2)
- Combined features may be able to more accurately related the target attribute
Table
88. Summary
Data Transformation
Section: 14 Hyper Parameters, Model Optimization and Lifecycle
Hyper Parameters allow you to control the model building process and quality
90. Data Rearrangement, Maximum model Size, passes, Shuffle Type
Table
93. Improving Model Quality
Optimizing Model
- To improve a model following are some options
: Add more training examples
: Add more relevant features
: Model hyperparameter tuning
- Quality Metrics of Training Data and Evaluation Data can provide important clues to improve model performance
94. Model Maintenance
- Models may need to be periodically rebuilt or updated to
: Keep in-sync with new patterns
: Support new more relevant features
: Support new class - in multi - class problems
: Changes in assumptions or distribution of data that was used to train model
: Changes to cut-off threshold
Example: Home price changes month to month depending on several factors
- Have a plan to evaluate model with new data periodically. Example: Weekly, Monthly, Quartly
- Models are probabilistic in nature...
: Binary Class - Provides bestAnswer(1 or 0) and a raw prediction score. Cut-off score is configurable
: Multi Class - Provides prediction score for each class. It can be interpreted as probability of observation belonging to the class. Class with highest score is the best answer
: Regression : Provides a score that contains raw numeric prediction of the target attribute.
- When models are changed, predicted results would also change - Quality metrics like AUC, F1 Score, RMSE can be used to determine whether to go ahead with proposed model change
95. AWS Machine Learning System Limits
- AWS ML imposes certain limits to ensure robust and reliable service
- Some are soft limits and can be increased by contacting AWS Customer Service
- Size of each observation: 100KB
- Size of training data: 100GB
- Size of batch prediction input: 1TB (single file limit. can be overcome by creating more batch files)
- No. of records per batch file: 100 million
- No. of variables/features: 10,000
- Throughput per second for realtime prediction: 200 requests/second
- Max Number of classes per multi-class model: 100
96. AWS Machine Learning Pricing
- Data Analysis and Model Building Fee - $0.42 per Hour of building time
: Number of computer hours required for data analysis, model training and evaluation
: Depends on size of input data, attributes, types of transformations applied
- Predictions Fees
: Batch predictions - $0.10/1,000 predictions founded to the nearest 1,000
: Real-time predictions - $0.0001 per prediction + Capacity reservation charge of $0.001 per hour for each 10MB provisioned for your model
Section 15 Integration of AWS Machine Learning With Your Application
98. Introduction
AWS ML Integration
- Speed!
: Turn your ideas into cool products in a matter of days
: Traditional approach would require months
- Highly scalable, secure service with redundancy built-in
: Scale automatically to train model with very large datasets
: Scale automatically to support high volume prediction needs
: Real-time prediction with capacity reservation
: Secure - Limit access to Authenticated and Authorized services and users
- Server less!
- Software Integration
: AWS Machine Learning - Complete functionality is accessible through SDK and Command Line Interfaces
: Model building and Prediction can be fully automated using SDK
: AWS SDKs in multiple lanuages - Python, Java, .NET, Javascript, Ruby, C++, ....
: Complete list languages https://aws.amazon.com/tools/
99. Integration Scenarios
Connectivity and Security Options
- You Data Center -> AWS ML Cloud Service
: Security: Key Based Authentication + IAM Policy + SSL
- AWS Hosted Application -> AWS ML Cloud Service
: Security : IAM Role + SSL
- Browser, Apps on Phone -> AWS ML Cloud Service
: Option 1: AWS Cognito Based Authentication + IAM Role + SSL
: Choice of authentication providers: Cognito, Google, Amazon, Facebook, Twitter, OpenID, Customer
: Option 2 : Key Based Authentication + IAM Policy + SSL
100. Security using IAM
Users belong to AWS root account. Cognito Users are application level users. Application belongs to AWS root account.