Amazon Machine Learning
Amazon machine learning is a service provided by Amazon that utilizes machine learning technology,aiming to help developers of any level to build predictive applications, such as fraud detection system. It was officially released out by Amazon Web Services on Thursday, April 9, 2015.[1] Today with the advance in technology, vast amounts of data can be collected about a system,product or process. Machine Learning (abbreviated as ML) helps in the analysis of these data and the results obtained by analyzing the data.[2]
Key Concepts
The following are key concepts of Amazon machine learning, with description in greater detail about how they are used within Amazon ML:
Datasources
A datasource is an object that contains metadata about your input data. Amazon ML reads your input data, computes descriptive statistics on its attributes, and stores the statistics—along with a schema and other information—as part of the datasource object.[3] Next, Amazon ML uses the datasource to train and evaluate an ML model and generate batch predictions.[3]
Term | Definition |
---|---|
Attribute | A unique, named property within an observation. In tabular-formatted data such as spreadsheets or comma-separated values (CSV) files, the column headings represent the attributes, and the rows contain values for each attribute.
Synonyms: variable, variable name, field, column |
Datasource Name | (Optional) Allows you to define a human-readable name for a datasource. These names enable you to find and manage your datasources in the Amazon ML console. |
Input Data | Collective name for all the observations that are referred to by a datasource. |
Location | Location of input data. Currently, Amazon ML can use data that is stored within Amazon S3 buckets, Amazon Redshift databases, or MySQL databases in Amazon Relational Database Service (RDS). |
Observation | A single input data unit. For example, if you are creating an ML model to detect fraudulent transactions, your input data will consist of many observations, each representing an individual transaction.
Synonyms: record, example, instance, row |
Row ID | (Optional) A flag that, if specified, identifies an attribute in the input data to be included in the prediction output. This attribute makes it easier to associate which prediction corresponds with which observation.
Synonyms: row identifier |
Schema | The information needed to interpret the input data, including attribute names and their assigned data types, and names of special attributes. |
Statistics | Summary statistics for each attribute in the input data. These statistics serve two purposes:
The Amazon ML console displays them in graphs to help you understand your data at-a-glance and identify irregularities or errors. Amazon ML uses them during the training process to improve the quality of the resulting ML model. |
Status | Indicates the current state of the datasource, such as In Progress, Completed, or Failed. |
Target Attribute | In the context of training an ML model, the target attribute identifies the name of the attribute in the input data that contains the "correct" answers. Amazon ML uses this to discover patterns in the input data and generate an ML model. In the context of evaluating and generating predictions, the target attribute is the attribute whose value will be predicted by a trained ML model.
Synonyms: target |
ML models
An ML model is a mathematical model that generates predictions by finding patterns in your data. Amazon ML supports three types of ML models: binary classification, multiclass classification and regression.[3]
Binary Classification Model
ML models for binary classification problems predict a binary outcome (one of two possible classes). To train binary classification models, Amazon ML uses the industry-standard learning algorithm known as logistic regression.[4]
Examples of Binary Classification Problems
- "Is this email spam or not spam?"
- "Will the customer buy this product?"
- "Is this product a book or a farm animal?"
- "Is this review written by a customer or a robot?"
Multiclass Classification Model
ML models for multiclass classification problems allow you to generate predictions for multiple classes (predict one of more than two outcomes). For training multiclass models, Amazon ML uses the industry-standard learning algorithm known as multinomial logistic regression.[4]
Examples of Multiclass Problems
- "Is this product a book, movie, or clothing?"
- "Is this movie a romantic comedy, documentary, or thriller?"
- "Which category of products is most interesting to this customer?"
Regression Model
ML models for regression problems predict a numeric value. For training regression models, Amazon ML uses the industry-standard learning algorithm known as linear regression.[4]
Examples of Regression Problems
- "What will the temperature be in Seattle tomorrow?"
- "For this product, how many units will sell?"
- "What price will this house sell for?"
Evaluations
An evaluation measures the quality of your ML model and determines if it is performing well.[3]
Term | Definition |
---|---|
Model Insights | Amazon ML provides you with a metric and a number of insights that you can use to evaluate the predictive performance of your model. |
AUC | Area Under the ROC Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples. |
Macro-averaged F1-score | The macro-averaged F1-score is used to evaluate the predictive performance of multiclass ML models. |
RMSE | The Root Mean Square Error (RMSE) is a metric used to evaluate the predictive performance of regression ML models. |
Cut-off | ML models work by generating numeric prediction scores. By applying a cut-off value, the system converts these scores into 0 and 1 labels. |
Accuracy | Accuracy measures the percentage of correct predictions. |
Precision | Precision measures the percentage of actual positives among those examples that are predicted as positive. |
Recall | Recall measures the percentage of actual positives that are predicted as positives. |
Batch predictions asynchronously generation
Batch predictions are for a set of observations that can run all at once. This is ideal for predictive analyses that do not have a real-time requirement.[3]
Term | Definition |
---|---|
Output Location | The results of a batch prediction are stored in an S3 bucket output location. |
Manifest File | This file relates each input data file with its associated batch prediction results. It is stored in the S3 bucket output location. |
Real-time predictions synchronously generation
Real-time predictions are for applications with a low latency requirement, such as interactive web, mobile, or desktop applications. Any ML model can be queried for predictions by using the low latency real-time prediction API.[3]
Term | Definition |
---|---|
Real-time Prediction API | The Real-time Prediction API accepts a single input observation in the request payload and returns the prediction in the response. |
Real-time Prediction Endpoint | To use an ML model with the real-time prediction API, you need to create a real-time prediction endpoint. Once created, the endpoint contains the URL that you can use to request real-time predictions. |
Evaluating ML Models
Models are evaluated to find out whether they will do a good job in predicting the target on future data. Usually this is done by checking the accuracy metric of the ML model on data for which the target answer is already known. Then this accuracy is used as a proxy on future data. We can create an evaluation to evaluate a ML model. We can create an Amazon ML datasource with the data that was not used for training, to be used as a datasource for the evaluation. The schema of this datasource should be same as the database used for training. It should also have actual values for the target variable. Once you have an evaluation datasource and an ML model, you can create an evaluation and review the results of evaluation.
ML Model Insights
Amazon ML provides an industry-standard metric and a number of insights to review the predictive accuracy of the ML model. The outcome of an evaluation contains the following:
- A prediction accuracy model: to report on the overall success of the model
- Visualizations: to help explore the accuracy of the model beyond the prediction accuracy metric
- The ability to review the impact of setting a score threshold (only for binary classification)
- Alerts on criteria: to check the validity of the evaluation.
The type of ML model being evaluated decides the choice of the metric and visualization. We should review these visualizations to decide if the model is performing well enough to match the business requirements.[5]
Binary Model Insights
The output of many binary classification algorithms is a prediction score.[6] The system's uncertainty that the given observation belongs to the positive class (the actual target value is 1) is indicated by this score. Binary classification models in Amazon ML output a score that ranges from 0 to 1. A threshold value is selected to compare the prediction scores. If there is an observation with score more than the threshold value, then it is predicted as target=1. The observations that have score less than the threshold value, is predicted as target=0.
Multiclass Model Insights
The output of multiclass model insights is a set of prediction scores.[7] These scores indicate the model's uncertainty that the given observation belongs to each of the classes. The predicted answer is the class with the highest prediction score.
Regression Model Insights
The output of a regression ML model is a numeric value for the model's prediction of the target.[8] In regression model insights, the range of predictions can differ from the range of the target in the training data. It is very important to plan how to address prediction values that fall outside the acceptable range for the application.
Overfitting a Model
The goal is to select the best ML model which means selecting the model with the best settings or hyperparameters. There are three different hyperparameters that can be set in Amazon ML: number of passes, regularization and model size.
Overfitting a model is one of the issues faced in ML approaches. This problem is seen when the ML approach tries to fit in every piece of data in the samples provided; This results in learning without generalizing enough. The three parameters that can be set help in avoiding overfitting are number of passes, regularization, model size. Regularization, in particular helps in finding commonalities in the data given as examples and learning a solution that generalizes over the set of examples.
Cross-Validation
In cross-validation, several ML models are trained on subsets of the available input data and then they are evaluated on the complementary subset of data. Cross validation is used to detect overfitting, i.e. failing to generalize a pattern.[9]
Evaluation Alerts
Amazon ML also provides insights to help you validate whether you validated the model correctly or not by displaying the validation criterion that has been violated, as follows.[10]
- Evaluation of ML model is done on held-out data
- Sufficient data was used for the evaluation of the predictive model
- Schema matched
- All records from evaluation files were used for predictive model performance evaluation
- Distribution of target variable
Managing Amazon ML Objects
Amazon ML Console can be used to modify the following four objects:
- Datasources
- ML models
- Evaluations
- Batch Predictions
All these objects serve different purposes and they have different attributes and functionality, still they are managed in similar ways. In the following section, we will describe the following common management operations for these objects and also point out to their differences:
- Listing Objects
- Retrieving Object Descriptions
- Updating Objects
- Deleting Objects
Listing Objects
Following operations in the Amazon ML API can be used to list objects:
- DescribeDataSources
- DescribeMLModels
- DescribeEvaluations
- DescribeBatchPredictions
These operations include parameters for filtering, sorting and paginating through a long list of objects. There is no limit to the number of objects that can be accessed through these APIs.
Retrieving Object Descriptions
Detailed descriptions for any object can be seen using both console and APIs.
To see descriptions on the console, navigate to a list for a specific type of object and then locate the row in the table corresponding to that object, you can do this by either browsing through the list or searching by ID or name.
Following APIs can be used to retrieve the descriptions for these objects:
- GetDataSource
- GetMLModel
- GetEvaluation
- GetBatchPrediction
All these operations take only two parameters: the object ID and a boolean flag called Verbose. If Verbose is set to True, then extra details about the object will be included.
Updating Objects
Following APIs can be used to update the details of an Amazon ML Object:
- UpdateDataSource
- UpdateMLObject
- UpdateEvaluation
- UpdateBatchPrediction
These operations require the Object’s ID to identify the object that needs to be updated. Name can be modified for all the object types. For ML Models, the ScoreThreshold field can be modified, as long as no real-time prediction endpoint is associated with it. No other property can be modified for any object.
Deleting Objects
Following APIs can be used to delete the Amazon ML objects:
- DeleteDataSource
- DeleteMLModel
- DeleteEvaluation
- DeleteBatchPrediction
All these APIs need only one parameter, the object’s ID that needs to be deleted.
Steps to Use
Input Data
Raw data need first to be provided by the user to Amazon to perform the analysis.
Creating an IAM Role[11]
Creating IAM(AWS Identity and Access Management) roles could feel tricky at first, but fortunately AWS made this quite simple for Amazon Machine Learning. The simplest way is to use the IAM role template Amazon prepared, which you can get to with several clicks. Here are the steps. First go to Amazon IAM and select “Role” in the right menu, then click “Create New Role”. After inputting a new role name, you can find the preset role “Amazon Machine Learning Role for Redshift Data Source”. Click “Select” and go through the rest of the wizard to create the role.[12]
Creating an S3 Bucket[11]
After creating the IAM role, we had to create an S3 bucket.[13] Amazon Machine Learning will stage data from Redshift into this bucket, before creating its datasource object. Simply create an S3 bucket, and copy the URL for that bucket. You can then fill this S3 bucket URL into the Amazon Machine Learning form. You also need to prepare a SQL query, which will be used to extract your Redshift data. Amazon Machine Learning only reads from a flat file stored in S3, so if you are analyzing data across multiple tables, you will need to create a SQL query that properly joins your tables. Once the above preparation is done, we can click on “Verify”, which starts checking the data from Redshift. The next screen will show you the schema of the data set the service automatically detected based on your SQL and Redshift data. You can modify the data type as necessary. Next, you will select the target of the analysis. For example, if you choose a numeric target, the analysis will be numerical regression.
On this target page, you are setting up a target value (Y value) for your data. Once you select the target, Amazon Machine Learning will start creating a datasource object, as well as present basic and some advanced statistics on the data. This is one of the results we got.
Creating the Model
The next step is to create the model. To do so, you can select “Create (train) ML model” on the data source details. If you use the default setting for your prediction model, it will use 70% of the data for training and the remaining 30% to evaluate the model.
After creating the prediction model, you can check the residual of your model.[11]
Performing Predictions
The final step is to use the prediction model that was created to make predictions. Follow the below steps :
- Navigate to data source details page.
- Click on the “Use the datasource to” dropdown and then select “Generate batch prediction”.
- In the next screen, you can select the prediction model to use, the data to analyze, and the S3 bucket for storing your result.
- Once you click on the “Finish” button on the “Step 4. Review” tab, Amazon Machine Learning will run the algorithm on your data and will save the results to the S3 bucket you specified.
The result is only available on S3, so you will need to download the data from there.[11]
Useful External Links
References
- ↑ "How AWS Machine Learning Can Help in Data Center Management". Data Center Knowledge. Retrieved 2016-02-01.
- ↑ "Amazon Machine Learning – Make Data-Driven Decisions at Scale | AWS Official Blog". aws.amazon.com. Retrieved 2016-02-01.
- 1 2 3 4 5 6 "Amazon Machine Learning Key Concepts - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-05.
- 1 2 3 "Types of ML Models - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-05.
- ↑ "ML Model Insights - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- ↑ "Binary Model Insights - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- ↑ "Multiclass Model Insights - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- ↑ "Regression Model Insights - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- ↑ "Cross-Validation - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- ↑ "Evaluation Alerts - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-09.
- 1 2 3 4 "Quick Review of Amazon Machine Learning Using Amazon Redshift as a Data Source | FlyData". FlyData. Retrieved 2016-02-05.
- ↑ "Tutorial: Using Amazon ML to Predict Responses to a Marketing Offer - Amazon Machine Learning". docs.aws.amazon.com. Retrieved 2016-02-05.
- ↑ "Amazon S3". Wikipedia, the free encyclopedia.