House Sales in King County, USA
We will use the dataset from kaggle for this lab. Please download it from the link.
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It’s a great dataset for evaluating simple regression models.
Run:
# import librariesimport boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.display import display
from time import gmtime, strftime
from sagemaker.predictor import csv_serializer
# Define IAM role and assign S3 bucket
role = get_execution_role()
prefix = 'demo3'
bucket_name = 'raytsanghk'# <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET
my_region = boto3.session.Session().region_name # set the region of the instance
xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest")
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + xgboost_container + " container for your SageMaker endpoint.")
Output:
Success - the MySageMakerInstance is in the ap-east-1 region. You will use the 286214385809.dkr.ecr.ap-east-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.
Prepare the data
Upload the csv file to SageMaker Notebook.
Run:
try:
data = pd.read_csv('./kc_house_data.csv',encoding='utf8')
print('Success: Data loaded into dataframe.')
except Exception as e:
print('Data load error: ',e)
Output:
Success: Data loaded into dataframe.
Run:
# Check Data
data
Output:
Run:
# Show id column
data.id
Output:
Run:
train_data, test_data = np.split(data.sample(frac=1, random_state=1729), [int(0.7 * len(data))])
print(train_data.shape, test_data.shape)
Output:
(15129, 21) (6484, 21)
Run:
train_data.head()
Output:
Note that in SageMaker the first column must be the target variable and the CSV should not include headers.
Run:
attributes = ['price','bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15']
len(attributes)
Output:
18
Save the Data to a csv file and upload to S3.
Run:
train_data = train_data[attributes]
train_data.to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')
Output:
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
Start training
Run:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(xgboost_container,
role,
instance_count=1,
instance_type='ml.m5.xlarge',
output_path='s3://{}/{}/output'.format(bucket_name, prefix),
sagemaker_session=sess)
xgb.set_hyperparameters(eta=0.06,
alpha=0.8,
lambda_bias=0.8,
gamma=50,
min_child_weight=6,
subsample=0.5,
silent=0,
early_stopping_rounds=5,
objective='reg:linear',
num_round=1000)
xgb.fit({'train': s3_input_train})
Output:
INFO:sagemaker:Creating training-job with name: xgboost-2023-03-22-11-32-27-409
2023-03-22 11:32:30 Starting - Starting the training job...
2023-03-22 11:32:43 Starting - Preparing the instances for training...
2023-03-22 11:33:19 Downloading - Downloading input data...
2023-03-22 11:33:44 Training - Downloading the training image..Arguments: train
[2023-03-22:11:34:16:INFO] Running standalone xgboost training.
[2023-03-22:11:34:16:INFO] Path /opt/ml/input/data/validation does not exist!
[2023-03-22:11:34:16:INFO] File size need to be processed in the node: 1.17mb. Available memory size in the node: 8119.47mb
[2023-03-22:11:34:16:INFO] Determined delimiter of CSV input is ','
[11:34:16] S3DistributionType set as FullyReplicated
[11:34:16] 15129x17 matrix with 257193 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[11:34:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[0]#011train-rmse:614855
Will train until train-rmse hasn't improved in 5 rounds.
[11:34:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[1]#011train-rmse:581322
[...]
[11:34:26] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[999]#011train-rmse:48981
2023-03-22 11:34:45 Completed - Training job completed
Training seconds: 87
Billable seconds: 87
Run:
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')
Output:
INFO:sagemaker:Creating model with name: xgboost-2023-03-22-11-35-20-160
INFO:sagemaker:Creating endpoint-config with name xgboost-2023-03-22-11-35-20-160
INFO:sagemaker:Creating endpoint with name xgboost-2023-03-22-11-35-20-160
---!
Run:
test_data_array = test_data.drop([ 'price','id','sqft_above','date'], axis=1).values #load the data into an arrayfrom sagemaker.serializers import CSVSerializer
xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an arrayprint(predictions_array.shape)
Output:
(6484,)
Run:
from sklearn.metrics import r2_score
print("R2 score : %.2f" % r2_score(test_data['price'],predictions_array))
Output:
R2 score : 0.88
Source Code
Download the notebook here .
Download the result notebook here .