SageMaker demo3 house sale - Teach One To Fish

House Sales in King County, USA

We will use the dataset from kaggle for this lab. Please download it from the link.

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It’s a great dataset for evaluating simple regression models.

Run:

# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.serializers import CSVSerializer

# Define IAM role and assign S3 bucket
role = get_execution_role()
prefix = 'demo3'
bucket_name = 'lab-bucket-name'# <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET

my_region = boto3.session.Session().region_name # set the region of the instance
xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest")
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + xgboost_container + " container for your SageMaker endpoint.")

Output:

Success - the MySageMakerInstance is in the ap-east-1 region. You will use the 286214385809.dkr.ecr.ap-east-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.

Prepare the data

Upload the csv file to SageMaker Notebook.

Run:

try:
  data = pd.read_csv('./kc_house_data.csv',encoding='utf8')
  print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Output:

Success: Data loaded into dataframe.

Run:

# Check Data
data

Output: data

Run:

# Show id column
data.id

Output: data id

Run:

train_data, test_data = np.split(data.sample(frac=1, random_state=1729), [int(0.7 * len(data))])
print(train_data.shape, test_data.shape)

Output:

(15129, 21) (6484, 21)

Run:

train_data.head()

Output: data head

Note that in SageMaker the first column must be the target variable and the CSV should not include headers.

Run:

attributes = ['price','bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']
len(attributes)

Output:

Save the Data to a csv file and upload to S3.

Run:

train_data = train_data[attributes]
train_data.to_csv('train.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')

s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

Output:

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole

Start training

Run:

sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(xgboost_container,
                                    role, 
                                    instance_count=1, instance_type='ml.m5.xlarge',
output_path='s3://{}/{}/output'.format(bucket_name, prefix),
sagemaker_session=sess)
xgb.set_hyperparameters(eta=0.06,
                        alpha=0.8,
                        lambda_bias=0.8,
                        gamma=50,
                        min_child_weight=6,
                        subsample=0.5,
                        silent=0,
                        early_stopping_rounds=5,
                        objective='reg:linear',
                        num_round=1000)

xgb.fit({'train': s3_input_train})

Output:

INFO:sagemaker:Creating training-job with name: xgboost-2023-03-22-11-32-27-409

2023-03-22 11:32:30 Starting - Starting the training job...
2023-03-22 11:32:43 Starting - Preparing the instances for training...
2023-03-22 11:33:19 Downloading - Downloading input data...
2023-03-22 11:33:44 Training - Downloading the training image..Arguments: train
[2023-03-22:11:34:16:INFO] Running standalone xgboost training.
[2023-03-22:11:34:16:INFO] Path /opt/ml/input/data/validation does not exist!
[2023-03-22:11:34:16:INFO] File size need to be processed in the node: 1.17mb. Available memory size in the node: 8119.47mb
[2023-03-22:11:34:16:INFO] Determined delimiter of CSV input is ','
[11:34:16] S3DistributionType set as FullyReplicated
[11:34:16] 15129x17 matrix with 257193 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[11:34:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[0]#011train-rmse:614855
Will train until train-rmse hasn't improved in 5 rounds.
[11:34:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
[1]#011train-rmse:581322
[...]
[11:34:26] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 62 extra nodes, 0 pruned nodes, max_depth=6
[999]#011train-rmse:48981

2023-03-22 11:34:45 Completed - Training job completed
Training seconds: 87
Billable seconds: 87

Run:

xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

Output:

INFO:sagemaker:Creating model with name: xgboost-2023-03-22-11-35-20-160
INFO:sagemaker:Creating endpoint-config with name xgboost-2023-03-22-11-35-20-160
INFO:sagemaker:Creating endpoint with name xgboost-2023-03-22-11-35-20-160

---!

Run:

test_data_array = test_data.drop([ 'price','id','sqft_above','date'], axis=1).values #load the data into an arrayfrom sagemaker.serializers import CSVSerializer

xgb_predictor.serializer = CSVSerializer() # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

Output:

(6484,)

Run:

from sklearn.metrics import r2_score
print("R2 score : %.2f" % r2_score(test_data['price'],predictions_array))

Output:

R2 score : 0.88

Source Code

Download the notebook here .

Download the result notebook here .

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.