This project made use of data from over 28 000 deliveries completed by Sendy in Nairobi, Kenya. The dataset used to train the predictive model used 21 201 data points while the model was test on 7 068 data points. The aim was was to optimise the model to improve the Mean Sqaured Error score.
Here is a summary of columns that are found in the dataset which represent the delivery attributes that can be used to predict delivery time:
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_No 7068 non-null object
1 User_Id 7068 non-null object
2 Vehicle_Type 7068 non-null object
3 Platform_Type 7068 non-null int64
4 Personal_or_Business 7068 non-null object
5 Placement_-_Day_of_Month 7068 non-null int64
6 Placement_-_Weekday_(Mo_=_1) 7068 non-null int64
7 Placement_-_Time 7068 non-null object
8 Confirmation_-_Day_of_Month 7068 non-null int64
9 Confirmation_-_Weekday_(Mo_=_1) 7068 non-null int64
10 Confirmation_-_Time 7068 non-null object
11 Arrival_at_Pickup_-_Day_of_Month 7068 non-null int64
12 Arrival_at_Pickup_-_Weekday_(Mo_=_1) 7068 non-null int64
13 Arrival_at_Pickup_-_Time 7068 non-null object
14 Pickup_-_Day_of_Month 7068 non-null int64
15 Pickup_-_Weekday_(Mo_=_1) 7068 non-null int64
16 Pickup_-_Time 7068 non-null object
17 Distance_(KM) 7068 non-null int64
18 Temperature 5631 non-null float64
19 Precipitation_in_millimeters 199 non-null float64
20 Pickup_Lat 7068 non-null float64
21 Pickup_Long 7068 non-null float64
22 Destination_Lat 7068 non-null float64
23 Destination_Long 7068 non-null float64
24 Rider_Id 7068 non-null object
dtypes: float64(6), int64(10), object(9)
1) The attributes I chose to include in the model were:
train1 = train1[['Pickup_-_Weekday_(Mo_=_1)',
'Precipitation_in_millimeters','Distance_(KM)','Time_from_Pickup_to_Arrival' ]]
test1 = test1[['Pickup_-_Weekday_(Mo_=_1)',
'Precipitation_in_millimeters','Distance_(KM)']]
I streamlined the string values into numerical values in both the train and test data sets:
#weekdays get a dummy value of 0
train1['Pickup_-_Weekday_(Mo_=_1)'] = train1['Pickup_-_Weekday_(Mo_=_1)'].replace([1,2,3,4,5],0)
#weekend days get a dummy value of 1
train1['Pickup_-_Weekday_(Mo_=_1)'] = train1['Pickup_-_Weekday_(Mo_=_1)'].replace([6,7],1)
#no rain day get dummy 0
train1['Precipitation_in_millimeters'] = train1['Precipitation_in_millimeters'].fillna(int(0))
#rain days get dummy 1
train1.loc[train1['Precipitation_in_millimeters'] >0] = int(1)
I then created a column to store the predicted delivery times in the test-dataframe:
test1['Time_from_Pickup_to_Arrival'] = np.nan
test1['Time_from_Pickup_to_Arrival'].unique()
test1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7068 entries, 0 to 7067
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pickup_-_Weekday_(Mo_=_1) 7068 non-null int64
1 Precipitation_in_millimeters 7068 non-null float64
2 Distance_(KM) 7068 non-null int64
3 Time_from_Pickup_to_Arrival 0 non-null float64
dtypes: float64(2), int64(2)
memory usage: 221.0 KB
2) Model training using the the training dataset:
x = train1.drop('Time_from_Pickup_to_Arrival', axis=1)
y = train1['Time_from_Pickup_to_Arrival']
>>> n_samples, n_features = 10, 5
>>> rng = np.random.RandomState(0)
>>> y = train1['Time_from_Pickup_to_Arrival']
>>> X = train1.drop('Time_from_Pickup_to_Arrival', axis=1)
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(X, y)
y_preds = reg.predict(X)
3) Next I calculated the root mean squared error between the between the predicted delivery time and the actual delivery times:
def rmse(y_test, y_predict):
return np.sqrt(mean_squared_error(y_test, y_predict))
answer = rmse(y, y_preds)
answer
794.4000853649443
4) Finally I applied the model to predict delivery times on the test dataset:
xt = test1.drop('Time_from_Pickup_to_Arrival', axis=1)
ytest_preds = reg.predict(xt)
daf = pd.DataFrame(ytest_preds, columns=['Time_from_Pickup_to_Arrival'])
daf.head()
Time_from_Pickup_to_Arrival
0 1418.709746
1 1124.555383
2 1124.555383
3 1124.555383
4 1222.606838
Further tuning to the model and regression can be applied to lower the RMSE.