Predicting Delivery Arrival Time - Sendy

This project used data from over 28 000 deliveries completed by Sendy in Nairobi, Kenya. The dataset used to train the predictive model used 21 201 data points while the model was tested on 7 068 data points. The aim was to optimise the model to improve the Mean Squared Error score. Travel time was measured in seconds.

Here is a summary of columns that are found in the dataset which represent the delivery attributes that can be used to predict delivery time:

Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 Order_No                              7068 non-null   object 
 User_Id                               7068 non-null   object 
 Vehicle_Type                          7068 non-null   object 
 Platform_Type                         7068 non-null   int64  
 Personal_or_Business                  7068 non-null   object 
 Placement_-_Day_of_Month              7068 non-null   int64  
 Placement_-_Weekday_(Mo_=_1)          7068 non-null   int64  
 Placement_-_Time                      7068 non-null   object 
 Confirmation_-_Day_of_Month           7068 non-null   int64  
 Confirmation_-_Weekday_(Mo_=_1)       7068 non-null   int64  
Confirmation_-_Time                   7068 non-null   object 
Arrival_at_Pickup_-_Day_of_Month      7068 non-null   int64  
Arrival_at_Pickup_-_Weekday_(Mo_=_1)  7068 non-null   int64  
Arrival_at_Pickup_-_Time              7068 non-null   object 
Pickup_-_Day_of_Month                 7068 non-null   int64  
Pickup_-_Weekday_(Mo_=_1)             7068 non-null   int64  
Pickup_-_Time                         7068 non-null   object 
Distance_(KM)                         7068 non-null   int64  
Temperature                           5631 non-null   float64
Precipitation_in_millimeters          199 non-null    float64
Pickup_Lat                            7068 non-null   float64
Pickup_Long                           7068 non-null   float64
Destination_Lat                       7068 non-null   float64
Destination_Long                      7068 non-null   float64
Rider_Id                              7068 non-null   object 
dtypes: float64(6), int64(10), object(9)

1) I chose the 3 attributes below to include in the model as they had a more significant impact on the delivery time:

Day of the week - since traffic conditions on the weekend or weekday influence delivery times
Precipitation - since rain will impact traffic congestion on the road system
Distance - between the Sendy depot and the delivery destination

train1 = train1[['Pickup_-_Weekday_(Mo_=_1)',
'Precipitation_in_millimeters','Distance_(KM)','Time_from_Pickup_to_Arrival' ]]

test1 = test1[['Pickup_-_Weekday_(Mo_=_1)',
'Precipitation_in_millimeters','Distance_(KM)']]

I streamlined the string values into numerical values in both the train and test data sets:

#weekdays get a dummy value of 0
train1['Pickup_-_Weekday_(Mo_=_1)'] = train1['Pickup_-_Weekday_(Mo_=_1)'].replace([1,2,3,4,5],0)

#weekend days get a dummy value of 1
train1['Pickup_-_Weekday_(Mo_=_1)'] = train1['Pickup_-_Weekday_(Mo_=_1)'].replace([6,7],1)

#no rain day get dummy 0
train1['Precipitation_in_millimeters'] = train1['Precipitation_in_millimeters'].fillna(int(0))

#rain days get dummy 1 else 0
train1['Precipitation_in_millimeters'] = np.where(train1['Precipitation_in_millimeters'] > 0, 1, 0)

2) Model training using the training dataset:

x = train1.drop('Time_from_Pickup_to_Arrival', axis=1)
y = train1['Time_from_Pickup_to_Arrival']
>>> n_samples, n_features = 10, 5

reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(X, y)

y_preds = reg.predict(X)

3) Next, I calculated the root mean squared error between the predicted delivery time and the actual delivery times:

def rmse(y_test, y_predict):
  return np.sqrt(mean_squared_error(y_test, y_predict))
  
  answer = rmse(y, y_preds)
  answer
      794.4000853649443

4) Finally, I applied the model to predict delivery times on the test dataset:

xt = test1.drop('Time_from_Pickup_to_Arrival', axis=1)

ytest_preds = reg.predict(xt)
       
daf = pd.DataFrame(ytest_preds, columns=['Time_from_Pickup_to_Arrival'])
daf.head()
  Time_from_Pickup_to_Arrival
0 	1418.709746
1 	1124.555383
2 	1124.555383
3 	1124.555383
4 	1222.606838

RMSE ≈ 794 seconds (~13 minutes) suggests substantial error. To lower the RMSE, further tuning of the model can be applied such as including more attributes. Alternatively, a simpler linear regression model may be used if only 3 attributes will be used.