Millions of people move to Johannesburg every year in search of opportunities, yet the housing market is not able to keep up with demand. The government’s Department of Human Settlements has various housing programs in all municipalities such as Johannesburg to provide housing subsidies to citizens. A housing subsidy is a once-off grant paid by the government to qualifying beneficiaries for housing purposes. The grant is not paid in cash to beneficiaries but, rather, is paid to the seller of a house or, in new subsidised housing developments, the grant is used to construct a house. The title deed of the constructed house is then registered in the name of the beneficiary.
This project explores the housing subsidy applicant data for citizens in the City of Johannesburg which stood at approximately 450,000 applicants in 2018.
This page is split into two sections, Insights and Data Cleaning - and covers 5 insights I extracted from the dataset and also 5 things I did to clean the dataset and improve data quality. I have included snippets of the Python code used in my analysis and made use of the Seaborn library and Tableau to create data visulisations.
sns.countplot(x=newdf['Region '], order=['A','B','C','D','E','F','G'])
Below is a breakdown of how I explored and cleaned the public housing applicant dataset.
Here is a summary of rows that have data in each column and as we can see several columns have data gaps where no values were entered:
newdf.count()
ProvinceID 455791
Municipality 455791
TownName 455791
Region 455791
Area 455791
QuestionaireID 455791
Registration Date 455791
Year 455791
Month 455791
Day 455791
StreetName 454786
HouseNo 455498
WardNo 25097
HomeTelNo 216429
WorkTelNo 1843
CellNo 290476
Relationship 455791
Age Main Member 455388
Gender Main Member 455786
Spouse Age 36302
Spouse GenderID 36305
MonthlyIncome 455717
HSS Status 455791
DwellingType 455788
EnergySource 455788
Sanitation 455788
Water 455788
Disability 73683
dtype: int64
p = newdf.duplicated('QuestionaireID')
q = []
for i in p.iteritems():
if i[1] == True:
q.append(i[0])
len(q) = 14238
newdf.drop_duplicates(subset='QuestionaireID', keep='first',inplace=True)
newdf = newdf.drop(columns=['ProvinceID','Municipality','WorkTelNo','Spouse Age', 'Spouse GenderID'])
newdf['StreetName'] = newdf['StreetName'].fillna('None')
newdf['HouseNo'] = newdf['HouseNo'].fillna('None')
newdf['WardNo'] = newdf['WardNo'].fillna(0)
newdf['HomeTelNo'] = newdf['HomeTelNo'].fillna('None')
newdf['CellNo'] = newdf['CellNo'].fillna('None')
newdf['Age Main Member'] = newdf['Age Main Member'].fillna(0)
newdf['Gender Main Member'] = newdf['Gender Main Member'].fillna('U')
newdf['MonthlyIncome'] = newdf['MonthlyIncome'].fillna('0. Response not given')
newdf['DwellingType'] = newdf['DwellingType'].fillna('9. Unknown')
newdf['EnergySource'] = newdf['EnergySource'].fillna('9. Unknown')
newdf['Sanitation'] = newdf['Sanitation'].fillna('10. Unknown')
newdf['Water'] = newdf['Water'].fillna('12. Unknown')
newdf['WardNo'].unique()
array([0, '11', '2', '12', '121', '32', '23', '22', '21', '10', '1002',
'2003', '100', '200', '65', '95', '920', '632', '0', '62', '16',
'119', '25', '66', '29', '19', ' ', 'BLOCK 8', '47', '53', '48',
'63', '321', '500', '153', '85', '42', '43', '103', '1661', '654',
'129', '44', '45', '7', '50', 'none', '84', '49', '82', '52', '9',
12.0, 14.0, 24.0, 35.0, 32.0, 200.0, 100.0, 20.0, 65.0, 95.0, 50.0,
119.0, 53.0, 952.0, 129.0, 51.0, 61.0, 54.0, 31.0, 29.0, 30.0,
10.0, 15.0, 125.0, 123.0, 120.0, 21.0, 34.0, 28.0, 632.0, 652.0,
321.0, 62.0, 654.0, 39.0, 2003.0, 1002.0, 27.0, 621.0, 962.0,
332.0, 19.0, 26.0, 25.0, 122.0, 112.0, 236.0, 3176.0, 23.0, 2001.0,
52.0, 130.0, 110.0, '26', '33', '300', '231', '35', '621', '36',
'951', '37', '34', '652', '54', '6', '24', '46', '51', '1', '17',
'13', '55', '18', 'Q', '2127', '259', '41', '38', '39', '322',
'2000', '20', '98', '962', '130', '14', '40', '102', '8', '230',
'123', '1230', '522', '620', '191', 'NONE', '31', '120', '852',
'3', '202', '125', '563', '630', '5', '4', '56', '921', '15', '27',
'28', '30', '106', '227', '92', ' ', '362', 'WARD 25',
'REGION 9', 'MAPETLA', '952', '2001', '3003', '39/41', 'WARD 39',
'213', '963', '263', '210', 'ward 15', '122', 'EXT 29', '542',
'49FARANA', '127', '113', 'ward 12', '1003', '193', '320', '520',
'57', 'NUL', 'CHRIS HANI', '149', '201', '114', 'REGION D', '651',
'400', ' N/A', '258', 'PIMVILLE EXT6', 'kh', '4061A', '203', '115',
'135', 'NU', 'ZONE 6', '128', 'DIEPKLOOF ZONE 4', '199',
'PROTEA NORTH', 'PIMVILLE ZONE 4', '67', 'non', 'WARD 75',
' ', '107', '91', '90', '600', '521', '159', 'E', '75',
'no 15 avenue', '3002', 'REGION A', ']', 'REGION6', 'A', '352',
'156', '1159', '145', '221', '911', '960', '96', '116', '154',
'152', '69', '650', '124', '77', '60', '2111', '403', '68', '118',
'279', ' CAMP', '58', '965', 'van der merwe', '178', '1830', '126',
'627', '1811', 'fre', '24 FREEDOM PARK', 'AVENUE', '1376', '3597',
'STR', '59', 'WARD 62', '61', '1259', '86', 'WARD 60', 'NO', '610',
'73', 'WARD 21', '74', '2100', '79', '105', '196', '80', 'T',
'UNKOWN', '97', '1122', '89', '64', 'BLOCK A', 'unknoww',
'THABEDE', '110', 'GRASMERE', 'MOUNTAIN VIEW', 'G', 'unknown',
'832', 'PLOT 56', 'WARD 10', '241', '358', '829', '693', 'ext 9',
22.0, 1.0, 57.0, 89.0, 6.0, 7.0, 79.0, 111.0, 77.0, '2103', '192',
'78', 'mlilo', '08 REGION A', '111', '214', '177', 'RGION 4',
'131', 'EXT 3B', '2013', '1000', 'ward 3', '953', '99', '491',
'EXT2', '71', '70', '72', '925', '624', '914', '220', '81', '83'],
dtype=object)
newdf['Age Main Member'].describe()
count 441553.000000
mean 48.312248
std 13.670387
min 0.000000
25% 39.000000
50% 47.000000
75% 57.000000
max 1814.000000
Name: Age Main Member, dtype: float64
newdf.drop(newdf[newdf['Age Main Member']>100].index, inplace=True)
newdf.drop(newdf[newdf['Age Main Member']<18].index, inplace=True)
newdf['Age Main Member'].describe()
count 440905.000000
mean 48.323522
std 13.204649
min 18.000000
25% 39.000000
50% 47.000000
75% 57.000000
max 100.000000
Name: Age Main Member, dtype: float64