Wednesday, January 25, 2017

Data Cleansing

I want to elaborate a little bit more on the process I followed to construct the Ecobici dashboard in my previous entry.

The original file included almost 690 thousand rows corresponding to the trips that public bike users made in December 2016. The raw archive was provided in the following format:

Genero_Usuario,Edad_Usuario,Bici,Ciclo_Estacion_Retiro,Fecha_Retiro,Hora_Retiro,Ciclo_Estacion_Arribo,Fecha_Arribo,Hora_Arribo
M,33,3340,24,01/12/2016,12:00:07 AM,123,01/12/2016,12:14:48 AM
M,36,7863,75,01/12/2016,12:00:28 AM,278,01/12/2016,12:09:24 AM
F,45,2964,365,01/12/2016,12:00:37 AM,369,01/12/2016,12:08:20 AM
...

It is a text file that can be opened with Excel. The guiding wizard will ask for a few parameters to determine what character is being used as a column separator (a comma in this case) and the data type for each column. I guess this file could be handled directly from Tableau, but I found it easier to make some arrangements first in Excel:

  • Merging columns

    The original file contains separate columns for the dates and times when each trip begins and also for when it ends (four columns in total). I decided to merge the information into a single column for the departures and another one for the arrivals (two columns). I simply concatenated the values in Excel, but it's important to take care of the data types that Excel assigns to each column or this concatenation will result in some bizarre values. While exporting the CSV file with the Excel wizard, make sure to choose the Text type for the dates and times columns, or you could also use the function TEXT(value,format)if you selected the default General type while pulling the data.

    I later learned that Tableau is a little picky on the time formatting and I also had to convert the values to the 24hrs convention instead of the am/pm used in the original file. In the end, this was the function I used:
    CONCATENATE(A1," ",TEXT(B1,"HH:MM:SS"))

  • Joins (vlookup)

    The original file includes the station IDs for the origin and destination of each trip. However, I also wanted to know the address of each point so that I could draw the points into a map.

    Ecobici offers a JSON service that in real time publishes the bike availability per station. This document can be easily converted to a CSV file containing the following information:

    id,district,lon,lat,bikes,slots,zip,address,addressNumber,nearbyStations,status,name,stationType
    1,CUA,-99.168051,19.433296,13,14,06500,001 - Río Balsas-Río Sena,S/N,"2,3,85",OPN,1 RIO BALSAS-RIO SENA,"BIKE,TPV"
    2,CUA,-99.171695,19.431386,2,8,06500,002 - Río Guadalquivir - Río Balsas,S/N,"1,5",OPN,2 RIO GUADALQUIVIR-RIO BALSAS,BIKE
    ...


    From this file I created a catalogue where I could find the latitud, longitud and name of each station using VLOOKUP(value,range,column).

  • Calculated fields

    There are at least a couple of measures that were not included in the original dataset, but that can be determined from the provided data. In this case, I was interested in knowing more about each trip:

    1. Time spent on each trip: This is a trivial substraction between two columns in Excel.
    2. Distance between the origin and destination: I used the haversine formula to calculate the distance between two points expressed as latitude and longitude. This is also an easy computation, but this time I made it in Tableau because I came to this idea once I had already imported the data to the tool. It is important to note that this is just an estimation because it considers a straight line between the origin and destination, and not the real route that could be very different and much longer. It also disregards the extra distance that result from biking up and down any elevation variances. 

    I have also been considering it would be interesting to observe the deficit and superavit of bikes each station may have at different times during the day. This is an important metric for service operators to relocate the bikes from where there are too many, to where there are shortages. I'll probably leave this exercise for the future.

No comments:

Post a Comment