Monday, January 30, 2017

Efficiency


I have been reflecting on my dashboards performance...

You may have noticed that the Ecobici dashboard I presented on my previous couple of posts takes a while to refresh every time you filter the views. This exercise is using a considerable number of rows and there are some optimization tricks that I should have then taken into consideation.

To begin with, it would have helped to calculate some fields in the source file instead of making the calculations in Tableau:
  • The distance calculation is a clear example and I should have done it just as I did with the trip time.
  • In the end, it was not a good idea to merge the date and time values into a single column and rely on Tableau's functions to calculate YEAR, WEEKDAY, DAY, HOUR, etc. It is much more efficient to have distinct columns for the measures being used.

There are other factors to consider for responsive dashboards and I've found a couple of nice articles that may be helpful to go deeper in this topic:

For now, I only added the date and distance calculations on the source file and this had a significant impact on the performance:




Wednesday, January 25, 2017

Data Cleansing

I want to elaborate a little bit more on the process I followed to construct the Ecobici dashboard in my previous entry.

The original file included almost 690 thousand rows corresponding to the trips that public bike users made in December 2016. The raw archive was provided in the following format:

Genero_Usuario,Edad_Usuario,Bici,Ciclo_Estacion_Retiro,Fecha_Retiro,Hora_Retiro,Ciclo_Estacion_Arribo,Fecha_Arribo,Hora_Arribo
M,33,3340,24,01/12/2016,12:00:07 AM,123,01/12/2016,12:14:48 AM
M,36,7863,75,01/12/2016,12:00:28 AM,278,01/12/2016,12:09:24 AM
F,45,2964,365,01/12/2016,12:00:37 AM,369,01/12/2016,12:08:20 AM
...

It is a text file that can be opened with Excel. The guiding wizard will ask for a few parameters to determine what character is being used as a column separator (a comma in this case) and the data type for each column. I guess this file could be handled directly from Tableau, but I found it easier to make some arrangements first in Excel:

  • Merging columns

    The original file contains separate columns for the dates and times when each trip begins and also for when it ends (four columns in total). I decided to merge the information into a single column for the departures and another one for the arrivals (two columns). I simply concatenated the values in Excel, but it's important to take care of the data types that Excel assigns to each column or this concatenation will result in some bizarre values. While exporting the CSV file with the Excel wizard, make sure to choose the Text type for the dates and times columns, or you could also use the function TEXT(value,format)if you selected the default General type while pulling the data.

    I later learned that Tableau is a little picky on the time formatting and I also had to convert the values to the 24hrs convention instead of the am/pm used in the original file. In the end, this was the function I used:
    CONCATENATE(A1," ",TEXT(B1,"HH:MM:SS"))

  • Joins (vlookup)

    The original file includes the station IDs for the origin and destination of each trip. However, I also wanted to know the address of each point so that I could draw the points into a map.

    Ecobici offers a JSON service that in real time publishes the bike availability per station. This document can be easily converted to a CSV file containing the following information:

    id,district,lon,lat,bikes,slots,zip,address,addressNumber,nearbyStations,status,name,stationType
    1,CUA,-99.168051,19.433296,13,14,06500,001 - Río Balsas-Río Sena,S/N,"2,3,85",OPN,1 RIO BALSAS-RIO SENA,"BIKE,TPV"
    2,CUA,-99.171695,19.431386,2,8,06500,002 - Río Guadalquivir - Río Balsas,S/N,"1,5",OPN,2 RIO GUADALQUIVIR-RIO BALSAS,BIKE
    ...


    From this file I created a catalogue where I could find the latitud, longitud and name of each station using VLOOKUP(value,range,column).

  • Calculated fields

    There are at least a couple of measures that were not included in the original dataset, but that can be determined from the provided data. In this case, I was interested in knowing more about each trip:

    1. Time spent on each trip: This is a trivial substraction between two columns in Excel.
    2. Distance between the origin and destination: I used the haversine formula to calculate the distance between two points expressed as latitude and longitude. This is also an easy computation, but this time I made it in Tableau because I came to this idea once I had already imported the data to the tool. It is important to note that this is just an estimation because it considers a straight line between the origin and destination, and not the real route that could be very different and much longer. It also disregards the extra distance that result from biking up and down any elevation variances. 

    I have also been considering it would be interesting to observe the deficit and superavit of bikes each station may have at different times during the day. This is an important metric for service operators to relocate the bikes from where there are too many, to where there are shortages. I'll probably leave this exercise for the future.

Monday, January 16, 2017

Bike Share in Mexico City


Playing with open data for public bike usage in Mexico City!




For this observation, I downloaded some of the open data provided by Ecobici in Mexico City. I specifically used the December 2016 results which are available as a CSV archive that contains information of 688,657 trips. Although December may be atypical because of the holidays, some interesting observations result from the exploration of the data:
  • The busiest time of the week is Thursdays 8am.
  • The busiest origin stations are not necessarily the same as the busiest destination stations.
OriginTrips
REFORMA-HAVRE 8,319
REFORMA-RIO SENA 5,837
RIO BALSAS-RIO SENA 5,482
AV. CENTRAL-J. MENESES                  5,443
REFORMA-LIEJA 5,268
      
DestinationTrips
REFORMA-HAVRE 8,335
AV. JESÚS GARCIA-CARLOS J. MENESES 6,061
REFORMA-RIO SENA 5,882
RIO BALSAS-RIO SENA 5,605
AVENIDA MEXICO-SONORA 5,390
  • It is also interesting to observe that the daily origin heatmap varies for each station:
    • If the heatmap is skewed towards the morning, probably the location corresponds to a residential area where users are riding from home to work (or school, or wherever people begin their daily activities).
      For example, select station 266 AV. JESÚS GARCIA-CARLOS J. MENESES (don't forget to click the filtered station on the origin map).
    • If the heatmap is more active around 18hrs, this may signify that more offices are located around the zone and are using the station to leave from work.
      For example, filter by station 27 REFORMA-HAVRE.
  • The longest trip in December was of 10.36km (note that this is just a straight line approximation) and it took place from 244 EJERCITO NACIONAL-F.C. DE CUERNAVACA to 445 RIFF-AVENIDA RIO CHURUBUSCO. However, the average trip is just 1.639km and the average time of usage is 14.5mins, does this mean the average speed is only 6.78km/h?