Many metropolitan areas in the US have bikesharing programs -- public bicycles the public may use for a small fee, and return to any other bike rack in the city. In Milwaukee, this system is called Bublr and is backed by BCycle (an organization partially owned by Trek).
At first, I thought the available data would be sparse, or that Bublr would be unwilling to lend us the data. We shot an email over to their support line, and heard back from their operations director. He stated that the bikeshare runs on an open data system called the General Bikeshare Feed Specification (GBFS) which is published by a consortium of bikeshare organizations called the North American Bikeshare Association (NABSA).
It's wonderful that such data is available publicly, and a record of available GBFS servers is available alongside the specification.
The introduction to the GBFS states that it is not designed considering "information whose primary purpose is historical". Therefore, we need to scrape this data ourselves if we want to see historical trends.
In order to scrape this data, we created a shell script that will look at the record of servers, and hit each one for their information. In order to be respectful, we are only downloading the information once every five minutes. Our script also compresses the data after it is downloaded, so that it takes up fewer server resources -- we are running this on a puny VPS that has other things running.
We wrote our script with a little bit of cleverness in mind: in their compressed form, the files are organized by date and with a timestamp identifier. However, each file can be decompressed individually, or in bulk, to a single directory and get the exact same directory structure back out. This allows us to select batches of dates or times that we want to analyze, decompress them en-masse, and have the entire dataset ready to work with, akin to doing a query.
However, there is a problem with the overall format of the data we gather: it is not great for doing historical analyses, because each dump is a snapshot of the stations 'right now'. If we want to do any kind of historical querying of this, we need to write programs to go through each data file to find the answer. To be honest, this sounds like a lot of work, and we like to be ~lazy~ efficient. The data itself can be shoved into a database, it just requires a little bit of tooling.
To start, we'll look to the GBFS to see which parts of the data are absolutely required. It's find to have extra data that we don't know what to do with yet, but we need some sort of baseline for creating a database. Looking at the spec, we can see three files are mandated:
station_status.json. We can be relatively sure that each of our dumps have these files in them (unless they don't follow spec, which we'll see once we try to gather the information for import).
By looking at the data contained in each of those files, we can start to see how a database can be managed. The
*_information.json files are short, and only provide things like names, phone numbers, etc. These are not important to look at historically, so this information will be the thing related to the primary information.
station_status.json is where the goods are, because it lists things important to look at historically, like the current number of bicycles available.
Before compression: data +-- <site> +-- <date> +-- <timestamp> +-- data After compression: Data +-- <date> +-- <timestamp>.tar.bz2
After we collect some information for a few days, we can start to see trends in the data. For our examples, we will look at the Milwaukee bikeshare system, Bublr.
By tracking the total number of bikes inside the entire system at one time, we can measure concurrency of rides: how many riders are riding at the same time?
With this measure, it is easy to see that during the day, the most amount of bikes have been checked out. And, at night, there are times when there isn't a single Bublr rider.
Here, we waited a few days to collect some historical data. This should be enough for us to see trends across days.
Once we have collected the information, we can start to answer some questions that would help us administer the local bikeshare system.
- What are the the peak times of Bublr usage, and how many concurrent riders are there at any point of the day?
- Which stations are the most/least frequented? How much is the difference? Can this help us to identify new areas to expand to?
- How can predictive analytics be implemented to determine when bikes will need to be serviced?
- Is there a general trend of direction that bikes are dropped off in (i.e. downhill from other points in the city)?
Peak times/concurrent riders
We can determine the number of peak riders by looking at the total number of bikes in the system. Across the day, we should be able to tell what the min and max numbers of bikes are.
The point at which there are the least bikes in the system is the point where the most bikes are currently being ridden. Additionally, we can measure rider activity by reading the amount that the total number of bikes fluctuates -- this will happen when someone checks out, then returns, a bike. Large numbers of users aren't expected to check out bikes at the same time, so this should be a good measure.
Data cleaning and normalization
The first step to processing this data is to clean up the impulse data into something that can be logged historically. Because every GBFS-speaking bikeshare system must include some basic information, we can start with that.
By writing a small shell script, we can process this data with a method called ETL (extract, transform, load). We want to have the ability to make flexible queries on this data, so we'll throw it into a MySQL database.
Processing the data intermediately from the original data lets us archive that original data for future mining purposes. This lets us work extremely fast: we can download 100% of the available data now and store it in case we need it, but we don't need to look at it all now. If there is a question that we can't answer with our intermediate database, then we can always go back to the source data to get the answer without waiting around for a new data collection cycle.
After we dump the
station_status.json files, do normalization and insert that transformed data into a database, we can start to do some analytics and build a model for the peak times.
This is a sample of how we're representing the
station_statustable in our database. The information is almost a direct translation from the original JSON files, but we have another field called
system_idso we can link the system information to this data.
To start answering this question, let's take a look at the system as a whole. Grouping these rows and summing the
num_bikes_available field will give us the total number of bikes in the system at any point of time that we've recorded.
SELECT FROM_UNIXTIME(last_reported) as timestamp, SUM(num_bikes_available) as num_bikes_available FROM station_status WHERE system_id = 'bcycle_bublr' GROUP BY last_reported ORDER BY last_reported ASC;
Plotting that data gives us a broader picture with very obvious features.
This graphic is a perfect showcase of why it's important to look at the data with a broader perspective. While it may be obvious that there are more bikes in the system during the night than during the day, viewing this validates that hypothesis and gives us confidence in the data being accurate. Fewer people ride at night than during the day, and around 3 AM the number of available bikes maxes out at 280.
TODO: Elaborate more on this model. Incorporate the postition of the sun in comparison to the bikes available.
Which stations are the busiest?
It would be good to know which Bublr stations are the busiest. We can actually find this answer in the same dataset we queried for our previous question.
The thing we want to measure isactivity. Now, since each of our datapoints are impulse and not historical, we need to derive activity as a function of the change in the number of bikes available. Simply put, when the number of bikes at any station are fluctuating, that is an indicator of activity. It would be great if we had actual data for each station instead of this estimated metric, but we'll make stew with what we have.
In order to prototype our function of activity, we'll only use a singular Bublr station. For this purpose, we'll use the station _bcycle_bublr2780.
Do define activity, we need to record every time the number of bikes changes either positively or negatively.
We can write a SQL query to automatically pull activity:
# Activity is defined as a change in the number of bikes available. SELECT * FROM ( SELECT current_row.system_id as system_id, current_row.station_id as station_id, FROM_UNIXTIME(current_row.last_reported) as timestamp, CAST(previous_row.num_bikes_available as SIGNED INTEGER) - CAST(current_row.num_bikes_available as SIGNED INTEGER) as num_bikes_available_change FROM ( SELECT @rownum:=@rownum+1 as row, a.* FROM station_status a, (SELECT @rownum:=0) r WHERE system_id = 'bcycle_bublr' AND station_id = 'bcycle_bublr_3320' ORDER BY last_reported ASC ) as current_row LEFT JOIN ( SELECT @rownum2:=@rownum2+1 as row, a.* FROM station_status a, (SELECT @rownum2:=0) r WHERE system_id = 'bcycle_bublr' AND station_id = 'bcycle_bublr_3320' ORDER BY last_reported ASC ) as previous_row ON (current_row.row = previous_row.row + 1) ORDER BY current_row.last_reported ASC ) as all_results WHERE num_bikes_available_change <> 0;
Bike Servicing Predictive Analytics
Unfortunately, we don't have the data required to do predictive analytics on these bikes, because we don't have any metrics for the individual bikes. If we had data on the # of rides each bike takes, we could generate servicing distributions for different parts. This would allow Bublr to stock their repair trucks with the appropriate number of parts, and service bikes so they don't wear out in the first place.
We really wish that we had information about individual bikes, so that we could do some predictive analytics on them, on a timeline of miles ridden. Using that data, we could predict when certain bikes fail, what their rate of crime is, etc. Unfortunatley, that data is unavailable, and so we are unable to do that kind of analysis.
Trend of bike deliveries
In order to inspect the general trend of where bikes end up going, we would ideally look at a map with the number of bikes represented on each station's geographical coordinates. As time goes by, we might expect them to change in some fashion.
For this question, we want to look at the average number of bikes across a day. If the general direction of that trend for a particular station is downard, then that station on-the-whole loses bikes during the course of a day. However, if the number of bikes in the station rise (or fill all the available slots) then the station is a common destination for riders.
Our hypothesis is that we will see more bikes being returned around popular areas, amd areas downhill from the stations of withdrawl.
A good starting point would be to first identify which stations receive the most activity. Before, we measured activity as a function of how many times the number of bikes in the station fluctuates. We'll continue using this as the activity metric, and make a quick plot using R to demonstrate where the most active stations are.