Recently I came over this library, learned a little about it, tried it, of course, and decided to share my thoughts.
From official website: “Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. ” I think pretty clear, but it would be better to see it in action, wouldn’t it?
Before starting, make sure you have Bokeh installed in your environment, if you don’t have it, follow the installation instructions from here.
So I created some kind of case study for myself. Decided to visualize the changes in CO2 emissions in time and in correlation to GDP (and check if that correlation even exists, because you never know :|).
So I took two files: one with CO2 emissions from Gapminder.org and another from DataCamp course (because that file was already preprocessed 😀 yeeeeees, I am a lazy bastard 😀 ). You can also download these files from here.
How do we start to analyze the data? Correct, by importing necessary packages and by importing data itself (very important :D). Then we perform some EDA (exploratory data analysis) to understand what we are dealing with and after that cleaning and transforming data into format necessary for analysis. Pretty straightforward. As article doesn’t focus on these steps I will just insert the code below with all the transformations I have made.
import pandas as pd
import numpy as np
# Data cleaning and preparation
data = pd.read_csv('data/co2_emissions_tonnes_per_person.csv')
data.head()
gapminder = pd.read_csv('data/gapminder_tidy.csv')
gapminder.head()
df = gapminder[['Country', 'region']].drop_duplicates()
data_with_regions = pd.merge(data, df, left_on='country', right_on='Country', how='inner')
data_with_regions = data_with_regions.drop('Country', axis='columns')
data_with_regions.head()
new_df = pd.melt(data_with_regions, id_vars=['country', 'region'])
new_df.head()
columns = ['country', 'region', 'year', 'co2']
new_df.columns = columns
upd_new_df = new_df[new_df['year'].astype('int64') > 1963]
upd_new_df.info()
upd_new_df = upd_new_df.sort_values(by=['country', 'year'])
upd_new_df['year'] = upd_new_df['year'].astype('int64')
df_gdp = gapminder[['Country', 'Year', 'gdp']]
df_gdp.columns = ['country', 'year', 'gdp']
df_gdp.info()
final_df = pd.merge(upd_new_df, df_gdp, on=['country', 'year'], how='left')
final_df = final_df.dropna()
final_df.head()
np_co2 = np.array(final_df['co2'])
np_gdp = np.array(final_df['gdp'])
np.corrcoef(np_co2, np_gdp)
By the way, CO2 emissions and GDP correlate, and quite significantly – 0.78.
np.corrcoef(np_co2, np_gdp)
Out[138]:
array([[1. , 0.78219731],
[0.78219731, 1. ]])
And now let’s get to the visualization part. Again, we start with necessary imports. I will explain all of them further. Now, just relax and import.
from bokeh.io import curdoc
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper, Slider
from bokeh.palettes import Spectral6
from bokeh.layouts import widgetbox, row
We will start with a preparations of different details for our interactive visualization app. First, we create a color mapper for different regions of the world, so every country will have different color depends on the region it is situated in. We select unique regions and convert them to a list. Then we use CategoricalColorMapper
to assign different color for each region.
regions_list = final_df.region.unique().tolist()
color_mapper = CategoricalColorMapper(factors=regions_list, palette=Spectral6)
Next, we will prepare a data source for our application. Bokeh accepts a lot of different types of data as the source for graphs and visuals: providing data directly using lists of values, pandas dataframes and series, numpy arrays and so on. But the core of most Bokeh plots is ColumnDataSource
.
At the most basic level, a ColumnDataSource
is simply a mapping between column names and lists of data. The ColumnDataSource
takes a data
parameter which is a dictionary, with string column names as keys and lists (or arrays) of data values as values. If one positional argument is passed in to the ColumnDataSource
initializer, it will be taken as data
. (from official website).
# Make the ColumnDataSource: source
source = ColumnDataSource(data={
'x': final_df.gdp[final_df['year'] == 1964],
'y': final_df.co2[final_df['year'] == 1964],
'country': final_df.country[final_df['year'] == 1964],
'region': final_df.region[final_df['year'] == 1964],
})
We start with a sample of our data only for one year. We basically create a dictionary of values for x, y, country
and region
.
Next step is to set up limits for our axes. We can do that by finding minimum and maximum values for ‘X’ and ‘Y’.
# Save the minimum and maximum values of the gdp column: xmin, xmax
xmin, xmax = min(final_df.gdp), max(final_df.gdp)
# Save the minimum and maximum values of the co2 column: ymin, ymax
ymin, ymax = min(final_df.co2), max(final_df.co2)
After that we create our figure, where we will place all our visualization objects. We give it a title, set width and height and also we set the axes. (‘Y’ axis is set to log type just for better view – few types were tried and this one gave the best result)
# Create the figure: plot
plot = figure(title='Gapminder Data for 1964',
plot_height=600, plot_width=1000,
x_range=(xmin, xmax),
y_range=(ymin, ymax), y_axis_type='log')
Bokeh uses a definition of glyph for all the visual shapes that can appear on the plot. The full list of glyphs built into Bokeh is given below (not inventing anything – all info from official page):
AnnularWedge
Annulus
Arc
Bezier
Ellipse
HBar
HexTile
Image
ImageRGBA
ImageURL
Line
MultiLine
MultiPolygons
Oval
Patch
Patches
Quad
Quadratic
Ray
Rect
Segment
Step
Text
VBar
Wedge
All these glyphs share a minimal common interface through their base class Glyph
We won’t go too deep with all these shapes and will use circles as one of the most basic ones. If you would like to play more with other glyphs you have all the necessary documentation and links.
# Add circle glyphs to the plot
plot.circle(x='x', y='y', fill_alpha=0.8, source=source, legend='region',
color=dict(field='region', transform=color_mapper),
size=7)
So how do we add these circles? We assign our source to the “source” parameter of the circle glyph, we specify data for ‘X’ and ‘Y’, we add legend for colors and we apply previously created ColorMapper
to the “color” parameter, “fill_alpha” sets a little of transparency and “size” is the size of the circles that will appear on the plot.
Next we improve the appearance of our plot by setting up the location of the legend and giving some explanations to our axes.
# Set the legend.location attribute of the plot
plot.legend.location = 'bottom_right'
# Set the x-axis label
plot.xaxis.axis_label = 'Income per person (Gross domestic product per person adjusted for differences in purchasing power in international dollars, fixed 2011 prices, PPP based on 2011 ICP)'
# Set the y-axis label
plot.yaxis.axis_label = 'CO2 emissions (tonnes per person)'
As of now we have a basic and static plot for the year 1964, but the title of the article has a word that doesn’t fit with this situation – “Interactive” O_O. So let’s add some interactivity!
To do that we will add a slider with years, so in the end we will have a visualization for every available year. Cool! isn’t it?
Previously we imported class Slider
, now it’s time to use it! So we create the object of this class with start being the minimum year, end – maximum, default value – minimal year year again, step (how fast the values are changing on the slider) – 1 year and the title.
Also we create a callback for any change that happens on this slider. Callbacks in Bokeh always have the same input parameters: attr, old, new
. We are going to update our datasource based on the value of a slider. So we create a new dictionary that will correspond to the year from the slider and based on this we update our plot. Also we update the title accordingly.
# Make a slider object: slider
slider = Slider(start=min(final_df.year), end=max(final_df.year), step=1, value=min(final_df.year), title='Year')
def update_plot(attr, old, new):
# set the `yr` name to `slider.value` and `source.data = new_data`
yr = slider.value
new_data = {
'x': final_df.gdp[final_df['year'] == yr],
'y': final_df.co2[final_df['year'] == yr],
'country': final_df.country[final_df['year'] == yr],
'region': final_df.region[final_df['year'] == yr],
}
source.data = new_data
# Add title to figure: plot.title.text
plot.title.text = 'Gapminder data for %d' % yr
# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)
With this amount of data points the plot becomes messy very quickly. So to add more clarity to every little circle that will be presented here, I decided to also include HoverTool into this figure.
# Create a HoverTool: hover
hover = HoverTool(tooltips=[('Country', '@country'), ('GDP', '@x'), ('CO2 emission', '@y')])
# Add the HoverTool to the plot
plot.add_tools(hover)
HoverTool accepts a list of tuples with first value being label and the second – being value detail from the datasource.
We have done with all the components of our little app, just few final lines of code to create a layout and add it to the current document
# Make a row layout of widgetbox(slider) and plot and add it to the current document
layout = row(widgetbox(slider), plot)
curdoc().add_root(layout)
And we are done! Congratulations! We run this code and… Nothing. No errors (or maybe some errors, but then you fix them and there is no errors) and no app, no visualization O_o. Why the hell did I spend all that time to create a cool plot and I get nothing? Not even explanation what I did wrong?
That were my first thoughts when I tried to run the app. But then I remembered a trick that you actually first have to start a server that will be a backend for this visualization.
So the next and the last thing you have to do is to run the code below from your command line:
bokeh serve --show my_python_file.py
And it will automatically open your visualization in a new browser tab.
Despite being the most popular, matplotlib is not the most user-friendly data visualization tool and has it’s own limitations, aaaaand I don’t really like it. So Bokeh is one possible solution if you belong to that same cohort of people as I do. Try it and let me know your thoughts.
Thank you for your attention, hope this little introduction to Bokeh is useful and have a great day! (or night if you are reading this before going to bed :D)
P.S. Want to try plotly as well, saw a lot of positive feedback about it.
P.S.S. Code on Github.
Photo by Yosh Ginsu on Unsplash
Hola,
Estoy tratando de descargarme los dos ficheros:
Así que tomé dos archivos: uno con emisiones de CO2 de Gapminder.org y otro del curso en DataCamp (porque ese archivo ya estaba preprocesado 😀 yeeeeees, soy un bastardo perezoso 😀 ). También puedes descargar estos archivos desde aquí
pero no puedo. Me los podrías enviar?
Muchas gracias