Recently I learned that the best way to digest information, assimilate it is a two-step algorithm:
- Share new information within 12-24 hours with 2 different persons.
- Apply new knowledge – practice it.
And that’s it. Nothing complicated. So I decided to use my blog for the first part.
I constantly learn new data science techniques, so here I want to share what was the most recent.
(technical text begins here) In the last lesson I learned more about pandas time series and how to work with indexes that contain this type of data.
First and very awesome characteristic of time series index is partial datetime string selection:
# Select sales data for the 5th of February, 2015
sales.loc['2015-02-05'] sales.loc['February 5, 2015'] sales.loc['2015-Feb-5'] # Whole month sales.loc['2015-2'] # whole year sales.loc['2015']
Be careful with data types. If your index consists of strings the tricks above won’t work. To convert a string to a datetime we can use pandas to_datetime() method. Specifying format parameter helps with the formatting. Default format is ISO 8601 (‘yyyy-mm-dd hh:mm:ss’)
pd.to_datetime(['2015-2-16', '2015-2-20'], format='%Y-%m-%d %H%M%S)
Other cool feature of time series index is resampling. There are two types of it: downsampling and upsampling. Former is when we have 9 rows of data for 9 hours each row representing each hour. We can downsample it and get a summary for 3 hour groups. Example:
>>> index = pd.date_range('1/1/2000', periods=9, freq='H') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 01:00:00 1 2000-01-01 02:00:00 2 2000-01-01 03:00:00 3 2000-01-01 04:00:00 4 2000-01-01 05:00:00 5 2000-01-01 06:00:00 6 2000-01-01 07:00:00 7 2000-01-01 08:00:00 8 Freq: T, dtype: int64
>>> series.resample('3H').sum() 2000-01-01 00:00:00 3 2000-01-01 03:00:00 12 2000-01-01 06:00:00 21 Freq: 3T, dtype: int64
Upsampling is an operation in opposite direction. Example, upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 minute bins and fill the NaN
values using the pad
method.
>>> series.resample('30T').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:30:00 0 2000-01-01 01:00:00 1 2000-01-01 01:30:00 1 2000-01-01 02:00:00 2 Freq: 30S, dtype: int64
Data Visualization
We have a variety of option to customize our plots using python. We can change color, marker and line type. Below is a little summary of available options
That’s all for today. I will learn something new and share it here soon. Have an awesome day!