Tips on Working with Datetime Index in pandas

As you may understand from the title it is not a complete guide on Time Series or Datetime data type in Python. So if you expect to get in-depth explanation from A to Z it’s a wrong place. Seriously. There is a fantastic article on this topic, well explained, detailed and quite straightforward. Don’t waste your time on this one.

For those who have reached this part I will tell that you will find something useful here for sure. Again, seriously. I found my notes on Time Series and decided to organize it into a little article with general tips, which are aplicable, I guess, in 80 to 90% of times you work with dates. So it’s worth sharing, isn’t it?

I have a dataset with air pollutants measurements for every hour since 2016 in Madrid, so I will use it as an example.

Importing data

By default pandas will use the first column as index while importing csv file with read_csv(), so if your datetime column isn’t first you will need to specify it explicitly index_col='date'.

The beauty of pandas is that it can preprocess your datetime data during import. By specifying parse_dates=True pandas will try parsing the index, if we pass list of ints or names e.g. if [1, 2, 3] – it will try parsing columns 1, 2, 3 each as a separate date column, list of lists e.g. if [[1, 3]] – combine columns 1 and 3 and parse as a single date column, dict, e.g. {‘foo’ : [1, 3]} – parse columns 1, 3 as date and call result ‘foo’. If you are using other method to import data you can always use pd.to_datetime after it.

I have imported my data using the following code:

import pandas as pd
import glob

pattern = 'data/madrid*.csv'
csv_files = glob.glob(pattern)

frames = []

for csv in csv_files:
    df = pd.read_csv(csv, index_col='date', parse_dates=True)
    frames.append(df)

df = pd.concat(frames)
df.head()

Out[4]: 
                     BEN  CH4   CO  EBE  NMHC     NO   NO_2  NOx  O_3  PM10  
date                                                                          
2016-11-01 01:00:00  NaN  NaN  0.7  NaN   NaN  153.0   77.0  NaN  NaN   NaN   
2016-11-01 01:00:00  3.1  NaN  1.1  2.0  0.53  260.0  144.0  NaN  4.0  46.0   
2016-11-01 01:00:00  5.9  NaN  NaN  7.5   NaN  297.0  139.0  NaN  NaN   NaN   
2016-11-01 01:00:00  NaN  NaN  1.0  NaN   NaN  154.0  113.0  NaN  2.0   NaN   
2016-11-01 01:00:00  NaN  NaN  NaN  NaN   NaN  275.0  127.0  NaN  2.0   NaN

The data is gathered from 24 different stations about 14 different pollutants. We are not going to analyze this data, and to make it little bit simpler we will choose only one station, two pollutants and remove all NaN values (DANGER! please, do not repeat it at home).

df_time = df[['O_3', 'PM10']][df['station'] == 28079008].dropna()
df_time.head()
Out[9]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

Now when we have our data prepared we can play with Datetime Index.

Selecting values

Although the default pandas datetime format is ISO8601 (“yyyy-mm-dd hh:mm:ss”) when selecting data using partial string indexing it understands a lot of other different formats. For example:

df_time.loc['2016-11-01'].head()
Out[17]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

df_time.loc['November 1, 2016'].head()
Out[18]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

df_time.loc['2016-Nov-1'].head()
Out[19]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

All produce the same output. So we are free to use whatever is more comfortable for us. Also we can select data for entire month:

df_time.loc['2016-11'].head()
Out[23]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

df_time.loc['2016-11'].count()
Out[24]: 
O_3     715
PM10    715
dtype: int64

The same works if we want to select entire year:

df_time.loc['2016'].head()
Out[31]: 
                     O_3  PM10
date                          
2016-11-01 01:00:00  4.0  46.0
2016-11-01 02:00:00  4.0  37.0
2016-11-01 03:00:00  4.0  31.0
2016-11-01 04:00:00  5.0  31.0
2016-11-01 05:00:00  6.0  27.0

df_time.loc['2016'].count()
Out[32]: 
O_3     8720
PM10    8720
dtype: int64

If we want to slice data and find records for some specific period of time we continue to use loc accessor, all the rules are the same as for regular index:

df_time.loc['2017-11-02 23:00' : '2017-12-01'].head()
Out[34]: 
                     O_3  PM10
date                          
2017-11-02 23:00:00  5.0  30.0
2017-11-03 00:00:00  5.0  25.0
2017-11-03 01:00:00  5.0  12.0
2017-11-03 02:00:00  6.0   8.0
2017-11-03 03:00:00  7.0  14.0

df_time.loc['2017-11-02 23:00' : '2017-12-01'].count()
Out[35]: 
O_3     690
PM10    690
dtype: int64

Resampling

Pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

resample() is a time-based groupby, followed by a reduction method on each of its groups.

The resample function is very flexible and allows us to specify many different parameters to control the frequency conversion and resampling operation. sum, mean, std, sem,max, min, median, first, last, ohlc
are available as a method of the returned object by resample()

# Converting hourly data into monthly data
df_time.resample('M').mean().head()
Out[46]: 
                  O_3       PM10
date                            
2016-01-31  21.871622  19.990541
2016-02-29  32.241679  25.853835
2016-03-31  44.234014  16.952381
2016-04-30  46.845938  12.189076
2016-05-31  53.136671  14.671177

For upsampling, we can specify a way to upsample to interpolate over the gaps that are created:

# Converting hourly data into 10-minutely data
df_time.resample('10Min').mean().head()
Out[46]: 
                     O_3  PM10
date                          
2016-01-01 01:00:00  8.0  17.0
2016-01-01 01:10:00  NaN   NaN
2016-01-01 01:20:00  NaN   NaN
2016-01-01 01:30:00  NaN   NaN
2016-01-01 01:40:00  NaN   NaN

df_time.resample('10Min').mean().ffill().head()
Out[47]: 
                     O_3  PM10
date                          
2016-01-01 01:00:00  8.0  17.0
2016-01-01 01:10:00  8.0  17.0
2016-01-01 01:20:00  8.0  17.0
2016-01-01 01:30:00  8.0  17.0
2016-01-01 01:40:00  8.0  17.0

We can use the following methods to fill the NaN values: ‘pad’, ‘backfill’, ‘ffill’, ‘bfill’, ‘nearest’. More details on this can be found in documentation. Or we can do it using interpolation with following methods: ‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’. And again, deeper explanation on this can be found in pandas docs.

And a table of resampling frequencies:

Input	Description
min, T	minute
H	hour
D	day
B	business day
W	week
M	month
Q	quarter
A	year

Visualization

And another one awesome feature of Datetime Index is simplicity in plotting, as matplotlib will automatically treat it as x axis, so we don’t need to explicitly specify anything.

import seaborn as sns
sns.set()
df_plot = df_time.resample('M').mean()
plt.plot(df_plot)
plt.title('Air polution by O3 and PM10')
plt.ylabel('micrograms per cubic meter (mg/m3)')
plt.xticks(rotation=45)
plt.show()

As promised in the beginning – few tips, that help in the majority of situations when working with datetime data. For me – one more refresher and organizer of thoughts that converts into knowledge. All win. Someone will find it useful, someone might not (I warned in the first paragraph :D), so actually I expect everyone reading this will find it useful.

This is the most exciting feature of knowledge – when you share it, you don’t loose anything, you only gain. To write an article, it requires some research, some verification, some learning – basically you get even more knowledge in the end.

Knowledge is just a tool. And it’s your responsibility to apply it or not. In the end of the day it doesn’t matter how much you know, it’s about how you use that knowledge. But that’s already another story…

Thank you for reading, have an incredible week, learn, spread the knowledge, use it wisely and use it for good deeds 🙂

16 thoughts on “Tips on Working with Datetime Index in pandas”

vikas kumar says:
June 6, 2020 at 07:00
my csv file is:-
“Time Stamp Total Volume Dispensed(Litres)
0 “17/07/2019 12:16:01 0
1 “17/07/2019 12:18:52 0
2 “17/07/2019 12:26:21 0
3 “17/07/2019 12:26:51 0
4 “17/07/2019 12:34:07 0
.. … …
171 “01/08/2019 16:47:35 33954
172 “01/08/2019 16:56:13 33954
173 “01/08/2019 17:06:13 33954
174 “01/08/2019 17:07:29 33954
175 “01/08/2019 17:17:29 63618
………….
[176 rows x 2 columns]…………….
df = pd.read_csv(csv, index_col=’Time Stamp’, parse_dates=True)
i have facing error:- ‘Time Stamp’ is not in list
please help sir,
i want to read csv file and calculate the total Volume Dispensed(Litres) monthly wise and plot bar chart using python
1. Sergi says:
  June 6, 2020 at 08:51
  You can try first reading the file and only after that assigning the timestamp column as index. Maybe during this process you will find out why you cannot do that directly.
  Cheers!
Dexter says:
July 21, 2020 at 18:25
Sergi,
I tried to resample my hourly rows to monthly, but raise this error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of ‘Index’
I try this code to fix, but don’t work.
dataset[‘datetime’] = dataset.index
dataset[‘datetime’] = to_datetime(dataset[‘datetime’])
del dataset[‘datetime’]
# resampling hourly data into monthly data
dataset.resample(‘M’).sum()
Seems the index DateTime column is the problem, but in your example, the date column also is an index.
Any idea?
1. Sergi says:
  July 22, 2020 at 17:28
  I am not sure what it can be, but check carefully if your index is DateTime Index and not string/datetime/int etc. Sometimes after some modifications you change the type and do not notice it. I make this error quite often XD
Shivam says:
July 22, 2020 at 08:43
I have data
Date Sq. Mtr Sq. Yrd KGS LBS TARE WT. OZ TIME
2020-01-01 1340.12 1603 546.0 1204 8.0 12.017467 08:29:49
2020-01-01 1340.12 1603 551.0 1215 8.0
Sir I want weekly data from this, so that I uses this
df[‘Date’] = df.to_datetime(df[‘Date’])
df = df.set_index(“Date”)
Daily_data = df.resample(‘D’).sum()
But here in daily data I want my day from 7:30 to 7:30 (means today’s 7:30 to tommorw morning’s 7:30) now I’m not able to set this as a date (because of that’s my business hours)
After daily_data I’m converting to the weekly data. By df.resample(‘W’).sum()
1. Sergi says:
  July 22, 2020 at 17:35
  What I see from the example you provided is that your “Date” column do not have hours – you have to combine “Date” and “Time” columns into one Datetime Index. Once you have it you can create an additional column, let’s call it “Business DateTime” and apply a transformation logic you want. This way you will have 2 columns: one with standard dates and another with business dates.
David Gerrard says:
July 31, 2020 at 09:22
Hi Sergi,
I have been using your example for some study I am doing but I can not work out how to change the graph into a stacked bar chart. I have tried the obvious plt.plot.bar(df_plot) etc. Have you any suggestions.
Thanks in advance
David
1. Sergi says:
  July 31, 2020 at 15:32
  Try plotting with seaborn. Here is the stackoverflow post that will help you stackoverflow.com
Simon BAL-FONTAINE says:
November 30, 2020 at 11:45
Hi Sergi,
You show how to select data using ‘loc’ depending on year, year and month, etc.
But I need to select date only with hours ( data on each day between 6AM and 10AM for exemple).
Do you have a solution or it’s impossible with this function ?
1. Sergi says:
  December 1, 2020 at 09:19
  Hi,
  In the example you have it 🙂
  df_time.loc['2017-11-02 23:00' : '2017-12-01'].head()
  You can modify it to
  df_time.loc['2017-11-02 06:00' : '2017-12-01 10:00'].head()
  But if you want to select only specific rows for specific hours you should use another function between_time()
  Example:
  df.between_time('06:00:00', '10:00:00')
  Also, please check the type of your index – if it is not datetime it will not work
  Cheers
Renzo Karsdorp says:
February 5, 2021 at 03:26
great example Sergi.
I was wondering, have you done something like this for csv’s from separate datasources? For example, what if you had a NOX.csv and PM10.csv with the same timestamps. how would you align those different files with you datetime index?
Also, how is the database going along, do you see a drop in poluttants due to decrease of activities during Covid?
1. Sergi says:
  February 7, 2021 at 12:38
  Thank you 🙂
  For different datasources I would rather combine them first into one dataframe and only after that would create an index.
  Regarding the database, I haven’t checked the dataset for new data, so cannot answer this 🙁
Pingback: Learning to use RedisTimeSeries – JJPP: JP in JP
1. Sergi says:
  March 6, 2021 at 11:00
  Thanks for mentioning my work 🙂
RIT says:
May 21, 2021 at 17:20
Sergi, thanks for this – I am struggling to pull out the index value? any advice appreciated, Thanks
1. Sergi says:
  June 27, 2021 at 20:11
  Thank you 🙂 I would love to help, but I need more info than just “struggling to pull out index value” :/

Blog

Tips on Working with Datetime Index in pandas

Importing data

Selecting values

Resampling

Visualization

16 thoughts on “Tips on Working with Datetime Index in pandas”

Leave a Reply Cancel reply

Importing data

Selecting values

Resampling

Visualization

16 thoughts on “Tips on Working with Datetime Index in pandas”

Leave a Reply Cancel reply

Sergi's Blog Cookies Policy