Extracting and Transforming Data in Python

In order to get insights from data you have to play with it a little..

It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. And in today’s article I will list those techniques with code samples and some explanations. Let’s get started.

For this article I’ve created a sample DataFrame with random numbers to play with it. We will use this data as an example during the explanations in the article.

import pandas as pd
import numpy as np

cols = ['col0', 'col1', 'col2', 'col3', 'col4']
rows = ['row0', 'row1', 'row2', 'row3', 'row4']
data = np.random.randint(0, 100, size=(5, 5))
df = pd.DataFrame(data, columns=cols, index=rows)

df.head()

Out[2]: col0 col1 col2 col3 col4 row0 24 78 42 7 96 row1 40 4 80 12 84 row2 83 17 80 26 15 row3 92 68 58 93 33 row4 78 63 35 70 95

Indexing DataFrames

To extract data from pandas DataFrame we can use direct indexing or accessors. We can select necessary rows and columns using it’s labels:

df['col1']['row1']
Out[3]: 4

Please, note the order in this type of indexing: first you specify column label and then row. But the truth is, datasets with is rare and small, while in real life we work with much heavier machinery. It is much better to select data using accessors – .loc and .iloc. The difference between them is that .loc accepts labels and .iloc – indexes. Also when we use accessors first we specify rows and then columns. I had some difficult time in the beginning to get used to it – SQL background, what else can you say..

So, to select a single value using accessors you’d do the following:

df.loc['row4', 'col2']
Out[4]: 35
df.iloc[rowindex, colindex] ==> df.iloc[4, 2]
Out[5]: 35

Using indexing we can select a single value, Series or DataFrame from a DataFrame (sorry for tautology). Above I have shown how to select a value.

To subselect few columns, just pass nested list of it’s labels and DataFrame will be returned:

df_new = df[['col1','col2']]
df_new.head(3)

Out[6]: col1 col2 row0 78 42 row1 4 80 row2 17 80

If you want to select also specific rows, add its indexes and you will get a DataFrame again. This technique is called slicing and more in detail about it – below.

df_new = df[['col1','col2']][1:4]
df_new.head(3)

Out[7]: col1 col2 row1 4 80 row2 17 80 row3 68 58

To select a Series you have to select a single column with all or range of rows. Each line of code will produce the same output:

df['col0']
df.loc[:,'col0']
df.iloc[:, 0]

Out[8]: row0 24 row1 40 row2 83 row3 92 row4 78 Name: col0, dtype: int32

The colon means that we want to select all rows or all columns – df.loc[:,:] or df.iloc[:,:] will return all values. And slowly we’ve got to the slicing – selecting specific ranges from our data. To slice a Series you just add a range of rows you want to select using its indexes:

df['col3'][2:5]

Out[12]: row2 26 row3 93 row4 70 Name: col3, dtype: int32

And don’t forget about ranging in Python – first element included, second excluded. So the code above will return rows with indexes 5, 6, 7, 8 and 9. And indexes start from 0.

Slicing DataFrames works the same way. With just one nuance. When using .loc (labels) both borders are included. For example, select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and all columns:

df.loc['row1':'row4', :]
Out[20]: col0 col1 col2 col3 col4 row1 40 4 80 12 84 row2 83 17 80 26 15 row3 92 68 58 93 33 row4 78 63 35 70 95

df.iloc[1:4, :]
Out[21]: col0 col1 col2 col3 col4 row1 40 4 80 12 84 row2 83 17 80 26 15 row3 92 68 58 93 33

The first line of code above selected row1, row2, row3 and row4. While the second – row1, row2 and row3 only. And few more examples below.

Select columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4 and all rows:

df.loc[:, 'col1':'col4']
Out[22]: col1 col2 col3 col4 row0 78 42 7 96 row1 4 80 12 84 row2 17 80 26 15 row3 68 58 93 33 row4 63 35 70 95

df.iloc[:, 1:4]
Out[23]: col1 col2 col3 row0 78 42 7 row1 4 80 12 row2 17 80 26 row3 68 58 93 row4 63 35 70

Select  rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and  columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4:

df.loc['row1':'row4', 'col1':'col4']
Out[24]: col1 col2 col3 col4 row1 4 80 12 84 row2 17 80 26 15 row3 68 58 93 33 row4 63 35 70 95

df.iloc[1:4,1:4]
Out[25]: col1 col2 col3 row1 4 80 12 row2 17 80 26 row3 68 58 93

Use a list to select specific columns or rows that are not in a range.

df.loc['row2':'row4', ['col1','col3']]
Out[28]: col1 col3 row2 17 26 row3 68 93 row4 63 70
df.iloc[[2,4], 0:4]
Out[30]: col0 col1 col2 col3 row2 83 17 80 26 row4 78 63 35 70

Filtering DataFrames

Filtering is a more general tool to select parts of the data based on properties of interest of the data itself and not on indexes or labels. DataFrames have several methods for filtering. Underlying idea for all these methods is a Boolean Series. The df[‘col1’] > 20 (we assume col1 is of type integer) will return a Boolean Series where this condition is true. I will put here the output of .head() method, so you don’t need to scroll up to match the numbers.

Out[2]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33
row4    78    63    35    70    95

So to select part of a DataFrame where values of col1 are bigger than 20 we will use the following code:

df[df['col1'] > 20]
# assigning variable also works
condition = df['col1'] > 20
df[condition]

Out[31]: col0 col1 col2 col3 col4 row0 24 78 42 7 96 row3 92 68 58 93 33 row4 78 63 35 70 95

We can combine those filters using standard logical operators (and – &, or – |, not – ~). Notice usage of parenthesis for these operations.

df[(df['col1'] > 25) & (df['col3'] < 30)] # logical and
Out[33]: col0 col1 col2 col3 col4 row0 24 78 42 7 96

df[(df['col1'] > 25) | (df['col3'] < 30)] # logical or
Out[34]: col0 col1 col2 col3 col4 row0 24 78 42 7 96 row1 40 4 80 12 84 row2 83 17 80 26 15 row3 92 68 58 93 33 row4 78 63 35 70 95

df[~(df['col1'] > 25)] # logical not
Out[35]: col0 col1 col2 col3 col4 row1 40 4 80 12 84 row2 83 17 80 26 15

Dealing with 0 and NaN values

Almost always datasets have zero or NaN values and we definitely want to know where they are. Ours is particular, so we will modify it a little:

df.iloc[3, 3] = 0
df.iloc[1, 2] = np.nan
df.iloc[4, 0] = np.nan
df['col5'] = 0
df['col6'] = np.NaN
df.head()

Out[57]: col0 col1 col2 col3 col4 col5 col6 row0 24.0 78 42.0 7 96 0 NaN row1 40.0 4 NaN 12 84 0 NaN row2 83.0 17 80.0 26 15 0 NaN row3 92.0 68 58.0 0 33 0 NaN row4 NaN 63 35.0 70 95 0 NaN

To select columns that don’t have any zero value we can use .all() method (all data is present):

df.loc[:, df.all()]

Out[43]: col0 col1 col2 col4 col6 row0 24.0 78 42.0 96 NaN row1 40.0 4 NaN 84 NaN row2 83.0 17 80.0 15 NaN row3 92.0 68 58.0 33 NaN row4 NaN 63 35.0 95 NaN

If we want to find a column that have at least one nonzero (any) value, this will help:

df.loc[:, df.any()]

Out[47]: col0 col1 col2 col3 col4 row0 24.0 78 42.0 7 96 row1 40.0 4 NaN 12 84 row2 83.0 17 80.0 26 15 row3 92.0 68 58.0 0 33 row4 NaN 63 35.0 70 95

To select columns with any NaN:

df.loc[:, df.isnull().any()]

Out[48]: col0 col2 col6 row0 24.0 42.0 NaN row1 40.0 NaN NaN row2 83.0 80.0 NaN row3 92.0 58.0 NaN row4 NaN 35.0 NaN

Select columns without NaNs:

df.loc[:, df.notnull().all()]

Out[49]: col1 col3 col4 col5 row0 78 7 96 0 row1 4 12 84 0 row2 17 26 15 0 row3 68 0 33 0 row4 63 70 95 0

We can drop those rows containing NaNs, but it’s a dangerous game – dropping usually isn’t a solution. You have to understand your data and deal with such rows wisely. I warned you.

Are you sure you wanna know it? OK.. 😀

df.dropna(how='all', axis=1) # if all values in a column are NaN it will be dropped
Out[69]: col0 col1 col2 col3 col4 col5 row0 24.0 78 42.0 7 96 0 row1 40.0 4 NaN 12 84 0 row2 83.0 17 80.0 26 15 0 row3 92.0 68 58.0 0 33 0 row4 NaN 63 35.0 70 95 0

df.dropna(how='any', axis=1) # if any value in a row is NaN it will be dropped
Out[71]: col1 col3 col4 col5 row0 78 7 96 0 row1 4 12 84 0 row2 17 26 15 0 row3 68 0 33 0 row4 63 70 95 0

This methods do not modify original DataFrame, so to continue working with filtered data you have to assign it to new dataframe or reassign to the existing one

df = df.dropna(how='any', axis=1)

The beauty of filtering is that we actually can select or modify values of one column based on another. For example, we can select values from col1 where col2 is grater than 35 and update those values by adding 5 to each:

# Find a column based on another
df['col1'][df['col2'] > 35]
Out[74]: row0 78 row2 17 row3 68 Name: col1, dtype: int32
# Modify a column based on another df['col1'][df['col2'] > 35] += 5
df['col1']
Out[77]: row0 83 row1 4 row2 22 row3 73 row4 63 Name: col1, dtype: int32

And this brings us to the next part –

Transforming DataFrames

Once we selected or filtered our data we want to transform it somehow. The best way to do this is with methods inherited to DataFrames or numpy universal funcs, that transform entire column of data element-wise. Examples would be pandas’ .floordiv() function (from documentation:
‘Integer division of dataframe and other, element-wise’) or numpy’s .floor_divide() (doc: ‘Return the largest integer smaller or equal to the division of the inputs.’).

If those functions were not available we could write our own and use it with .apply() method.

def some_func(x):
    return x * 2
df.apply(some_func) -- # update each entry of a DataFrame without any loops

# lambda also works df.apply(lambda n: n*2) -- # the same

These functions do not return transformations, so we have to store it explicitly:

df['new_col'] = df['col4'].apply(lambda n: n*2)
df.head()

Out[82]: col0 col1 col2 col3 col4 col5 col6 new_col row0 24.0 83 42.0 7 96 0 NaN 192 row1 40.0 4 NaN 12 84 0 NaN 168 row2 83.0 22 80.0 26 15 0 NaN 30 row3 92.0 73 58.0 0 33 0 NaN 66 row4 NaN 63 35.0 70 95 0 NaN 190

If index is a string it has a .str accessor that allows us to modify entire index at once:

df.index.str.upper()
Out[83]: Index(['ROW0', 'ROW1', 'ROW2', 'ROW3', 'ROW4'], dtype='object')

Also, we cannot use .apply() method on index – the alternative for it is .map()

df.index = df.index.map(str.lower)
Out[85]: Index(['row0', 'row1', 'row2', 'row3', 'row4'], dtype='object')

But .map() can be used on columns as well. For example:

# Create the dictionary: red_vs_blue
red_vs_blue = {0:'blue', 12:'red'}
# Use the dictionary to map the 'col3' column to the new column df['color'] df['color'] = df['col3'].map(red_vs_blue)
df.head()

Out[92]: col0 col1 col2 col3 col4 col5 col6 new_col color row0 24.0 83 42.0 7 96 0 NaN 192 NaN row1 40.0 4 NaN 12 84 0 NaN 168 red row2 83.0 22 80.0 26 15 0 NaN 30 NaN row3 92.0 73 58.0 0 33 0 NaN 66 blue row4 NaN 63 35.0 70 95 0 NaN 190 NaN

Arithmetic operations on Series and DataFrames work directly. The expression below will create a new column where each value with index n is a sum of values with index n from ‘col3’ and ‘col7’.

df['col7'] = df['col3'] + df['col4'] 
df.head()

Out[94]: col0 col1 col2 col3 col4 col5 col6 new_col color col7 row0 24.0 83 42.0 7 96 0 NaN 192 NaN 103 row1 40.0 4 NaN 12 84 0 NaN 168 red 96 row2 83.0 22 80.0 26 15 0 NaN 30 NaN 41 row3 92.0 73 58.0 0 33 0 NaN 66 blue 33 row4 NaN 63 35.0 70 95 0 NaN 190 NaN 165

This is the second version of the article, because the first one was a complete mess – errors in the code, no examples and few other things. Thanks to the feedback I went through the article one more time and I think it looks much better now. I have covered basics of transforming and extracting data in Python with code snippets and examples here and hopefully it will be useful for people who are just starting their path in this field.

Meanwhile, love data science and smile more. We have to be positive as we have the sexiest job of 21st century 😀

2 ideas con respecto a “Extracting and Transforming Data in Python”

  1. Hi, thanks for this amazing piece. Quick question: if you’re working with a huge dataset and thus large data frame, how do you do transformations using the columns if you don’t want to type out each column label for all 700 columns?

    1. You can access the .columns attribute of the DataFrame and copy-paste it or store in variable etc. You will have a list with all columns names so you can filter it using list methods. Hope this is something that might help 🙂 if that doesn’t work reach me on social media – links in the footer 😉
      Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *