Extracting and Transforming Data in Python

In order to get insights from data you have to play with it a little..

It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. And in today’s article I will list those techniques with code samples and some explanations. Let’s get started.

For this article I’ve created a sample DataFrame with random numbers to play with it. We will use this data as an example during the explanations in the article.

import pandas as pd
import numpy as np

cols = ['col0', 'col1', 'col2', 'col3', 'col4']
rows = ['row0', 'row1', 'row2', 'row3', 'row4']
data = np.random.randint(0, 100, size=(5, 5))
df = pd.DataFrame(data, columns=cols, index=rows)

df.head()

Out[2]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33
row4    78    63    35    70    95

Indexing DataFrames

To extract data from pandas DataFrame we can use direct indexing or accessors. We can select necessary rows and columns using it’s labels:

df['col1']['row1']

Out[3]: 4

Please, note the order in this type of indexing: first you specify column label and then row. But the truth is, datasets with is rare and small, while in real life we work with much heavier machinery. It is much better to select data using accessors – .loc and .iloc. The difference between them is that .loc accepts labels and .iloc – indexes. Also when we use accessors first we specify rows and then columns. I had some difficult time in the beginning to get used to it – SQL background, what else can you say..

So, to select a single value using accessors you’d do the following:

df.loc['row4', 'col2']
Out[4]: 35

df.iloc[rowindex, colindex] ==> df.iloc[4, 2]
Out[5]: 35

Using indexing we can select a single value, Series or DataFrame from a DataFrame (sorry for tautology). Above I have shown how to select a value.

To subselect few columns, just pass nested list of it’s labels and DataFrame will be returned:

df_new = df[['col1','col2']]
df_new.head(3)

Out[6]: 
      col1  col2
row0    78    42
row1     4    80
row2    17    80

If you want to select also specific rows, add its indexes and you will get a DataFrame again. This technique is called slicing and more in detail about it – below.

df_new = df[['col1','col2']][1:4]
df_new.head(3)

Out[7]: 
      col1  col2
row1     4    80
row2    17    80
row3    68    58

To select a Series you have to select a single column with all or range of rows. Each line of code will produce the same output:

df['col0']
df.loc[:,'col0']
df.iloc[:, 0]

Out[8]: 
row0    24
row1    40
row2    83
row3    92
row4    78
Name: col0, dtype: int32

The colon means that we want to select all rows or all columns – df.loc[:,:] or df.iloc[:,:] will return all values. And slowly we’ve got to the slicing – selecting specific ranges from our data. To slice a Series you just add a range of rows you want to select using its indexes:

df['col3'][2:5]

Out[12]: 
row2    26
row3    93
row4    70
Name: col3, dtype: int32

And don’t forget about ranging in Python – first element included, second excluded. So the code above will return rows with indexes 5, 6, 7, 8 and 9. And indexes start from 0.

Slicing DataFrames works the same way. With just one nuance. When using .loc (labels) both borders are included. For example, select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and all columns:

df.loc['row1':'row4', :]
Out[20]: 
      col0  col1  col2  col3  col4
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33
row4    78    63    35    70    95

df.iloc[1:4, :]
Out[21]: 
      col0  col1  col2  col3  col4
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33

The first line of code above selected row1, row2, row3 and row4. While the second – row1, row2 and row3 only. And few more examples below.

Select columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4 and all rows:

df.loc[:, 'col1':'col4']
Out[22]: 
      col1  col2  col3  col4
row0    78    42     7    96
row1     4    80    12    84
row2    17    80    26    15
row3    68    58    93    33
row4    63    35    70    95

df.iloc[:, 1:4]
Out[23]: 
      col1  col2  col3
row0    78    42     7
row1     4    80    12
row2    17    80    26
row3    68    58    93
row4    63    35    70

Select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4:

df.loc['row1':'row4', 'col1':'col4']
Out[24]: 
      col1  col2  col3  col4
row1     4    80    12    84
row2    17    80    26    15
row3    68    58    93    33
row4    63    35    70    95

df.iloc[1:4,1:4]
Out[25]: 
      col1  col2  col3
row1     4    80    12
row2    17    80    26
row3    68    58    93

Use a list to select specific columns or rows that are not in a range.

df.loc['row2':'row4', ['col1','col3']]
Out[28]: 
      col1  col3
row2    17    26
row3    68    93
row4    63    70

df.iloc[[2,4], 0:4]
Out[30]: 
      col0  col1  col2  col3
row2    83    17    80    26
row4    78    63    35    70

Filtering DataFrames

Filtering is a more general tool to select parts of the data based on properties of interest of the data itself and not on indexes or labels. DataFrames have several methods for filtering. Underlying idea for all these methods is a Boolean Series. The df[‘col1’] > 20 (we assume col1 is of type integer) will return a Boolean Series where this condition is true. I will put here the output of .head() method, so you don’t need to scroll up to match the numbers.

Out[2]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33
row4    78    63    35    70    95

So to select part of a DataFrame where values of col1 are bigger than 20 we will use the following code:

df[df['col1'] > 20]
# assigning variable also works
condition = df['col1'] > 20
df[condition]

Out[31]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96
row3    92    68    58    93    33
row4    78    63    35    70    95

We can combine those filters using standard logical operators (and – &, or – |, not – ~). Notice usage of parenthesis for these operations.

df[(df['col1'] > 25) & (df['col3'] < 30)] # logical and
Out[33]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96

df[(df['col1'] > 25) | (df['col3'] < 30)] # logical or
Out[34]: 
      col0  col1  col2  col3  col4
row0    24    78    42     7    96
row1    40     4    80    12    84
row2    83    17    80    26    15
row3    92    68    58    93    33
row4    78    63    35    70    95

df[~(df['col1'] > 25)] # logical not
Out[35]: 
      col0  col1  col2  col3  col4
row1    40     4    80    12    84
row2    83    17    80    26    15

Dealing with 0 and NaN values

Almost always datasets have zero or NaN values and we definitely want to know where they are. Ours is particular, so we will modify it a little:

df.iloc[3, 3] = 0
df.iloc[1, 2] = np.nan
df.iloc[4, 0] = np.nan
df['col5'] = 0
df['col6'] = np.NaN
df.head()

Out[57]: 
      col0  col1  col2  col3  col4  col5  col6
row0  24.0    78  42.0     7    96     0   NaN
row1  40.0     4   NaN    12    84     0   NaN
row2  83.0    17  80.0    26    15     0   NaN
row3  92.0    68  58.0     0    33     0   NaN
row4   NaN    63  35.0    70    95     0   NaN

To select columns that don’t have any zero value we can use .all() method (all data is present):

df.loc[:, df.all()]

Out[43]: 
      col0  col1  col2  col4  col6
row0  24.0    78  42.0    96   NaN
row1  40.0     4   NaN    84   NaN
row2  83.0    17  80.0    15   NaN
row3  92.0    68  58.0    33   NaN
row4   NaN    63  35.0    95   NaN

If we want to find a column that have at least one nonzero (any) value, this will help:

df.loc[:, df.any()]

Out[47]: 
      col0  col1  col2  col3  col4
row0  24.0    78  42.0     7    96
row1  40.0     4   NaN    12    84
row2  83.0    17  80.0    26    15
row3  92.0    68  58.0     0    33
row4   NaN    63  35.0    70    95

To select columns with any NaN:

df.loc[:, df.isnull().any()]

Out[48]: 
      col0  col2  col6
row0  24.0  42.0   NaN
row1  40.0   NaN   NaN
row2  83.0  80.0   NaN
row3  92.0  58.0   NaN
row4   NaN  35.0   NaN

Select columns without NaNs:

df.loc[:, df.notnull().all()]

Out[49]: 
      col1  col3  col4  col5
row0    78     7    96     0
row1     4    12    84     0
row2    17    26    15     0
row3    68     0    33     0
row4    63    70    95     0

We can drop those rows containing NaNs, but it’s a dangerous game – dropping usually isn’t a solution. You have to understand your data and deal with such rows wisely. I warned you.

Are you sure you wanna know it? OK.. 😀

df.dropna(how='all', axis=1) # if all values in a column are NaN it will be dropped
Out[69]: 
      col0  col1  col2  col3  col4  col5
row0  24.0    78  42.0     7    96     0
row1  40.0     4   NaN    12    84     0
row2  83.0    17  80.0    26    15     0
row3  92.0    68  58.0     0    33     0
row4   NaN    63  35.0    70    95     0

df.dropna(how='any', axis=1) # if any value in a row is NaN it will be dropped
Out[71]: 
      col1  col3  col4  col5
row0    78     7    96     0
row1     4    12    84     0
row2    17    26    15     0
row3    68     0    33     0
row4    63    70    95     0

This methods do not modify original DataFrame, so to continue working with filtered data you have to assign it to new dataframe or reassign to the existing one

df = df.dropna(how='any', axis=1)

The beauty of filtering is that we actually can select or modify values of one column based on another. For example, we can select values from col1 where col2 is grater than 35 and update those values by adding 5 to each:

# Find a column based on another
df['col1'][df['col2'] > 35]
Out[74]: 
row0    78
row2    17
row3    68
Name: col1, dtype: int32

# Modify a column based on another
df['col1'][df['col2'] > 35] += 5
df['col1']
Out[77]: 
row0    83
row1     4
row2    22
row3    73
row4    63
Name: col1, dtype: int32

And this brings us to the next part –

Transforming DataFrames

Once we selected or filtered our data we want to transform it somehow. The best way to do this is with methods inherited to DataFrames or numpy universal funcs, that transform entire column of data element-wise. Examples would be pandas’ .floordiv() function (from documentation:
‘Integer division of dataframe and other, element-wise’) or numpy’s .floor_divide() (doc: ‘Return the largest integer smaller or equal to the division of the inputs.’).

If those functions were not available we could write our own and use it with .apply() method.

def some_func(x):
    return x * 2

df.apply(some_func) -- # update each entry of a DataFrame without any loops

# lambda also works
df.apply(lambda n: n*2) -- # the same

These functions do not return transformations, so we have to store it explicitly:

df['new_col'] = df['col4'].apply(lambda n: n*2)
df.head()

Out[82]: 
      col0  col1  col2  col3  col4  col5  col6  new_col
row0  24.0    83  42.0     7    96     0   NaN      192
row1  40.0     4   NaN    12    84     0   NaN      168
row2  83.0    22  80.0    26    15     0   NaN       30
row3  92.0    73  58.0     0    33     0   NaN       66
row4   NaN    63  35.0    70    95     0   NaN      190

If index is a string it has a .str accessor that allows us to modify entire index at once:

df.index.str.upper()
Out[83]: Index(['ROW0', 'ROW1', 'ROW2', 'ROW3', 'ROW4'], dtype='object')

Also, we cannot use .apply() method on index – the alternative for it is .map()

df.index = df.index.map(str.lower)
Out[85]: Index(['row0', 'row1', 'row2', 'row3', 'row4'], dtype='object')

But .map() can be used on columns as well. For example:

# Create the dictionary: red_vs_blue
red_vs_blue = {0:'blue', 12:'red'}

# Use the dictionary to map the 'col3' column to the new column df['color']
df['color'] = df['col3'].map(red_vs_blue)
df.head()

Out[92]: 
      col0  col1  col2  col3  col4  col5  col6  new_col color
row0  24.0    83  42.0     7    96     0   NaN      192   NaN
row1  40.0     4   NaN    12    84     0   NaN      168   red
row2  83.0    22  80.0    26    15     0   NaN       30   NaN
row3  92.0    73  58.0     0    33     0   NaN       66  blue
row4   NaN    63  35.0    70    95     0   NaN      190   NaN

Arithmetic operations on Series and DataFrames work directly. The expression below will create a new column where each value with index n is a sum of values with index n from ‘col3’ and ‘col7’.

df['col7'] = df['col3'] + df['col4'] 
df.head()

Out[94]: 
      col0  col1  col2  col3  col4  col5  col6  new_col color  col7
row0  24.0    83  42.0     7    96     0   NaN      192   NaN   103
row1  40.0     4   NaN    12    84     0   NaN      168   red    96
row2  83.0    22  80.0    26    15     0   NaN       30   NaN    41
row3  92.0    73  58.0     0    33     0   NaN       66  blue    33
row4   NaN    63  35.0    70    95     0   NaN      190   NaN   165

This is the second version of the article, because the first one was a complete mess – errors in the code, no examples and few other things. Thanks to the feedback I went through the article one more time and I think it looks much better now. I have covered basics of transforming and extracting data in Python with code snippets and examples here and hopefully it will be useful for people who are just starting their path in this field.

Meanwhile, love data science and smile more. We have to be positive as we have the sexiest job of 21st century 😀

2 думок про “Extracting and Transforming Data in Python”

Kay says:
June 29, 2020 at 20:45
Hi, thanks for this amazing piece. Quick question: if you’re working with a huge dataset and thus large data frame, how do you do transformations using the columns if you don’t want to type out each column label for all 700 columns?
1. Sergi says:
  June 30, 2020 at 18:45
  You can access the .columns attribute of the DataFrame and copy-paste it or store in variable etc. You will have a list with all columns names so you can filter it using list methods. Hope this is something that might help 🙂 if that doesn’t work reach me on social media – links in the footer 😉
  Cheers!

Блог

Extracting and Transforming Data in Python

Indexing DataFrames

Filtering DataFrames

Dealing with 0 and NaN values

Transforming DataFrames

2 думок про “Extracting and Transforming Data in Python”

Leave a Reply Cancel reply

Indexing DataFrames

Filtering DataFrames

Dealing with 0 and NaN values

Transforming DataFrames

2 думок про “Extracting and Transforming Data in Python”

Leave a Reply Cancel reply

Sergi's Blog Cookies Policy