DataFrame - Indexing

Python - Indexing in DataFrame

There are many ways of indexing in DataFrame such as .loc, .iloc, .at, and .iat. In this page, loc and iloc are briefly reviewed.

(1) Basic information of DataFrame in Python

First, need to import numpy and pandas libraries.

import numpy as np
import pandas as pd 

Let’s create the data1.

data1 = {"ID" : ["A", "A", "A", "B", "B"],
         "year" : [2014, 2015, 2016, 2015, 2016],
         "credits" : [100, 200, 250, 150, 140]}
data1
{'ID': ['A', 'A', 'A', 'B', 'B'],
 'credits': [100, 200, 250, 150, 140],
 'year': [2014, 2015, 2016, 2015, 2016]}

Let’s change it into the DataFrame format. Note that we can define its index. And the penalty column is added, but we don’t define its values.

df1 = pd.DataFrame(data1, 
      columns=["year", "ID", "credits", "penalty"],
      index =["one", "two", "three", "four", "five"])

We have year, ID, credits and penalty columns, but the values in the penalty column are NaN.

df1
year ID credits penalty
one 2014 A 100 NaN
two 2015 A 200 NaN
three 2016 A 250 NaN
four 2015 B 150 NaN
five 2016 B 140 NaN

We can call the specific column in the DataFrame using different ways (method 1 and method 2) below. But the result is series format, not the data frame.

df1["year"]  # method 1
one      2014
two      2015
three    2016
four     2015
five     2016
Name: year, dtype: int64
df1.year  # method 2 
one      2014
two      2015
three    2016
four     2015
five     2016
Name: year, dtype: int64

Let’s put the penalty’s values.

df1["penalty"] = [0.1, 0.2, 0.3, 0.4, 0.5]; df1
year ID credits penalty
one 2014 A 100 0.1
two 2015 A 200 0.2
three 2016 A 250 0.3
four 2015 B 150 0.4
five 2016 B 140 0.5

Let’s make a new column, “extra” and it ranges from 0 to 4. Therefore, the need to put range value 5. (np.aragne(5))

df1["extra"] = np.arange(5); df1
year ID credits penalty extra
one 2014 A 100 0.1 0
two 2015 A 200 0.2 1
three 2016 A 250 0.3 2
four 2015 B 150 0.4 3
five 2016 B 140 0.5 4

We also can put the values in the specific index. We add point column and put its values at the index one, two and four.

value = pd.Series([10, 20, 30], index=["two", "four", "one"])
df1["points"] = value; df1
year ID credits penalty extra points
one 2014 A 100 0.1 0 30.0
two 2015 A 200 0.2 1 10.0
three 2016 A 250 0.3 2 NaN
four 2015 B 150 0.4 3 20.0
five 2016 B 140 0.5 4 NaN

Create the total column which can be calculated by subtracting between credits and penalty columns.

df1["total"] = df1["credits"] - df1["penalty"]; df1
year ID credits penalty extra points total
one 2014 A 100 0.1 0 30.0 99.9
two 2015 A 200 0.2 1 10.0 199.8
three 2016 A 250 0.3 2 NaN 249.7
four 2015 B 150 0.4 3 20.0 149.6
five 2016 B 140 0.5 4 NaN 139.5

Delete the extra column using del command.

del df1["extra"] ; df1
year ID credits penalty points total
one 2014 A 100 0.1 30.0 99.9
two 2015 A 200 0.2 10.0 199.8
three 2016 A 250 0.3 NaN 249.7
four 2015 B 150 0.4 20.0 149.6
five 2016 B 140 0.5 NaN 139.5

Note that when adding, deleting columns, make sure which columns are in the DataFrame.

df1.columns
Index(['year', 'ID', 'credits', 'penalty', 'points', 'total'], dtype='object')

(2) Indexing Methods, loc and iloc

df1.index.name = "Order"
df1.columns.name="Info"; df1
Info year ID credits penalty points total
Order
one 2014 A 100 0.1 30.0 99.9
two 2015 A 200 0.2 10.0 199.8
three 2016 A 250 0.3 NaN 249.7
four 2015 B 150 0.4 20.0 149.6
five 2016 B 140 0.5 NaN 139.5

There are many ways of indexing. Simply use the index numbers in the square brackets [ ].

df1[0:3]
Info year ID credits penalty points total
Order
one 2014 A 100 0.1 30.0 99.9
two 2015 A 200 0.2 10.0 199.8
three 2016 A 250 0.3 NaN 249.7

But there are many better ways. Now let’s see the loc and iloc methods. Check the detailed information in the reference. But the main difference is,

Note
loc : only work on index,
-df.loc[row_indexer,column_indexer]
iloc : work on integer position

As the loc works on index, we need to put the index. We previously defined the index, one to five.

df1.loc["two"] # the result is a series format. 
Info
year        2015
ID             A
credits      200
penalty      0.2
points        10
total      199.8
Name: two, dtype: object

We can read the row of index from two to four. Here are some examples of using loc command.

df1.loc["two": "four"]
Info year ID credits penalty points total
Order
two 2015 A 200 0.2 10.0 199.8
three 2016 A 250 0.3 NaN 249.7
four 2015 B 150 0.4 20.0 149.6
df1.loc["two":"four", "credits"]
Order
two      200
three    250
four     150
Name: credits, dtype: int64

Note that df1.loc[ : , “year] is same as df1[“year”].

df1.loc[:, "year"]
Order
one      2014.0
two      2015.0
three    2016.0
four     2015.0
five     2016.0
six         NaN
Name: year, dtype: float64
df1["year"]
Order
one      2014.0
two      2015.0
three    2016.0
four     2015.0
five     2016.0
six         NaN
Name: year, dtype: float64
df1.loc[:, ["year", "ID"]]
Info year ID
Order
one 2014.0 A
two 2015.0 A
three 2016.0 A
four 2015.0 B
five 2016.0 B
six NaN NaN
df1.loc["one":"three", "year":"penalty"]
Info year ID credits penalty
Order
one 2014.0 A 100.0 0.1
two 2015.0 A 200.0 0.2
three 2016.0 A 250.0 0.3

Reminder. Our DataFrame we are dealing with is below.

df1
Info year ID credits penalty points total
Order
one 2014.0 A 100.0 0.1 30.0 99.9
two 2015.0 A 200.0 0.2 10.0 199.8
three 2016.0 A 250.0 0.3 NaN 249.7
four 2015.0 B 150.0 0.4 20.0 149.6
five 2016.0 B 140.0 0.5 NaN 139.5
six NaN NaN NaN NaN NaN NaN

We can use boolean array as well, here are some examples.

df1["year"] > 2014 # called mask 
Order
one      False
two       True
three     True
four      True
five      True
six      False
Name: year, dtype: bool
df1.loc[df1["year"] > 2014, :]
Info year ID credits penalty points total
Order
two 2015.0 A 200.0 0.2 10.0 199.8
three 2016.0 A 250.0 0.3 NaN 249.7
four 2015.0 B 150.0 0.4 20.0 149.6
five 2016.0 B 140.0 0.5 NaN 139.5
df1.loc[df1["ID"] == "B", ["ID", "credits"]]
Info ID credits
Order
four B 150.0
five B 140.0
df1.loc[ (df1["credits"] > 130) & (df1["credits"] < 220), :]
Info year ID credits penalty points total
Order
two 2015.0 A 200.0 0.2 10.0 199.8
four 2015.0 B 150.0 0.4 20.0 149.6
five 2016.0 B 140.0 0.5 NaN 139.5

Now, how about iloc command? The iloc command is primarily internet position based, it means it starts from 0 to lenght-1 of the axis.

Our DataFrame is following below. Here are some examples.

df1
Info year ID credits penalty points total
Order
one 2014.0 A 100.0 0.1 30.0 99.9
two 2015.0 A 200.0 0.2 10.0 199.8
three 2016.0 A 250.0 0.3 NaN 249.7
four 2015.0 B 150.0 0.4 20.0 149.6
five 2016.0 B 140.0 0.5 NaN 139.5
six NaN NaN NaN NaN NaN NaN
df1.iloc[3:5, 0:2]
Info year ID
Order
four 2015.0 B
five 2016.0 B
df1.iloc[3] # index four's row will be loaded in a series format. 
Info
year        2015
ID             B
credits      150
penalty      0.4
points        20
total      149.6
Name: four, dtype: object

Of course we can get the same result using two commands, iloc and loc , with different codes inside the square brackets.

df1.iloc[2:4, 0:2]
Info year ID
Order
three 2016.0 A
four 2015.0 B
df1.loc["three":"four", "year":"ID"]
Info year ID
Order
three 2016.0 A
four 2015.0 B

Download code: GitHub
Reference : Pandas Choices for Indexing

No comments:

Post a Comment