Python - Indexing in DataFrame
There are many ways of indexing in DataFrame such as .loc, .iloc, .at, and .iat. In this page, loc and iloc are briefly reviewed.
(1) Basic information of DataFrame in Python
First, need to import numpy and pandas libraries.
import numpy as np
import pandas as pd
Let’s create the data1.
data1 = {"ID" : ["A", "A", "A", "B", "B"],
"year" : [2014, 2015, 2016, 2015, 2016],
"credits" : [100, 200, 250, 150, 140]}
data1
{'ID': ['A', 'A', 'A', 'B', 'B'],
'credits': [100, 200, 250, 150, 140],
'year': [2014, 2015, 2016, 2015, 2016]}
Let’s change it into the DataFrame format. Note that we can define its index. And the penalty column is added, but we don’t define its values.
df1 = pd.DataFrame(data1,
columns=["year", "ID", "credits", "penalty"],
index =["one", "two", "three", "four", "five"])
We have year, ID, credits and penalty columns, but the values in the penalty column are NaN.
df1
year | ID | credits | penalty | |
---|---|---|---|---|
one | 2014 | A | 100 | NaN |
two | 2015 | A | 200 | NaN |
three | 2016 | A | 250 | NaN |
four | 2015 | B | 150 | NaN |
five | 2016 | B | 140 | NaN |
We can call the specific column in the DataFrame using different ways (method 1 and method 2) below. But the result is series format, not the data frame.
df1["year"] # method 1
one 2014
two 2015
three 2016
four 2015
five 2016
Name: year, dtype: int64
df1.year # method 2
one 2014
two 2015
three 2016
four 2015
five 2016
Name: year, dtype: int64
Let’s put the penalty’s values.
df1["penalty"] = [0.1, 0.2, 0.3, 0.4, 0.5]; df1
year | ID | credits | penalty | |
---|---|---|---|---|
one | 2014 | A | 100 | 0.1 |
two | 2015 | A | 200 | 0.2 |
three | 2016 | A | 250 | 0.3 |
four | 2015 | B | 150 | 0.4 |
five | 2016 | B | 140 | 0.5 |
Let’s make a new column, “extra” and it ranges from 0 to 4. Therefore, the need to put range value 5. (np.aragne(5))
df1["extra"] = np.arange(5); df1
year | ID | credits | penalty | extra | |
---|---|---|---|---|---|
one | 2014 | A | 100 | 0.1 | 0 |
two | 2015 | A | 200 | 0.2 | 1 |
three | 2016 | A | 250 | 0.3 | 2 |
four | 2015 | B | 150 | 0.4 | 3 |
five | 2016 | B | 140 | 0.5 | 4 |
We also can put the values in the specific index. We add point column and put its values at the index one, two and four.
value = pd.Series([10, 20, 30], index=["two", "four", "one"])
df1["points"] = value; df1
year | ID | credits | penalty | extra | points | |
---|---|---|---|---|---|---|
one | 2014 | A | 100 | 0.1 | 0 | 30.0 |
two | 2015 | A | 200 | 0.2 | 1 | 10.0 |
three | 2016 | A | 250 | 0.3 | 2 | NaN |
four | 2015 | B | 150 | 0.4 | 3 | 20.0 |
five | 2016 | B | 140 | 0.5 | 4 | NaN |
Create the total column which can be calculated by subtracting between credits and penalty columns.
df1["total"] = df1["credits"] - df1["penalty"]; df1
year | ID | credits | penalty | extra | points | total | |
---|---|---|---|---|---|---|---|
one | 2014 | A | 100 | 0.1 | 0 | 30.0 | 99.9 |
two | 2015 | A | 200 | 0.2 | 1 | 10.0 | 199.8 |
three | 2016 | A | 250 | 0.3 | 2 | NaN | 249.7 |
four | 2015 | B | 150 | 0.4 | 3 | 20.0 | 149.6 |
five | 2016 | B | 140 | 0.5 | 4 | NaN | 139.5 |
Delete the extra column using del command.
del df1["extra"] ; df1
year | ID | credits | penalty | points | total | |
---|---|---|---|---|---|---|
one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
five | 2016 | B | 140 | 0.5 | NaN | 139.5 |
Note that when adding, deleting columns, make sure which columns are in the DataFrame.
df1.columns
Index(['year', 'ID', 'credits', 'penalty', 'points', 'total'], dtype='object')
(2) Indexing Methods, loc and iloc
df1.index.name = "Order"
df1.columns.name="Info"; df1
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
five | 2016 | B | 140 | 0.5 | NaN | 139.5 |
There are many ways of indexing. Simply use the index numbers in the square brackets [ ].
df1[0:3]
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
But there are many better ways. Now let’s see the loc and iloc methods. Check the detailed information in the reference. But the main difference is,
Note
loc : only work on index,
-df.loc[row_indexer,column_indexer]
iloc : work on integer position
As the loc works on index, we need to put the index. We previously defined the index, one to five.
df1.loc["two"] # the result is a series format.
Info
year 2015
ID A
credits 200
penalty 0.2
points 10
total 199.8
Name: two, dtype: object
We can read the row of index from two to four. Here are some examples of using loc command.
df1.loc["two": "four"]
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
df1.loc["two":"four", "credits"]
Order
two 200
three 250
four 150
Name: credits, dtype: int64
Note that df1.loc[ : , “year] is same as df1[“year”].
df1.loc[:, "year"]
Order
one 2014.0
two 2015.0
three 2016.0
four 2015.0
five 2016.0
six NaN
Name: year, dtype: float64
df1["year"]
Order
one 2014.0
two 2015.0
three 2016.0
four 2015.0
five 2016.0
six NaN
Name: year, dtype: float64
df1.loc[:, ["year", "ID"]]
Info | year | ID |
---|---|---|
Order | ||
one | 2014.0 | A |
two | 2015.0 | A |
three | 2016.0 | A |
four | 2015.0 | B |
five | 2016.0 | B |
six | NaN | NaN |
df1.loc["one":"three", "year":"penalty"]
Info | year | ID | credits | penalty |
---|---|---|---|---|
Order | ||||
one | 2014.0 | A | 100.0 | 0.1 |
two | 2015.0 | A | 200.0 | 0.2 |
three | 2016.0 | A | 250.0 | 0.3 |
Reminder. Our DataFrame we are dealing with is below.
df1
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
one | 2014.0 | A | 100.0 | 0.1 | 30.0 | 99.9 |
two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
six | NaN | NaN | NaN | NaN | NaN | NaN |
We can use boolean array as well, here are some examples.
df1["year"] > 2014 # called mask
Order
one False
two True
three True
four True
five True
six False
Name: year, dtype: bool
df1.loc[df1["year"] > 2014, :]
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
df1.loc[df1["ID"] == "B", ["ID", "credits"]]
Info | ID | credits |
---|---|---|
Order | ||
four | B | 150.0 |
five | B | 140.0 |
df1.loc[ (df1["credits"] > 130) & (df1["credits"] < 220), :]
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
Now, how about iloc command? The iloc command is primarily internet position based, it means it starts from 0 to lenght-1 of the axis.
Our DataFrame is following below. Here are some examples.
df1
Info | year | ID | credits | penalty | points | total |
---|---|---|---|---|---|---|
Order | ||||||
one | 2014.0 | A | 100.0 | 0.1 | 30.0 | 99.9 |
two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
six | NaN | NaN | NaN | NaN | NaN | NaN |
df1.iloc[3:5, 0:2]
Info | year | ID |
---|---|---|
Order | ||
four | 2015.0 | B |
five | 2016.0 | B |
df1.iloc[3] # index four's row will be loaded in a series format.
Info
year 2015
ID B
credits 150
penalty 0.4
points 20
total 149.6
Name: four, dtype: object
Of course we can get the same result using two commands, iloc and loc , with different codes inside the square brackets.
df1.iloc[2:4, 0:2]
Info | year | ID |
---|---|---|
Order | ||
three | 2016.0 | A |
four | 2015.0 | B |
df1.loc["three":"four", "year":"ID"]
Info | year | ID |
---|---|---|
Order | ||
three | 2016.0 | A |
four | 2015.0 | B |
Download code: GitHub
Reference : Pandas Choices for Indexing
No comments:
Post a Comment