Python - Indexing in DataFrame
There are many ways of indexing in DataFrame such as .loc, .iloc, .at, and .iat. In this page, loc and iloc are briefly reviewed.
(1) Basic information of DataFrame in Python
First, need to import numpy and pandas libraries.
import numpy as np
import pandas as pd Let’s create the data1.
data1 = {"ID" : ["A", "A", "A", "B", "B"],
"year" : [2014, 2015, 2016, 2015, 2016],
"credits" : [100, 200, 250, 150, 140]}data1{'ID': ['A', 'A', 'A', 'B', 'B'],
'credits': [100, 200, 250, 150, 140],
'year': [2014, 2015, 2016, 2015, 2016]}
Let’s change it into the DataFrame format. Note that we can define its index. And the penalty column is added, but we don’t define its values.
df1 = pd.DataFrame(data1,
columns=["year", "ID", "credits", "penalty"],
index =["one", "two", "three", "four", "five"])We have year, ID, credits and penalty columns, but the values in the penalty column are NaN.
df1| year | ID | credits | penalty | |
|---|---|---|---|---|
| one | 2014 | A | 100 | NaN |
| two | 2015 | A | 200 | NaN |
| three | 2016 | A | 250 | NaN |
| four | 2015 | B | 150 | NaN |
| five | 2016 | B | 140 | NaN |
We can call the specific column in the DataFrame using different ways (method 1 and method 2) below. But the result is series format, not the data frame.
df1["year"] # method 1one 2014
two 2015
three 2016
four 2015
five 2016
Name: year, dtype: int64
df1.year # method 2 one 2014
two 2015
three 2016
four 2015
five 2016
Name: year, dtype: int64
Let’s put the penalty’s values.
df1["penalty"] = [0.1, 0.2, 0.3, 0.4, 0.5]; df1| year | ID | credits | penalty | |
|---|---|---|---|---|
| one | 2014 | A | 100 | 0.1 |
| two | 2015 | A | 200 | 0.2 |
| three | 2016 | A | 250 | 0.3 |
| four | 2015 | B | 150 | 0.4 |
| five | 2016 | B | 140 | 0.5 |
Let’s make a new column, “extra” and it ranges from 0 to 4. Therefore, the need to put range value 5. (np.aragne(5))
df1["extra"] = np.arange(5); df1| year | ID | credits | penalty | extra | |
|---|---|---|---|---|---|
| one | 2014 | A | 100 | 0.1 | 0 |
| two | 2015 | A | 200 | 0.2 | 1 |
| three | 2016 | A | 250 | 0.3 | 2 |
| four | 2015 | B | 150 | 0.4 | 3 |
| five | 2016 | B | 140 | 0.5 | 4 |
We also can put the values in the specific index. We add point column and put its values at the index one, two and four.
value = pd.Series([10, 20, 30], index=["two", "four", "one"])df1["points"] = value; df1| year | ID | credits | penalty | extra | points | |
|---|---|---|---|---|---|---|
| one | 2014 | A | 100 | 0.1 | 0 | 30.0 |
| two | 2015 | A | 200 | 0.2 | 1 | 10.0 |
| three | 2016 | A | 250 | 0.3 | 2 | NaN |
| four | 2015 | B | 150 | 0.4 | 3 | 20.0 |
| five | 2016 | B | 140 | 0.5 | 4 | NaN |
Create the total column which can be calculated by subtracting between credits and penalty columns.
df1["total"] = df1["credits"] - df1["penalty"]; df1| year | ID | credits | penalty | extra | points | total | |
|---|---|---|---|---|---|---|---|
| one | 2014 | A | 100 | 0.1 | 0 | 30.0 | 99.9 |
| two | 2015 | A | 200 | 0.2 | 1 | 10.0 | 199.8 |
| three | 2016 | A | 250 | 0.3 | 2 | NaN | 249.7 |
| four | 2015 | B | 150 | 0.4 | 3 | 20.0 | 149.6 |
| five | 2016 | B | 140 | 0.5 | 4 | NaN | 139.5 |
Delete the extra column using del command.
del df1["extra"] ; df1| year | ID | credits | penalty | points | total | |
|---|---|---|---|---|---|---|
| one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
| two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
| three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
| four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
| five | 2016 | B | 140 | 0.5 | NaN | 139.5 |
Note that when adding, deleting columns, make sure which columns are in the DataFrame.
df1.columnsIndex(['year', 'ID', 'credits', 'penalty', 'points', 'total'], dtype='object')
(2) Indexing Methods, loc and iloc
df1.index.name = "Order"
df1.columns.name="Info"; df1| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
| two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
| three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
| four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
| five | 2016 | B | 140 | 0.5 | NaN | 139.5 |
There are many ways of indexing. Simply use the index numbers in the square brackets [ ].
df1[0:3]| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| one | 2014 | A | 100 | 0.1 | 30.0 | 99.9 |
| two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
| three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
But there are many better ways. Now let’s see the loc and iloc methods. Check the detailed information in the reference. But the main difference is,
Note
loc : only work on index,
-df.loc[row_indexer,column_indexer]
iloc : work on integer position
As the loc works on index, we need to put the index. We previously defined the index, one to five.
df1.loc["two"] # the result is a series format. Info
year 2015
ID A
credits 200
penalty 0.2
points 10
total 199.8
Name: two, dtype: object
We can read the row of index from two to four. Here are some examples of using loc command.
df1.loc["two": "four"]| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| two | 2015 | A | 200 | 0.2 | 10.0 | 199.8 |
| three | 2016 | A | 250 | 0.3 | NaN | 249.7 |
| four | 2015 | B | 150 | 0.4 | 20.0 | 149.6 |
df1.loc["two":"four", "credits"]Order
two 200
three 250
four 150
Name: credits, dtype: int64
Note that df1.loc[ : , “year] is same as df1[“year”].
df1.loc[:, "year"]Order
one 2014.0
two 2015.0
three 2016.0
four 2015.0
five 2016.0
six NaN
Name: year, dtype: float64
df1["year"]Order
one 2014.0
two 2015.0
three 2016.0
four 2015.0
five 2016.0
six NaN
Name: year, dtype: float64
df1.loc[:, ["year", "ID"]]| Info | year | ID |
|---|---|---|
| Order | ||
| one | 2014.0 | A |
| two | 2015.0 | A |
| three | 2016.0 | A |
| four | 2015.0 | B |
| five | 2016.0 | B |
| six | NaN | NaN |
df1.loc["one":"three", "year":"penalty"]| Info | year | ID | credits | penalty |
|---|---|---|---|---|
| Order | ||||
| one | 2014.0 | A | 100.0 | 0.1 |
| two | 2015.0 | A | 200.0 | 0.2 |
| three | 2016.0 | A | 250.0 | 0.3 |
Reminder. Our DataFrame we are dealing with is below.
df1| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| one | 2014.0 | A | 100.0 | 0.1 | 30.0 | 99.9 |
| two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
| three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
| four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
| five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
| six | NaN | NaN | NaN | NaN | NaN | NaN |
We can use boolean array as well, here are some examples.
df1["year"] > 2014 # called mask Order
one False
two True
three True
four True
five True
six False
Name: year, dtype: bool
df1.loc[df1["year"] > 2014, :]| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
| three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
| four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
| five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
df1.loc[df1["ID"] == "B", ["ID", "credits"]]| Info | ID | credits |
|---|---|---|
| Order | ||
| four | B | 150.0 |
| five | B | 140.0 |
df1.loc[ (df1["credits"] > 130) & (df1["credits"] < 220), :]| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
| four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
| five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
Now, how about iloc command? The iloc command is primarily internet position based, it means it starts from 0 to lenght-1 of the axis.
Our DataFrame is following below. Here are some examples.
df1| Info | year | ID | credits | penalty | points | total |
|---|---|---|---|---|---|---|
| Order | ||||||
| one | 2014.0 | A | 100.0 | 0.1 | 30.0 | 99.9 |
| two | 2015.0 | A | 200.0 | 0.2 | 10.0 | 199.8 |
| three | 2016.0 | A | 250.0 | 0.3 | NaN | 249.7 |
| four | 2015.0 | B | 150.0 | 0.4 | 20.0 | 149.6 |
| five | 2016.0 | B | 140.0 | 0.5 | NaN | 139.5 |
| six | NaN | NaN | NaN | NaN | NaN | NaN |
df1.iloc[3:5, 0:2]| Info | year | ID |
|---|---|---|
| Order | ||
| four | 2015.0 | B |
| five | 2016.0 | B |
df1.iloc[3] # index four's row will be loaded in a series format. Info
year 2015
ID B
credits 150
penalty 0.4
points 20
total 149.6
Name: four, dtype: object
Of course we can get the same result using two commands, iloc and loc , with different codes inside the square brackets.
df1.iloc[2:4, 0:2]| Info | year | ID |
|---|---|---|
| Order | ||
| three | 2016.0 | A |
| four | 2015.0 | B |
df1.loc["three":"four", "year":"ID"]| Info | year | ID |
|---|---|---|
| Order | ||
| three | 2016.0 | A |
| four | 2015.0 | B |
Download code: GitHub
Reference : Pandas Choices for Indexing
No comments:
Post a Comment