Stat n Math : DataFrame

Python - Indexing in DataFrame

There are many ways of indexing in DataFrame such as .loc, .iloc, .at, and .iat. In this page, loc and iloc are briefly reviewed.

(1) Basic information of DataFrame in Python

First, need to import numpy and pandas libraries.

import numpy as np
import pandas as pd

Let’s create the data1.

data1 = {"ID" : ["A", "A", "A", "B", "B"],
         "year" : [2014, 2015, 2016, 2015, 2016],
         "credits" : [100, 200, 250, 150, 140]}

data1

{'ID': ['A', 'A', 'A', 'B', 'B'],
 'credits': [100, 200, 250, 150, 140],
 'year': [2014, 2015, 2016, 2015, 2016]}

Let’s change it into the DataFrame format. Note that we can define its index. And the penalty column is added, but we don’t define its values.

df1 = pd.DataFrame(data1, 
      columns=["year", "ID", "credits", "penalty"],
      index =["one", "two", "three", "four", "five"])

We have year, ID, credits and penalty columns, but the values in the penalty column are NaN.

df1

	year	ID	credits	penalty
one	2014	A	100	NaN
two	2015	A	200	NaN
three	2016	A	250	NaN
four	2015	B	150	NaN
five	2016	B	140	NaN

We can call the specific column in the DataFrame using different ways (method 1 and method 2) below. But the result is series format, not the data frame.

df1["year"]  # method 1

one      2014
two      2015
three    2016
four     2015
five     2016
Name: year, dtype: int64

df1.year  # method 2

one      2014
two      2015
three    2016
four     2015
five     2016
Name: year, dtype: int64

Let’s put the penalty’s values.

df1["penalty"] = [0.1, 0.2, 0.3, 0.4, 0.5]; df1

	year	ID	credits	penalty
one	2014	A	100	0.1
two	2015	A	200	0.2
three	2016	A	250	0.3
four	2015	B	150	0.4
five	2016	B	140	0.5

Let’s make a new column, “extra” and it ranges from 0 to 4. Therefore, the need to put range value 5. (np.aragne(5))

df1["extra"] = np.arange(5); df1

	year	ID	credits	penalty	extra
one	2014	A	100	0.1	0
two	2015	A	200	0.2	1
three	2016	A	250	0.3	2
four	2015	B	150	0.4	3
five	2016	B	140	0.5	4

We also can put the values in the specific index. We add point column and put its values at the index one, two and four.

value = pd.Series([10, 20, 30], index=["two", "four", "one"])

df1["points"] = value; df1

	year	ID	credits	penalty	extra	points
one	2014	A	100	0.1	0	30.0
two	2015	A	200	0.2	1	10.0
three	2016	A	250	0.3	2	NaN
four	2015	B	150	0.4	3	20.0
five	2016	B	140	0.5	4	NaN

Create the total column which can be calculated by subtracting between credits and penalty columns.

df1["total"] = df1["credits"] - df1["penalty"]; df1

	year	ID	credits	penalty	extra	points	total
one	2014	A	100	0.1	0	30.0	99.9
two	2015	A	200	0.2	1	10.0	199.8
three	2016	A	250	0.3	2	NaN	249.7
four	2015	B	150	0.4	3	20.0	149.6
five	2016	B	140	0.5	4	NaN	139.5

Delete the extra column using del command.

del df1["extra"] ; df1

	year	ID	credits	penalty	points	total
one	2014	A	100	0.1	30.0	99.9
two	2015	A	200	0.2	10.0	199.8
three	2016	A	250	0.3	NaN	249.7
four	2015	B	150	0.4	20.0	149.6
five	2016	B	140	0.5	NaN	139.5

Note that when adding, deleting columns, make sure which columns are in the DataFrame.

df1.columns

Index(['year', 'ID', 'credits', 'penalty', 'points', 'total'], dtype='object')

(2) Indexing Methods, loc and iloc

df1.index.name = "Order"
df1.columns.name="Info"; df1

Info	year	ID	credits	penalty	points	total
Order
one	2014	A	100	0.1	30.0	99.9
two	2015	A	200	0.2	10.0	199.8
three	2016	A	250	0.3	NaN	249.7
four	2015	B	150	0.4	20.0	149.6
five	2016	B	140	0.5	NaN	139.5

There are many ways of indexing. Simply use the index numbers in the square brackets [ ].

df1[0:3]

Info	year	ID	credits	penalty	points	total
Order
one	2014	A	100	0.1	30.0	99.9
two	2015	A	200	0.2	10.0	199.8
three	2016	A	250	0.3	NaN	249.7

But there are many better ways. Now let’s see the loc and iloc methods. Check the detailed information in the reference. But the main difference is,

Note
loc : only work on index,
-df.loc[row_indexer,column_indexer]
iloc : work on integer position

As the loc works on index, we need to put the index. We previously defined the index, one to five.

df1.loc["two"] # the result is a series format.

Info
year        2015
ID             A
credits      200
penalty      0.2
points        10
total      199.8
Name: two, dtype: object

We can read the row of index from two to four. Here are some examples of using loc command.

df1.loc["two": "four"]

Info	year	ID	credits	penalty	points	total
Order
two	2015	A	200	0.2	10.0	199.8
three	2016	A	250	0.3	NaN	249.7
four	2015	B	150	0.4	20.0	149.6

df1.loc["two":"four", "credits"]

Order
two      200
three    250
four     150
Name: credits, dtype: int64

Note that df1.loc[ : , “year] is same as df1[“year”].

df1.loc[:, "year"]

Order
one      2014.0
two      2015.0
three    2016.0
four     2015.0
five     2016.0
six         NaN
Name: year, dtype: float64

df1["year"]

Order
one      2014.0
two      2015.0
three    2016.0
four     2015.0
five     2016.0
six         NaN
Name: year, dtype: float64

df1.loc[:, ["year", "ID"]]

Info	year	ID
Order
one	2014.0	A
two	2015.0	A
three	2016.0	A
four	2015.0	B
five	2016.0	B
six	NaN	NaN

df1.loc["one":"three", "year":"penalty"]

Info	year	ID	credits	penalty
Order
one	2014.0	A	100.0	0.1
two	2015.0	A	200.0	0.2
three	2016.0	A	250.0	0.3

Reminder. Our DataFrame we are dealing with is below.

df1

Info	year	ID	credits	penalty	points	total
Order
one	2014.0	A	100.0	0.1	30.0	99.9
two	2015.0	A	200.0	0.2	10.0	199.8
three	2016.0	A	250.0	0.3	NaN	249.7
four	2015.0	B	150.0	0.4	20.0	149.6
five	2016.0	B	140.0	0.5	NaN	139.5
six	NaN	NaN	NaN	NaN	NaN	NaN

We can use boolean array as well, here are some examples.

df1["year"] > 2014 # called mask

Order
one      False
two       True
three     True
four      True
five      True
six      False
Name: year, dtype: bool

df1.loc[df1["year"] > 2014, :]

Info	year	ID	credits	penalty	points	total
Order
two	2015.0	A	200.0	0.2	10.0	199.8
three	2016.0	A	250.0	0.3	NaN	249.7
four	2015.0	B	150.0	0.4	20.0	149.6
five	2016.0	B	140.0	0.5	NaN	139.5

df1.loc[df1["ID"] == "B", ["ID", "credits"]]

Info	ID	credits
Order
four	B	150.0
five	B	140.0

df1.loc[ (df1["credits"] > 130) & (df1["credits"] < 220), :]

Info	year	ID	credits	penalty	points	total
Order
two	2015.0	A	200.0	0.2	10.0	199.8
four	2015.0	B	150.0	0.4	20.0	149.6
five	2016.0	B	140.0	0.5	NaN	139.5

Now, how about iloc command? The iloc command is primarily internet position based, it means it starts from 0 to lenght-1 of the axis.

Our DataFrame is following below. Here are some examples.

df1

Info	year	ID	credits	penalty	points	total
Order
one	2014.0	A	100.0	0.1	30.0	99.9
two	2015.0	A	200.0	0.2	10.0	199.8
three	2016.0	A	250.0	0.3	NaN	249.7
four	2015.0	B	150.0	0.4	20.0	149.6
five	2016.0	B	140.0	0.5	NaN	139.5
six	NaN	NaN	NaN	NaN	NaN	NaN

df1.iloc[3:5, 0:2]

Info	year	ID
Order
four	2015.0	B
five	2016.0	B

df1.iloc[3] # index four's row will be loaded in a series format.

Info
year        2015
ID             B
credits      150
penalty      0.4
points        20
total      149.6
Name: four, dtype: object

Of course we can get the same result using two commands, iloc and loc , with different codes inside the square brackets.

df1.iloc[2:4, 0:2]

Info	year	ID
Order
three	2016.0	A
four	2015.0	B

df1.loc["three":"four", "year":"ID"]

Info	year	ID
Order
three	2016.0	A
four	2015.0	B

Download code: GitHub
Reference : Pandas Choices for Indexing

Stat n Math

Pages

DataFrame - Indexing

Python - Indexing in DataFrame

(1) Basic information of DataFrame in Python

(2) Indexing Methods, loc and iloc

No comments:

Post a Comment

Search This Blog

Popular Posts