Stat n Math : Missing Data & Date Indexing

Missing Data & DateTime Index in Pandas Dataframes

Import the numpy and pandas modules first.

import numpy as np
import pandas as pd

Create the DataFrame with random numbers

df = pd.DataFrame(np.random.randn(5,4))

df

	0	1	2	3
0	-0.765359	-0.560998	-0.209899	0.941353
1	-1.196285	-1.607525	1.191866	-0.233517
2	0.239024	-1.697216	1.860250	-0.868241
3	0.700579	-1.041329	-0.829990	1.602110
4	-0.203317	-1.359759	-1.142708	-1.512480

Let’s define the column and index. For indexing, we can define it using the date_rangecommand. Here, we use annual indexing to the beginning of an arbitrary month, August. The first row will be Aug 1, 2015. As we have 5 rows so period will be 5.

For more information, please check the reference. This information is useful in the time series.

df.columns = ["A", "B", "C", "D"]
df.index = pd.date_range(start=pd.datetime(2015,1,1), 
                  periods=5, freq='AS-Aug')

df

	A	B	C	D
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517
2017-08-01	0.239024	-1.697216	1.860250	-0.868241
2018-08-01	0.700579	-1.041329	-0.829990	1.602110
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480

Create a new column, E, with missing values using np.nan
Handling missing values? we might have two choices, either remove it or update its value.

df["E"] = [np.nan, 0.1, -0.1, 1.1, np.nan]

df

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	NaN
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	NaN

Drop any rows where one of cells contain NaN (missing value).

For more information about handling with missing values, please check the reference.

df.dropna(how="any")

	A	B	C	D	E
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1

Drop columns where all cells in that column is NaN. As not all missing values in the column E, so we can keep the column E. If we want to remove the rows with all missing values, then use df.dropna(how='all')

df.dropna(axis=1, how='all')

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	NaN
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	NaN

When the indexing is defined by using Date_Range, we can remove the rows this way below. Firstly, need to convert df['date'] from string to datetime, then use drop command.

pd.to_datetime("20160801")

Timestamp('2016-08-01 00:00:00')

df.drop(pd.to_datetime("20160801"))

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	NaN
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	NaN

Of course we can remove several rows at the same time.

df.drop([pd.to_datetime("20150801"), pd.to_datetime("20190801")])

	A	B	C	D	E
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1

We can remove the column with adding “axis=1”

df.drop("E", axis=1)

	A	B	C	D
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517
2017-08-01	0.239024	-1.697216	1.860250	-0.868241
2018-08-01	0.700579	-1.041329	-0.829990	1.602110
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480

We can update its values using fillna command.

df.fillna(value=3) # df.fillna(0) with zeros

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	3.0
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	3.0

Reminder, our data is following below.

df

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	NaN
2016-08-01	-1.196285	-1.607525	1.191866	-0.233517	0.1
2017-08-01	0.239024	-1.697216	1.860250	-0.868241	-0.1
2018-08-01	0.700579	-1.041329	-0.829990	1.602110	1.1
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	NaN

We can check boolean whether there is a missing values in our data set.

df.isnull() # boolean masks

	A	B	C	D	E
2015-08-01	False	False	False	False	True
2016-08-01	False	False	False	False	False
2017-08-01	False	False	False	False	False
2018-08-01	False	False	False	False	False
2019-08-01	False	False	False	False	True

Previously, indexing method, loc, was briefly reviewed. Check the link
When there is a column with missing values, we want to read the rows with missing values at the specific column. So here, column E has missing values. So we want to read the rows with missing values at the E column.

df.loc[df.isnull()["E"],:]

	A	B	C	D	E
2015-08-01	-0.765359	-0.560998	-0.209899	0.941353	NaN
2019-08-01	-0.203317	-1.359759	-1.142708	-1.512480	NaN

Refernces
Time Series / Date functionality
Missing values in Pandas DataFrame

Stat n Math

Pages

Missing Data & Date Indexing

Missing Data & DateTime Index in Pandas Dataframes

No comments:

Post a Comment

Search This Blog

Popular Posts