Missing Data & Date Indexing

Missing Data & DateTime Index in Pandas Dataframes


Import the numpy and pandas modules first.

import numpy as np
import pandas as pd 
Create the DataFrame with random numbers
df = pd.DataFrame(np.random.randn(5,4))
df
0 1 2 3
0 -0.765359 -0.560998 -0.209899 0.941353
1 -1.196285 -1.607525 1.191866 -0.233517
2 0.239024 -1.697216 1.860250 -0.868241
3 0.700579 -1.041329 -0.829990 1.602110
4 -0.203317 -1.359759 -1.142708 -1.512480

Let’s define the column and index. For indexing, we can define it using the date_rangecommand. Here, we use annual indexing to the beginning of an arbitrary month, August. The first row will be Aug 1, 2015. As we have 5 rows so period will be 5.

For more information, please check the reference. This information is useful in the time series.

df.columns = ["A", "B", "C", "D"]
df.index = pd.date_range(start=pd.datetime(2015,1,1), 
                  periods=5, freq='AS-Aug')
df
A B C D
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517
2017-08-01 0.239024 -1.697216 1.860250 -0.868241
2018-08-01 0.700579 -1.041329 -0.829990 1.602110
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480

Create a new column, E, with missing values using np.nan
Handling missing values? we might have two choices, either remove it or update its value.

df["E"] = [np.nan, 0.1, -0.1, 1.1, np.nan]
df
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 NaN
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 NaN

Drop any rows where one of cells contain NaN (missing value).

For more information about handling with missing values, please check the reference.

df.dropna(how="any")
A B C D E
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1

Drop columns where all cells in that column is NaN. As not all missing values in the column E, so we can keep the column E. If we want to remove the rows with all missing values, then use df.dropna(how='all')

df.dropna(axis=1, how='all') 
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 NaN
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 NaN

When the indexing is defined by using Date_Range, we can remove the rows this way below. Firstly, need to convert df['date'] from string to datetime, then use drop command.

pd.to_datetime("20160801")
Timestamp('2016-08-01 00:00:00')
df.drop(pd.to_datetime("20160801"))
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 NaN
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 NaN

Of course we can remove several rows at the same time.

df.drop([pd.to_datetime("20150801"), pd.to_datetime("20190801")])
A B C D E
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1

We can remove the column with adding “axis=1”

df.drop("E", axis=1) 
A B C D
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517
2017-08-01 0.239024 -1.697216 1.860250 -0.868241
2018-08-01 0.700579 -1.041329 -0.829990 1.602110
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480

We can update its values using fillna command.

df.fillna(value=3) # df.fillna(0) with zeros 
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 3.0
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 3.0

Reminder, our data is following below.

df
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 NaN
2016-08-01 -1.196285 -1.607525 1.191866 -0.233517 0.1
2017-08-01 0.239024 -1.697216 1.860250 -0.868241 -0.1
2018-08-01 0.700579 -1.041329 -0.829990 1.602110 1.1
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 NaN

We can check boolean whether there is a missing values in our data set.

df.isnull() # boolean masks
A B C D E
2015-08-01 False False False False True
2016-08-01 False False False False False
2017-08-01 False False False False False
2018-08-01 False False False False False
2019-08-01 False False False False True

Previously, indexing method, loc, was briefly reviewed. Check the link
When there is a column with missing values, we want to read the rows with missing values at the specific column. So here, column E has missing values. So we want to read the rows with missing values at the E column.

df.loc[df.isnull()["E"],:] 
A B C D E
2015-08-01 -0.765359 -0.560998 -0.209899 0.941353 NaN
2019-08-01 -0.203317 -1.359759 -1.142708 -1.512480 NaN

Refernces
Time Series / Date functionality
Missing values in Pandas DataFrame

No comments:

Post a Comment