Missing Data & DateTime Index in Pandas Dataframes
Import the numpy and pandas modules first.
import numpy as np
import pandas as pd
Create the DataFrame with random numbers df = pd.DataFrame(np.random.randn(5,4))
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.765359 | -0.560998 | -0.209899 | 0.941353 |
1 | -1.196285 | -1.607525 | 1.191866 | -0.233517 |
2 | 0.239024 | -1.697216 | 1.860250 | -0.868241 |
3 | 0.700579 | -1.041329 | -0.829990 | 1.602110 |
4 | -0.203317 | -1.359759 | -1.142708 | -1.512480 |
Let’s define the column and index. For indexing, we can define it using the date_range
command. Here, we use annual indexing to the beginning of an arbitrary month, August. The first row will be Aug 1, 2015. As we have 5 rows so period will be 5.
For more information, please check the reference. This information is useful in the time series.
df.columns = ["A", "B", "C", "D"]
df.index = pd.date_range(start=pd.datetime(2015,1,1),
periods=5, freq='AS-Aug')
df
A | B | C | D | |
---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 |
Create a new column, E, with missing values using np.nan
Handling missing values? we might have two choices, either remove it or update its value.
df["E"] = [np.nan, 0.1, -0.1, 1.1, np.nan]
df
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | NaN |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | NaN |
Drop any rows where one of cells contain NaN (missing value).
For more information about handling with missing values, please check the reference.
df.dropna(how="any")
A | B | C | D | E | |
---|---|---|---|---|---|
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
Drop columns where all cells in that column is NaN. As not all missing values in the column E, so we can keep the column E. If we want to remove the rows with all missing values, then use df.dropna(how='all')
df.dropna(axis=1, how='all')
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | NaN |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | NaN |
When the indexing is defined by using Date_Range, we can remove the rows this way below. Firstly, need to convert df['date']
from string to datetime, then use drop
command.
pd.to_datetime("20160801")
Timestamp('2016-08-01 00:00:00')
df.drop(pd.to_datetime("20160801"))
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | NaN |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | NaN |
Of course we can remove several rows at the same time.
df.drop([pd.to_datetime("20150801"), pd.to_datetime("20190801")])
A | B | C | D | E | |
---|---|---|---|---|---|
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
We can remove the column with adding “axis=1”
df.drop("E", axis=1)
A | B | C | D | |
---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 |
We can update its values using fillna
command.
df.fillna(value=3) # df.fillna(0) with zeros
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | 3.0 |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | 3.0 |
Reminder, our data is following below.
df
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | NaN |
2016-08-01 | -1.196285 | -1.607525 | 1.191866 | -0.233517 | 0.1 |
2017-08-01 | 0.239024 | -1.697216 | 1.860250 | -0.868241 | -0.1 |
2018-08-01 | 0.700579 | -1.041329 | -0.829990 | 1.602110 | 1.1 |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | NaN |
We can check boolean whether there is a missing values in our data set.
df.isnull() # boolean masks
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | False | False | False | False | True |
2016-08-01 | False | False | False | False | False |
2017-08-01 | False | False | False | False | False |
2018-08-01 | False | False | False | False | False |
2019-08-01 | False | False | False | False | True |
Previously, indexing method, loc
, was briefly reviewed. Check the link
When there is a column with missing values, we want to read the rows with missing values at the specific column. So here, column E has missing values. So we want to read the rows with missing values at the E column.
df.loc[df.isnull()["E"],:]
A | B | C | D | E | |
---|---|---|---|---|---|
2015-08-01 | -0.765359 | -0.560998 | -0.209899 | 0.941353 | NaN |
2019-08-01 | -0.203317 | -1.359759 | -1.142708 | -1.512480 | NaN |
Refernces
Time Series / Date functionality
Missing values in Pandas DataFrame