Calculating time to holidays on pandas dataframe
"time to" or "time from" are basically the same but in reverse order of each other
Let's say you want to create another column that counts the number of days from/to the nearest holiday in the calendar.
We need two columns:
- dates
- holiday : boolean
Rolling operations on dataframes are good if you only operate over one column, but when you need data from more columns it gets very complicated.
The method described here is in no way optimal, but does the job with minimal complications.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from datetime import date,datetime
from matplotlib.pyplot import plot
df = pd.read_csv("../input/train.csv")
# Convert the datetime to proper type
df.datetime=pd.to_datetime(df.datetime)
# Function to calculate the days from holiday
# You should set last_holiday_date before calling this function
def set_dfh(holiday,timestamp):
global last_holiday_date
if last_holiday_date=="":
last_holiday_date=datetime.date(datetime.strptime('2010-11-12', '%Y-%m-%d'))
if (holiday==1):
last_holiday_date= timestamp
return(timestamp-last_holiday_date)
# Calculate days FROM holidays (moving forward)
last_holiday_date=datetime.date(datetime.strptime('2010-11-12', '%Y-%m-%d'))
dfh=[set_dfh(x,y) for x,y in zip(df.holiday,df.datetime.dt.date)]
# Calculate days TO holidays (moving backward)
last_holiday_date=datetime.date(datetime.strptime('2013-01-01', '%Y-%m-%d'))
dth=reversed([abs(set_dfh(x,y)) for x,y in zip(reversed(df.holiday),reversed(df.datetime.dt.date))])
dth= [t for t in dth] # so dth is not an iterator but it's content
# Add to dataset in two ways
# Raw time diferences
df["dfh"]=dfh
df["dth"]=dth
# Time differences converted to int
oneday=timedelta(days=1)
df["days from holiday"]=[s / oneday for s in dfh]
df["days to holiday"] =[s / oneday for s in dth]
# Visualize both columns together
df[["days to holiday","days from holiday"]].loc[300:480].plot(legend=True,figsize=(120, 100) )
Let's say you want to create another column that counts the number of days from/to the nearest holiday in the calendar.
We need two columns:
- dates
- holiday : boolean
Rolling operations on dataframes are good if you only operate over one column, but when you need data from more columns it gets very complicated.
The method described here is in no way optimal, but does the job with minimal complications.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from datetime import date,datetime
from matplotlib.pyplot import plot
df = pd.read_csv("../input/train.csv")
# Convert the datetime to proper type
df.datetime=pd.to_datetime(df.datetime)
# Function to calculate the days from holiday
# You should set last_holiday_date before calling this function
def set_dfh(holiday,timestamp):
global last_holiday_date
if last_holiday_date=="":
last_holiday_date=datetime.date(datetime.strptime('2010-11-12', '%Y-%m-%d'))
if (holiday==1):
last_holiday_date= timestamp
return(timestamp-last_holiday_date)
# Calculate days FROM holidays (moving forward)
last_holiday_date=datetime.date(datetime.strptime('2010-11-12', '%Y-%m-%d'))
dfh=[set_dfh(x,y) for x,y in zip(df.holiday,df.datetime.dt.date)]
# Calculate days TO holidays (moving backward)
last_holiday_date=datetime.date(datetime.strptime('2013-01-01', '%Y-%m-%d'))
dth=reversed([abs(set_dfh(x,y)) for x,y in zip(reversed(df.holiday),reversed(df.datetime.dt.date))])
dth= [t for t in dth] # so dth is not an iterator but it's content
# Add to dataset in two ways
# Raw time diferences
df["dfh"]=dfh
df["dth"]=dth
# Time differences converted to int
oneday=timedelta(days=1)
df["days from holiday"]=[s / oneday for s in dfh]
df["days to holiday"] =[s / oneday for s in dth]
# Visualize both columns together
df[["days to holiday","days from holiday"]].loc[300:480].plot(legend=True,figsize=(120, 100) )
Summary:
1. Take the needed columns out
2. Create a list while operating with their values
3. Insert the resulting list back on the dataframe
0 Comments:
Post a Comment
<< Home