Time Series Analysis

9 minute read

This post is meant to provide a quick, concise but comprehensive overview of how to deal with time series datasets and its typical pre-processing steps. It also gives a brief introduction to stationary and non-stationary data.

Time Series:
Time series is a sequence of information which attaches a time period to each value. The value can be pretty much anything measurable that depends on time in some way. Like stock prices, humidity or number of people. As long as the values record are unambiguous any medium could be measured with time series. There aren’t any limitations regarding the total time span of the time series. It could be a minute a day a month or even a century. All that is needed is a starting and an ending point usually.
Time Series Data donot follow any of the standard distributions because they never satisfies Gauss-Markov assumptions unlike regular linear regression data.

DATASET
Here I will be using a weather time series dataset recorded by the Max Planck Institute for Biogeochemistry.
This dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes,between 2009 and 2016.

import tensorflow as tf

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

sns.set()
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

# Storing it into a data frame
df = pd.read_csv(csv_path)

Creating a copy of the data

data = df.copy()

Data Preprocessing

Visualizing and processing the Column names

data.columns
Index(['Date Time', 'p (mbar)', 'T (degC)', 'Tpot (K)', 'Tdew (degC)',
       'rh (%)', 'VPmax (mbar)', 'VPact (mbar)', 'VPdef (mbar)', 'sh (g/kg)',
       'H2OC (mmol/mol)', 'rho (g/m**3)', 'wv (m/s)', 'max. wv (m/s)',
       'wd (deg)'],
      dtype='object')

Here it can be seen the column names have gaps/blank spaces in between, this can cause error while selecting particular columns by mentioning the coloumn names. So eliminating those gaps can solve the problem.

data.columns = data.columns.str.replace(' ', '')
data.columns
Index(['DateTime', 'p(mbar)', 'T(degC)', 'Tpot(K)', 'Tdew(degC)', 'rh(%)',
       'VPmax(mbar)', 'VPact(mbar)', 'VPdef(mbar)', 'sh(g/kg)',
       'H2OC(mmol/mol)', 'rho(g/m**3)', 'wv(m/s)', 'max.wv(m/s)', 'wd(deg)'],
      dtype='object')

Checking for duplicate instances.

len(data) -len(data['DateTime'].unique())
327

So there are 327 duplicate values. These can be removed by the drop_duplicate method.

data.drop_duplicates(subset ="DateTime", 
                     keep = 'first',  # Considering the first value as unique and rest of the same values as duplicate.
                     inplace = True) 

Rechecking for duplicate instances shows null. Thus duplicates have been succesfully removed.

len(data) -len(data['DateTime'].unique())
0

Indexing with date-time stamps
Visualizing the data with three time stamps. Here the columns represent instances and the attributes are shown in rows.
The first row contains the time stamps.
The dataset contains 14 attributes.
Moreover, the date-time stamp is present as an attribute. Using it as the index of the dataframe is more convenient.

data.head(3).T
0 1 2
DateTime 01.01.2009 00:10:00 01.01.2009 00:20:00 01.01.2009 00:30:00
p(mbar) 996.52 996.57 996.53
T(degC) -8.02 -8.41 -8.51
Tpot(K) 265.4 265.01 264.91
Tdew(degC) -8.9 -9.28 -9.31
rh(%) 93.3 93.4 93.9
VPmax(mbar) 3.33 3.23 3.21
VPact(mbar) 3.11 3.02 3.01
VPdef(mbar) 0.22 0.21 0.2
sh(g/kg) 1.94 1.89 1.88
H2OC(mmol/mol) 3.12 3.03 3.02
rho(g/m**3) 1307.75 1309.8 1310.24
wv(m/s) 1.03 0.72 0.19
max.wv(m/s) 1.75 1.5 0.63
wd(deg) 152.3 136.1 171.6

Before using the date-time as index its better to change it to python’s date time object.

data['DateTime'].describe()
count                  420224
unique                 420224
top       13.12.2012 05:00:00
freq                        1
Name: DateTime, dtype: object
data['DateTime'] = pd.to_datetime(data['DateTime'])
data['DateTime'].describe()
count                  420224
unique                 420224
top       2012-10-26 12:50:00
freq                        1
first     2009-01-01 00:10:00
last      2017-01-01 00:00:00
Name: DateTime, dtype: object

Setting date-time as the dataframe index.

data.set_index('DateTime', inplace=True)

data.head(5).T
DateTime 2009-01-01 00:10:00 2009-01-01 00:20:00 2009-01-01 00:30:00 2009-01-01 00:40:00 2009-01-01 00:50:00
p(mbar) 996.52 996.57 996.53 996.51 996.51
T(degC) -8.02 -8.41 -8.51 -8.31 -8.27
Tpot(K) 265.40 265.01 264.91 265.12 265.15
Tdew(degC) -8.90 -9.28 -9.31 -9.07 -9.04
rh(%) 93.30 93.40 93.90 94.20 94.10
VPmax(mbar) 3.33 3.23 3.21 3.26 3.27
VPact(mbar) 3.11 3.02 3.01 3.07 3.08
VPdef(mbar) 0.22 0.21 0.20 0.19 0.19
sh(g/kg) 1.94 1.89 1.88 1.92 1.92
H2OC(mmol/mol) 3.12 3.03 3.02 3.08 3.09
rho(g/m**3) 1307.75 1309.80 1310.24 1309.19 1309.00
wv(m/s) 1.03 0.72 0.19 0.34 0.32
max.wv(m/s) 1.75 1.50 0.63 0.50 0.63
wd(deg) 152.30 136.10 171.60 198.00 214.30

Lets check the statistical data (such as count, mean etc) of each attribute.
It can seen the pressure, temperature, density, wind degree(WD) have relatively high mean comparing to other attributes.

data.describe().T
count mean std min 25% 50% 75% max
p(mbar) 420224.0 989.214157 8.360888 913.60 984.20 989.58 994.73 1015.35
T(degC) 420224.0 9.442421 8.421135 -23.01 3.36 9.40 15.46 37.28
Tpot(K) 420224.0 283.484880 8.502206 250.60 277.43 283.46 289.52 311.34
Tdew(degC) 420224.0 4.953472 6.731171 -25.01 0.23 5.21 10.07 23.11
rh(%) 420224.0 76.028738 16.460467 12.95 65.24 79.30 89.40 100.00
VPmax(mbar) 420224.0 13.568642 7.734770 0.95 7.77 11.81 17.59 63.77
VPact(mbar) 420224.0 9.532333 4.183996 0.79 6.21 8.86 12.35 28.32
VPdef(mbar) 420224.0 4.036225 4.891287 0.00 0.87 2.18 5.29 46.01
sh(g/kg) 420224.0 6.021503 2.656043 0.50 3.92 5.59 7.80 18.13
H2OC(mmol/mol) 420224.0 9.638778 4.235244 0.80 6.28 8.96 12.48 28.82
rho(g/m**3) 420224.0 1216.097805 39.967936 1059.45 1187.54 1213.83 1242.79 1393.54
wv(m/s) 420224.0 1.700930 65.472111 -9999.00 0.99 1.76 2.86 28.49
max.wv(m/s) 420224.0 3.054884 69.043660 -9999.00 1.76 2.96 4.73 23.50
wd(deg) 420224.0 174.748064 86.685323 0.00 124.90 198.10 234.10 360.00

Checking for null value and fixing data frequency
While dealing with time series data, the frequency of each time stamp is important. Fixing the frequency to a constant value can be done by the python’s asfreq() method.
Our data has been recorded at an interval of 10 minutes, everyday from 01/01/2009 to 01-01-2017.
Before setting the frequency, lets check if there’s any null value.

data.isna().sum()
p(mbar)           0
T(degC)           0
Tpot(K)           0
Tdew(degC)        0
rh(%)             0
VPmax(mbar)       0
VPact(mbar)       0
VPdef(mbar)       0
sh(g/kg)          0
H2OC(mmol/mol)    0
rho(g/m**3)       0
wv(m/s)           0
max.wv(m/s)       0
wd(deg)           0
dtype: int64

So there is no null value. Lets set the time interval to 10 mins. If it inserts any extra instance, that can be filled with the following values (same approach can be attained to fill null values):

  1. mean
  2. median
  3. the immediate before or after value
  4. Any user defined value

Here I will be using the immediate previous value.
But before that, its better to ensure that the indices are all in ascending order. This can be done by sorting the dataframe in ascending order, specifically here as the index values are date-time values.

data = data.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='mergesort',
                na_position='last', sort_remaining=True)
data = data.asfreq(freq='10min',method='ffill')
(data.head().T)
DateTime 2009-01-01 00:10:00 2009-01-01 00:20:00 2009-01-01 00:30:00 2009-01-01 00:40:00 2009-01-01 00:50:00
p(mbar) 996.52 996.57 996.53 996.51 996.51
T(degC) -8.02 -8.41 -8.51 -8.31 -8.27
Tpot(K) 265.40 265.01 264.91 265.12 265.15
Tdew(degC) -8.90 -9.28 -9.31 -9.07 -9.04
rh(%) 93.30 93.40 93.90 94.20 94.10
VPmax(mbar) 3.33 3.23 3.21 3.26 3.27
VPact(mbar) 3.11 3.02 3.01 3.07 3.08
VPdef(mbar) 0.22 0.21 0.20 0.19 0.19
sh(g/kg) 1.94 1.89 1.88 1.92 1.92
H2OC(mmol/mol) 3.12 3.03 3.02 3.08 3.09
rho(g/m**3) 1307.75 1309.80 1310.24 1309.19 1309.00
wv(m/s) 1.03 0.72 0.19 0.34 0.32
max.wv(m/s) 1.75 1.50 0.63 0.50 0.63
wd(deg) 152.30 136.10 171.60 198.00 214.30

PLOTTING THE DATA
Lets, plot the pressure attribute

data['p(mbar)'].plot(figsize=(20,5))
plt.title('Time_vs_Pressure', fontsize=20, color='r')
plt.ylabel('Pressure',fontsize=15)
plt.xlabel('Date_Time',fontsize=15)
plt.show()

png

Q-Q Plot
A Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. To determine whether the data is distributed in a certain way. It usually shows how the data fits the normal distribution. It plots all the values a variable contains and arranges in order, where the y-axis represent the ordered values from min to max. The x-axis shows the theoretical quantiles of the data i.e. how many standard deviations away from the mean these values are.
Below is a Q-Q plot of the pressure variable. The Red line shows how the data should have followed if they were normally distributed. Thus it shows the data is not normally distributed.

import scipy.stats as ssplt
import pylab
ssplt.probplot(data['p(mbar)'], plot=pylab)
pylab.show()

png

Different kinds of time-series data

  1. Stationary
  2. Non-stationary

A stationary series is a stochastic process whose statistical properties such as mean, variance (the average degree to which each point differs from the average of all data points) and autocorrelation do not change over time. Hence, a non-stationary series is one whose statistical properties change over time because of presence of a certain trend.
Types of stationary time-series:
a) Strict Stationary: In a series when samples of identical size have identical distributions, i.e. $Sample_1$ ($x_t$, $x_{t+k}$) and $Sample_2$ ($x_{t+e}$, $x_{t+e+k}$) have same distributions (Dist($\mu$,$\sigma^2$)). This is very restrictive and rarelyobserved in nature.
b) Weak Stationary: having a constant mean ($\mu$ = constant), constant variance ($\rho$=constant) and consistent covariance between periods at an identical distance from one another i.e. Cov($x_n$, $x_{n+k}$) = Cov($x_m$, $x_{m+k}$). Eg: white noise.

Non-stationary data should be converted to stationary before applying any forecasting model, ny removing its underlying trend which is also a function of time.

WHITE NOISE
A white noise is a sequence with no specific pattern. In case of time series, when the mean and the variance doesn’t change over time, and there is no clear relationship between present and past values.
Conditions for which a time series is a white noise:

  1. mean ($\mu$) = 0 (constant)
  2. varaince ($\sigma$) = constant
  3. Auto-correlation =
    (thus constant co-variance, since

Thus white noise is a sequence of random data without having any periodicity, happens sporadically, and there’s no scope of projecting it into the future.
NOTE: When the values of the sequence are drawn from from a Gaussian Distribution, then it is called Gaussian White Noise.
Lets, check if our pressure data is a white noise or not. In order to do so, we can create a white noise (random values from a normal distribution) having mean and variance of the pressure attribute.

white_noise = np.random.normal(loc = data['p(mbar)'].mean(), #mean of the pressure attribute 
                               scale = data['p(mbar)'].std(), #standard deviation of pressure attribute
                               size = len(data))

Lets, compare the original pressure data with its white noise. For that its better to create a separate dataframe with these two time series.

press_noise = pd.DataFrame(data['p(mbar)'])
press_noise['White_Noise'] = white_noise
press_noise.head()
p(mbar) White_Noise
DateTime
2009-01-01 00:10:00 996.52 990.787520
2009-01-01 00:20:00 996.57 991.901143
2009-01-01 00:30:00 996.53 994.329886
2009-01-01 00:40:00 996.51 992.027583
2009-01-01 00:50:00 996.51 987.105276
press_noise.describe()
p(mbar) White_Noise
count 420768.000000 420768.000000
mean 989.218778 989.214672
std 8.357717 8.365454
min 913.600000 950.854988
25% 984.210000 983.574355
50% 989.590000 989.212760
75% 994.720000 994.858965
max 1015.350000 1027.655144

We can see the newly generated white noise has similar mean and standard deviation with that of the pressure series. Plotting the graph of these series will further clarify the comparison.

fig = plt.figure(figsize=(20,10))
plt.subplot(2,1,1)
plt.subplots_adjust(hspace = 0.3)

press_noise['White_Noise'].plot()
plt.title('Time_vs_White_Noise')
plt.xlabel('time')
plt.ylabel('Value')
plt.ylim(910, 1030)

plt.subplot(2,1,2)
press_noise['p(mbar)'].plot()
plt.title('Time_vs_Pressure')
plt.xlabel('time')
plt.ylabel('Value')

plt.show()
plt.savefig('noise_vs_data.png')

png

As the series contains 420224 data points, the graph is clumpsy, plotting with fewer data points will solve the issue.

fig = plt.figure(figsize=(20,10))
plt.subplot(2,1,1)
plt.subplots_adjust(hspace = 0.4)

press_noise['White_Noise'][:3000].plot()
plt.title('Time_vs_White_Noise')
plt.xlabel('time')
plt.ylabel('Value')
plt.ylim(955,1020)

plt.subplot(2,1,2)
press_noise['p(mbar)'][:3000].plot()
plt.title('Time_vs_Pressure')
plt.xlabel('time')
plt.ylabel('Value')
plt.ylim(955,1020)
plt.show()
plt.savefig('noise_vs_data_trunc.png')

png

Now, it is easy to see the differences between the pattern of the two series. So there are smaller jumps between periods in the pressure value, since the values are not random. Thus to make an accurate forecast, the pattern has to be recognised.

Comments