Breaking News

Applied Exploratory Data Analysis, Bike-Sharing. The Power of Visualization, Python.

1. Introduction

This study analyzes a Modified Bike-Sharing data set. Unlike the original data set, this “Modified” version includes nulls, zeros, and outliers, which opens the door to a detail Exploratory Data Analysis EDA. Many of the public studies on Bike-Sharing include basic EDA and then go straight into Modeling. The general goal of this analysis is to perform an extensive EDA that takes into account the Physics of the phenomena and provides insights for modeling Bike-Sharing rentals. After reading this post you will know about:
  • Exploiting the visual power of the human brain allows learning about patterns or trends in congested plots..
  • The tricks to make those rows that have cells with null and non-null values visible in a plot.
  • The identification of suspicious patterns and proposed steps to address them.
  • The sequential steps that unveil the story behind the data during the EDA.
  • The importance of considering the Physics that explains the phenomena before performing imputations.
  • The preliminary assessment of the variables that influence the number of bicycle rentals.
  • The Python code to present data on plots with visual contrast, which helps to unveil patterns and understand the “puzzle” behind the data.
Note that Data Exploration is performed using Python (version 3.7.4) with Data Science libraries such as NumPyPandasSciKit-LearnSeaborncategory_encodersscipy, and date time among others.
Table of contents
1….. Introduction
2….. Attribute Information
3….. Nulls, Zeros, and Outliers
4….. Correlation Structure
5….. Univariate Behavior of the Response Variable
6….. Behavior of Response Variable With Time Variables
7….. Conclusions
8….. References

2. Attribute Information

The data set contains six variables related with time, four continuous variables, and the response variable counts of bicycles ‘cnt’. The ASCII file and the Python code reside in my .




2.1. Feature Engineering and Encoding

With the purpose to a) facilitate data filtering, b) extract hidden information from the data, c) analyze trends on plots, and d) model the Bike-Sharing rentals, it is handy to derive and encode some features as follows:



Notice the ‘season’ variable carries ordering (i.e., Summer is followed by Fall). Encoding algorithms such as LabelEncoder() from Sci-Kit-learn will not honor the order of seasons because it encodes in alphabetical order. Instead, the category_encoders library allows the use of dictionary-mapping to establish custom order.

3. Nulls, Zeros, and Outliers

Data tables have columns identifying features or variables and rows with cells holding data for each one of the features. If a row happens to have a cell with a null value, the entire row will cause some algorithms to cough even though the rest of the cells in that row are non-null values. Zeros and outliers, on the other hand, will influence the model negatively if they represent sources of error. Therefore, the first task is to identify, analyze, and then address nulls, zeros, and outliers. The use of tags facilitates this process; this section of the analysis describes this process.

3.1. Start by Tagging Nulls.

The execution of df.isna().sum() identifies features with nulls in the data set. ‘temp’ and ‘hum’ are the only two features with nulls. A column named ‘outlr_miss’ will store tags with the following values: (Note that NaN in Python means Non-a-Number, that is the reason for the suffix ‘_nan’)



Once the rows containing null values have been tagged, there should be exactly 23 rows showing ‘temp_nan’ or ‘hum_nan’ tags. Observe the following Pandas data-frame output.



3.2. Tagging Zeros




The figure on the side shows the output of df.isin([0]).sum(). Ordinal features such as hours ‘hr’ or Boolean features such as ‘holiday’ or ‘workingday’ are expected to have zeros. However, zeros in continuous features such as ‘atemp’ with 2 zero-values, ‘hum’ with 22 zero-values, and ‘windspeed’ with 2180 zero-values are suspicious. Notice the tag has the suffix ‘_zero’ indicating the row has zeros. The following is a snippet of the Python code to encode nulls and zeros.
Tagging Nulls
df['outlr_miss']='data'
df.loc[(df['outlr_miss']=='data') & df['temp'].isna(), 'outlr_miss'] = 'temp_nan' 
df.loc[(df['outlr_miss']=='data') & df['hum'].isna(), 'outlr_miss'] = 'hum_nan' 

# Tagging zeros
df.loc[(df['outlr_miss']=='data') & df['atemp'].isin([0]), 'outlr_miss'] = 'atemp_zero' 
df.loc[(df['outlr_miss']=='data') & df['hum'].isin([0]), 'outlr_miss'] = 'hum_zero' 
df.loc[(df['outlr_miss']=='data') & df['windspeed'].isin([0]), 'outlr_miss'] = 'windspeed_zero'
3.3. Scatter Matrix Plot or Pair Plots Visualize Nulls, Zeros, and Outliers
Scatter matrix plots are useful in the assessment of collinearity among features and the identification of nulls, zeros, and outliers. At first view, matrix plots look like a bunch of small monochromatic plots, sometimes so tiny, that they make us wonder about their purpose. But, when these plots show data with the correct color contrast, they trigger the human brain. Remember that the human brain is visual by nature, it processes information way better through images and colors. Here is where tags are useful, they become a category which colors nulls, zeros, and outliers.
At this point in the analysis, the Bike-Sharing data table includes rows with cells that hold null values. This is an implicit problem because the majority of plotting and machine learning ML algorithms cough in the presence of nulls. By default, these algorithms address this issue by excluding rows that have null values. The consequence of that default setting is that rows that have null and non-null values become completely invisible for the plot. The following section explains a null-zero replacement trick that uses tags to make rows that have null and non-Null values visible in a plot.
3.3.1. The code for plotting
The Python library seaborn facilitates the construction of variety of plots; however, algorithms such as pairplot(..) cough when nulls are present. This analysis proposes two tricks to overcome this issue.
Null-zero replacement Trick. Plots receive data in a table-like format; they have columns representing features or variables and rows representing the actual data. When a table has rows with nulls, plotting algorithms generate errors. To overcome this issue, users and statistical applications automatically exclude rows that have null values. As a result, non-null values located in those same rows become ‘invisible’ on the plot.
This analysis proposes a Null-zero replacement trick that uses zeros to replace nulls. This transformation takes place at run time, while data are feed to the Python plot algorithm; therefore, nulls remain intact in the original data set.
Custom color palette for tags. Python plot algorithms follow a predefined color palette, and many times the contrast between colors shadow data patterns to the human eye. The definition of a custom color palette addresses this issue, observe the next Python code.
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

sns.set(context='talk', style='ticks', font_scale=0.98)
colorPalette = {'data':'lightgray', 'windspeed_zero':'lightgreen', 
                'hum_zero':'black', 'hum_nan':'dodgerblue',
               'temp_nan':'orange', 'atemp_zero':'violet', 'atemp_outlr':'cyan'}

#colorPalette = ['lightgreen', 'lavender', 'red', 'dodgerblue', 'black', 'tan', 'cyan']
sns.pairplot(df[ordFeatures + contFeatures + target + ['outlr_miss']].fillna(0),
             hue="outlr_miss", palette=colorPalette, plot_kws = {'s': 30})

plt.show()
plt.close('all')




Applied Exploratory Data Analysis, Bike-Sharing. The Power of Visualization, Python.


Misael Uribe
Jan 30 · 17 min read



1. Introduction

This study analyzes a Modified Bike-Sharing data set. Unlike the original data set, this “Modified” version includes nulls, zeros, and outliers, which opens the door to a detail Exploratory Data Analysis EDA. Many of the public studies on Bike-Sharing include basic EDA and then go straight into Modeling. The general goal of this analysis is to perform an extensive EDA that takes into account the Physics of the phenomena and provides insights for modeling Bike-Sharing rentals. After reading this post you will know about:
  • Exploiting the visual power of the human brain allows learning about patterns or trends in congested plots..
  • The tricks to make those rows that have cells with null and non-null values visible in a plot.
  • The identification of suspicious patterns and proposed steps to address them.
  • The sequential steps that unveil the story behind the data during the EDA.
  • The importance of considering the Physics that explains the phenomena before performing imputations.
  • The preliminary assessment of the variables that influence the number of bicycle rentals.
  • The Python code to present data on plots with visual contrast, which helps to unveil patterns and understand the “puzzle” behind the data.
Note that Data Exploration is performed using Python (version 3.7.4) with Data Science libraries such as NumPyPandasSciKit-LearnSeaborncategory_encodersscipy, and date time among others.
Table of contents
1….. Introduction
2….. Attribute Information
3….. Nulls, Zeros, and Outliers
4….. Correlation Structure
5….. Univariate Behavior of the Response Variable
6….. Behavior of Response Variable With Time Variables
7….. Conclusions
8….. References

2. Attribute Information

The data set contains six variables related with time, four continuous variables, and the response variable counts of bicycles ‘cnt’. The ASCII file and the Python code reside in my .




2.1. Feature Engineering and Encoding

With the purpose to a) facilitate data filtering, b) extract hidden information from the data, c) analyze trends on plots, and d) model the Bike-Sharing rentals, it is handy to derive and encode some features as follows:



Notice the ‘season’ variable carries ordering (i.e., Summer is followed by Fall). Encoding algorithms such as LabelEncoder() from Sci-Kit-learn will not honor the order of seasons because it encodes in alphabetical order. Instead, the category_encoders library allows the use of dictionary-mapping to establish custom order.

3. Nulls, Zeros, and Outliers

Data tables have columns identifying features or variables and rows with cells holding data for each one of the features. If a row happens to have a cell with a null value, the entire row will cause some algorithms to cough even though the rest of the cells in that row are non-null values. Zeros and outliers, on the other hand, will influence the model negatively if they represent sources of error. Therefore, the first task is to identify, analyze, and then address nulls, zeros, and outliers. The use of tags facilitates this process; this section of the analysis describes this process.

3.1. Start by Tagging Nulls.

The execution of df.isna().sum() identifies features with nulls in the data set. ‘temp’ and ‘hum’ are the only two features with nulls. A column named ‘outlr_miss’ will store tags with the following values: (Note that NaN in Python means Non-a-Number, that is the reason for the suffix ‘_nan’)



Once the rows containing null values have been tagged, there should be exactly 23 rows showing ‘temp_nan’ or ‘hum_nan’ tags. Observe the following Pandas data-frame output.



3.2. Tagging Zeros




The figure on the side shows the output of df.isin([0]).sum(). Ordinal features such as hours ‘hr’ or Boolean features such as ‘holiday’ or ‘workingday’ are expected to have zeros. However, zeros in continuous features such as ‘atemp’ with 2 zero-values, ‘hum’ with 22 zero-values, and ‘windspeed’ with 2180 zero-values are suspicious. Notice the tag has the suffix ‘_zero’ indicating the row has zeros. The following is a snippet of the Python code to encode nulls and zeros.

3.3. Scatter Matrix Plot or Pair Plots Visualize Nulls, Zeros, and Outliers

Scatter matrix plots are useful in the assessment of collinearity among features and the identification of nulls, zeros, and outliers. At first view, matrix plots look like a bunch of small monochromatic plots, sometimes so tiny, that they make us wonder about their purpose. But, when these plots show data with the correct color contrast, they trigger the human brain. Remember that the human brain is visual by nature, it processes information way better through images and colors. Here is where tags are useful, they become a category which colors nulls, zeros, and outliers.
At this point in the analysis, the Bike-Sharing data table includes rows with cells that hold null values. This is an implicit problem because the majority of plotting and machine learning ML algorithms cough in the presence of nulls. By default, these algorithms address this issue by excluding rows that have null values. The consequence of that default setting is that rows that have null and non-null values become completely invisible for the plot. The following section explains a null-zero replacement trick that uses tags to make rows that have null and non-Null values visible in a plot.

3.3.1. The code for plotting

The Python library seaborn facilitates the construction of variety of plots; however, algorithms such as pairplot(..) cough when nulls are present. This analysis proposes two tricks to overcome this issue.
  • Null-zero replacement Trick. Plots receive data in a table-like format; they have columns representing features or variables and rows representing the actual data. When a table has rows with nulls, plotting algorithms generate errors. To overcome this issue, users and statistical applications automatically exclude rows that have null values. As a result, non-null values located in those same rows become ‘invisible’ on the plot.
    This analysis proposes a Null-zero replacement trick that uses zeros to replace nulls. This transformation takes place at run time, while data are feed to the Python plot algorithm; therefore, nulls remain intact in the original data set.
  • Custom color palette for tags. Python plot algorithms follow a predefined color palette, and many times the contrast between colors shadow data patterns to the human eye. The definition of a custom color palette addresses this issue, observe the next Python code.
  • 3.3.2. The analysis of nulls, zeros, and outliers
The focus of this section is the description of tips that help identify and analyze nulls, zeros and outliers in plots.
(1) Unveiling rows with null and non-null values. The Null-zero replacement trick allows the presentation of rows that have null and non-null values in the scatter matrix plot. Observe the orange dots that correspond to the ‘temp_nan’ tag; each orange dot represents cell values from a row in the data table. Keep in mind that these orange dots represent either null or non-null values depending on the plot. On one hand, orange dots represent null values in those plots where the temperature ‘temp’ is one of the variables; these plots are surrounded by an orange rectangle in the scatter matrix plot. Note that zeros represent null values in these plots; this is why the orange dots follow a linear pattern at ‘temp=0’. On the other hand, orange dots represent non-null values on those plots where temperature ‘temp’ is not one of the variables.
Both orange and blue dots follow the same interpretation. Notice that these dots have tags with the same suffix ‘_nan’. The next step of this workflow will address these null values before going into modeling.



NOTE: one improvement to the Null-zero replacement trick is to use a transparent color for those dots that represent null values. This requires additional coding to manipulate data and assign colors at run time.
(2) Outliers showing true zero-values. Observe the black dots corresponding to the ‘hum_zero’ tag; they represent humidity ‘hum’ with a value of zero. These zero values do exist in the data table. These black dots form a linear pattern that stands out as an outlier. The interpretation of dots with ‘atemp-zero’ and ‘windspeed_zero’ tags is similar to the black dots. These zero-values, which stand as outliers, need to be addressed before performing further analysis.
(3) Outliers. The scatter plots that have temperature ‘temp’ and ‘feel-like’ temperature ‘atemp’ as variables, show a sequence of gray dots forming a linear pattern. This dots should be labeled as outliers because: 1) they stand out from the majority of the dots forming a 45-degree trend, 2) they show a constant value for ‘atemp’ which is suspicious, and 3) they all occur during one day 08–17–2012, which is also suspicious. Once again, this is a data set that was modified, probably by hand, and these outliers need to be addressed before modeling. The following is a snippet of the Python code to tag these outliers as ‘atemp_outlr’:

3.4. Addressing Nulls, Zeros, And Outliers

Strategies that use metrics such as mean, median, or mode to perform imputations seem to be suitable for univariate analysis. The Bike-Sharing data set is a multivariate data set, and the first attempt to address nulls, zeros, and outliers should involve the exploration of correlations or similarities among the features within the data set.

3.4.1. Addressing nulls, zeros, and outliers for temperature

We all have experienced it, at high temperatures ‘temp’, the ‘feel-like’ temperature is higher than the actual temperature; conversely, at low temperatures, the ‘feel-like’ temperature is lower than the actual temperature. This linear relationship between these variables helps address and correct nulls, zeros, and outliers for temperature.
The next figure shows a linear relationship between temperature ‘temp’ and ‘feel-like’ temperature ‘atemp’. One variable tells about the other. Therefore, one of these variables can be dropped from the analysis during modeling. However, we will use this relationship to emphasize the fact that when it comes to addressing nulls, zeros, or outliers, looking for relationships or similarities between the variables should be the first option in the list.



The scatter plot on the left presents all the original data. The colors, which represent tags, reveal three suspicious patterns (observe the suffix of each tag): Outliers (light blue dots), zeros (magenta dots), and nulls (orange dots). Remember, orange dots represent null values replaced by zeros; the Null-zero trick makes the nulls visible as zeros on the plot, we can have a visual assessment of how many they are and whether corrections to these null values will make sense.
This is where the tags become handy; they allow filtering the suspicious trends before data is fed into the Linear Regression model from Sci-Kit-learn. Once the model is trained, it is used to correct the suspicious patterns in temperature.

3.4.2. Addressing Nulls and zeros for humidity

 or amount of water vapor in the air is usually reported as Relative Humidity on weather reports. Relative Humidity is related to dew point and temperature; therefore, the first attempt that tries to address nulls and zeros for humidity should involve the identification of correlations or similarities with temperature and dew point.
The Bike-Sharing data set provides values for humidity and temperature, but values for dew point are missing in this data set. The next figure investigates possible relationships or correlations for humidity with the available variables in the data set. Keep in mind blue dots show null values presented as zero in these plots (these points do not exist in the data table).



The plots show the absence of a defined pattern for humidity; however, the patterns are there. The human brain is a visual learner, and the data needs to be presented in chunks and with color to reveal possible patterns. Exploring the data in more detail should help to identify possible relationships for humidity. Think about this, the temperature during the morning is low, then it raises progressively to a pick during the afternoon, and finally, it decreases at the end of the day. Since temperature is related to humidity, this suggests a possible humidity relationship should exist during the day. The next graph shows temperature and humidity day by day.



The previous plots show an absence of a defined pattern for daily values of humidity, temperature, and wind speed. However, the next graph reveals a different picture; it shows a pattern when humidity and temperature interact through the hours of the day. The third plot, from left to right, shows humidity values of zero; these values exist in the data set, they are not the result of the Null-zero replacement trick.



Note that data presented by colors and chunks (by days) allows the brain to knowledge the opposite patterns between temperature and humidity; It would not make sense to replace the zeros in humidity with an average value.
The next figure confirms this, black dots have a ‘hum_zero’ tag. The suffix ‘_zero’ indicates that in fact those are zeros in the original data set.



Observe that humidity and temperature follow opposite trends, which is in line with the behavior described by . Humidity is related not only to the temperature but also to the dew point; as a result, to calculate one value, . Dew point is absent in the Bike-Sharing data set; therefore, there is not a straightforward relationship that can be derived from the Bike-Sharing data set to calculate null and zero values for humidity.
Any relationship that addresses null and zero values for humidity needs to honor the daily relationship of humidity, temperature, and dew point. Random Forest is a great starting point to address nulls and zeros for humidity because:
1)the relationship between humidity, temperature, and the dew point is not linear
2) Random Forest does a good job predicting values within the range of values where it was trained.
3) the absence of information related to the dew point makes it challenging to derive humidity values directly.
The data set that is used to train the Random Forest Regressor has non-null values for humidity. Once the Random Forest Regressor is trained, it is used to predict null values for humidity. The following is a snippet of the Python code.
# train data set for Random Forest Regressor
df_Train = df_cpy.copy(deep=True)
df_Test = df_cpy.copy(deep=True)

options = [np.nan, 0]
df_Train = df_Train.loc[df_Train.hum.isin(options)==False]
df_Test = df_Test.loc[df_Test['hum'].isin(options)]
df_Test.drop(columns = ['hum'], inplace=True)

# Predictions with Random Forest Regressor
predictors = ['month_int','weekday_int','hr','Clear','Light Snow','Slightly cloudy','Thunderstorm','temp']
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(df_Train[predictors], df_Train['hum'])
df_Test['hum'] = rfr.predict(X = df_Test[predictors])
df_cpy2 = pd.concat([df_Train,df_Test],axis='rows', sort=False)
df_cpy2.sort_index(inplace=True)
The next graph shows the results of the Random Forest Regressor. Observe the plot for Thursday 2011/3/10; before the application of Random Forest, this plot had a horizontal trend at ‘hum=0’. After the application of Random Forest Tree Regressor, this plot shows a humidity trend that: a) matches daily humidity trends reported in  and b) starts right at the end of Wednesday and finishes to meet the beginning trend of Friday.
Therefore, in this case, Random Forest Regressor not only does a good job addressing nulls and zeros, but also honors the underlined Physics of humidity, temperature, and dew point.



3.4.3. Addressing zeros in wind speed

The next figure shows the behavior of wind speed with different features. Observe the green dots; they represent zero values in humidity. These dots form not only a detached pattern in the scatter plots but also a bimodal behavior in the distribution plot. The Random Forest Regressor is a good starting point to address this suspicious pattern in humidity.



The next figure presents the results of the Random Forest Regressor. The distribution plot shows a unimodal behavior, and the green dots no longer form a horizontal trend at windspeed=0’.



4. Correlation Structure

One way to identify collinearity among the features is by using the correlation matrix, which provides the . Intuitively, the bicycle counts depend on variables that are related not only with weather conditions but also with time. Therefore, the correlation matrix includes both weather and time variables.
The collinearity between temperature ‘temp’ and the feeling-like temperature ‘atemp’ is evident. As a result, the feel-like temperature, which is derived from ‘temp’, will be excluded during the modeling process. In general, the bicycle counts show a positive correlation with temperature and a negative correlation with humidity. The bottom row of the scatter plot matrix shows these tendencies. Also, notice that some of the time variables have a high correlation with the bicycle counts; this is especially true for hours. The next sections will explore, graphically, the relationship of bicycle counts with time variables.



5. Univariate Behavior of the Response Variable

The distribution of the number of bicycles ‘cnt’ is right-skewed, and a logarithmic transformation is required to correct the behavior. Even though the logarithmic transformation does not correct the skewness completely, the results are close to a normal distribution. This transformation will have a positive impact on some of the Machine Learning algorithms.



6. Behavior of Response Variable With Time Variables

Observe the median (above 50% of the data) of the box plots in the next figure; it shows four general characteristics of the number of bicycle rentals. 1) rentals are increasing over the years, 2) trends initially increase and then decrease over time, 3) trends are cyclical through time, and 4) the presence of outliers. In more detail, the following statements can be derived:
  • The first two box plots indicate that starting in January, rentals gradually increase, they continue the increasing pattern through Spring and reach a pick in the Summer. Then rentals decrease through Fall and Winter to meet the trends of January and Spring of the next year, a sign of cyclical behavior. Pick bike rentals occurs in Summer.
  • The middle-left box plot indicates the demand for bicycle rentals is higher during working days. This phenomenon is even more pronounced, almost 50% more, for the second year. This behavior signals that the popularity of bike rentals is increasing, at least in the city source of this data set.
  • The middle-right box plot shows people ride less on holidays. However, this plot also reveals the increase in the popularity of bicycle rentals throughout the years.
  • The bottom box plot also resembles both the increase in popularity and cyclical behavior. This time with two cycles over a short period of time, a day. The demand for bicycles picks between 7–9 AM and 5–6 PM. These patterns match workers going to work and leaving work; workers actually use bicycles as transportation media for work. Notice demand remains relatively constant between pick hours, this could be explained by “regular people”, tourists, and students using bicycles.




Let’s inspect the next set of plots. Observe how the brain can digest multiple trends easily, thanks to different sorting and colors on the plot. The inflections of the continuous trend lines depict the average number of rented bicycles for each hour throughout the day. These trend lines resemble the same patterns and characteristics that we just explained. These plots give us an additional piece of information. The dash lines represent the variation of average temperature through the hours of the day. This dash trend helps us determine, at least partially (remember this is a multivariate analysis, humidity, wind speed, and dew point are not in these plots), the influence of temperature on the number of rental bikes through the hours of the day.
The plot on the top reveals interesting trends. Fall shows the highest average temperatures throughout the day; in contrast, Spring shows the lowest average temperature throughout the day. These lowest average temperatures in Spring seem to correlate with the lowest average bike count, which also happens in Spring. We could say that people don’t bike much in Spring; maybe people want to stay away from pollen, dust, etc. During these periods, people don’t want to catch allergies.
There is an overlap of the trends that tell us about the average bicycle counts for Summer, Fall, and Winter. One could think that this could be related to temperature; however, the trends that tell us about the average temperature for the same seasons do not overlap. In other words, separation of the average temperature trends does not match separation for average count trends for Summer, Fall, and Winter.
(*) Therefore, the average temperature during Summer, Fall, and Winter does not change the average demand trends of bicycles. However, since the temperature is not the only factor controlling the weather conditions, it seems that there is a commingle of variables factoring in a relatively constant bicycle demand during Summer, Fall, and Winter.
(*) Also, the average bicycle demand during Summer, Fall, and Winter is higher compared to the demand during Spring.
The last plot also reveals an interesting trend. The average temperature remains constant throughout the days of the week. You may ask: how is this possible when the previous plot shows that average temperature changes for the same hour of the day? Well, this plot shows average temperatures for all Mondays and all the seasons, say at the 23 hours exactly. This average temperature happens to be relatively the same for Tuesday, Wednesday, etc. You can run the calculations manually, say at 23 hours, for confirmation. Notice the sinuosity of the temperature trend; even though it is an average trend, it follows the trends reported in .
There is another interesting trend in this last plot. Imagine for a moment that the picks of average bicycle counts are not in the last plot. Then the average demand for bicycles is higher during the weekends, especially late morning and early afternoon.



1.1. Conclusions

In general:
  • Monochromatic plots showing all the data for all the variables are deceiving to the human brain. Humans are visual learners, and color on the plots make a big difference. When plots show data with colors referencing different categories, tags, or short time frames, they unveil valuable patterns and insights during Exploratory Data Analysis.
  • The first line of action to address nulls, zeros, and outliers is the identification of correlations or similarities among the available variables. Strategies that use metrics such as mean or median to impute values are suitable for univariate analysis and should be the last line of action.
  • Random Forest Regressor is a good starting point to address nulls, zeros, and outliers; it provides results that honor the physics of the phenomena in this data set.
  • The trend that describes the average bicycle counts during the weekends differs from that of Monday through Friday (working days). During working days, demand for bicycle picks during rush hours 6:30–8:530 AM, 5:00–7:00 PM.
  • The trend that describes the average bicycle counts during Spring differs from that of Summer, Fall, and Winter. Average bicycle demand for Summer, Fall, and Winter remains relatively constant (observe these trends overlap).
  • It is not 100% clear whether temperature alone is the cause of these different trends in the demand of bicycles. Remember temperature is not the only factor in weather conditions.
  • The trends of bicycle counts are cyclical through the day and through the year.

8. References

No comments

Please do not enter any spam link in comment box.