In this article we explore what it means to create a Jupyter Notebook template for use in initial data exploration. The resulting familiarity with the data can then be utilized as the basis for further detailed analysis, narrative development, feature engineering, etc.

Introduction

When leading teams or managing projects it is often that one will encounter data in different forms such as user feedback, project management metrics, system usage statistics, or performance metrics.

In each of these cases we’d like to be able to analyze the data, develop a narrative describing what the data is telling us, and then create one or more actionable recommendations and/or insights.

However, the first step is usually becoming familiar with the general ‘shape’ of the data, and then use that as a guide on how to proceed next. We aren’t performing deep dive as of yet; our first goal is simply to become generally familiar with the data, and then allow this to guide us onto the next, more detailed steps.

png

For example, let us assume we have a corpus of unstructured user feedback on a web application we manage. In this case we might want to understand the most commonly used language in the texts, frequency counts, and review lengths. Based on what we uncover perhaps we’ll notice a high number of negative terms in the texts.

This in turn might lead us to start thinking about how we might more deeply analyze the seemingly negative comments:

  • For example, should we categorize the reviews in a way that maps to various areas of the application? Is the data structured in such a way that this is possible?
  • If there isn’t a clear categorical delineation should we instead apply an unsupervised K-means clustering algorithm to programmatically create groupings for further investigation?

Based on the groupings we find we could then potentially develop a narrative, a set of actionable recommendations to address the root causes for each category, and Key Performance Indicators (KPIs) to measure progress.

So, in the example above–and indeed the subject of this article–we use initial data exploration to shape and guide our deeper, secondary analysis, narrative development, and recommendations/insights.

Assumptions

Before we begin a few assumptions:

Your data is clean.

Data cleaning is a subject unto itself, and a whole separate article could be written on the topic. But for the purposes of this write up; however, we will assume that your data is mostly clean, and that any issues with missing values, incorrect data types etc has been handled and dealt with before the initial exploration.

So, while we will take a quick look later on at dealing with a few messy records, this article will not explore the subject in detail.

Here is a good resource on the subject if you want to learn more.

You have some basic Python programming skills and access to a Jupyter Notebook environment.

If you use the template from this article most of the programming work has been done for you. You’ll simply need to adjust the template as you see fit to meet your objectives. If on the other hand you need help setting up your Jupyter Notebook environment you can refer to a previous article I wrote on this subject here.

Also note if you need a hand let me know; I’ll be happy to help out as I’m able. :)

Resources

We’ll discuss each the following resources in more detail later on, but we include them here for easy reference:

  • You can find the template discussed in this article here
  • You can find the modified Iris Data Set utilized in the template here
  • You can find the modified IMDB Movie Reviews Set text utilized in the template here

Let’s Get Started!

So, having said all that, let’s get started!

Sample Data

The first thing we need to do; however, is obtain example data to execute the template against to illustrate how it works.

To do this we’ll be utilizing two freely and commonly referred to example data sets:

Note that we’ve made two modifications to these data sets:

  • For the Iris Data Set we’ve added two new records to the set to simulate non-numeric and NaN issues. You can find our modified version here.

  • For the IMDB Movie Reviews Set we already cleaned and processed the review text to use it directly in the template. You can find our modified version here.

Jupyter Notebook Template

Once we have access to the data sets we can utilize the Jupyter Notebook Template (found here) to gain the initial idea of what the data looks like and plan our next actions.

Let’s break down what the template is doing section-by-section:

Load libraries and data

Libraries

We start by importing the Python libraries we’ll need for our initial data exploration:

# Configure the notebook
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
# Load libraries
import numpy as np

import pandas as pd
from pandas import read_csv
from pandas.plotting import scatter_matrix
from pandas import set_option
from pandas import DataFrame

from matplotlib import pyplot

from wordcloud import WordCloud, STOPWORDS

import pickle

import nltk
from nltk.probability import FreqDist

import os

Data

Note we are loading a modified Iris data set and a IMDB Movie Reviews data set. You can download these files here and here, and we assume you’ve placed the data sets in a directory called datasets in the same root folder as this template like so:

├── datasets
│   ├── iris.data.csv
│   └── reviews.obj

Let’s go ahead and load the data files:

dataFile = os.path.join('.', 'datasets', 'iris.data.csv')
data = read_csv(dataFile, header = 0)
filehandler = open(os.path.join('.', 'datasets', 'reviews.obj'), 'rb')
reviews = pickle.load(filehandler)

Helper functions

Next we want to write a quick helper function, corrTableColors(). This function allows us to set rules on when to apply colors to the correlation table values which we’ll create below.

Note that the threshold values in the function can be altered to meet your needs as required. For example, we are highlighting any correlation values greater or less than 0.7, but these can easily be altered for your project data.

def corrTableColors(value):
    # Default color
    color = 'black'

    if value == 1:
        color = 'white'
    elif value < -0.7:
        color = 'red'
    elif value > 0.7:
        color = 'green'

    return 'color: %s' % color

Inspect and visualize the data (Iris Data Set)

Now we are ready for the fun part: Inspecting the data and learning more about how it is structured!

We’ll utilize the Iris Data Set for the steps below, so that we can see how the commands work on an actual data set.

Non-Visual Exploration

We start with a non-visual exploration where the properties of the data are examined such as the number of records, the data types, descriptive statistics, correlations, etc.

Shape

The command below will return the number of rows by number of columns which are also referred to as observations and variables:

print("data.shape = ", data.shape)
data.shape =  (152, 5)

Thus we have 152 rows/observations and 5 columns/variables.

First 10 rows

Using the head command we can examine the first 10 records:

print(data.head(10))
  sepalLength  sepalWidth  petalLength  petalWidth        class
0         5.1         3.5          1.4         0.2  Iris-setosa
1         4.9         3.0          1.4         0.2  Iris-setosa
2         4.7         3.2          1.3         0.2  Iris-setosa
3         4.6         3.1          1.5         0.2  Iris-setosa
4         5.0         3.6          1.4         0.2  Iris-setosa
5         5.4         3.9          1.7         0.4  Iris-setosa
6         4.6         3.4          1.4         0.3  Iris-setosa
7         5.0         3.4          1.5         0.2  Iris-setosa
8         4.4         2.9          1.4         0.2  Iris-setosa
9         4.9         3.1          1.5         0.1  Iris-setosa

We can clearly see that the first four columns are numeric data, and the last column appears to be classification labels.

We aren’t sure at this point; however, what the ranges of the numeric data are or how many labels we are dealing with. We’ll explore this later on.

Data Types

Let’s confirm our assumption that the first four columns are numeric. To do this we examine the output of the dtypes command:

print(data.dtypes)
sepalLength     object
sepalWidth     float64
petalLength    float64
petalWidth     float64
class           object
dtype: object

OK, this is interesting! The dtypes command appears to think we have some non-numeric data in the sepalLength column, and we know this because it classifies the column as the object type.

Spoiler alert: Remember when I said above that we are using a modified version of the Iris Data Set? This is the reason: I intentionally placed two problematic records into the data to simulate what might happen if a few slipped through our initial cleaning. Below we’ll take a few basic actions to mitigate this…

If we sort of the data in descending order the non-numeric values should float to the top and give us a better idea of what is occurring:

data['sepalLength'].sort_values(ascending = False).head(5)
151    abc
131    7.9
135    7.7
118    7.7
117    7.7
Name: sepalLength, dtype: object

Ahah! It appears that record 151 has a typo: Someone (i.e. me) put the value abc into the field instead of an actual numeric value.

Let’s drop the record and recast the column into a numeric data type.

Note that in this case it is perfectly fine for us to drop the record, because we are working on initial data exploration. At this point we simply want to get a sense of the story the data is telling us, and so removing a few problematic records shouldn’t affect this outcome.

# Drop the erroneous record; likely a data entry error
data.drop(151, inplace = True)

# Recast the column into numeric values
data['sepalLength'] = pd.to_numeric(data['sepalLength'])

# Confirm our work
print(data.dtypes)
sepalLength    float64
sepalWidth     float64
petalLength    float64
petalWidth     float64
class           object
dtype: object

We now see that all the columns are numeric except the last column which we assumed contained classification label data.

Another item we should check for is NULL values. Python does not use the NULL keyword, and instead uses NaN and None to represent a NULL.

We can quickly check for this as follows:

# isnull() will find both NaN and None values in a dataframe
# Let's see how many we have and count them up
data.isnull().sum()
sepalLength    0
sepalWidth     1
petalLength    0
petalWidth     0
class          0
dtype: int64

It appears we have one NaN/None value in our data set. Let’s go ahead and drop it, so we can continue with the exploration:

#Drop any NaN/None values from our data set and then confirm we've removed them all:
data.dropna(inplace = True)

# Confirm our work
data.isna().sum()
sepalLength    0
sepalWidth     0
petalLength    0
petalWidth     0
class          0
dtype: int64

As a reminder: Data cleaning is a whole subject unto itself, which could have an entire article dedicated to this subject alone. For this write up; however, we are assuming your data is mostly clean, and we’ve just covered two basic commands to deal with a few records that might have slipped through.

Descriptive stats

Next we can take a look at the descriptive statistics for the numeric data in our set:

# Expand the width of the displayed table
set_option('display.width', 200)

# Set the decimal place at two places
set_option('precision', 2)

# Output descriptive statistics
data.describe()
sepalLength sepalWidth petalLength petalWidth
count 150.00 150.00 150.00 150.00
mean 5.84 3.05 3.76 1.20
std 0.83 0.43 1.76 0.76
min 4.30 2.00 1.00 0.10
25% 5.10 2.80 1.60 0.30
50% 5.80 3.00 4.35 1.30
75% 6.40 3.30 5.10 1.80
max 7.90 4.40 6.90 2.50

We can see for example that the sepalLength values fluctuate between 4.3 and 7.9, and that the standard deviation (std) is relatively small for this data set. This implies that most of the data is close to the mean, and that we don’t have a wide range of values to consider.

We can also note that the numeric values seem to be on similar scale.

Note that scale is important, because if you were planning on using an algorithm such as linear regression, neural networks, K-means, etc. as part of your narrative/analysis you’d need to take into scaling into account.

For example, when dealing with data such as housing you might value such as $450,000 for the house price and the value 3 for the number of bathrooms. In these cases you’d need to apply normalization and/or standardization techniques to mitigate the variances between the value scales.

For example, we wouldn’t want the computer to become confused and think that the number of bathrooms was so small in comparison to the house price that it had no effect, which is quite the opposite!

While these topics are out-of-scope for this article, if you want to learn more you can read more about them here.

Class distribution

Let’s examine the last column which appeared to be categorization labels for each row. How many different classes/labels are we dealing with?

data.groupby('class').size()
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

We note that we have an even distribution of classes/labels among the samples.

If you were planning on using data modeling this would be important, because you’d have to account for a disparity in the number of classes within the observations if an imbalance existed.

For example, if we have one thousand positive comments and four negative ones, it would be hard to draw convincing conclusions, identify trends, or make recommendations on how we could improve on the four negative samples alone.

Correlation

Next we examine the data for correlations using the corr() command. We also utilize the helper function, corrTableColors(), we wrote above to make reading the table easier.

As stated before, if you want to change the color thresholds to fit your needs this should be easy to do.

corr = data.corr(method = "pearson")
corr.style.applymap(corrTableColors)
sepalLength sepalWidth petalLength petalWidth
sepalLength 1.00 -0.11 0.87 0.82
sepalWidth -0.11 1.00 -0.42 -0.36
petalLength 0.87 -0.42 1.00 0.96
petalWidth 0.82 -0.36 0.96 1.00

Unsurprisingly the data set has one or more positive correlations present, which our intuition would have told us would be the case in data concerning plants.

In a real-world data set this would potentially provide us areas to investigate further. For example, assume we had metrics on the number of clicks required to perform some task, and metrics on the time taken for a web page to load. We could examine the correlation between these two metrics and the reported user satisfaction rating for the system in order to identify potential positive or negative relationships.

We would expect to find that requiring the user to click excessively together with long loading times would increase dissatisfaction, and the correlation table could help us empirically identify this linkage.

Skew

Skewness is a measure of how long the ‘tails’ are in the data. Here is an image to illustrate:

png

In this image the ‘tail’ of the data drifts off to the right, and this is called positive skewness. If on the other hand the tail would have drifted off to the left we would have said the data had negative skewness.

Skewness might interest you if you are concerned about data that doesn’t follow the bell curve (also known as a normal distribution).

For example, let us assume we are examining data from a project concerning customer hold times. If we saw positive skewness in the data we might be concerned, because the long right-side tail would imply some customers are having potentially excessive wait times which in turn could negatively impact customer satisfaction.

Let’s see how we can view a numeric representation of skewness for our sample data:

data.skew()
sepalLength    0.31
sepalWidth     0.33
petalLength   -0.27
petalWidth    -0.10
dtype: float64

For a normal distribution the skew would be zero. Our data set has skewness values close to zero, so there does not appear to be any significant positive or negative skew.

We’ll also be able to visually confirm this in the next section.

Visualizations

Now that we’ve finished with the non-visual data exploration let’s employ some graphical techniques and see what we can find. :)

Histograms

Creating histograms for our sample data set is extremely easy as show in the code below:

# Control how big the histogram images are
data.hist(figsize = (8,8))

# Draw the histograms
pyplot.show()

png

The histograms confirm what we learned earlier about skewness in the data, and both sepalLength and sepalWidth appear to normal distributions (i.e. follow the bell curve).

petalLength and petalWidth on the other hand appear to be bimodal distritubtions. Or that is to say they appear to have two ‘peaks’ in their value distributions.

A bimodal (or multimodal if there are more than two peaks) might be of interest to you if you were considering scheduling or queue analysis data for example.

Let’s assume that we were working on a project to deploy an online restaurant point of sale (POS) system, and we wanted to ensure the system was responsive during peak hours (i.e. non-functional requirements). Examining the sales data for the restaurant hour-by-hour would likely reveal a bimodal distribution with two peaks: One at breakfast and another at lunch.

We could use these peaks to calculate the capacity and scalability requirements for the system, recommend on-demand load balancing for the back end POS servers during peak hours, etc.

Density plots

Density plots can be thought of as plots of smoothed histograms. This assists in examining the ‘shape’ of the data, and are easier to read–for me anyhow–when considering distribution and multimodal distribution types.

Let’s take a look:

# Create the density plots in a 2x2 layout
data.plot(
    kind = 'density',
    subplots = True,
    layout = (2,2),
    sharex = False,
    figsize = (8,8)
)

# Draw the plots
pyplot.show()

png

These plots make it very easy to gauge the distribution of the data as well as the presence of a multimodal distribution.

So for example we can clearly see that the petalLength and sepalWidth follow a bell curve, and that the petalLength and petalWidth values are indeed bimodal.

We also have an additional confirmation regarding the lack of skewness of the data as covered earlier.

Box and whisker plots

Box and whisker plots are fantastic data visualization tools–although they can be non-intuitive to decipher at first. Here is a quick two-minute article on how to interpret them, and here is an info graphic explanation from that same source:

png

Creating them in our notebook is easy too:

# Create the box-and-whisker plots in a 2x2 layout
data.plot(
    kind="box",
    subplots = True,
    layout = (2,2),
    sharex = False,
    sharey = False,
    figsize = (8,8)
)

# Draw the graphs
pyplot.show()

png

We can note that the box-and-whisker plots contain a lot of information about our data set in a compact form:

  • By examining the Y-axis values and ‘whiskers’ it is clear the data has a value spread of about 1 to 8 centimeters
  • Other than the sepalWidth variable, the data does not contain outliers
  • The interquartile range (IQR) also appears to have a narrow spread with the greatest amount of variation occurring in the petalLengh values
  • We can quickly identify the median value for each variable by examining the value of the green bar. For example, sepalWidth appears to have a median of about 5.75.

Scatter plot matrix

We next consider a scatter plot matrix, which allows us to visually estimate correlations similar to how we did this above using quantitative means:

scatter_matrix(data, figsize=(10, 10), s = 200)
pyplot.show()

png

The scatter plot matrix reinforces visually what we saw above in the correlations table.

For example, we see that data points for petalLength vs. petalWidth increase steadily towards the upper right side of the graph in a diagonal line. If we consult the correlations table we created earlier we find that these two variables are highly correlated at 0.96.

On the other hand, if we saw data points on the graph decreasing towards the lower right side of the graph we would acknowledge that there was negative correlation.

sepalWidth and sepalLength on the other hand don’t seem to have much correlation present since there doesn’t appear to be a clear pattern in the plotted data points. The correlation table confirms with a recorded value of -0.11 for these two variables.

Here is a visual explanation of these concepts:

png

Source

Unstructured Text (IMDB Reviews)

At this point we are staring to have a good initial understanding of what our data looks like. However, what if we have unstructured natural language data such as customer reviews as part of our collection?

In this section we’ll cover three easy techniques to analyze and assess unstructured textual data.

For our example data set we are going to use the IMDB Movie Reviews data set we loaded at the beginning of this notebook.

Note that we’ve already cleaned up the text by removing punctuation and other non-alphabetic values, set all the text to lower case, removed stop words such as ‘a’, ‘and’, ‘the’, etc., and each review is on a separate line.

Let’s take a quick peek at one of the records we’ll be working with:

reviews[110]
'truly dreadful know initially kiwi effort soon started realize characters speaking hardly disguised kiwi accents fake american ones need set n america anyway could set nz actors could used normal voices surely someone production team could hear dreadful attempts speaking american accents bad bad surprised lasted long make seemed like poor attempt segal willis type action man flick total waste money taxpayer money piece trash would leading revolution money put back treasury still reeling get pun reeling absolute garbage seen continue watch well fanatic cant help'

From all the negative words I think it’s pretty clear someone didn’t like this movie!

World Clouds

The first technique we’ll cover is a word cloud.

Google says a word cloud is “an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.”

Thus, the more often a specific word appears in the text, the bigger and bolder those words will appear in the word cloud.

So, for example if we were to utilize a word cloud on a corpus of text related to customer support tickets it might help us to gain an initial understanding of the most common issues. We might see the word ‘slow’ or ‘missing’ occur for a system that had latency and data errors.

Let’s create one from our data, and see what it looks like:

# Generate wordcloud
wordcloud = WordCloud(
    random_state=1,
    background_color='black',
    colormap='Set2',
    collocations=False,
    stopwords = STOPWORDS).generate(' '.join(reviews))

wordcloud.to_image()

png

We quickly have a powerful graphical analysis tool with almost no code!

The words ‘good’ and ‘time’ have the 2nd and 3rd largest font size respectively. We also see the word ‘bad’ in the word cloud, but its font size is much smaller.

From this we can form an initial assumption that our sample of movie reviews contains more positive reviews than negative. Now of course we’d need to empirically validate this with further analysis, but for an initial data exploration it serves the purpose of helping us form a story in our mind of what the data is telling us and how to proceed next.

Numerical Word Frequency Counts

If after creating the word cloud we want to know the exact word count frequencies for words appearing the greatest and least amount of times we can do that like so:

# Count the number of times each word appears in the text(s)
data_analysis = nltk.FreqDist(' '.join(reviews).split())

Now we can view the top 10 most frequent words in the data set:

data_analysis.most_common(10)
[('one', 26788),
 ('like', 20274),
 ('good', 15140),
 ('time', 12724),
 ('even', 12646),
 ('would', 12436),
 ('story', 11983),
 ('really', 11736),
 ('see', 11475),
 ('well', 10662)]

And what if we wanted to know the top 10 least frequent words?

Easy enough:

data_analysis.most_common()[-10:]
[('riiiight', 1),
 ('imy', 1),
 ('jayden', 1),
 ('concider', 1),
 ('acids', 1),
 ('crahan', 1),
 ('substitutions', 1),
 ('daeseleire', 1),
 ('shortsightedness', 1),
 ('unfairness', 1)]

Sentence Lengths

Next we’ll examine the length of each record in the reviews.

First let’s calculate the number of words per review, and store them in their own column:

reviewsDF = pd.DataFrame(reviews, columns = ['Review Text'])
reviewsDF['Word Count'] = [len(x.split()) for x in reviewsDF['Review Text']]

We can examine how many words are in the first five reviews with the head() command:

reviewsDF.head()
Review Text Word Count
0 stuff going moment mj started listening music ... 214
1 classic war worlds timothy hines entertaining ... 80
2 starts manager nicholas bell giving welcome in... 238
3 must assumed praised greatest filmed opera eve... 184
4 superbly trashy wondrously unpretentious explo... 208

Next we can use the describe() function as we did on the Iris data to create descriptive statistics for the number of words in each review:

reviewsDF['Word Count'].describe()
count    25000.00
mean       116.16
std         89.16
min          3.00
25%         61.00
50%         86.00
75%        141.00
max       1416.00
Name: Word Count, dtype: float64

What is really interesting here is the spread of the values. The std statistic alone gives us an indication there is a lot of variance at 89.16, and if we next examine the min and max entries we see values of 3 and 1,416 respectively.

(For reference 1,416 words is about three pages!)

Clearly we had two individuals with very different levels of motivation when it came to authoring movie reviews…

We can also depict this visually via a histogram:

ax = reviewsDF['Word Count'].hist();
ax.set_xlabel("Number of words")
ax.set_ylabel("Number of records")
Text(0, 0.5, 'Number of records')

png

Whether or not the variance in record length matters will be up to you and your situation.

For example, if this was unstructured text containing user comments on application performance perhaps I would remove records with less than twenty words and more than five hundred. I might not feel that a record of less than twenty words or more than five hundred would have the concise details I needed to perform a proper root cause analysis.

Again; however, this will be up to you and your domain knowledge of your data.

As a final note, if we wanted to review the five records with the least amount of words we could do this:

reviewsDF.sort_values(by = ['Word Count'], ascending = True).head()
Review Text Word Count
20276 terrible good effects 3
22258 ming merciless little bardwork foul 5
9245 better choose paul verhoeven even watched 6
23558 rent one even dollar rental night 6
4048 adrian pasdar excellent makes fascinating woman 6

This not only gives us the word count for each record, but also the content and the row index where the record can be found in the data set (i.e. the values in the table’s first column).

If you wanted to see the records with the largest number of words you could replace ascending = True with ascending = False like so:

reviewsDF.sort_values(by = ['Word Count'], ascending = False).head()
Review Text Word Count
3485 match tag team table match bubba ray spike dud... 1416
20982 titanic directed james cameron presents fictio... 903
20347 attention spoilers first let say rob roy one b... 892
10503 probably heard bit new disney dub miyazaki cla... 853
6864 praised atlantis lost empire disney adventure ... 805

Wrapping Up

In this article we covered how to create a Jupyter Notebook template for initial data exploration. We then explored how to load the data, perform some simple cleaning operations, explore the data using numerical and graphic techniques, and we also examined three ways to initially analyze unstructured natural language data.

On the basis of this template we have gained a deeper understanding of the story the data had to tell us, and next we could take a deeper dive into narrative development, actionable recommendations, or KPIs and baselines to measure future data sets against.

If you have any methods, strategies, or experiences building out your own initial data exploration process and would like to share I would love to hear about them! Please let me know either in the comments or via email, and I will add them to this article (giving you full credit of course!).

I’d also be more than happy to answer any questions you may have after reading this, or if you have any problems setting up your own Jupyter Notebook environment and need a hand.

Thank you.