Learn Python Series (#30) - Data Science Part 1 - Pandas

Repository

https://github.com/pandas-dev/pandas
https://github.com/python/cpython

What will I learn?

You will learn what kind of toolset the pandas Python package is providing you with, how to install it (if you haven't installed it already in your current Python distribution), and import it into your projects;
how to convert data (either passed-in directly or read from another source) to a pandas DataFrame;
how to save data from a pandas DataFrame to an external file, such as CSV;
how to do some basic pandas data wrangling operations.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.7) distribution, such as (for example) the Anaconda Distribution;
The ambition to learn Python programming.

Difficulty

Beginner

Curriculum (of the `Learn Python Series`):

Additional sample code files

The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here: https://github.com/realScipio/learn-python-series/blob/master/lps-030/learn-python-series-030-data-science-pt1-pandas.ipynb

GitHub Account

https://github.com/realScipio

Learn Python Series (#30) - Data Science Part 1 - Pandas

Welcome to already episode #30 of the Learn Python Series! It's been a while since I've published my last (#29) tutorial episode on Python, after which I was busy with a number of projects including co-developing and running UA and @steem-ua together with @holger80.

Not everybody realises that (although I can code) I'm not originally academically educated in Computer Sciences, ergo that I'm writing the Learn Python Series partially as a documentation project on my own Python research, study and development aspirations. By carefully writing these tutorials in a very structured format, almost (or even exactly) "book-like", I'm "cementing" my own Python knowledge and skills. The past months I've gained an interest in learning more about Data Science using Python, and I recently came to the conclusion my own "research notes" were beginning to pile and felt the need to better document my progress. How to do that better by resuming the Learn Python Series? So there you go...! ;-)

About Data Science

Data Science is about gaining insights from (huge) amounts of (structured) data by analysing that data, and also to analytically and algorithmically solve complex problems, which insights and algorithmic solutions also have the potential to generate much value. When you dig into (large / big) data sets, you might be able to discover new insights that were previously hidden. The process of first exploring data, investigating that data to discover data characteristics and patterns, enriching that data with other data, often times requires a combination of both analytical skills and mathematical / business / tech creativity and skill. I suppose data science is positioned in the intersecting areas of those fields, which alligns with my own interests as well; which is why I find Data Science fascinating to learn more about, personally.

About the Python package `pandas`

pandas is a well-known and actively developed Python package which can be summarised as a "data analysis, wrangling and management toolkit"; I suppose you could call it "Excel for Python" in a way. pandas provides powerful and flexible methods and data formats to aid data science tasks, using Python and it's built on top of numpy ("Numerical Python", which we've already yet briefly talked about in episode #11 of the Learn Python Series).

pandas is positioned (as opposed to NumPy itself) as a more "high level" data analysis / wrangling toolkit, and - like Excel or OpenOffice "Calc" - it works really well with "tabular data". Unlike Excel / Calc, pandas is able to handle really large data sets, with file sizes ranging from hundreds of MegaBytes to even Gigabytes (or more!); try working with (or even opening!) those on a regular Excel / Calc application running on a regular personal computer!

pandas can therefore be used to -1- clean / munge / wrangle data sets, -2- analyse and (re-) model the data set, and -3- organise the data analysis (to plot, display in tabular form, and/or further process).

In short pandas is really powerful and cool, so let's dive right in!

Installing and importing `pandas`

If you're working with the Anaconda Python distribution, the pandas package is already installed by default, so you only need to import it in your project. If you haven't already installed pandas, that's as simple as:

pip install pandas

Then, create a new Python file, give it a relevant name (for example pandas_tut_1.py) and then simply begin with:

import pandas as pd

`pandas` Data Frame Basics

A DataFrame is a pandas data structure to represent tabular data (like a CSV file or an Excel spreadsheet with named columns and rows). Shortly hereafter, we'll be covering how to read-in an existing CSV file and convert it to a DataFrame object, but let's begin with creating a simple example DataFrame from scratch.

the `.DataFrame()` constructor

If we begin with a regular Python data object such as a dictionary, or a list of lists or tuples, pandas provides the .DataFrame() constructor to convert such data objects into a pandas DataFrame, for example like so:

import pandas as pd

weather_dict = {
    'day': ['1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019', '1/5/2019'],
    'temp_celsius': [3, 2, -1, 0, 4]
}

df1 = pd.DataFrame(data=weather_dict)
df1

	day	temp_celsius
0	1/1/2019	3
1	1/2/2019	2
2	1/3/2019	-1
3	1/4/2019	0
4	1/5/2019	4

Explanation: after importing pandas as pd, and declaring a dictionary object with two keys (day and temp_celsius), each containing one list with 5 values, we then converted the weather_dict dictionary object into a DataFrame object (called df1).

Nota bene: as always, I'm writing this tutorial itself using Jupyter Notebook, which contains both a Python interpreter, the markdown content, and a number of built-in Jupyter Notebook-specific methods and mechanisms. Running the above code inside a Jupyter Notebook prints/outputs the df1 DataFrame simply by calling the variable df1. In case you want to print the df1 DataFrame contents from the command line after having coded the above in an external code editor (e.g. Microsoft Visual Studio Code), then you need to do:

print(df1)

        day  temp_celsius
0  1/1/2019             3
1  1/2/2019             2
2  1/3/2019            -1
3  1/4/2019             0
4  1/5/2019             4

(From here on I'm assuming you're following along on a Jupyter Notebook as well, hence I won't be explicitly printing the DataFrame objects every time in the remainder of this and following tutorial(s).)

Nota bene: in this particular (dictionary) example, I've been using a "top down" approach, in which data is converted into a DataFrame object by dictionary keys. However, a more "logical" approach would be to insert that data "row-by-row", as the temperature value of "3 degrees Celsius" belongs to the associated data value "1/1/2019".

Another way to construct the same DataFrame, is via a "list of lists", which are then given column names as an additional constructor argument, like so:

import pandas as pd

weather_list = [
    ['1/1/2019', 3],
    ['1/2/2019', 2],
    ['1/3/2019', -1],
    ['1/4/2019', 0],
    ['1/5/2019', 4]
]

df2 = pd.DataFrame(data=weather_list, columns=['day', 'temp_celsius'])
df2

	day	temp_celsius
0	1/1/2019	3
1	1/2/2019	2
2	1/3/2019	-1
3	1/4/2019	0
4	1/5/2019	4

the `read_csv()` method

As we've just learned, the DataFrame() constructor needs to be passed a data= argument, which is the Python object holding the (example) data. But of course when dealing with large data sets you're not going to declare all those values manually. Instead, you might have saved them already on disk and you like to read the data from disk to then convert to a DataFrame object.

For exactly that purpose, pandas has the built-in method read_csv() (as well as a number of similar methods for other file types). Suppose in your current working directory exists the CSV file weather.csv, then you can construct the exact same DataFrame object like so:

import pandas as pd
df3 = pd.read_csv('weather.csv')
df3

	day	temp_celsius
0	1/1/2019	3
1	1/2/2019	2
2	1/3/2019	-1
3	1/4/2019	0
4	1/5/2019	4

the `to_csv()` method

pandas also allows to go the opposite route: to export DataFrame objects and save them to disk as .csv files. The to_csv() is used for that.

Nota bene: in order to save the weather.csv example file (that we just read via read_csv()) from the df2 DataFrame object we constructed before, it's convenient to not save the 0,1,2,3,4 index values to the CSV file (those index values are format-specific, and don't directly belong to the original data set). By default (at least in pandas version 0.24.0; the current version) those index values would be exported to CSV and so are the column names / headers (the first row of the CSV file). While we do want those column name values included in the CSV file, but not the pandas default index values, we set the index= parameter to None (and leave the header= parameter as it is by default: True). As the first argument we pass the file name (and an optional filepath in case you want to save it in another directory as your current working directory):

df2.to_csv('weather.csv', index=None)

After running the above to_csv() code line, your file 'weather.csv' should be saved as a valid CSV file, located in your current working directory.

the `.head()` and `.tail()` methods

When working with large data sets, it's often times convenient to quickly inspect the data you're working with, without wanting to "eye ball" big amounts of data. To only display the top 5 lines of your DataFrame (including column names and index numbers) you can use the .head() method, and to only display the bottom 5 lines of your DataFrame you can use .tail().

Nota bene: Please note that our (very simple) example weather.csv data set only contains 5 rows in total for simplicity / explanatino matters, ergo, in thisspecific example case you wouldn't notice a difference when running either ...

df3, or
df3.head(), or
df3.tail()

However, you can also pass an integer N to both .head() and .tail(), to only show N likes either at the top or bottom of your DataFrame, for example:

df3.head(2)

	day	temp_celsius
0	1/1/2019	3
1	1/2/2019	2

df3.tail(2)

	day	temp_celsius
3	1/4/2019	0
4	1/5/2019	4

In these specific examples, by passing the integer value of 2 to both .head(2) and .tail(2) we only show the top and bottom 2 lines of the DataFrame, respectively.

Index slicing

If you're interested to only use a specific set of DataFrame rows, you can use index slices just like we've learned about already on regular Python lists.

For example, if we only want to work with rows 1 and 2:

df3[1:3]

	day	temp_celsius
1	1/2/2019	2
2	1/3/2019	-1

Nota bene: the stop parameter is non-inclusive, ergo df3[1:3] means "begin with row 1 and stop at row number 3", hence, it shows rows 1 and 2.

In case you want to work with the entire DataFrame beginning with row number 2, then use:

df3[2:]

	day	temp_celsius
2	1/3/2019	-1
3	1/4/2019	0
4	1/5/2019	4

And in case you want to work with the entire DataFrame until row number 3, then use:

df3[:3]

	day	temp_celsius
0	1/1/2019	3
1	1/2/2019	2
2	1/3/2019	-1

the `columns` attribute / property

If you want to assign, return, or print all column names your DataFrame holds, call the columns attribute / property, like so:

df3.columns

Index(['day', 'temp_celsius'], dtype='object')

the `shape` attribute / property

Calling the shape property returns a tuple of the DataFrames "size" or "shape" in the form of (num_rows, num_columns), like so:

df3.shape

(5, 3)

Nota bene: it's

Learn Python Series (#30) - Data Science Part 1 - Pandas

Learn Python Series (#30) - Data Science Part 1 - Pandas

Repository

What will I learn?

Requirements

Difficulty

Curriculum (of the Learn Python Series):

Additional sample code files

GitHub Account

Learn Python Series (#30) - Data Science Part 1 - Pandas

About Data Science

About the Python package pandas

Installing and importing pandas

pandas Data Frame Basics

the .DataFrame() constructor

the read_csv() method

the to_csv() method

the .head() and .tail() methods

Index slicing

the columns attribute / property

the shape attribute / property

Curriculum (of the `Learn Python Series`):

About the Python package `pandas`

Installing and importing `pandas`

`pandas` Data Frame Basics

the `.DataFrame()` constructor

the `read_csv()` method

the `to_csv()` method

the `.head()` and `.tail()` methods

the `columns` attribute / property

the `shape` attribute / property