Learn Python Series (#30) - Data Science Part 1 - Pandas
Repository
- https://github.com/pandas-dev/pandas
- https://github.com/python/cpython
What will I learn?
- You will learn what kind of toolset the
pandas
Python package is providing you with, how to install it (if you haven't installed it already in your current Python distribution), and import it into your projects; - how to convert data (either passed-in directly or read from another source) to a
pandas
DataFrame; - how to save data from a
pandas
DataFrame to an external file, such as CSV; - how to do some basic
pandas
data wrangling operations.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.7) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Beginner
Curriculum (of the Learn Python Series
):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
Additional sample code files
The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here: https://github.com/realScipio/learn-python-series/blob/master/lps-030/learn-python-series-030-data-science-pt1-pandas.ipynb
GitHub Account
https://github.com/realScipio
Learn Python Series (#30) - Data Science Part 1 - Pandas
Welcome to already episode #30 of the Learn Python Series
! It's been a while since I've published my last (#29) tutorial episode on Python, after which I was busy with a number of projects including co-developing and running UA and @steem-ua together with @holger80.
Not everybody realises that (although I can code) I'm not originally academically educated in Computer Sciences, ergo that I'm writing the Learn Python Series
partially as a documentation project on my own Python research, study and development aspirations. By carefully writing these tutorials in a very structured format, almost (or even exactly) "book-like", I'm "cementing" my own Python knowledge and skills. The past months I've gained an interest in learning more about Data Science using Python, and I recently came to the conclusion my own "research notes" were beginning to pile and felt the need to better document my progress. How to do that better by resuming the Learn Python Series
? So there you go...! ;-)
About Data Science
Data Science is about gaining insights from (huge) amounts of (structured) data by analysing that data, and also to analytically and algorithmically solve complex problems, which insights and algorithmic solutions also have the potential to generate much value. When you dig into (large / big) data sets, you might be able to discover new insights that were previously hidden. The process of first exploring data, investigating that data to discover data characteristics and patterns, enriching that data with other data, often times requires a combination of both analytical skills and mathematical / business / tech creativity and skill. I suppose data science is positioned in the intersecting areas of those fields, which alligns with my own interests as well; which is why I find Data Science fascinating to learn more about, personally.
About the Python package pandas
pandas
is a well-known and actively developed Python package which can be summarised as a "data analysis, wrangling and management toolkit"; I suppose you could call it "Excel for Python" in a way. pandas
provides powerful and flexible methods and data formats to aid data science tasks, using Python and it's built on top of numpy
("Numerical Python", which we've already yet briefly talked about in episode #11 of the Learn Python Series
).
pandas
is positioned (as opposed to NumPy itself) as a more "high level" data analysis / wrangling toolkit, and - like Excel or OpenOffice "Calc" - it works really well with "tabular data". Unlike Excel / Calc, pandas
is able to handle really large data sets, with file sizes ranging from hundreds of MegaBytes to even Gigabytes (or more!); try working with (or even opening!) those on a regular Excel / Calc application running on a regular personal computer!
pandas
can therefore be used to -1- clean / munge / wrangle data sets, -2- analyse and (re-) model the data set, and -3- organise the data analysis (to plot, display in tabular form, and/or further process).
In short pandas
is really powerful and cool, so let's dive right in!
Installing and importing pandas
If you're working with the Anaconda Python distribution, the pandas
package is already installed by default, so you only need to import
it in your project. If you haven't already installed pandas
, that's as simple as:
pip install pandas
Then, create a new Python file, give it a relevant name (for example pandas_tut_1.py
) and then simply begin with:
import pandas as pd
pandas
Data Frame Basics
A DataFrame
is a pandas
data structure to represent tabular data (like a CSV file or an Excel spreadsheet with named columns and rows). Shortly hereafter, we'll be covering how to read-in an existing CSV file and convert it to a DataFrame object, but let's begin with creating a simple example DataFrame from scratch.
the .DataFrame()
constructor
If we begin with a regular Python data object such as a dictionary, or a list of lists or tuples, pandas
provides the .DataFrame()
constructor to convert such data objects into a pandas
DataFrame, for example like so:
import pandas as pd
weather_dict = {
'day': ['1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019', '1/5/2019'],
'temp_celsius': [3, 2, -1, 0, 4]
}
df1 = pd.DataFrame(data=weather_dict)
df1
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
Explanation: after importing pandas
as pd
, and declaring a dictionary object with two keys (day
and temp_celsius
), each containing one list with 5 values, we then converted the weather_dict
dictionary object into a DataFrame object (called df1
).
Nota bene: as always, I'm writing this tutorial itself using Jupyter Notebook, which contains both a Python interpreter, the markdown content, and a number of built-in Jupyter Notebook-specific methods and mechanisms. Running the above code inside a Jupyter Notebook prints/outputs the df1
DataFrame simply by calling the variable df1
. In case you want to print the df1
DataFrame contents from the command line after having coded the above in an external code editor (e.g. Microsoft Visual Studio Code), then you need to do:
print(df1)
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4
(From here on I'm assuming you're following along on a Jupyter Notebook as well, hence I won't be explicitly printing the DataFrame objects every time in the remainder of this and following tutorial(s).)
Nota bene: in this particular (dictionary) example, I've been using a "top down" approach, in which data is converted into a DataFrame object by dictionary keys. However, a more "logical" approach would be to insert that data "row-by-row", as the temperature value of "3 degrees Celsius" belongs to the associated data value "1/1/2019".
Another way to construct the same DataFrame, is via a "list of lists", which are then given column names as an additional constructor argument, like so:
import pandas as pd
weather_list = [
['1/1/2019', 3],
['1/2/2019', 2],
['1/3/2019', -1],
['1/4/2019', 0],
['1/5/2019', 4]
]
df2 = pd.DataFrame(data=weather_list, columns=['day', 'temp_celsius'])
df2
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
the read_csv()
method
As we've just learned, the DataFrame()
constructor needs to be passed a data=
argument, which is the Python object holding the (example) data. But of course when dealing with large data sets you're not going to declare all those values manually. Instead, you might have saved them already on disk and you like to read the data from disk to then convert to a DataFrame object.
For exactly that purpose, pandas
has the built-in method read_csv()
(as well as a number of similar methods for other file types). Suppose in your current working directory exists the CSV file weather.csv
, then you can construct the exact same DataFrame object like so:
import pandas as pd
df3 = pd.read_csv('weather.csv')
df3
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
the to_csv()
method
pandas
also allows to go the opposite route: to export DataFrame objects and save them to disk as .csv
files. The to_csv()
is used for that.
Nota bene: in order to save the weather.csv
example file (that we just read via read_csv()
) from the df2 DataFrame object we constructed before, it's convenient to not save the 0,1,2,3,4
index values to the CSV file (those index values are format-specific, and don't directly belong to the original data set). By default (at least in pandas
version 0.24.0; the current version) those index values would be exported to CSV and so are the column names / headers (the first row of the CSV file). While we do want those column name values included in the CSV file, but not the pandas
default index values, we set the index=
parameter to None
(and leave the header=
parameter as it is by default: True
). As the first argument we pass the file name (and an optional filepath in case you want to save it in another directory as your current working directory):
df2.to_csv('weather.csv', index=None)
After running the above to_csv()
code line, your file 'weather.csv'
should be saved as a valid CSV file, located in your current working directory.
the .head()
and .tail()
methods
When working with large data sets, it's often times convenient to quickly inspect the data you're working with, without wanting to "eye ball" big amounts of data. To only display the top 5 lines of your DataFrame (including column names and index numbers) you can use the .head()
method, and to only display the bottom 5 lines of your DataFrame you can use .tail()
.
Nota bene: Please note that our (very simple) example weather.csv
data set only contains 5 rows in total for simplicity / explanatino matters, ergo, in thisspecific example case you wouldn't notice a difference when running either ...
df3
, ordf3.head()
, ordf3.tail()
However, you can also pass an integer N
to both .head()
and .tail()
, to only show N likes either at the top or bottom of your DataFrame, for example:
df3.head(2)
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
df3.tail(2)
day | temp_celsius | |
---|---|---|
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
In these specific examples, by passing the integer value of 2
to both .head(2)
and .tail(2)
we only show the top and bottom 2 lines of the DataFrame, respectively.
Index slicing
If you're interested to only use a specific set of DataFrame rows, you can use index slices just like we've learned about already on regular Python lists.
For example, if we only want to work with rows 1 and 2:
df3[1:3]
day | temp_celsius | |
---|---|---|
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
Nota bene: the stop parameter is non-inclusive, ergo df3[1:3]
means "begin with row 1 and stop at row number 3", hence, it shows rows 1 and 2.
In case you want to work with the entire DataFrame beginning with row number 2, then use:
df3[2:]
day | temp_celsius | |
---|---|---|
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
And in case you want to work with the entire DataFrame until row number 3, then use:
df3[:3]
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
the columns
attribute / property
If you want to assign, return, or print all column names your DataFrame holds, call the columns
attribute / property, like so:
df3.columns
Index(['day', 'temp_celsius'], dtype='object')
the shape
attribute / property
Calling the shape
property returns a tuple of the DataFrames "size" or "shape" in the form of (num_rows, num_columns)
, like so:
df3.shape
(5, 3)
Nota bene: it's