Setting up databases with PostgreSQL, PSequel, and Python – Towards Data Science
Sign up
Sign in
Sign up
Sign in
Hamza B., PhD
Follow
Towards Data Science
--
Listen
Share
As the demand for Data Scientists continues to increase, and is being dubbed the the “sexiest job job of the 21st century” by various outlets (including Harvard Business Review), questions have been asked of what skills should aspiring data scientists master on their way to their first data analyst job.
There is now a plethora of online courses to gain the skills needed for a data scientist to be good at their job (excellent reviews of online resources here and here). However, as I reviewed the various courses myself , I noticed that a lot of focus is put on exciting and flashy topics like Machine Learning and Deep Learning without covering the basics of what is needed to gather and store datasets needed for such analysis.
Before we go into PostgreSQL, I suspect many of you have the same question: why should I care about SQL?
Although database management may seem like a boring topic for aspiring data scientists — implementing a dog breed classifier is very rewarding I know! — it is a necessary skill once you join the industry and the data supports this: SQL remains the most common and in-demand skill listed in LinkedIn job postings for data science jobs.
Pandas can perform the most common SQL operations well but is not suitable for large databases — its main limit comes to the amount of data one can fit in memory. Hence, if a data scientist is working with large databases, SQL is used to transform the data into something manageable for pandas before loading it in memory.
Furthermore, SQL is much more than just a method to have flat files dropped into a table. The power of SQL comes in the way that it allows users to have a set of tables that “relate” to one another, this is often represented in an “Entity Relationship Diagram” or ERD.
Many data scientists use both simultaneously — they use SQL queries to join, slice and load data into memory; then they do the bulk of the data analysis in Python using pandas library functions.
This is particularly important when dealing with the large datasets found in Big Data applications. Such applications would have tens of TBs in databases with several billion rows.
Data Scientist often start with SQL queries that would extract 1% of data needed to a csv file, before moving to Python Pandas for data analysis.
There is a way to learn SQL without leaving the much loved Python environment in which so much Machine Learning and Deep Learning techniques are taught and used: PostgreSQL.
PostgreSQL allows you to leverage the amazing Pandas library for data wrangling when dealing with large datasets that are not stored in flat files but rather in databases.
PostgreSQL can be installed in Windows, Mac, and Linux environments (see install details here). If you have a Mac, I’d highly recommend installing the Postgres.App SQL environment. For Windows, check out BigSQL.
PostgreSQL uses a client/server model. This involves two running processes:
In real-case scenarios, the client and the server will often be on different hosts and they would communicate over a TCP/IP network connection.
I’ll be focusing on Postgres.App for Mac OS in the rest of the tutorial.
After installing PostgresApp, you can setup your first database by following the instructions below:
You now have a PostgreSQL server running on your Mac with default settings:
Host: localhost, Port: 5432, Connection URL: posgresql://localhost
The PostgreSQL GUI client we’ll use in this tutorial is PSequel. It has a minimalist and easy to use interface that I really enjoy to easily perform PostgreSQL tasks.
Once Postgres.App and PSequel installed, you are now ready to set up your first database! First, start by opening Postgres.App and you’ll see a little elephant icon appear in the top menu.
You’ll also notice a button that allows you to “Open psql”. This will open a ommand line that will allow you to enter commands. This is mostly used to create databases, which we will create using the following command:
create database sample_db;
Then, we connect to the database we just created using PSequel. We’ll open PSequel and enter the name’s database, in our case:
sample_db
. Click on “Connect” to connect to the database.Let’s create a table (consisting of rows and columns) in PSequel. We define the table’s name, and the name and type of each column.
The available datatypes in PostgreSQL for the columns (i.e. variables), can be found on the PostgreSQL Datatypes Documentation.
In this tutorial, we’ll create a simple table with world countries. The first column will provide each country an ‘id’ integer, and the second column will provide the country’s name using variable length character string (with 255 characters max).
Once ready, click “Run Query”. The table will then be created in the database. Don’t forget to click the “Refresh” icon (bottom right) to see the table listed.
We are now ready to populate our columns with data. There are many different ways to populate a table in a database. To enter data manually, the
INSERT
will come in handy. For instance, to enter the country Morocco with id number 1, and Australia with id number 2, the SQL command is:After running the query and refreshing the tables, we get the following table:
In practice, populating the tables in the database manually is not feasible. It is likely that the data of interest is stored in CSV files. To import a CSV file into the
country_list_csv
table, you use COPY
statement as follows:As you can see in the commands above, the table with column names is specified after the
COPY
command. The columns must be ordered in the same fashion as in the CSV file. The CSV file path is specified after the FROM
keyword. The CSVDELIMITER
must also be specified.If the CSV file contains a header line with column names, it is indicated with the
HEADER
keyword so that PostgreSQL ignores the first line when importing the data from the CSV file.The key to SQL is understanding statements. A few statements include:
SELECT is where you tell the query what columns you want back.
FROM is where you tell the query what table you are querying from. Notice the columns need to exist in this table. For example, let’s say we have a table of orders with several columns but we are only interested in a subset of three:
Also, the LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset. The ORDER BY statement allows us to order our table by any row. We can use these two commands together to a table of ‘orders’ in a database as:
Now that you know how to setup a database, create a table and populate in PostgreSQL, you can explore the other common SQL commands as explained in the following tutorials:
Once your PostgreSQL database and tables are setup, you can then move to Python to perform any Data Analysis or Wrangling required.
PostgreSQL can be integrated with Python using psycopg2 module. It is a popular PostgreSQL database adapter for Python. It is shipped along with default libraries in Python version 2.5.x onwards
Connecting to an existing PostgreSQL database can be achieved with:
Going back to our country_list table example, inserting records into the table in sample_db can be accomplished in Python with the following commands:
Other commands for creating, populating and querying tables can be found on various tutorials on Tutorial Points and PostgreSQL Tutorial.
You now have a working PostgreSQL database server ready for you to populate and play with. It’s powerful, flexible, free and is used by numerous applications.
--
--
Towards Data Science
Machine Learning Engineer
Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams
source