PostgreSQL Copy to/from feature

While completing an interview coding challenge recently, I learned of a feature in PostgreSQL that allows one to load large amounts of data into a PostgreSQL database.

Sometimes, you get a request to seed large amounts of data from a text file or a CSV file into a database table. Or sometimes, you get a request to export large amounts of data from a database table out to a CSV file. Let’s just stick with the seeding example to see how we may do this in ActiveRecord. Below is an example of seeding data from a CSV into a users table.

Importing data from a CSV file

I didn’t test the code above, but it should be something like that. The above would actually work. However, if there are a large amounts of user data, let’s say millions of rows, the above code may take awhile to run. Also, if the User model has a lot of ActiveRecord callbacks, we would take a large performance hit due to the fact that we need to go through the ActiveRecord abstraction layer.

How do we get around this issue and copy data to/from the database faster? Well, turns out PostgreSQL has a neat little feature called COPY that can do this super fast. The link to the PostgreSQL documentation on COPY is below.

Basically, you can copy to and from a database table. Let’s convert the above script into utilizing the COPY feature.

Boom. And that will execute in seconds instead of potential hours. Much faster. The first “users” refers to the table name and the columns list in the parenthesis refers to the users table’s columns. You can actually omit the columns list and PostgreSQL will just figure it out if you have the “HEADER CSV” parameters set like the example above, but I like a bit more specificity in my COPY statements.

One issue you may run into if you try executing the above in a production environment where you have the csv file in the application directory is that PostgreSQL’s COPY commands are read and written directly by the server. Which means that the CSV file must reside on the server where the database resides. In a typical production environment, this will not be the case since the application server and the database server will live on separate servers. Fortunately, there’s an easy way to get around this issue. Take a look at the modified script below.

This script is longer than the first example, but it allows you to copy CSV data that lives on the application server into the database in a separate server. So, if you ever do data seeding type of work in migration files in production environments, you would want to go with the example script above.

There’s the last “gotcha” that you might run into when using the COPY feature with Ruby on Rails. Database tables that map to ActiveRecord models in typical Rails application generally have the created_at and updated_at columns that cannot be null. If you use the COPY feature above, you may run into a situation where the script fails because the created_at and updated_at columns cannot be null. This won’t be an issue if your CSV files have this data set defined, but it most likely won’t. To get around this issue, generate a migration that will set all new database entries with created_at and updated_at columns to the current datetime.

Exporting data out to a CSV file

Now, what about exporting data out a CSV? This is pretty straightforward.

Yep, that’s literally it. You can write any SQL statement you want within the parameters.

Rails gem to help simplify this process

Some developers are either not comfortable with writing raw SQL statements (I have to admit, I sometimes fall into this category as well if it’s a complex query) or they simply prefer the abstraction that modern ORMS provide. If you prefer performing the above operations in pretty Ruby code, there’s a gem called “postgres-copy” that provides this abstraction layer.

I personally haven’t used the gem yet because I feel fine writing raw COPY statements, but for those who prefer writing the above operations in Ruby should try the above gem.