Tag Archives: Postgresql

COPY millions of rows to Postgresql with Rails

ActiveRecord is great when you need to quickly access and manipulate a few rows within a database. Loading records into a database is just as easy… Seed files and custom rake tasks make inserting records a breeze. But what happens when you need to import lots of rows. To clarify, when I say lots of rows, I’m talking about millions of rows from a delimited text file.

So many rows, so little time

My first instinct was to read in the text file and do a Model.create() on each row. This took a long, long time. (I actually gave up on it)

Next, I tried batching the rows to an array in an effort limit the number of database calls. I then imported each batch using the activerecord-import gem. Batching helps if your recordset is a couple thousand rows, but doesn’t scale efficiently to a million+ rows.

COPY to the rescue

In order to get in the millions, I needed to take Rails out of the equation as much as possible. This meant using an ActiveRecord raw_connection and the COPY command from postresql.

Due to permission issues, you most likely will not be able to use the filename option of the COPY command. This actually turns out to not be a big deal since you can still use STDIN to pass data to the command. Confused? Let’s look at the example

It’s actually pretty simple. I execute the COPY command on our large_table with data from STDIN. Then I read in the file and put the data into STDIN for each line. When the file is finished, I issue an end copy instruction.

Conclusion

With the copy technique above, I was able to import over 2.4 million rows in less than 4 min. Not too shabby. I would be interested in hearing what strategies you all have used for large imports.