ActiveRecord is great when you need to quickly access and manipulate a few rows within a database. Loading records into a database is just as easy… Seed files and custom rake tasks make inserting records a breeze. But what happens when you need to import lots of rows. To clarify, when I say lots of rows, I’m talking about millions of rows from a delimited text file.
So many rows, so little time
My first instinct was to read in the text file and do a Model.create() on each row. This took a long, long time. (I actually gave up on it)
# This is going to be awhile!!! CSV.parse(File.read('largefile.txt'), headers: true) do |row| SomeModel.create(row) end
Next, I tried batching the rows to an array in an effort limit the number of database calls. I then imported each batch using the activerecord-import gem. Batching helps if your recordset is a couple thousand rows, but doesn’t scale efficiently to a million+ rows.
# Faster, but still not quite there... batch =  batch_size = 1000 CSV.parse(File.read('largefile.txt'), headers: true) do |row| batch << SomeModel.new(row) if batch.size >= batch_size SomeModel.import batch batch =  end end SomeModel.import batch
COPY to the rescue
Due to permission issues, you most likely will not be able to use the filename option of the COPY command. This actually turns out to not be a big deal since you can still use STDIN to pass data to the command. Confused? Let’s look at the example
# Setup raw connection conn = ActiveRecord::Base.connection rc = conn.raw_connection rc.exec("COPY large_table (col1, col2, etc) FROM STDIN WITH CSV") file = File.open('largefile.txt', 'r') while !file.eof? # Add row to copy data rc.put_copy_data(file.readline) end # We are done adding copy data rc.put_copy_end # Display any error messages while res = rc.get_result if e_message = res.error_message p e_message end end
It’s actually pretty simple. I execute the COPY command on our large_table with data from STDIN. Then I read in the file and put the data into STDIN for each line. When the file is finished, I issue an end copy instruction.
With the copy technique above, I was able to import over 2.4 million rows in less than 4 min. Not too shabby. I would be interested in hearing what strategies you all have used for large imports.