ActiveRecord is great when you need to quickly access and manipulate a few rows within a database. Loading records into a database is just as easy… Seed files and custom rake tasks make inserting records a breeze. But what happens when you need to import lots of rows. To clarify, when I say lots of rows, I’m talking about millions of rows from a delimited text file.

So many rows, so little time

My first instinct was to read in the text file and do a Model.create() on each row. This took a long, long time. (I actually gave up on it)

  # This is going to be awhile!!!
  CSV.parse(File.read('largefile.txt'), headers: true) do |row|
    SomeModel.create(row)
  end

Next, I tried batching the rows to an array in an effort limit the number of database calls. I then imported each batch using the activerecord-import gem. Batching helps if your recordset is a couple thousand rows, but doesn’t scale efficiently to a million+ rows.

  # Faster, but still not quite there...
  batch = []
  batch_size = 1000
  
  CSV.parse(File.read('largefile.txt'), headers: true) do |row|
    batch << SomeModel.new(row)
    if batch.size >= batch_size
      SomeModel.import batch
      batch = []
    end
  end
  SomeModel.import batch

COPY to the rescue

In order to get in the millions, I needed to take Rails out of the equation as much as possible. This meant using an ActiveRecord raw_connection and the COPY command from postresql.

Due to permission issues, you most likely will not be able to use the filename option of the COPY command. This actually turns out to not be a big deal since you can still use STDIN to pass data to the command. Confused? Let’s look at the example

  # Setup raw connection
  conn = ActiveRecord::Base.connection
  rc = conn.raw_connection
  rc.exec("COPY large_table (col1, col2, etc) FROM STDIN WITH CSV")

  file = File.open('largefile.txt', 'r')
  while !file.eof?
    # Add row to copy data
    rc.put_copy_data(file.readline)
  end

  # We are done adding copy data
  rc.put_copy_end
  
  # Display any error messages
  while res = rc.get_result
    if e_message = res.error_message
      p e_message
    end
  end

It’s actually pretty simple. I execute the COPY command on our large_table with data from STDIN. Then I read in the file and put the data into STDIN for each line. When the file is finished, I issue an end copy instruction.

Conclusion

With the copy technique above, I was able to import over 2.4 million rows in less than 4 min. Not too shabby. I would be interested in hearing what strategies you all have used for large imports.