COPY millions of rows to Postgresql with Rails

ActiveRecord is great when you need to quickly access and manipulate a few rows within a database. Loading records into a database is just as easy… Seed files and custom rake tasks make inserting records a breeze. But what happens when you need to import lots of rows. To clarify, when I say lots of rows, I’m talking about millions of rows from a delimited text file.

So many rows, so little time

My first instinct was to read in the text file and do a Model.create() on each row. This took a long, long time. (I actually gave up on it)

Next, I tried batching the rows to an array in an effort limit the number of database calls. I then imported each batch using the activerecord-import gem. Batching helps if your recordset is a couple thousand rows, but doesn’t scale efficiently to a million+ rows.

COPY to the rescue

In order to get in the millions, I needed to take Rails out of the equation as much as possible. This meant using an ActiveRecord raw_connection and the COPY command from postresql.

Due to permission issues, you most likely will not be able to use the filename option of the COPY command. This actually turns out to not be a big deal since you can still use STDIN to pass data to the command. Confused? Let’s look at the example

It’s actually pretty simple. I execute the COPY command on our large_table with data from STDIN. Then I read in the file and put the data into STDIN for each line. When the file is finished, I issue an end copy instruction.


With the copy technique above, I was able to import over 2.4 million rows in less than 4 min. Not too shabby. I would be interested in hearing what strategies you all have used for large imports.


  • The fastest way for parsing file to DB is to use:

    1. ActiveRecord::Base.connection.execute(“INSERT INTO …. “all your data”) . You don’t need “activerecord-import gem” it implements the same thing. But realize more logic which could consume resources.

    2. Then less objects you’ve created (array, hashes, variables) then faster it will process. (ruby consumes much resorces for creating objects)

    3. Turn off STDOUT its also influence on duration.

    4. Also if you use INSERT INTO, you need to delete previous data after successfull creating, LEFT OUTER JOIN for delete is good buddy for the goal.

    5. Use SQL TRANSACTION for ensure that large amount of data will not be partly loaded in your DB. (Ofc it consumes much resource but it bring more resistence if you” ll get exception)

    6 Use SDD instead HDD, and top hardware, captain!

    7. Perform the processes in background.

  • Loading data from CSV file is too primitive. I need to set up indices and do some preprocessing. But true, ActiveRecord is damn slow. There is faster ORM – Sequel. But even that one is too slow. I use PyPy and psycopg2ct and executemany API from DB API 2.0 for loading data. It is over 10x faster than Ruby based solution.

  • Sometimes when I have to insert a batch of records with some preprocessing (for example, checking of existing for new lines) I at first create a temp table:
    Create temp (like table);
    Then I load there records with copy or atomic inserts (depends on data source). Adding records to small table without indexes is almost as fast as writing to csv file.
    Insert into temp …/ copy from … To temp;
    Now I do the preprocessing.
    And load data to main table with:
    Insert into table select * from temp;

    Blazing fast! Of course removing indexes and generating them at the end of load makes it even faster. 😉

  • For MySQL there is a “LOAD DATA INFILE” that allows to do this in pure SQL:

  • Thanks all for the comments! There is some great advice here : )

  • Hi,

    Great article. I have the same problem in the project that I’m working.

    After a long period of test and benchmark we decide to use COPY too. The diference is that we are pre-process each “contact” (model that I use) to remove duplicated emails, than I send data to DB in batches of 10,000 lines (string with /n to separate each line).

    Is important to say that COPY command works with SQL TRANSACTION. So you will not have problems with it.

    Today we import around a billion lines of data pear day.

  • Good approach. In fact, if you don’t have complex authentication that ActiveRecord is handling for you, you can dispense with your ruby script altogether and just use the shell:

    $ cat largefile.txt | psql db_name -c “COPY large_table (col1, col2, etc) FROM STDIN WITH CSV”

  • Struggled with this for days; found this… My import of over 20m rows that was taking over 12 hours is now done in 7 minutes. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *