Writing batch import scripts with Grails, GSQL and GPars

At work, we use a combination of GSQL and GORM to allows us to migrate data from legacy databases to our Grails projects. However, when dealing with large sets of data, this can be quite time consuming. Our largest databases can take 2 to 5 hours to process.

In this post, I will describe how to integrate GPars with Grails and GSQL to make these scripts run much faster.

A Sample Script

Here is how one of our scripts might look like.

import groovy.sql.Sql
import org.codehaus.groovy.grails.commons.ConfigurationHolder

/* Imports photos into our Grails application */
def photoImport = {

  def sql = Sql.newInstance(ConfigurationHolder.config.db as String,
            ConfigurationHolder.config.dbUsername as String,
            ConfigurationHolder.config.dbPassword as String,
            "com.mysql.jdbc.Driver")

  /* go through each row of photos */
  sql.eachRow("""SELECT * from photos""") { photo ->

    // get the user
    def user = User.findByUsername( photo.username )

    if (!user) {

      println 'could not locate user: ' + photo.userid
      return;

    } else {

      // creates a new album if the user does not have one
      def album = Album.findByUser( user )
      if( !album ){
        album = new Album( title : "${user.username}'s album", user: user )
	  }

      // import data from database into the photo list
	  album.addToPhotos(caption: photo.title, description: photo.description, album: album)

      // save modified album
	  if (!album.save(flush: true)){
        println "could not save album : " + album.errors
      }
    }
  }
}.call()

Notes:

  1. Description: This script looks through the photo table and imports photos into our database. The real script also does a great deal of image resizing and manipulation during the import, and creates secondary domain objects, but you get the idea.
  2. Grails shell friendly: I wrap this code in a closure so it can be invoked both via grails console and grails shell. In the code, you see
    def photoImport = { ...
    }.call()
    

    Wrapping this script in a closure allows me to call this on our production servers via grails shell like this:

    grails prod shell
    groovy > load scripts/importPhotos.groovy
    

    It also allows me to edit and debug the script using the Grails console.

    I found that with Gant scripts, you have to do a lot of extra work to get your application setting loaded properly, which are handled automatically by Grails shell and console.

  3. You don’t necesarily need to use SQL – for some of our content, we import data in Excel files. For this, I use Goran Ehrsson’s ExcelBuilder script, except using the following import annotations to handle xlsx. ( Also changing all HSSF references to XSSF )

    @Grab(group = "org.apache.poi", module = "poi", version = "[3.6,)")
    @Grab(group = "org.apache.poi", module = "poi-ooxml", version = "[3.6,)")
    import org.apache.poi.xssf.usermodel.*
    
  4. Configurable data sources: For our situation, the production data source and the development data source have different locations, we handle this by keeping this information within Config.groovy and keeping different sets for each environment.
    def sql = Sql.newInstance(ConfigurationHolder.config.db as String,
                  ConfigurationHolder.config.dbUsername as String,
                  ConfigurationHolder.config.dbPassword as String,
                  "com.mysql.jdbc.Driver")
    

Make it go faster: enabling GPars

GPars is a project that enables parallel processing in Groovy aimed at leveraging multiple processors and cores. Using GPars, in theory, we can improve the speed of our scripts with very little modification.

Adding GPars

There is a Grails plugin for the GPars project, however, it hasn’t been updated to the latest version of GPars, which has a few changes in syntax.

The easiest way to add GPars to your project is to actually just include the jar files. I do so by adding them into my grails-app/conf/BuildConfig.groovy file like so:

dependencies {
  build 'org.codehaus.gpars:gpars:0.10'
  build 'org.coconut.forkjoin.jsr166y:jsr166y:070108'
}

Parallelizing the GSQL .eachRow()

The next step is converting our eachRow loop into one that uses GPars. Since processing each row is fairly independent, this makes each iteration of the loop a perfect candidate for parallelization.

Paul King pointed out in the GPars mailing list that the underlying JDBC collection used by eachRow was not friendly to GPars, but we can use the rows() syntax in GSQL.

To do so, I simply change this line in my code:

  sql.eachRow("""SELECT * from photos""") { photo -> ... }

to the following

GParsPool.withPool {
   sql.rows( """select * from photos""" ).eachParallel { photo -> ... }
}

The code is pretty self explanatory, instead of looping and waiting for each iteration, it says ‘run each cycle of this loop as a parallel collection’.

Making GORM behave with GPars – getting hibernate sessions

If you try to run the code as it, you’ll run into a hibernate exception right away when it first tries to access the Grails database:

org.hibernate.HibernateException: No Hibernate Session bound to thread, and configuration does not allow creation of non-transactional one here

Some digging reveals that this is caused because the new thread created by GPars lacks the proper hibernate session. To fix it, we can wrap our GORM calls in a withTransaction closure, which will get an existing session or create a new one to run our GORM call.

In practice, we convert GORM getters like this:

   user = User.findByUsername( photo.username )

into this:

  User.withTransaction{
      user = User.findByUsername( photo.username )
  }

Making GORM behave with GPars – enabling pessimistic locking

Once I started running my scripts, it quickly became apparent that the threads were getting into each others’ ways when modifying data. The data in our legacy database would cluster similar data together ( i.e. photos would be imported in batch ), we would run into frequent StaleObjectExceptions and OptimisticLockExceptions from our good friend Hibernate.

One way to prevent this is to use pessimistic database locks to ensure that a SELECT UPDATE is called on our code. This is accomplished via the DomainObject.lock method, which acts like a get but prevents other rows from accessing this data.

So a block that looks like this:

        def album = Album.findByUser( user )
        if( !album ){
          album = new Album( title : "${user.username}'s album", user: user )
        }

will have additional code added to ensure that only one thread is trying to access that database row at the time.

        def album = Album.findByUser( user )
        if( !album ){
          album = new Album( title : "${user.username}'s album", user: user )
        }  else {
          // add a pessimistic lock to the album
          album = Album.lock( album.id )
        }

A GPars friendly script

So after all that work, we end up with a script that looks like this:

import groovy.sql.Sql
import org.codehaus.groovy.grails.commons.ConfigurationHolder
import groovyx.gpars.GParsPool

/* Imports photos into our Grails application */
def photoImport = {

  def sql = Sql.newInstance(ConfigurationHolder.config.db as String,
          ConfigurationHolder.config.dbUsername as String,
          ConfigurationHolder.config.dbPassword as String,
          "com.mysql.jdbc.Driver")

  /* go through each row of photos */
GParsPool.withPool {
  sql.rows("""SELECT * from photos""").eachParallel  { photo ->

  // get the user
  def user
  User.withTransaction{
      user = User.findByUsername( photo.username )
  }

  if (!user) {
     println 'could not locate user: ' + photo.userid
     return;
  } else {
     Album.withTransaction {

        // creates a new album if the user does not have one
        def album = Album.findByUser( user )
        if( !album ){
          album = new Album( title : "${user.username}'s album", user: user )
        } else {
          // add a pessimistic lock to the album
          album = Album.lock( album.id )
        }

        // import data from database into the photo list
        album.addToPhotos(caption: photo.title, description: photo.description, album: album)

        // save modified album
        if (!album.save(flush: true)){
          println "could not save album : " + album.errors
        }
     }
   }
 }
}
}.call()

Speed Improvements

So why go through all this trouble? Does the parallel processing provided by GPars provide any speed improvements?

I ran three versions of a modified user import script to see if there were any notable differences. This script imported the information of about 1,000 users to our system. Here are the results:

No GPars GPars Optimistic GPars Pessimistic
Run 1 174.142 53.731 56.031
Run 2 191.816 58.770 60.779
Run 3 196.652 57.356 68.459
Average 187.536 56.619 62.756

All times are in seconds.

As you can see, there were major differences between the average speed of a GPars-enabled import vs. one that is not GPars enabled. It took about 1/3 of the time to run the GPars scripts on my local developer machine ( some multi-core Dell laptop running Windows ).

There were not that many differences between the optimistic and pessimistic locking versions of the scripts. Given that the pessimistic locking version required the least amount of change while still allowing parallel processing, we decided to stick with that.

Given how easy it is to convert a non-GPars script to one that is GPars enabled, I definitively recommend this approach to anyone faced with a similar situation. In our investigation, I took at look at Spring Batch but bulked since it would require major rewrites to our scripts without promises of better performance.

If you want to learn more about GPars, I highly recommend getting the MEAP of Groovy in Action, 2nd edition by Dierk Koenig et al., which contains an excellent new chapter on the subject.

The threaded support for hibernate seems very awkward in Grails and GORM. If you know of a better / simpler way of doing this, please let me know.

About these ads

12 thoughts on “Writing batch import scripts with Grails, GSQL and GPars

  1. Enrique Medina

    Tomas,

    Shouldn’t you clear the Hibernate session every, let’s say, 1000 save() operations, to avoid massive memory consumption?

    Great article.

    Reply
  2. R.J. Osborne

    I just did a fairly large migration with Groovy’s SQL handling for a state government commission. It included several databases/schemas and well over 150 tables moving from Oracle to MySQL. I didn’t want to write a script for each and every table– and I didn’t. Although I didn’t use parallel processes in the JVM, I did break up large schemas and even very large table ranges into separate scripts- all of which could be run and tested independently. Those scripts relied on the same core data extraction techniques and could have been put in a single script.

    I wished GORM was easily isolated from Grails, as it would have made a few things more elegant; I had to rely on old-school JDBC to preserve Oracle’s datetime and blob data.

    That said, how would this approach fair when needing to move large, incongruent, messy, denormalized real-world data structures? I think there is a way to merge these two ideas into something really cool and table-independent in design.

    Reply
  3. Vaclav Pech

    Great post, indeed! You might also consider tuning the number of threads in the pool by explicitly setting the pool size in withPool(n), provided your script doesn’t utilize your CPU fully.

    Reply
  4. Mike Miller

    Great article, I’ve been trying to clear my work-plate – so that I can get some time to try out the GPars package. Glad to hear it’s going to be covered in GinA2.

    Reply
  5. Wanderson Santos

    You refactor code to this…


    def dataSource
    def photoImport = {
    def sql = Sql.newInstance(dataSource)
    ....

    Best regards!

    Reply
  6. Pingback: Groovy as a SalesForce API client | Lean Java Engineering

  7. Daniel

    Nice article.

    I’ve written “batch launcher” plugin to handle situations where you can/want use Grails infrastructure without an webapp environment:

    http://www.grails.org/plugin/batch-launcher
    http://code.google.com/p/grails-batch-launcher-plugin/
    http://grails.1312388.n4.nabble.com/Using-Grails-outside-of-an-webapp-td3058625.html#a3091861

    It still under development and test. Services and Quartz jobs can handle these Hibernate session initialization issues pretty well, but i’m not sure if it will be a better/faster/powerful solution than GPars.

    Best regards.

    Reply
  8. Pingback: Podcast grails.org.mx: Episodio 8 de la Temporada 1: Herramientas de construcción | GrailsMX

  9. Yuri Sakhno

    You do realize that the code
    def album = Album.findByUser(user)
    if (!album ) {
    album = new Album( title : “${user.username}’s album”, user: user )
    }
    may be executed simultaneously by two (or more) threads for the same user, in which case several albums will be created, don’t you?

    Reply
  10. Pingback: Podcast grails.org.mx: Episodio 8 de la Temporada 1: Herramientas de construcción |

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s