Batch processing with Neo4j

Hi all! I’ve been planning to start a blog for quite a while now so it’s about time to do so. As you might know, I’m a graph database enthusiast, especially loving Neo4j. A few days ago, I came across a blog post that announced the arrival of the PERIODIC COMMIT statement in Cypher. There’s already a way to try it out but you’ll have to grab an early milestone release (which is 2.1.0-M1 at the moment). If you’re using maven, you can get it as follows:

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j</artifactId>
    <version>2.1.0-M1</version>
</dependency>

<repository>
    <id>neo4j</id>
    <url>http://m2.neo4j.org/content/repositories/releases/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
</repository>

So, what does “PERIODIC COMMIT” actually mean? To state the announcement blog post:

For the best performance, you can give Cypher a hint about transaction size so that after ‘n’ updates a transaction will be committed…

This is perfect for batch processing of your data via Cypher. You can choose when a running transaction has to commit, so performance will greatly benefit. You specify the

USING PERIODIC COMMIT

command at the beginning of your Cypher statement, e.g.:

USING PERIODIC COMMIT MATCH (u:USER) SET u.name = 'changed'

This statement will commit every 10,000 operations by default. You can specify your own amount by appending it to the command:

USING PERIODIC COMMIT 1000 MATCH (u:USER) SET u.name = 'changed'

Let’s benchmark these statements. How does it affect the performance if we leave this statement out? What will happen if we use the default operation count and what happens when you’d play around with your own values? I’m going to use a very basic example, but it will be enough to illustrate its effects. First of all, lets populate an empty database with 100,000 nodes (no relationships). As this is for demonstration purposes only, we don’t care about the performance of this statement. This is what it could look like in Java:

final GraphDatabaseService graphDB = new TestGraphDatabaseFactory().newImpermanentDatabase();
final ExecutionEngine engine = new ExecutionEngine(graphDB);
try (final Transaction tx = graphDB.beginTx();) {
    for (int i = 0; i < 100_000; i++) {
        engine.execute("CREATE (u:USER { name : 'abc'}");
    }
    tx.success();
}

Now the database has some data we can work with. Let’s change the name property for every node that is labelled with USER. Before version 2.1.0, you would execute the following command:

MATCH (u:USER) SET u.name = 'changed'

As you can image, updating 100,000 nodes in a single transaction will hurt performance badly. Actually, on my machine, it took approx. 5,066 milliseconds to complete! We can do better, for sure! Luckily there’s a new kid on the block. Lets put the “PERIODIC COMMIT” command in front of the previous statement:

USING PERIODIC COMMIT MATCH (u:USER) SET u.name = 'changed'

If you’re using Java, make sure you’re running this statement outside of an existing transaction. If you do, you’ll get the following exception message: “org.neo4j.cypher.PeriodicCommitInOpenTransactionException: Executing queries that use periodic commit in an open transaction is not possible.”. After execution, it appears that the performance is already slightly better than before. It took on average 4,160 milliseconds to execute. In this particular case, the statement will commit exactly 10 times as we’re having 100,000 nodes and a default commit count on every 10,000 operations. But still, the difference on my machine is not that satisfying. Let’s use a custom commit count, let’s say 1,000.

USING PERIODIC COMMIT 1000 MATCH (u:USER) SET u.name = 'changed'

And we already seem to have a winner! It took approx. 3407 milliseconds, which is approx. a third faster than our initial statement! We could of course play around with more values, but at least on my machine, 1,000 seems to be very reasonable.

It’s great to see that there’s such a huge performance gain to be accomplished with Cypher. But what if we’d use the Java API to change every node’s name property? What are the performance implications? Lets recreate the last Cypher statement with Java code. It looks something like:

final Label userLabel = DynamicLabel.label("USER");
final String name = "name";
final String value = "changed";
final int batchSize = 1_000;		
int count = 0;
Transaction tx = graphDB.beginTx();
for (final Node node : GlobalGraphOperations.at(graphDB).getAllNodesWithLabel(userLabel)) {
    node.setProperty(name, value);
    if (++count % batchSize == 0) {
        tx.success();
        tx.close();
        tx = graphDB.beginTx();
    }
}
tx.success();
tx.close();

Again after three runs, the average run time is 3017 ms! That’s again about 400 ms faster than the fastest Cypher statement. In this simple example, the Java API seems to (slightly) outperform Cypher. If you have the luxury to use Java, I’d advise you to consider it as well. If you’re accessing your Neo4j database via REST, than this Cypher enhancement is well worth your consideration. Either way, if you have the resources, take your time to investigate what’s the fastest way to batch manipulate you data. Play around and choose wisely. But please bear in mind that when the processing fails at some point, you might have data that’s already been committed! Whether that’s a bad thing or not, that’s up to you to decide.

Thanks Neo4j team for such a awesome addition! Keep up the good work (and I’m sure you will ;-))… If there are questions, corrections or suggestions, feel free to add a comment or contact me directly at timmy dot storms at gmail dot com.

Advertisements

2 thoughts on “Batch processing with Neo4j

  1. HI Timmy,

    thanks for the nice blog post. It would make sense to use parameters in your cypher-statements though, both in create as well as the update statements to make them more realistic.

    Also in the first code snippet the less than sign is not rendered correctly and the initial create statement could probably also benefit from a batching of commits 🙂

  2. Hi Michael,

    Thank a lot for your feedback! I agree that my examples are really basic, but my main goal was to illustrate the performance implication of using the “PERIODIC COMMIT” statement. I did run a new benchmark with a query parameter, but it was consistently a tad slower than the corresponding example in the blog post. Nevertheless, I would love to read an article about the query caching mechanism that occurs when parameters are used. Could you guide me to an interesting article, or can I motivate you to write a blog post about it?

    Also, I know that the insert script could be optimized, but that was already stated. I was eager to use an example with the merge statement as that shows 3 new concepts of Neo4j 2, being MERGE, the use of labels and PERIODIC COMMIT of course.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s