Batch processing with Neo4j

Hi all! I’ve been planning to start a blog for quite a while now so it’s about time to do so. As you might know, I’m a graph database enthusiast, especially loving Neo4j. A few days ago, I came across a blog post that announced the arrival of the PERIODIC COMMIT statement in Cypher. There’s already a way to try it out but you’ll have to grab an early milestone release (which is 2.1.0-M1 at the moment). If you’re using maven, you can get it as follows:



So, what does “PERIODIC COMMIT” actually mean? To state the announcement blog post:

For the best performance, you can give Cypher a hint about transaction size so that after ‘n’ updates a transaction will be committed…

This is perfect for batch processing of your data via Cypher. You can choose when a running transaction has to commit, so performance will greatly benefit. You specify the


command at the beginning of your Cypher statement, e.g.:


This statement will commit every 10,000 operations by default. You can specify your own amount by appending it to the command:


Let’s benchmark these statements. How does it affect the performance if we leave this statement out? What will happen if we use the default operation count and what happens when you’d play around with your own values? I’m going to use a very basic example, but it will be enough to illustrate its effects. First of all, lets populate an empty database with 100,000 nodes (no relationships). As this is for demonstration purposes only, we don’t care about the performance of this statement. This is what it could look like in Java:

final GraphDatabaseService graphDB = new TestGraphDatabaseFactory().newImpermanentDatabase();
final ExecutionEngine engine = new ExecutionEngine(graphDB);
try (final Transaction tx = graphDB.beginTx();) {
    for (int i = 0; i < 100_000; i++) {
        engine.execute("CREATE (u:USER { name : 'abc'}");

Now the database has some data we can work with. Let’s change the name property for every node that is labelled with USER. Before version 2.1.0, you would execute the following command:

MATCH (u:USER) SET = 'changed'

As you can image, updating 100,000 nodes in a single transaction will hurt performance badly. Actually, on my machine, it took approx. 5,066 milliseconds to complete! We can do better, for sure! Luckily there’s a new kid on the block. Lets put the “PERIODIC COMMIT” command in front of the previous statement:


If you’re using Java, make sure you’re running this statement outside of an existing transaction. If you do, you’ll get the following exception message: “org.neo4j.cypher.PeriodicCommitInOpenTransactionException: Executing queries that use periodic commit in an open transaction is not possible.”. After execution, it appears that the performance is already slightly better than before. It took on average 4,160 milliseconds to execute. In this particular case, the statement will commit exactly 10 times as we’re having 100,000 nodes and a default commit count on every 10,000 operations. But still, the difference on my machine is not that satisfying. Let’s use a custom commit count, let’s say 1,000.


And we already seem to have a winner! It took approx. 3407 milliseconds, which is approx. a third faster than our initial statement! We could of course play around with more values, but at least on my machine, 1,000 seems to be very reasonable.

It’s great to see that there’s such a huge performance gain to be accomplished with Cypher. But what if we’d use the Java API to change every node’s name property? What are the performance implications? Lets recreate the last Cypher statement with Java code. It looks something like:

final Label userLabel = DynamicLabel.label("USER");
final String name = "name";
final String value = "changed";
final int batchSize = 1_000;		
int count = 0;
Transaction tx = graphDB.beginTx();
for (final Node node : {
    node.setProperty(name, value);
    if (++count % batchSize == 0) {
        tx = graphDB.beginTx();

Again after three runs, the average run time is 3017 ms! That’s again about 400 ms faster than the fastest Cypher statement. In this simple example, the Java API seems to (slightly) outperform Cypher. If you have the luxury to use Java, I’d advise you to consider it as well. If you’re accessing your Neo4j database via REST, than this Cypher enhancement is well worth your consideration. Either way, if you have the resources, take your time to investigate what’s the fastest way to batch manipulate you data. Play around and choose wisely. But please bear in mind that when the processing fails at some point, you might have data that’s already been committed! Whether that’s a bad thing or not, that’s up to you to decide.

Thanks Neo4j team for such a awesome addition! Keep up the good work (and I’m sure you will ;-))… If there are questions, corrections or suggestions, feel free to add a comment or contact me directly at timmy dot storms at gmail dot com.