Spring Data Neo4j 3.0 migration

As I’m still using Neo4j 1.9 in a project, it was about time to make a switch to Neo4j 2.0. It offers a lot of great features which I want to explore. I’m a Spring framework enthusiast and I really like the entity approach of Spring Data Neo4j. To support Neo4j 2.0, this library needs to be upgraded as well. The first thing I did, was change the maven dependencies from

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j</artifactId>
    <version>1.9.7</version>
</dependency>
<dependency>
    <groupId>org.springframework.data</groupId>
    <artifactId>spring-data-neo4j-aspects</artifactId>
    <version>2.3.5.RELEASE</version>
</dependency>

to

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j</artifactId>
    <version>2.0.3</version>
</dependency>
<dependency>
    <groupId>org.springframework.data</groupId>
    <artifactId>spring-data-neo4j-aspects</artifactId>
    <version>3.0.2.RELEASE</version>
</dependency>

Ideally this would just work, but as expected, that’s not the case. There are quite a few changes in Neo4j 2.0, so you’ll need to migrate a thing or two. This blog post is a good article that gets you started. I had to take the following steps to make my application run again.

  1. Explicitly define the entity packages
  2. We had to make the entity metadata handling an explicit step in the lifecycle and decided to change the base-package attribute for neo4j:config to be mandatory. It allows for one or more (comma separated) package names to be scanned for entities.

    As I’m using annotation based configuration of my Spring beans, I needed to add the following constructor to my configuration class:

    @Configuration
    @EnableNeo4jRepositories(basePackages = { "com...repository" })
    public class SpringConfiguration extends Neo4jAspectConfiguration {
        
        SpringConfiguration() {
            setBasePackage("com...entity");
        }
    
    }
    

    Note that I’m extending the Neo4jAspectConfiguration class, because I’m using the advanced mapping mode. If you’re using the simple mapping mode, the Neo4jConfiguration class will suit your needs.

  3. Switch to Java 1.7
  4. As Java 1.6 is no longer supported by Neo4j 2.0, your application needs to be built with Java 1.7. In my project, I’m solving this with the maven-compiler-plugin where I explicitly define the Java version.

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <artifactId>maven-compiler-plugin</artifactId>
                        <version>3.1</version>
                        <configuration>
                            <source>1.7</source>
                            <target>1.7</target>
                        </configuration>
                    </plugin>
                </plugins>
        </pluginManagement>
    </build>
    
  5. Every database operation should be transactional
  6. This is an explicit requirement of Neo4j 2.0. Previously, only write operations needed to be wrapped into a transaction. As I frequently perform read queries in my application, I needed to add Spring’s @Transactional annotation on a couple of methods in classes of stereotype Service. In order to get the annotation to work, you should add the @EnableTransactionManagement annotation on a Configuration class. The Neo4jConfiguration class conveniently defines a PlatformTransactionManager for us.
    I’m using an InitializingBean as well and the @Transactional annotation doesn’t work on its afterPropertiesSet() method. This is expected behavior as the class hasn’t been proxied yet. A decent workaround is to use the native Neo4j transaction API that can e.g. be accessed via the Neo4jTemplate. Here’s a code example as illustration:

    @Service
    public class MyService implements InitializingBean, MyServiceInterface {
    
        @Autowired
        private Neo4jTemplate template;
      
        @SuppressWarnings("deprecation")
        @Override
        public void afterPropertiesSet() throws Exception {
            try (final Transaction tx = template.getGraphDatabaseService().beginTx()) {
                final IndexProvider indexProvider = template.getInfrastructure().getIndexProvider();
                try {
                    indexProvider.getIndex("custom_index");
                } catch (final NoSuchIndexException e) {
                    indexProvider.createIndex(Node.class, "custom_index", IndexType.FULLTEXT);
                }
                tx.success();
            }
        }
    
        @Override
        @Transactional
        public void thisIsTransactional() {
        }
    
    }
    
  7. Refactor code that contains removed SDN methods
  8. Lucky for me, I only came across one method that has been removed in the 3.0 release of Spring Data Neo4j. AFAIK it hadn’t been deprecated in a previous release, so I was a bit surprised to see this error. I’m talking about the Neo4jTemplate#queryEngineFor(QueryType type) method. As you’ll see, it was an easy fix and actually, it shouldn’t have been used in the first place.
    template.queryEngineFor(QueryType.Cypher).query("start n=node(*) return count(n)", null);
    becomes
    template.query("start n=node(*) return count(n)", null);

  9. Upgrade Cypher DSL
  10. Cypher DSL is a neat little library that allows you to write beautiful query code in a DSL way. Using this DSL, you can dynamically compose your query based on e.g. user input. So no need to concatenate a bunch of Strings. To be able to use new Cypher keywords, I had to upgrade to the latest version 2.0.1. I noticed that many of the Cypher 2.0 features are included in the library, although it seems to miss a couple of vital keywords.

    <dependency>
        <groupId>org.neo4j</groupId>
        <artifactId>neo4j-cypher-dsl</artifactId>
        <version>2.0.1</version>
    </dependency>
    
  11. Cypher changes
  12. Migrating to Neo4j 2.0 does impact your Cypher queries. Another good thing about Cypher DSL is that it shows compilation errors when certain syntax isn’t supported anymore after a library upgrade. I had errors because the “?” and “!” characters behind property names aren’t allowed anymore. The following example query

    START u=node:__types__("className:com...User") 
    WHERE u.since? > 2010
    RETURN u
    

    would need to be changed to something similar to

    START u=node:__types__("className:com...User") 
    WHERE coalesce(u.since, 2014) > 2010
    RETURN u
    

    Mind that I still don’t use labels intentionally. I’ll get back on that soon. In Cypher DSL, the second query looks like

    import static org.neo4j.cypherdsl.CypherQuery.*;
    
    start(query("u", "__types__", "className:com...User"))
        .where(coalesce(identifier("u").property("year"), literal(2014)).gt(2010))
        .return(dentifier("u"));
    
  13. Keep legacy indexes
  14. I heavily rely on the Lucene indexes for e.g. wildcard searches. These indexes are considered legacy in Neo4j 2.0 in favor of the newly created Label (schema) indexes. Spring Data Neo4j also follows this approach and will use schema indexes by default from version 3.0. Schema indexes do not yet support wildcard searches so moving to this new approach is no option for me. I’m a big fan of the new Label approach, but until Labels are fully implemented, I’m going to have to skip them. An alternative would be the use regex but it’s way slower than Lucene.
    Pre 3.0 versions of SDN used a “__type__” property on nodes and relationships to know which entity type they belonged to. A “__types__” Lucene index would be created as well so you could easily find entities based on their class name. As I heavily rely on this behavior, I needed to reconfigure SDN to override its default Label strategy. A potential way to do this is as follows:

    @Configuration
    public class SpringConfiguration extends Neo4jAspectConfiguration {
      
        // use legacy type representation (Stategy.Labeled is the default)
        @Override
        @Bean
        public TypeRepresentationStrategyFactory typeRepresentationStrategyFactory() throws Exception {
            return new TypeRepresentationStrategyFactory(graphDatabase(), Strategy.Indexed);
        }
        
        // Use FQN by default
        @Override
        @Bean
        protected EntityAlias entityAlias() {
            return new ClassNameAlias();
        }
    
    }
    

Now that the legacy behavior is re-enabled, I can successfully run my unit tests again. A big plus is that there’s no need to migrate my existing data. You’ll also notice that all classes related to legacy indexing are deprecated in SDN, but until Label indexes are fully implemented, that’s something I’ll have to live with (I’ll get over it ;-)).

Having switched to Neo4j 2.0, I’m getting the power of Cypher 2.0 which is more complete than its predecessor and if I’m not mistaken, should be slightly faster as well.

If you have any remarks or question, feel free to leave a constructive comment. Happy graphing (or whatknot…) :-)

Batch processing with Neo4j

Hi all! I’ve been planning to start a blog for quite a while now so it’s about time to do so. As you might know, I’m a graph database enthusiast, especially loving Neo4j. A few days ago, I came across a blog post that announced the arrival of the PERIODIC COMMIT statement in Cypher. There’s already a way to try it out but you’ll have to grab an early milestone release (which is 2.1.0-M1 at the moment). If you’re using maven, you can get it as follows:

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j</artifactId>
    <version>2.1.0-M1</version>
</dependency>

<repository>
    <id>neo4j</id>
    <url>http://m2.neo4j.org/content/repositories/releases/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
</repository>

So, what does “PERIODIC COMMIT” actually mean? To state the announcement blog post:

For the best performance, you can give Cypher a hint about transaction size so that after ‘n’ updates a transaction will be committed…

This is perfect for batch processing of your data via Cypher. You can choose when a running transaction has to commit, so performance will greatly benefit. You specify the

USING PERIODIC COMMIT

command at the beginning of your Cypher statement, e.g.:

USING PERIODIC COMMIT MATCH (u:USER) SET u.name = 'changed'

This statement will commit every 10,000 operations by default. You can specify your own amount by appending it to the command:

USING PERIODIC COMMIT 1000 MATCH (u:USER) SET u.name = 'changed'

Let’s benchmark these statements. How does it affect the performance if we leave this statement out? What will happen if we use the default operation count and what happens when you’d play around with your own values? I’m going to use a very basic example, but it will be enough to illustrate its effects. First of all, lets populate an empty database with 100,000 nodes (no relationships). As this is for demonstration purposes only, we don’t care about the performance of this statement. This is what it could look like in Java:

final GraphDatabaseService graphDB = new TestGraphDatabaseFactory().newImpermanentDatabase();
final ExecutionEngine engine = new ExecutionEngine(graphDB);
try (final Transaction tx = graphDB.beginTx();) {
    for (int i = 0; i < 100_000; i++) {
        engine.execute("CREATE (u:USER { name : 'abc'}");
    }
    tx.success();
}

Now the database has some data we can work with. Let’s change the name property for every node that is labelled with USER. Before version 2.1.0, you would execute the following command:

MATCH (u:USER) SET u.name = 'changed'

As you can image, updating 100,000 nodes in a single transaction will hurt performance badly. Actually, on my machine, it took approx. 5,066 milliseconds to complete! We can do better, for sure! Luckily there’s a new kid on the block. Lets put the “PERIODIC COMMIT” command in front of the previous statement:

USING PERIODIC COMMIT MATCH (u:USER) SET u.name = 'changed'

If you’re using Java, make sure you’re running this statement outside of an existing transaction. If you do, you’ll get the following exception message: “org.neo4j.cypher.PeriodicCommitInOpenTransactionException: Executing queries that use periodic commit in an open transaction is not possible.”. After execution, it appears that the performance is already slightly better than before. It took on average 4,160 milliseconds to execute. In this particular case, the statement will commit exactly 10 times as we’re having 100,000 nodes and a default commit count on every 10,000 operations. But still, the difference on my machine is not that satisfying. Let’s use a custom commit count, let’s say 1,000.

USING PERIODIC COMMIT 1000 MATCH (u:USER) SET u.name = 'changed'

And we already seem to have a winner! It took approx. 3407 milliseconds, which is approx. a third faster than our initial statement! We could of course play around with more values, but at least on my machine, 1,000 seems to be very reasonable.

It’s great to see that there’s such a huge performance gain to be accomplished with Cypher. But what if we’d use the Java API to change every node’s name property? What are the performance implications? Lets recreate the last Cypher statement with Java code. It looks something like:

final Label userLabel = DynamicLabel.label("USER");
final String name = "name";
final String value = "changed";
final int batchSize = 1_000;		
int count = 0;
Transaction tx = graphDB.beginTx();
for (final Node node : GlobalGraphOperations.at(graphDB).getAllNodesWithLabel(userLabel)) {
    node.setProperty(name, value);
    if (++count % batchSize == 0) {
        tx.success();
        tx.close();
        tx = graphDB.beginTx();
    }
}
tx.success();
tx.close();

Again after three runs, the average run time is 3017 ms! That’s again about 400 ms faster than the fastest Cypher statement. In this simple example, the Java API seems to (slightly) outperform Cypher. If you have the luxury to use Java, I’d advise you to consider it as well. If you’re accessing your Neo4j database via REST, than this Cypher enhancement is well worth your consideration. Either way, if you have the resources, take your time to investigate what’s the fastest way to batch manipulate you data. Play around and choose wisely. But please bear in mind that when the processing fails at some point, you might have data that’s already been committed! Whether that’s a bad thing or not, that’s up to you to decide.

Thanks Neo4j team for such a awesome addition! Keep up the good work (and I’m sure you will ;-))… If there are questions, corrections or suggestions, feel free to add a comment or contact me directly at timmy dot storms at gmail dot com.