Java code skeletons, tips and pointers: July 2011

In this article I'm going to discuss how to add a search function to a Java Web Application. These are the requirements:
- You have an existing web application, so you're not starting from scratch
- The contents you want to search is stored in a database you can access
- The contents should be automatically indexed and updated when changes occur to the database

1. Why Solr?

Apache's Lucene is the de facto indexing standard for java. It's fast and has a lot of features (see http://lucene.apache.org/ for more information). Apache Solr (http://lucene.apache.org/solr/) can be seen as an extension to Lucene, made for web applications. It's actually a web application in it's own right and if you start it you get a fully working search application and even an administration application. It also allows you to easily query and update the index. It's a very impressive, but also very complex project. Fortunately we'll only need a few parts from this project. I'll tell you which parts in the next chapter.

2. Downloading and installing Solr

The first thing you'll need to do is download Solr. Go to http://www.apache.org/dyn/closer.cgi/lucene/solr/ and grab the latest release (this article is based on release 3.3.0). Download the correct file for your platform and extract the archive somewhere on your system.
Solrs distribution format is a bit unusual: it's a war (Web ARchive) file and the people at Apache seem to expect that you will just drop this war file into your servlet container (eg Tomcat) and be ready to go. Unfortunately this is not what we want: we want to add Solr to an existing web application. So the first thing you'll need to do is extract this war file somewhere on your system. I'll call this location SOLR_WAR. A war file is basically just a zip file, so you can use winrar or similar if you're on a Windows system. So go ahead and extract the apache-solr-3.3.0.war file (it's in apache-solr-3.3.0\dist).
Now you'll need to add the Solr jar files to your existing web application (like all jar files, they go in WEB-INF\lib). You'll need at lease the following files for Solr to start correctly:

SOLR_WAR\WEB-INF\lib\apache-solr-core-3.3.0.jar
SOLR_WAR\WEB-INF\lib\apache-solr-solrj-3.3.0.jar
SOLR_WAR\WEB-INF\lib\commons-codec-1.4.jar
SOLR_WAR\WEB-INF\lib\commons-fileupload-1.2.1.jar
SOLR_WAR\WEB-INF\lib\commons-httpclient-3.1
SOLR_WAR\WEB-INF\lib\commons-io-1.4.jar
SOLR_WAR\WEB-INF\lib\lucene-analyzers-3.3.0.jar
SOLR_WAR\WEB-INF\lib\lucene-core-3.3.0.jar
SOLR_WAR\WEB-INF\lib\lucene-highlighter-3.3.0.jar
SOLR_WAR\WEB-INF\lib\lucene-spatial-3.3.0.jar
SOLR_WAR\WEB-INF\lib\lucene-spellchecker-3.3.0.jar
SOLR_WAR\WEB-INF\lib\slf4j-api-1.6.1.jar
SOLR_WAR\WEB-INF\lib\slf4j-jdk14-1.6.1.jar
SOLR_WAR\WEB-INF\lib\velocity-1.6.1.jar

To use the DataImportHandler feature (which will feed the data in the database to the Lucene index), you'll also need the following jar file:

apache-solr-3.3.0\dist\apache-solr-dataimporthandler-3.3.0.jar

The next step is to edit your web.xml file to add the Solr servlets and filter. Here are the sections you should add:

<filter>
  <filter-name>SolrRequestFilter</filter-name>
  <filter-class>org.apache.solr.servlet.SolrDispatchFilter</filter-class>  
 </filter>

 <filter-mapping>
  
  <filter-name>SolrRequestFilter</filter-name>
  <url-pattern>/dataimport</url-pattern>
 </filter-mapping>
 <servlet>
  <servlet-name>SolrServer</servlet-name>
  <servlet-class>org.apache.solr.servlet.SolrServlet</servlet-class>
  <load-on-startup>1</load-on-startup>
 </servlet>
 <servlet>
  <servlet-name>SolrUpdate</servlet-name>
  <servlet-class>org.apache.solr.servlet.SolrUpdateServlet</servlet-class>
  <load-on-startup>2</load-on-startup>
 </servlet>
 <servlet>
  <servlet-name>Logging</servlet-name>
  <servlet-class>org.apache.solr.servlet.LogLevelSelection</servlet-class>
 </servlet>
<servlet-mapping>
  <servlet-name>SolrUpdate</servlet-name>
  <url-pattern>/update/*</url-pattern>
 </servlet-mapping>
 <servlet-mapping>
  <servlet-name>Logging</servlet-name>
  <url-pattern>/admin/logging</url-pattern>
 </servlet-mapping>

3. Configuring Solr

Solr has it's own configuration mechanism. It's not just a file, but an entire folder. The easiest way to set it up is to copy the solr folder from the distribution (it's in apache-solr-3.3.0\example) to a new location on your file system. I'll call this location SOLR_PATH.
The first thing you'll need to do is point your servlet container (eg Tomcat) to the location of the Solr configuration folder. This is done by adding the following VM argument:

-Dsolr.solr.home=SOLR_PATH

(where you replace SOLR_PATH by the location on your file system that you copied the solr configuration folder to). Where and how to add this argument depends on the servlet container and/or IDE you're using. For tomcat you could modify the catalina.bat file and add the following line at the top:

set %JAVA_OPTS%=%JAVA_OPTS% -Dsolr.solr.home=SOLR_PATH

(where you replace SOLR_PATH by the location on your file system that you copied the solr configuration folder to)
If you want to you can try starting your web server now. If you get any errors, please make sure you copied all the jar files and that the VM argument is configured correctly.

We're not done yet though. First we'll need to edit our schema, so Solr knows which fields you want to index. This is done in the SOLR_PATH\conf\schema.xml file. Open this in your favorite editor. The first element is the schema element. You can change the name to anything you want, but this doesn't matter (the name is only for display purposes). The schema element is followed by a types element. This is similar to a database: there are different data types depending on the type of data you want to store. I highly recommend reading http://wiki.apache.org/solr/SchemaXml to understand how the data types work. You don't need to modify anything to the types though.
The next section in the schema.xml file is the fields element. This is where you declare the fields that will be indexed (ie made searchable). To keep the example simple, I'll use products. Products have a unique id, a name and a description. You could define the fields as follows:

<field name="id" type="int" indexed="true" stored="true" required="true" />    
<field name="name" type="text_general" indexed="true" stored="false"/>
<field name="description" type="text_en" indexed="true" stored="false"/>

We'll also define an additional "text" field, which contains all of the product information. This field will allow us to search all the fields at once. Define it as follows:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

After the closing fields element, define the following elements:

<uniqueKey>id</uniqueKey>

and

<defaultSearchField>text</defaultSearchField>

We'll also define the copyfields, which will copy all the data to the "text" field we defined above:

<copyField source="name" dest="text"/>
<copyField source="description" dest="text"/>

If you don't understand this and would like to know what's going on, please read http://wiki.apache.org/solr/SchemaXml.

4. Configuring the DataImportHandler

Now we need a way to index the contents of the database in the Solr index. Fortunately, Solr has a mechanism for this, called the DataImportHandler (see http://wiki.apache.org/solr/DataImportHandler). First we need to tell Solr we want to use the DataImportHandler. Open the SOLR_PATH\conf\solrconfig.xml file and add the following code:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
   <str name="config">data-config.xml</str>
  </lst>
 </requestHandler>

Now create the file SOLR_PATH\conf\data-config.xml. The contents should look like this:

<dataConfig>
<dataSource
      jndiName="java:comp/env/jdbc/myDB"
      type="JdbcDataSource"/>      
 <document>
  <entity name="product" pk="id" 
   query="select id, name, description FROM product WHERE '${dataimporter.request.clean}' != 'false' OR last_modified > '${dataimporter.last_index_time}'">
  </entity>
    </document>
</dataConfig>

You should modify the jndiName to the jndi name of the database you are using. You can also provide a connection URL and username/password if you're not using jndi, as follows:

<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname" user="db_username" password="db_password"/>

As you can see in the query, your product table will need a last_modified field, which is a timestamp that contains the last data at which this record was updated (or created). If you don't have this field, Solr can't know which records have been updated since the last import and you will be forced to perform a full import each time (which is a big performance hit on large tables).
Now when you go to the following url: http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false the data will be imported from your database in the lucene index. Set the clean parameter to "true" to do a full import.
You can schedule a request to http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false to automatically update the index each hour, day, week... On linux you could add a cron job that does a

wget http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false

5. Querying the index

We're almost there. We have a working index, which gets updated with data from our database. Now all we need to do is query the index. In our backend bean we define a reference to the Solr server:

private static final SolrServer solrServer = initSolrServer();
private static SolrServer initSolrServer() {
  try {
    CoreContainer.Initializer initializer = new CoreContainer.Initializer();
    CoreContainer coreContainer = initializer.initialize();
    EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
    return server;
  } catch (Exception ex) {
   logger.log(Level.SEVERE, "Error initializing SOLR server", ex);
   return null;
  }
}

And when we want to query Solr, we do the following:

SolrQuery query = new SolrQuery(keyword);
QueryResponse response = solrServer.query(query);
SolrDocumentList documents = response.getResults();
for (SolrDocument document : documents) {
  Integer id = (Integer) document.get("id");
  //load this product from the database using its id
}

We're using Solrj to query Solr. For more info on how to construct queries using Solrj, please see http://wiki.apache.org/solr/Solrj

6. Conclusion

You should now have a working index, that gets updated with data from the database and that you can query directly from Java, all nicely integrated in our existing web application!

Java code skeletons, tips and pointers

Friday, July 15, 2011

Tomcat shutdown fails after installing Solr

Thursday, July 14, 2011

Adding Solr to an existing web application