Thursday, July 14, 2011

Adding Solr to an existing web application

In this article I'm going to discuss how to add a search function to a Java Web Application. These are the requirements:
- You have an existing web application, so you're not starting from scratch
- The contents you want to search is stored in a database you can access
- The contents should be automatically indexed and updated when changes occur to the database

1. Why Solr?

Apache's Lucene is the de facto indexing standard for java. It's fast and has a lot of features (see http://lucene.apache.org/ for more information). Apache Solr (http://lucene.apache.org/solr/) can be seen as an extension to Lucene, made for web applications. It's actually a web application in it's own right and if you start it you get a fully working search application and even an administration application. It also allows you to easily query and update the index. It's a very impressive, but also very complex project. Fortunately we'll only need a few parts from this project. I'll tell you which parts in the next chapter.

2. Downloading and installing Solr

The first thing you'll need to do is download Solr. Go to http://www.apache.org/dyn/closer.cgi/lucene/solr/ and grab the latest release (this article is based on release 3.3.0). Download the correct file for your platform and extract the archive somewhere on your system.
Solrs distribution format is a bit unusual: it's a war (Web ARchive) file and the people at Apache seem to expect that you will just drop this war file into your servlet container (eg Tomcat) and be ready to go. Unfortunately this is not what we want: we want to add Solr to an existing web application. So the first thing you'll need to do is extract this war file somewhere on your system. I'll call this location SOLR_WAR. A war file is basically just a zip file, so you can use winrar or similar if you're on a Windows system. So go ahead and extract the apache-solr-3.3.0.war file (it's in apache-solr-3.3.0\dist).
Now you'll need to add the Solr jar files to your existing web application (like all jar files, they go in WEB-INF\lib). You'll need at lease the following files for Solr to start correctly:
  • SOLR_WAR\WEB-INF\lib\apache-solr-core-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\apache-solr-solrj-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\commons-codec-1.4.jar
  • SOLR_WAR\WEB-INF\lib\commons-fileupload-1.2.1.jar
  • SOLR_WAR\WEB-INF\lib\commons-httpclient-3.1
  • SOLR_WAR\WEB-INF\lib\commons-io-1.4.jar
  • SOLR_WAR\WEB-INF\lib\lucene-analyzers-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\lucene-core-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\lucene-highlighter-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\lucene-spatial-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\lucene-spellchecker-3.3.0.jar
  • SOLR_WAR\WEB-INF\lib\slf4j-api-1.6.1.jar
  • SOLR_WAR\WEB-INF\lib\slf4j-jdk14-1.6.1.jar
  • SOLR_WAR\WEB-INF\lib\velocity-1.6.1.jar
To use the DataImportHandler feature (which will feed the data in the database to the Lucene index), you'll also need the following jar file:
  • apache-solr-3.3.0\dist\apache-solr-dataimporthandler-3.3.0.jar

The next step is to edit your web.xml file to add the Solr servlets and filter. Here are the sections you should add:
<filter>
  <filter-name>SolrRequestFilter</filter-name>
  <filter-class>org.apache.solr.servlet.SolrDispatchFilter</filter-class>  
 </filter>

 <filter-mapping>
  
  <filter-name>SolrRequestFilter</filter-name>
  <url-pattern>/dataimport</url-pattern>
 </filter-mapping>
 <servlet>
  <servlet-name>SolrServer</servlet-name>
  <servlet-class>org.apache.solr.servlet.SolrServlet</servlet-class>
  <load-on-startup>1</load-on-startup>
 </servlet>
 <servlet>
  <servlet-name>SolrUpdate</servlet-name>
  <servlet-class>org.apache.solr.servlet.SolrUpdateServlet</servlet-class>
  <load-on-startup>2</load-on-startup>
 </servlet>
 <servlet>
  <servlet-name>Logging</servlet-name>
  <servlet-class>org.apache.solr.servlet.LogLevelSelection</servlet-class>
 </servlet>
<servlet-mapping>
  <servlet-name>SolrUpdate</servlet-name>
  <url-pattern>/update/*</url-pattern>
 </servlet-mapping>
 <servlet-mapping>
  <servlet-name>Logging</servlet-name>
  <url-pattern>/admin/logging</url-pattern>
 </servlet-mapping>

3. Configuring Solr

Solr has it's own configuration mechanism. It's not just a file, but an entire folder. The easiest way to set it up is to copy the solr folder from the distribution (it's in apache-solr-3.3.0\example) to a new location on your file system. I'll call this location SOLR_PATH.
The first thing you'll need to do is point your servlet container (eg Tomcat) to the location of the Solr configuration folder. This is done by adding the following VM argument:
-Dsolr.solr.home=SOLR_PATH
(where you replace SOLR_PATH by the location on your file system that you copied the solr configuration folder to). Where and how to add this argument depends on the servlet container and/or IDE you're using. For tomcat you could modify the catalina.bat file and add the following line at the top:
set %JAVA_OPTS%=%JAVA_OPTS% -Dsolr.solr.home=SOLR_PATH
(where you replace SOLR_PATH by the location on your file system that you copied the solr configuration folder to)
If you want to you can try starting your web server now. If you get any errors, please make sure you copied all the jar files and that the VM argument is configured correctly.

We're not done yet though. First we'll need to edit our schema, so Solr knows which fields you want to index. This is done in the SOLR_PATH\conf\schema.xml file. Open this in your favorite editor. The first element is the schema element. You can change the name to anything you want, but this doesn't matter (the name is only for display purposes). The schema element is followed by a types element. This is similar to a database: there are different data types depending on the type of data you want to store. I highly recommend reading http://wiki.apache.org/solr/SchemaXml to understand how the data types work. You don't need to modify anything to the types though.
The next section in the schema.xml file is the fields element. This is where you declare the fields that will be indexed (ie made searchable). To keep the example simple, I'll use products. Products have a unique id, a name and a description. You could define the fields as follows:
<field name="id" type="int" indexed="true" stored="true" required="true" />    
<field name="name" type="text_general" indexed="true" stored="false"/>
<field name="description" type="text_en" indexed="true" stored="false"/>
We'll also define an additional "text" field, which contains all of the product information. This field will allow us to search all the fields at once. Define it as follows:
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
After the closing fields element, define the following elements:
<uniqueKey>id</uniqueKey>
and
<defaultSearchField>text</defaultSearchField>
We'll also define the copyfields, which will copy all the data to the "text" field we defined above:
<copyField source="name" dest="text"/>
<copyField source="description" dest="text"/>
If you don't understand this and would like to know what's going on, please read http://wiki.apache.org/solr/SchemaXml.

4. Configuring the DataImportHandler

Now we need a way to index the contents of the database in the Solr index. Fortunately, Solr has a mechanism for this, called the DataImportHandler (see http://wiki.apache.org/solr/DataImportHandler). First we need to tell Solr we want to use the DataImportHandler. Open the SOLR_PATH\conf\solrconfig.xml file and add the following code:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
   <str name="config">data-config.xml</str>
  </lst>
 </requestHandler> 
Now create the file SOLR_PATH\conf\data-config.xml. The contents should look like this:
<dataConfig>
<dataSource
      jndiName="java:comp/env/jdbc/myDB"
      type="JdbcDataSource"/>      
 <document>
  <entity name="product" pk="id" 
   query="select id, name, description FROM product WHERE '${dataimporter.request.clean}' != 'false' OR last_modified > '${dataimporter.last_index_time}'">
  </entity>
    </document>
</dataConfig>
You should modify the jndiName to the jndi name of the database you are using. You can also provide a connection URL and username/password if you're not using jndi, as follows:
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname" user="db_username" password="db_password"/>
As you can see in the query, your product table will need a last_modified field, which is a timestamp that contains the last data at which this record was updated (or created). If you don't have this field, Solr can't know which records have been updated since the last import and you will be forced to perform a full import each time (which is a big performance hit on large tables).
Now when you go to the following url: http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false the data will be imported from your database in the lucene index. Set the clean parameter to "true" to do a full import.
You can schedule a request to http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false to automatically update the index each hour, day, week... On linux you could add a cron job that does a
wget http://localhost:8080/MY_APP/dataimport?command=full-import&clean=false

5. Querying the index

We're almost there. We have a working index, which gets updated with data from our database. Now all we need to do is query the index. In our backend bean we define a reference to the Solr server:
private static final SolrServer solrServer = initSolrServer();
private static SolrServer initSolrServer() {
  try {
    CoreContainer.Initializer initializer = new CoreContainer.Initializer();
    CoreContainer coreContainer = initializer.initialize();
    EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
    return server;
  } catch (Exception ex) {
   logger.log(Level.SEVERE, "Error initializing SOLR server", ex);
   return null;
  }
}
And when we want to query Solr, we do the following:
SolrQuery query = new SolrQuery(keyword);
QueryResponse response = solrServer.query(query);
SolrDocumentList documents = response.getResults();
for (SolrDocument document : documents) {
  Integer id = (Integer) document.get("id");
  //load this product from the database using its id
}
We're using Solrj to query Solr. For more info on how to construct queries using Solrj, please see http://wiki.apache.org/solr/Solrj

6. Conclusion

You should now have a working index, that gets updated with data from the database and that you can query directly from Java, all nicely integrated in our existing web application!

18 comments:

  1. This is a full tutorial for integration Slor into Webapp. It is really useful. Thank the author so much.

    ReplyDelete
  2. Hi! thanks for your tutorial, however I have a problem, I am unable to get any search results , though the database import is done. Please help.

    ReplyDelete
  3. Where to write 5.Querying the index in my program? means in which class should i have to write?

    This code:
    private static final SolrServer solrServer = initSolrServer();
    private static SolrServer initSolrServer() {
    try {
    CoreContainer.Initializer initializer = new CoreContainer.Initializer();
    CoreContainer coreContainer = initializer.initialize();
    EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");
    return server;
    } catch (Exception ex) {
    logger.log(Level.SEVERE, "Error initializing SOLR server", ex);
    return null;
    }
    }

    ReplyDelete
  4. bhavesh jogi: This happens in the backend. Depending on the technology you're using it could be in a servlet, managed bean, ejb...

    ReplyDelete
    Replies
    1. I want to make search engine in java web application. I am fresher. How to start in eclipse? You said i have to write some code in servlet. like what u said in your comment. Please help me more. if possible send me this code in war file or zip file if possible. My email id : bhavesh.jogi007@gmail.com.

      Please help me. Send me code for reference.

      Delete
  5. I want to make search engine in java web application. I am fresher. How to start in eclipse? You said i have to write some code in servlet. like what u said in your comment. Please help me more. if possible send me this code in war file or zip file if possible. My email id : bhavesh.jogi007@gmail.com.

    Please help me. Send me code for reference.

    ReplyDelete
  6. how to read or access the solr index file i need program pls help me

    ReplyDelete
  7. Hi,
    Thank you for this tutorial, I was looking lot how to integrate solr with a web application.

    We appreciate if you can post the example as an eclipse or maven project.

    In your code :

    for (SolrDocument document : documents) {
    Integer id = (Integer) document.get("id");
    //load this product from the database using its id
    }

    why did you say load this product from the database, I think it should be loaded from the index.

    Thanks lot.

    ReplyDelete
    Replies
    1. Hi Majid,

      You usually don't store all the information in the index: this would make the index very large. You typically only store the fields you want to be able to search by. If the information you need is already present in the index, you don't need to query the database of course.
      In our example, if you needed the description of the product, you could just do:
      document.get("description")
      But if you need more information (like say the price), you would need to load the product from the database.

      Delete
  8. Thank you Mathias for your fast reply, I will appreciate if you can post the source code.
    thanks

    ReplyDelete
    Replies
    1. Hi Majid,

      Querying your database is typically done using JDBC or Hibernate. You probably need to query your database in other parts of your project? You can use the same code. If you've never queried a database in Java, I would suggest starting with http://docs.oracle.com/javase/tutorial/jdbc/basics/

      Unfortunately I can't post my code to query the database here: it consists of multiple classes and DAOs, this would lead us to far and not be of much help to you.

      Delete
  9. hi how to access admin console for this type of integration?
    rahulsingh336@gmail.com

    ReplyDelete
  10. Hi, could you please explain why you declare filters and servlets?
    If I got it right you use Solr in embedded mode (because you use EmbeddedSolrServer class), is it correct?

    ReplyDelete
  11. hi guys,
    am new to development if any one have sample solr search engine web project pls send to my mail id: anil.4reddy@gmail.com

    ReplyDelete
  12. hi,
    I have a column of type bfile in oracle data base, I want to index the contents of the file pointed by the bfile to apache solr 3.3, can u pls suggest me a method?

    ReplyDelete
  13. Hi Please provide code for integrating java with solr to display data from database by multiple fields like (id, name, description) through application level. My mail id : kavurupavan@gmail.com . Thanks for help.

    ReplyDelete
  14. where we have to set this point:::
    The first thing you'll need to do is point your servlet container (eg Tomcat) to the location of the Solr configuration folder. This is done by adding the following VM argument:
    -Dsolr.solr.home=SOLR_PATH

    ReplyDelete
    Replies
    1. This is done in the script that starts your web server. This can be in a .bat or .sh file or in your IDE directly.

      Delete