Data Import Handler – import data from XML files which are in Solr xml format

The DataImportHandler is a Solr contrib that provides a configuration driven way to import data from relational databases or XML files, into Solr in both “full import” and “incremental delta import” mode.

In this article I am going to explain, how to import and index data from xml files which are in solr XML format.

Importing and indexing data from XML files which are in Solr xml format:

Step 1: Create your dih-config.xml (as below) in the Solr “conf” directory.

dih-config.xml

   <dataConfig>
      <dataSource type="FileDataSource" basePath="/home/import/input"
          encoding="utf-8"/>
  <document>
    <entity
      name="document"
      processor="FileListEntityProcessor"
      baseDir="/home/import/data/2011-06-27"
      fileName=".*\.xml$"
      recursive="false"
      rootEntity="false"
      dataSource="null">
      <entity
        processor="XPathEntityProcessor"
        url="${document.fileAbsolutePath}"
        useSolrAddSchema="true"
        stream="true">
      </entity>
    </entity>
  </document>
  </dataConfig>

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn’t need any data source (thus dataSource=”null”). The used attributes:

  •     fileName (mandatory) – regular expression that says which files to choose
  •     recursive – should subdirectories be checked  (default: no)
  •     rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
  •     baseDir (mandatory) – the directory where the files should be located
  •     dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source  (we can ommit this parameter in Solr > 1.3)
  •     excludes – regular expression which says which files to exclude from indexing
  •     newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
  •     olderThan – the same as above, but says about older files
  •     biggerThan – only files bigger than the value of the parameter will be taken into consideration
  •     smallerThan –only files smaller than the value of this parameter will be taken into consideration
 
The inner entity will read data contained in xml file and index that data into solr. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

  •     url – the input data
  •     useSolrAddSchema – informs that the input data is in Solr XML format
  •     stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will  use far less memory and won’t try to load the whole XML file into the memory.

Step 2: Add a DIH request handler to solrconfig.xml if it’s not there already..

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">dih-config.xml</str>
    </lst>
</requestHandler>

Step 3: Trigger DIH

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true
Write a comment
Cancel Reply