Norconex HTTP Collector

Norconex HTTP Collector
Developer(s) Norconex Inc.
Stable release 2.x
Development status Active
Written in Java
Operating system Cross-platform
Type Web Crawler
License Apache
Website www.norconex.com/collectors/collector-http/

Norconex HTTP Collector is a web spider, or crawler initially created for Enterprise Search integrators and developers. It began as a closed source project developed by Norconex. It was released as open source in 2013.[1][2][3][4][5]

Architecture

Norconex HTTP Collector was built entirely using Java. A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.

Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high level representation of a URL-life-cycle from the crawler perspective.

The Importer and Committer modules are separate Apache licensed java libraries distributed with the Collector.

The Importer module parses incoming document from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the later can be used on its own, as a general-purpose document parser.

The committer module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Two committer implementations currently exists, for Apache Solr and Elastic Search.

Minimum Requirements

Java Standard Edition 7.0 or higher is required. Runs on any platform supporting Java.

Configuration

While the Norconex HTTP Collector can be configured programmatically it also supports XML configuration files. Apache Velocity is used to parse configuration files. Using Velocity directives permits configuration re-use amongst different Collector installations and variables substitution.

 <httpcollector id="Minimum Config HTTP Collector">
 <progressDir>./examples-output/minimum/progress</progressDir>
 <logsDir>./examples-output/minimum/logs</logsDir>
 <crawlers>
   <crawler id="Norconex Minimum Test Page">
     <startURLs>
       <url>http://www.norconex.com/product/collector-http-test/minimum.php</url>
     </startURLs>
     <workDir>./examples-output/minimum</workDir>
     <maxDepth>0</maxDepth>
     <delay default="5000" />
     <referenceFilters>
       <filter 
           class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
           onMatch="include" >
         http://www\.norconex\.com/product/collector-http-test/.*
       </filter>
     </referenceFilters>
     <importer>
       <postParseHandlers>
         <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
                 fields="title,keywords,description,document.reference"/>
       </postParseHandlers>
     </importer>
     <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
       <directory>./examples-output/minimum/crawledFiles</directory>
     </committer>
   </crawler>
 </crawlers>
 </httpcollector>

See also

References

  1. Source Code
  2. Beyond Search 1
  3. Beyond Search 2
  4. Big Data Made Simple
  5. Apache Solr Ecosystem

External links

This article is issued from Wikipedia - version of the Monday, July 27, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.