Search

Suggested keywords:
  • Java
  • Docker
  • Git
  • React
  • NextJs
  • Spring boot
  • Laravel

Nutch - Highly extensible, highly scalable Web crawler

  • Share this:
post-title
Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Its main feature include
  • Fetching, parsing and indexation in parallel and distributed
  • Plugin support
  • Ontology
  • Clustering
  • Distributed filesystem (via Hadoop)
  • Link-graph database
  • NTLM authentication
  • MapReduce
  • Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)
http://nutch.apache.org/
License:
Tech: