I have followed the tutorial at https://wiki.apache.org/nutch/RunNutchInEclipse, but it seems a bit outdated.
I couldn’t debug ParseJob because of the timeout. The tutorial is not up to date with the configuration:
<property> <name>sitemap.parser.timeout</name> <value>-1</value> </property>
and the tutorial lacks some ivy configuration needed to use HBase:
<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" /> <dependency org="org.apache.hbase" name="hbase-client" rev="0.98.8-hadoop2"/>