Mar 20, 2015

Solr for docs.typo3.org

This is work in progress!

Overview:

About Nutch

Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

Nutch Installation

Solr Indexing Server

Solr indexing server: srv136.typo3.org

Incoming

Schema Browser

http://localhost:8080/solr/t3o_latest/admin/schema.jsp

  • content
  • URL
  • type

Solr:

ssh -L 8080:localhost:8080 srv136.typo3.org -N

http://localhost:8080/solr/t3o_latest/admin/

Developmental and test versions of typo3.org:

Documentation URL: http://docs.typo3.org/documents.txt

Nutch

Patch used by DKD: https://issues.apache.org/jira/browse/NUTCH-978

Tomcat log of Solr:

tail -f /var/log/tomcat6/catalina.out

Configuration file:

cd /usr/local/apache-nutch-for-typo3-2.1.0/urls
nano conf/nutch-site.xml # (like from manual)
nano urls/seed.txt # (like from manual)  <- http://docs.typo3.org/documents.txt
nano conf/nutch-default.xml # ()
nano plugins/parse-html/plugin.xml

ls -l /usr/local/apache-nutch-for-typo3-2.1.0/urls

Create symlink:

ln -s /usr/local/apache-nutch-for-typo3/urls /usr/share/solr/urls

Java PATH:

head -n 70 /etc/init.d/tomcat6

Run Nutch for latest (dev instance):

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
   /usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
   -solr http://localhost:8080/solr/t3o_latest -dir crawl \
   -depth 5 -topN 10
# or -topN 1000

Run Nutch for latest (live instance):

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
   /usr/local/apache-nutch-for-typo3-2.1.0/bin/nutch crawl urls \
   -solr http://localhost:8080/solr/t3o_live -dir crawl \
   -depth 5 -topN 1000

Check the result: http://www.latest.dev.t3o.typo3.org/search/?id=180&L=0&q=TYPO3+Transition+Days

To do

  • We should try the latest version (2.1.1) of dkd/nutch-typo3-cms as suggested bei Olivier Dobberkau here in the comments.
  • We should run Nutch with a more recent version of Openjdk or SunJdk.

A problem

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
>     /usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
>     -solr http://localhost:8080/solr/t3o_latest -dir crawl \
>     -depth 5 -topN 10
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=http://localhost:8080/solr/t3o_latest
topN = 10
Injector: starting at 2015-03-20 18:00:33
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/mbless/urls
     at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
     at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
     at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:416)
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
     at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

t3org in a vagrant box

Comments

comments powered by Disqus

Previous topic

Extbase API: Main Focus: The Controller

Next topic

About Discourse

Tags

Archives

Languages

Recent Posts

This Page