PHPでクローラを書く - テノニッキ (@hideack 's diary)

PHPでクローラを書くときのライブラリないかな？と探したところ、sourceforgeで見つけたので使ってみた。結構、出来が良くて少し触ってみた感じ便利そうな雰囲気。

PHPCrawl
PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling and other features.
http://sourceforge.net/projects/phpcrawl/

ダウンロードして展開した際に含まれているexample.phpを見ると、それが全てだったりするのだけど、一応以下にサンプルファイルを一部抜粋して掲載。
どの様な形で実装できるかがわかると思う。

<?php
// Inculde the phpcrawl-mainclass
include("classes/phpcrawler.class.php");

// Extend the class and override the handlePageData()-method
class MyCrawler extends PHPCrawler 
{
  function handlePageData(&$page_data) 
  {
    // --- ここにコンテンツがクロールされた際の挙動を実装する
    // Print the URL of the actual requested page or file
    echo "Page requested: ".$page_data["url"]."\n";
    // Print the first line of the header the server sent (HTTP-status)
    echo "Status: ".strtok($page_data["header"], "\n")."\n";
    // Print the referer
    echo "Referer-page: ".$page_data["referer_url"]."\n";
    // Print if the content was be recieved or not
    if ($page_data["received"]==true)
      echo "Content received: ".$page_data["bytes_received"]." bytes";
    else
      echo "Content not received";
    
    echo "\n\n";
    flush();
  }
}

$crawler = &new MyCrawler();

$crawler->setURL("www.yahoo.com");  // クロール対象のWebサイト
$crawler->addReceiveContentType("/text\/html/");    // クローラで取り込む対象のContentType
$crawler->addNonFollowMatch("/.(jpg|gif|png)$/ i");  // 探索除外のファイル拡張子
$crawler->setTrafficLimit(1000 * 1024);

$crawler->go();
?>

いろいろ楽しいことできそうだな。