心路
死生契阔,与子成说。执子之手,与子偕老。

  • 首页
  • 关于
  • MyIcy
  • 琐记(413)
  • 技术(221)
  • 八卦(105)
  • 读书(20)
  • RSS
  • 很感谢云舒兄弟去年推荐到阿里...
  • 信春哥,得永生....
  • 据说胡适有36个博士学位...
  • 我给你力吧...
  • 当前Blog无任何评论...
  • 最新论证:先有的嫖客后有的鸡...
  • 好的呀,非常感谢! 我这个当...
  • 云老大,将你的程序修改了下,...
  • cnbeta上有中文的...
  • 当前Blog无任何评论...
  • 舒舒,不要把我给你的邮件随便...
  • 本人有项目想和你合作,有意向...
  • hehe
  • 大哥,你好,Ttyutils的rpm或...
  • 很好
  •  
     
  • 黑小子's Blog
  • Nick's blog
  • sbilly的乌托邦
  • 80 sec
  • 段段的blog
  • 肉肉的洗手间
  • JY美女
  • 小叶子的空间
  • 螺螺的blog
  • 忽尔今秋
  • Icy's Blog
  • 虚拟面包
  • 涛涛的blog
  • Tomy's blog
  • 王俊的blog
  • 狐狸的叶子
  • demonalex's blog
  • Super*Hei's Blog
  •  
    Powered by: SaBlog
    perl爬虫
    Submitted by 云舒 on 2009, December 27, 6:07 PM. 技术

    摘要:昨天阴沉了一天,闷在家里做了个简单的爬虫,主要是多线程、队列、Bloom Filter等的使用,算是个demo吧。今天终于下雪了,和LP出去走了走,很有感觉。

    update:2010年1月4日同事在项目中使用,所以我略微review了一下代码。有点小的改动,主要是线程互斥。

    #!/usr/bin/perl
    use strict;
    use warnings;
    #use Data::Dumper;
    use threads;
    use threads::shared;
    use Thread::Queue;
    use Thread::Semaphore;
     
    use Bloom::Filter;
    use URI::URL;
    use Web::Scraper;
     
    my $max_threads = 15;
    my $base_url = $ARGV[0] || 'http://www.icylife.net';
    my $host = URI::URL->new($base_url)->host;
     
    my $queue = Thread::Queue->new( );
     
    my $semaphore = Thread::Semaphore->new( $max_threads );
    my $mutex = Thread::Semaphore->new( 1 );
     
    my $filter = shared_clone( Bloom::Filter->new(capacity => 10000, error_rate => 0.0001) );
     
    $queue->enqueue( $base_url );
    $filter->add( $base_url );
     
    while( 1 )
    {
            # join all threads which can be joined
            #my $joined = 0;
            foreach ( threads->list(threads::joinable) )
            {
                    #$joined ++;
                    $_->join( );
            }
            #print $joined, " joined\n";
     
            # if there are no url need process.
            my $item = $queue->pending();
            if( $item == 0 )
            {
                    my $active = threads->list(threads::running);
                    # there are no active thread, we finish the job
                    if( $active == 0 )
                    {
                            print "All done!\n";
                            last;
                    }
                    # we will get some more url if there are some active threads, just wait for them
                    else
                    {
                            #print "[MAIN] 0 URL, but $active active thread\n";
                            sleep 1;
                            next;
                    }
            }
     
            # if there are some url need process
            #print "[MAIN] $item URLn";
            $semaphore->down;
            #print "[MAIN]Create thread.n";
            threads->create( \&ProcessUrl );
    }
     
    # join all threads which can be joined
    foreach ( threads->list() )
    {
            $_->join( );
    }
     
    sub ProcessUrl
    {
            my $scraper = scraper
            {
                    process '//a', 'links[]' => '@href';
            };
     
            my $res;
            my $link;
     
            while( my $url = $queue->dequeue_nb() )
            {
                    eval
                    {
                            $res = $scraper->scrape( URI->new($url) )->{'links'};
                    };
                    if( $@ )
                    {
                            warn "$@\n";
                            next;
                    }
                    next if (! defined $res );
     
                    #print "there are ".scalar(threads->list(threads::running))." threads, ", $queue->pending(), " urls need process.\n";
     
                    foreach( @{$res} )
                    {
                            $link = $_->as_string;
                            $link = URI::URL->new($link, $url);
     
                            # not http and not https?
                            next if( $link->scheme ne 'http' && $link->scheme ne 'https' );
                            # another domain?
                            next if( $link->host ne $host );
     
                            $link = $link->abs->as_string;
     
                            if( $link =~ /(.*?)#(.*)/ )
                            {
                                    $link = $1;
                            }
     
                            next if( $link =~ /.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf)$/i );
     
                            $mutex->down();
                            if( ! $filter->check($link) )
                            {
                                    print $filter->key_count(), " ", $link, "\n";
                                    $filter->add($link);
                                    $queue->enqueue($link);
                            }
                            $mutex->up();
                            undef $link;
                    }
                    undef $res;
            }
            undef $scraper;
            $semaphore->up( );
    }
     
     
    评论
    闲得无聊随手而作,bug懒的管了,反正只是demo一下而已。
    Post by 云舒 on 2009, December 27, 10:59 PM
    博主能说下功能么?    以后不会再做任何修改了?
    Post by michael on 2009, December 28, 2:59 PM
    就是一个简单的爬虫,寻找一个网站的所有URL链接地址。本就是写的好玩的,所以应该不会修改的,只是验证一下Bloom Filter而已。
    Post by 云舒 on 2009, December 29, 1:18 PM
    perl达人,膜拜
    Post by 冒充的人 on 2009, December 29, 9:28 PM
    晕掉,刚在chinaunix上看到这一篇,我还在想这个发贴人的ID是谁呢。。。
    Post by infocert on 2009, December 31, 5:28 PM
    2010年1月4日update代码。
    Post by 云舒 on 2010, January 4, 4:56 PM
    添加评论
    您的名字:
    您的E-mail:
    评论内容:
    验证码:
    Processed in 0.355817 second(s)