scrapy爬虫技术八

SitemapSpider

class scrapy.contrib.spiders.SitemapSpider
SitemapSpider使您爬取网站时可以通过 Sitemaps 来发现爬取的URL。

其支持嵌套的sitemap，并能从 robots.txt 中获取sitemap的url。
1、sitemap_urls
包含你需要爬取的url的sitemap的url列表。你可以指定为一个robots.txt,spider会从中分析并提取url。
2、sitemap_rules
一个包含(regex,callback)元组的列表：

regex是一个用于匹配从sitemap提供的url的正则表达式。regex可以是一个字符串或者编译的正则对象。
callback指定了匹配正则表达式的url的处理函数。callback可以是一个字符串(spider中方法的名字)或者是callable。
例如:
1
sitemap_rules=[('/product/','parse_product')]

规则按顺序进行匹配，之后第一个匹配才会被应用。
如果你忽略该属性，sitemap中发现的所有url将会被parse函数处理。
3、sitemap_follow
一个用于匹配要跟进的sitemap的正则表达式的列表(list)。其仅仅被应用在使用 Sitemap index files 来指向其他sitemap文件的站点。

默认情况下所有的sitemap都会被跟进。
4、sitemap_alternate_links
指定当一个 url 有可选的链接时，是否跟进。有些非英文网站会在一个 url 块内提供其他语言的网站链接。
例如:

<url>
    <loc>http://example.com/</loc>
    <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
</url>

当 sitemap_alternate_links 设置时，两个URL都会被获取。当 sitemap_alternate_links 关闭时，只有 http://example.com/ 会被获取。

默认 sitemap_alternate_links 关闭。

SitemapSpider样例

简单的例子:使用parse处理通过sitemap发现的所有url:

from scrapy.contrib.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...

用特定的函数处理某些url，其他的使用另外的callback:

from scrapy.contrib.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),
    ]

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...

跟进robots.txt文件定义的sitemap并只跟进包含有..sitemap_shop的url:

from scrapy.contrib.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]
    sitemap_follow = ['/sitemap_shops']

    def parse_shop(self, response):
        pass # ... scrape shop here ...

在SitemapSpider中使用其他url:

from scrapy.contrib.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]

    other_urls = ['http://www.example.com/about']

    def start_requests(self):
        requests = list(super(MySpider, self).start_requests())
        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
        return requests

    def parse_shop(self, response):
        pass # ... scrape shop here ...

    def parse_other(self, response):
        pass # ... scrape other here ...

反正前面提到的这几种spider都是对原始Spider的一种重写。自己也可以做一个spider然后继承使用。但是要继承原始的Spider。