scrapy爬虫技术九

选择器(Selectors)

选择器顾名思义就是选择的意思，就是从HTML源码中提取数据。现有哪些工具可以使用呢？

BeautifulSoup是在程序员间非常流行的网页分析库，它基于HTML代码的结构来构造一个Python对象，对不良标记的处理也非常合理，它有一个缺点:慢。
lxml是一个基于ElementTree的python化的XML解析库(也可以解析html)

Scrapy提取数据有自己的一套机制，他们被称作选择器。因为他们通过特定的XPath或者CSS表达式来选择HTML文件中的某个部分。

XPath是一门用来在XML文件中选择节点的语言，也可以用在HTML上。CSS是一门将HTML文档样式化的语言。选择器由它定义，并与特定的HTML元素的样式相关连。

使用选择器

构造选择器
scrapy selector是以文字(text)或TextResponse构造的Selector实例。其根据输入的类型自动选择最优的分析方法(XML vs HTML):

1 2	>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse

以文字构造:

1
2
3

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

以response构造:

1
2
3

>>> response = HtmlResponse(url='http://example.com',body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

为了方便起见，response对象.selector属性提供了一个selector,您可以随时使用该快捷方法:

1 2	>>> response.selector.xpath('//span/text()').extract() [u'good']

实例:
爬取地址:http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
爬取的源码：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

打开shell，输入

1	scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

接着，当shell挂在后，您将获得名为response的shell变量，其为响应的response,并且在其response.selector属性上绑定了一个selector。
构建一个XPath来选择title标签内的文字

1 2	response.selector.xpath('//title/text()') [<Selector (text) xpath=//title/text()>]

由于在response中使用XPath、CSS查询十分普遍，因此，scrapy提供了两个实用的快捷方式，response.xpath()及response.css();

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector(text) xpath=//title/text()>]

如你所见，.xpath()及.css()方法返回一个类Selector的实例，它是一个新选择器的列表。
为了提取真实的原文数据，你需要调用.extract()方法如下:

1 2	>>> response.xpath('//title/text()').extract() [u'Example website']

注意CSS选择器可以使用CSS3伪元素来选择文字或者属性节点:

1 2	>>> response.css('title::text').extract() [u'Example website']

现在我们将得到根URL(base URL)和一些图片链接:

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']
>>> response.css('base::attr(href)')extract()
[u'http://example.com/']
>>> response.xpath('//a[contains(@href,"image")]/@href').extract()
[u'image1.html',u'image2.html',u'image3.html',u'image4.html',u'image5.html']
>>> response.xpath('a[href*=image]::attr(href)').extract()
[u'image1.html',u'image2.html',u'image3.html',u'image4.html',u'image5.html']
>>> response.xpath('//a[contains(@href,"image")]/img/@src').extract()
[u'image1_thumb.jpg',u'image2_thumb.jpg',u'image3_thumb.jpg',u'image4_thumb.jpg',u'image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',u'image2_thumb.jpg',u'image3_thumb.jpg',u'image4_thumb.jpg',u'image5_thumb.jpg']

嵌套选择器

选择器方法.xpath()或者.css()返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。

>>> links=response.xpath('//a[contains(@href,'image')]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index,link in enumerate(links):
        args=(index,link,xpath('@href').extract(),link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args
Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

综合正则表达式使用选择器

Selector也有一个.re()方法，用来通过正则表达式来提取数据。然而，不同于.xpath()或者.css()方法，.re()方法返回unicodee字符串列表。所以你无法构造嵌套式的.re()使用。
例子:

>>> response.xpath('//a[contains(@href,"image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

使用相对XPaths

记住如果你使用嵌套的选择器，并使用起始位/的XPath，那么该XPath将对文档使用绝对路径，而且对于你调用的Selector不是相对路径。
比如，假设你想提取在

元素中的所有

元素。首先，你将先得到所有的

元素:
>>> divs = response.xpath(‘//div’)
开始时，你可能会尝试使用下面的错误的方法，因为它其实是从整篇文档中，而不仅仅是从那些

元素内部提取所有的

元素。

1 2	>>> for p in divs.xpath('//p'): print p.extract()

下面是比较合适的处理方法(注意.//pXPath的点前缀

1 2	>>> for p in divs.xpath('.//p'): print p.extract()

另一种常见的情况将是提取所有直系

的结果:

1 2	>>> for p in divs.xpath('p'): print p.extract()

使用EXSLT扩展

因建于lxml之上，scrapy选择器也支持一些EXSLT扩展，可以在XPath表达式中使用预先制定的命名空间。

前缀	命名空间	用途
re	http://exslt.org/regular-expressions	正则表达式
set	http://exslt.org/sets	集合操作

正则表达式
例如在XPath的starts-with()或contains()无法满足需求时，test()函数可以非常有用。
例如在列表中选择有”class”元素且结尾为一个数字的链接:

>>> from scrapy imort Selector
>>> doc ="""
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc,type="html")
>>> sel = xpath('//li/@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class,"item-\d$")]/@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']

提醒:C语言库libxslt不原生支持EXSLT正则表达式，因此lxml在实现时使用了python re 模块的钩子。因此，在XPath表达式中使用regexp函数可能会牺牲少量的性能。

集合操作
集合操作可以方便地用于在提取文字元素前从文档中去除一些部分。

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...
...   Customer reviews:
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>>
>>> for scope in sel.xpath('//div[@itemscope]'):
...     print "current scope:", scope.xpath('@itemtype').extract()
...     props = scope.xpath('''
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)''')
...     print "    properties:", props.extract()
...     print
...

current scope: [u'http://schema.org/Product']
    properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']
    properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']
    properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']
>>>

在这里，我们首先在itemscope元素上迭代，对于其中一个元素，我们寻找所有的itemprops元素，并排除那些本身在另一个itemscope内的元素。

內建选择器参考

xpath(query)
寻找可以匹配xpath query的节点，并返回SelectorList的一个实例结果，单一化其所有元素，列表元素也实现了Slector的接口。
css(query)
应用给定的CSS选择器，返回SelectorList的一个实例。
query是一个包含CSS选择器的字符串。
在后台通过cssselect库和运行.path()方法，CSS查询会被转换为XPath查询。
extract()
串行化并将匹配到的节点返回一个unicode字符串列表。结尾是编码内容的的百分比。
re(regex)
应用给定的regex,并返回匹配到的unicode字符串列表。regex可以是一个已编译的正则表达式，也可以是一个将被re.compile(regex)编译为正则表达式的字符串。