精彩不停！Python解析库lxml与xpath用法总结

2021-01-12 09:17

4．xpath 轴

轴可定义相对于当前节点的节点集。

轴名称结果ancestor选取当前节点的所有先辈（父、祖父等）。ancestor－or－self选取当前节点的所有先辈（父、祖父等）以及当前节点本身。attribute选取当前节点的所有属性。child选取当前节点的所有子元素。descendant选取当前节点的所有后代元素（子、孙等）。descendant－or－self选取当前节点的所有后代元素（子、孙等）以及当前节点本身。following选取文档中当前节点的结束标签之后的所有节点。namespace选取当前节点的所有命名空间节点。parent选取当前节点的父节点。preceding选取文档中当前节点的开始标签之前的所有节点。preceding－sibling选取当前节点之前的所有同级节点。self选取当前节点。

5．xpath运算符

下面列出了可用在 XPath 表达式中的运算符：

运算符描述实例返回值｜计算两个节点集／／book ｜／／cd返回所有拥有 book 和 cd 元素的节点集＋加法6 ＋ 410－减法6 － 42＊乘法6 ＊ 424div除法8 div 42＝等于price＝9．80如果 price 是 9．80，则返回 true。如果 price 是 9．90，则返回 false。！＝不等于price！＝9．80如果 price 是 9．90，则返回 true。如果 price 是 9．80，则返回 false。＜小于price＜9．80如果 price 是 9．00，则返回 true。如果 price 是 9．90，则返回 false。＜＝小于或等于price＜＝9．80如果 price 是 9．00，则返回 true。如果 price 是 9．90，则返回 false。＞大于price＞9．80如果 price 是 9．90，则返回 true。如果 price 是 9．80，则返回 false。＞＝大于或等于price＞＝9．80如果 price 是 9．90，则返回 true。如果 price 是 9．70，则返回 false。or或price＝9．80 or price＝9．70如果 price 是 9．80，则返回 true。如果 price 是 9．50，则返回 false。and与price＞9．00 and price＜9．90如果 price 是 9．80，则返回 true。如果 price 是 8．50，则返回 false。mod计算除法的余数5 mod 21

好了，xpath的内容就这么多了。接下来我们要介绍一个神器lxml，他的速度很快，曾经一直是我使用beautifulsoup时最钟爱的解析器，没有之一，因为他的速度的确比其他的html．parser 和html5lib快了许多。

二、lxml

1．lxml安装

lxml 是一个xpath格式解析模块，安装很方便，直接pip install lxml 或者easy＿install lxml即可。

2．lxml 使用

lxml提供了两种解析网页的方式，一种是你解析自己写的离线网页时，另一种则是解析线上网页。

导入包：

from lxml import etree

1．解析离线网页：

html＝etree．parse（＇xx．html＇，etree．HTMLParser（））aa＝html．xpath（＇／［＠id＝＂s＿xmancard＿news＂］／div／div［2］／div／div［1］／h2／a［1］／＠href＇）print（aa）

2．解析在线网页：

from lxml import etreeimport requestsrep＝requests．get（＇https：／／www．baidu．com＇）html＝etree．HTML（rep．text）aa＝html．xpath（＇／［＠id＝＂s＿xmancard＿news＂］／div／div［2］／div／div［1］／h2／a［1］／＠href＇）print（aa）

那么我们怎么获取这些标签和标签对应的属性值了，很简单，首先获取标签只需你这样做：

然后我们可以，比方说，你要获取a标签内的文本和它的属性href所对应的值，有两种方法，

1．表达式内获取

aa＝html．xpath（＇／［＠id＝＂s＿xmancard＿news＂］／div／div［2］／div／div［1］／h2／a［1］／text（）＇）
ab＝html．xpath（＇／［＠id＝＂s＿xmancard＿news＂］／div／div［2］／div／div［1］／h2／a［1］／＠href＇）

2．表达式外获取

aa＝html．xpath（＇／［＠id＝＂s＿xmancard＿news＂］／div／div［2］／div／div［1］／h2／a［1］＇）
aa．text
aa．attrib．get（＇href＇）

这样就完成了获取，怎么样，是不是很简单了，哈哈哈。

下面再来lxml的解析规则：

表达式描述nodename选取此节点的所有子节点／从当前节点选取直接子节点／／从当前节点选取子孙节点．选取当前节点．．选取当前节点的父节点＠选取属性

html ＝ lxml．etree．HTML（text）＃使用text构造一个XPath解析对象，etree模块可以自动修正HTML文本html ＝ lxml．etree．parse（＇．／ex．html＇，etree．HTMLParser（））＃直接读取文本进行解析from lxml import etreeresult ＝ html．xpath（＇／＇）＃选取所有节点result ＝ html．xpath（＇／／li＇）＃获取所有li节点result ＝ html．xpath（＇／／li／a＇）＃获取所有li节点的直接a子节点result ＝ html．xpath（＇／／li／／a＇）＃获取所有li节点的所有a子孙节点result ＝ html．xpath（＇／／a［＠href＝＂link．html＂］／．．／＠class＇）＃获取所有href属性为link．html的a节点的父节点的class属性result ＝ html．xpath（＇／／li［＠class＝＂ni＂］＇）＃获取所有class属性为ni的li节点result ＝ html．xpath（＇／／li／text（）＇）＃获取所有li节点的文本result ＝ html．xpath（＇／／li／a／＠href＇）＃获取所有li节点的a节点的href属性result ＝ html．xpath（＇／／li［contains（＠class，＂li＂）］／a／text（））＃当li的class属性有多个值时，需用contains函数完成匹配result ＝ html．xpath（＇／／li［contains（＠class，＂li＂） and ＠name＝＂item＂］／a／text（）＇）＃多属性匹配result ＝ html．xpath（＇／／li［1］／a／text（）＇）result ＝ html．xpath（＇／／li［last（）］／a／text（）＇）result ＝ html．xpath（＇／／li［position（）＜3］／a／text（）＇）result ＝ html．xpath（＇／／li［last（）－2］／a／text（）＇）＃按序选择，中括号内为XPath提供的函数result ＝ html．xpath（＇／／li［1］／ancestor：：＊＇）＃获取祖先节点result ＝ html．xpath（＇／／li［1］／ancestor：：div＇）result ＝ html．xpath（＇／／li［1］／attribute：：＊＇）＃获取属性值result ＝ html．xpath（＇／／li［1］／child：：a［＠href＝＂link1．html＂］＇）＃获取直接子节点result ＝ html．xpath（＇／／li［1］／descendant：：span＇）＃获取所有子孙节点result ＝ html．xpath（＇／／li［1］／following：：＊［2］＇）＃获取当前节点之后的所有节点的第二个result ＝ html．xpath（＇／／li［1］／following－sibling：：＊＇）＃获取后续所有同级节点

3．lxml案例

为了偷懒，小编决定还是采用urllib那篇文章的代码，哈哈哈，机智如我。