Beautiful Soup 学习笔记

pqcc

浏览: 124882 次
性别:
来自: 北京

最近访客更多访客>>

gypb

王沧波

war_fish

andye

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Python

jQuery Python CSS HTML

BeautifulSoup 是做 HTML 解析用的，功能非常强大，比我之前见识过的 JAVA 以及Python 版本的 HTML Parser 要强大很多。最喜欢 BeautifulSoup 的 DOM 选择器，很有 JQuery 的味道，喜欢 JQuery 的开发人员应该会马上喜欢上他。

下面只是我个人做的一些学习笔记，当作手册来查阅的。

1. titleTag = soup.html.head.title
2. len(soup('p')) 多少个 p.
3. soup.findAll('p', align="center")
4. print soup.prettify()
5. navigate soup的一些方法:
    soup.contents[0].name
    soup.contents[0].contents[0].name
    head = soup.contents[0].contents[0]
    head.next
    head.nextSibling.name
    head.nextSibling.contents[0]
6. titleTag = soup.html.head.title
   # <title>Page title</title>
   titleTag.string: 返回 <title></title> 之间的内容.
7. <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
   soup('p', align="center")[0]['id']    (# u'firstpara')
   soup.find('p', align=re.compile('^b.*'))['id']
8. 修改 soup
    titleTag['id'] = 'theTitle'
    titleTag.contents[0].replaceWith("New title")
    soup.p.replaceWith(soup.b)
    soup.body.insert(0, "This page used to have ")
    soup.body.insert(2, " <p> tags!")
9. 循环:
   for incident in soup('td', width="90%"):
10. BeautifulStoneSoup的一个主要缺点就是它不知道如何处理自结束标签.
11. prettify 方法添加了一些换行和空格以便让文档结构看起来更清晰.
    str和unicode函数不会剔除这些节点，他们也不会添加任何空白符。
12. heading = soup.h1
    str(heading) // 输出结果: # '<h1>Heading</h1>'
    heading.renderContents() // 输出结果: # 'Heading'
13. firstPTag, secondPTag = soup.findAll('p')
    firstPTag['id']    // 输出结果: # u'firstPara'
    secondPTag['id']    // 输出结果: # u'secondPara'
14. soup.p跳到文档中的第一个 <P> tag。
    soup.table.tr.td 跳到文档总第一个table的第一列第一行。
    获得第一个<FOO> 标签另一种方式是使用.fooTag 而不是 .foo。
    例如，soup.table.tr.td可以表示为soup.tableTag.trTag.tdTag
15. findAll和 find 仅对Tag对象以及顶层剖析对象有效，但 NavigableString不可用。
16. The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)
17. 查找所有的<TITLE>和<P>标签:
    (1). soup.findAll(['title', 'p'])
    (2). soup.findAll({'title' : True, 'p' : True})   (更快一些)
18. 传一个True值,这样可以匹配每个tag的name：也就是匹配每个tag。
   allTags = soup.findAll(True)
   [tag.name for tag in allTags]
   [u'html', u'head', u'title', u'body', u'p', u'b', u'p', u'b']
19. 查找两个并仅两个属性的标签(tags)：
    soup.findAll(lambda tag: len(tag.attrs) == 2)
    寻找单个字符为标签名并且没有属性的标签：
    soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)
20. keyword参数用于筛选tag的属性。下面这个例子是查找拥有属性align且值为 center的所有标签：
    soup.findAll(align="center")
    soup.findAll(id=re.compile("para$"))
    soup.findAll(align=["center", "blah"])
    soup.findAll(align=lambda(value): value and len(value) < 5)
    特殊值: soup.findAll(align=True)
            [tag.name for tag in soup.findAll(align=None)]
    soup.findAll(attrs={'id' : re.compile("para$")})
21. 使用CSS类查找:
    soup.find("tagName", { "class" : "cssClass" })
    soup.findAll(text="one")
    soup.findAll(text=u'one')
    soup.findAll(text=["one", "two"])
    soup.findAll(text=re.compile("paragraph"))
    soup.findAll(text=lambda(x): len(x) < 12)
22. limit 参数:
    soup.findAll('p', limit=1)
23. find方法类似findAll的函数，但只返回第一个可匹配对象。
24. findNextSiblings(name, attrs, text, limit, **kwargs):
    findPreviousSiblings(name, attrs, text, limit, **kwargs)
    findAllNext(name, attrs, text, limit, **kwargs)
    findAllPrevious(name, attrs, text, limit, **kwargs)
    findParents(name, attrs, limit, **kwargs)
25. subtree = soup.a
    subtree.extract()
26. soup.find(text="Argh!").replaceWith("Hooray!")

0
顶

0
踩

分享到：

photoshop 技巧 | 搜索 [笔记摘抄]

2010-03-29 17:26
浏览 5679
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论