Beautiful Soup修改文档树

文章目录

　　BeautifulSoup最重要的方法是搜索解析树，允许根据需要对 Web文档进行更改，可以使用其属性更改标签的属性，例如 .name、.string 或 .append() 方法。它允许您在 .new_string() 和 .new_tag() 方法的帮助下向现有标签添加新标签和字符串。还有其他方法，例如 .insert()、.insert_before() 或 .insert_after() 可以对 HTML 或 XML 文档进行各种修改。

　　上篇文章我们为你介绍Beautiful Soup搜索文档树，在本章中，我们晓得博客将为你介绍Beautiful Soup修改文档树。

更改标签名称和属性

　　 Beautiful Soup修改文档树-更改标签及属性，创建 Beautiful Soup后，可以轻松进行修改，例如重命名标签、修改其属性、添加新属性和删除属性。

soup = BeautifulSoup('<b class="bolder">Beautiful Soup搜索文档树</b>'，"lxml")
tag = soup.b

　　修改和添加新属性如下 –

tag.name = 'Blockquote'
tag['class'] = 'Bolder'
tag['id'] = 1.1
tag

　　输出：

<Blockquote class="Bolder" id="1.1">Beautiful Soup搜索文档树</Blockquote

　　删除属性如下 –

del tag['class']
tag

del tag['id']
tag

　　输出：

<Blockquote id="1.1">Beautiful Soup搜索文档树</Blockquote>
<Blockquote>Beautiful Soup搜索文档树</Blockquote>

Modifying .string

　　 Beautiful Soup修改文档树 -修改string，您可以轻松修改标签的 .string 属性 –

markup = '<a href="https://www.pythonthree.com/python_basic/pycharm-tutorials/">Pycharm教程</a>'
soup = BeautifulSoup(markup,"lxml")
tag = soup.a
tag.string = "My Favourite tutorial."
tag

　　输出：

<a href="https://www.tutorialspoint.com/index.htm">My Favourite tutorial.</a>

　　从上面，我们可以看到标签是否包含任何其他标签，它们及其所有内容都将被新数据替换。

append()

　　使用 tag.append() 方法向现有标签添加新数据/内容。它与 Python 列表中的 append() 方法非常相似。

markup = '<a href="https://www.pythonthree.com/python_basic/pycharm-tutorials/">Pycharm教程</a>'
soup = BeautifulSoup(markup,"lxml")
Bsoup.a.append(" Really Liked it")
Bsoup

Bsoup.a.contents

　　输出：

<html><body><a href="https://www.pythonthree.com/python_basic/pycharm-tutorials/">Pycharm教程 Really Liked it</a></body></html>
['Pycharm教程', ' Really Liked it']

NavigableString() 和 .new_tag()

　　如果您想向文档添加字符串，可以使用 append() 或 NavigableString() 构造函数轻松完成 –

>>> soup = BeautifulSoup("<b></b>")
>>> tag = soup.b
>>> tag.append("Start")
>>>
>>> new_string = NavigableString(" Your")
>>> tag.append(new_string)
>>> tag
<b>Start Your</b>
>>> tag.contents
['Start', ' Your']

　　注意：如果您在访问 NavigableString() 函数时发现任何名称错误，如，NameError: name ‘NavigableString’ 未定义。只需从 bs4 包中导入 NavigableString 目录 –

from bs4 import NavigableString

　　我们可以解决上面的错误。您可以向现有标签添加注释，也可以添加 NavigableString 的其他一些子类，只需调用构造函数即可。

>>> from bs4 import Comment
>>> adding_comment = Comment("Always Learn something Good!")
>>> tag.append(adding_comment)
>>> tag
<b>Start Your<!--Always Learn something Good!--></b>
>>> tag.contents
['Start', ' Your', 'Always Learn something Good!']

　　添加一个全新的标签（不附加到现有标签）可以使用 Beautifulsoup 内置方法 BeautifulSoup.new_tag() –

>>> soup = BeautifulSoup("<b></b>")
>>> Otag = soup.b
>>>
>>> Newtag = soup.new_tag("a", href="https://www.pythonthree.com")
>>> Otag.append(Newtag)
>>> Otag
<b><a href="https://www.pythonthree.com"></a></b>

　　只需要第一个参数，即标签名称。

insert()

　　与 python 列表上的 .insert() 方法类似，tag.insert() 将插入新元素，但是与 tag.append() 不同，新元素不一定位于其父内容的末尾。可以在任何位置添加新元素。

>>> markup = '<a href="https://www.pythonthree.com/wordpress-theme-for-web-design/genesis-theme-tutorials/">Genesis主题建站教程</a>'
>>> soup = BeautifulSoup(markup，"lxml")
>>> tag = soup.a
>>>
>>> tag.insert(1, "Love this Tutorial")
>>> tag
<a href="https://www.pythonthree.com/wordpress-theme-for-web-design/genesis-theme-tutorials/">Genesis主题建站教程
Love this Tutorial</a>
>>> tag.contents
['Genesis主题建站教程', 'Love this Tutorial']
>>>

insert_before() 和 insert_after()

　　 Beautiful Soup修改文档树 -添加属性，要在解析树中的某些内容之前插入一些标签或字符串，我们使用 insert_before() –

>>> soup = BeautifulSoup("Brave")
>>> tag = soup.new_tag("i")
>>> tag.string = "Be"
>>>
>>> soup.b.string.insert_before(tag)
>>> soup.b
<b><i>Be</i>Brave</b>

　　类似地，要在解析树中的某些内容之后插入一些标记或字符串，请使用 insert_after()。

>>> soup.b.i.insert_after(soup.new_string(" Always "))
>>> soup.b
<b><i>Be</i> Always Brave</b>
>>> soup.b.contents
[<i>Be</i>, ' Always ', 'Brave']

clear()

　　要删除标签的内容，请使用 tag.clear() –

>>> markup = '<a href="https://www.pythonthree.com/">For <i>technical & Non-technical&lr;/i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>> tag
<a href="https://www.pythonthree.com/">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> tag.clear()
>>> tag
<a href="https://www.pythonthree.com/"></a>

　　推荐：使用BeautifulSoup查询关键词谷歌搜索结果排名

extract()

　　要从树中删除标记或字符串，请使用 PageElement.extract()。

>>> markup = '<a href="https://www.pythonthree.com/wordpress-theme-for-web-design/enfold-theme-tutorials/">For <i&gr;technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> i_tag = soup.i.extract()
>>>
>>> a_tag
<a href="https://www.pythonthree.com/wordpress-theme-for-web-design/enfold-theme-tutorials/">For Contents</a>
>>>
>>> i_tag
<i>technical & Non-technical</i>
>>>
>>> print(i_tag.parent)
None

decompose()

　　tag.decompose() 从树中删除一个标签并删除它的所有内容。

>>> markup = '<a href="https://www.pythonthree.com/wordpress-plugins/elementor-tutorial/">For <i>technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> a_tag
<a href="https://www.pythonthree.com/wordpress-plugins/elementor-tutorial/">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> soup.i.decompose()
>>> a_tag
<a href="https://www.pythonthree.com/wordpress-plugins/elementor-tutorial/">For Contents</a>
>>>

Replace_with()

　　 Beautiful Soup修改文档树 -替换内容，顾名思义，pageElement.replace_with() 函数将用树中的新标签或字符串替换旧标签或字符串 –

>>> markup = '<a href="https://www.pythonthree.com/wordpress-web-design/">Complete Python <i>Material</i></a>'
>>> soup = BeautifulSoup(markup,"lxml")
>>> a_tag = soup.a
>>>
>>> new_tag = soup.new_tag("Official_site")
>>> new_tag.string = "https://www.pythonthree.com/"
>>> a_tag.i.replace_with(new_tag)
<i>Material</i>
>>>
>>> a_tag
<a href="https://www.pythonthree.com/wordpress-web-design/">Complete Python <Official_site>https://www.pythonthree.com/</Official_site></a>

　　在上面的输出中，您已经注意到 replace_with() 返回被替换的标签或字符串（如我们示例中的“Material”），因此您可以检查它或将其添加回树的另一部分。

wrap()

　　pageElement.wrap() 在您指定的标签中包含一个元素并返回一个新的包装器 –

>>> soup = BeautifulSoup("<p>tutorialspoint.com</p>")
>>> soup.p.string.wrap(soup.new_tag("b"))
<b>tutorialspoint.com</b>
>>>
>>> soup.p.wrap(soup.new_tag("Div"))
<Div><p><b>tutorialspoint.com</b></p></Div>

unwrap()

　　tag.unwrap() 与 wrap() 正好相反，并用该标签内的任何内容替换标签。

>>> soup = BeautifulSoup('<a href="https://www.pythonthree.com/">I liked <i>tutorialspoint</i></a>')
>>> a_tag = soup.a
>>>
>>> a_tag.i.unwrap()
<i></i>
>>> a_tag
<a href="https://www.pythonthree.com/">I liked tutorialspoint</a>

　　从上面，您已经注意到像 replace_with() 一样，unwrap() 返回被替换的标签。unwrap() 非常适合剥离标记，下面是 unwrap() 的另一个例子，以更好地理解它

>>> soup = BeautifulSoup("<p>I <strong>AM</strong> a <i>text</i>.</p>")
>>> soup.i.unwrap()
<i></i>
>>> soup
<html><body><p>I <strong>AM</strong> a text.</p></body></html>

总结

　　以上是晓得博客为你介绍的Beautiful Soup修改文档树的全部内容，希望对你的BeautifulSoup学习有所帮助，欢迎留言讨论。更多内容可参考官方文档。

Beautiful Soup修改文档树

Beautiful Soup修改文档树

更改标签名称和属性

Modifying .string

append()

NavigableString() 和 .new_tag()

insert()

insert_before() 和 insert_after()

clear()

extract()

decompose()

Replace_with()

wrap()

unwrap()

总结

使用PyScript在Web上运行Python可视化

Caktus AI怎么使用

Matplotlib与Seaborn的区别

如何用Python下载网页上图像

10个Jupyter Notebook提示和技巧

NumPy二元运算符

Beautiful Soup修改文档树

更改标签名称和属性

Modifying .string

append()

NavigableString() 和 .new_tag()

insert()

insert_before() 和 insert_after()

clear()

extract()

decompose()

Replace_with()

wrap()

unwrap()

总结

相关文章