Python案例怎么获取网页源码？

wen python案例 2026-06-08 66

Python案例：如何获取网页源码？完整教程与实战代码解析

目录导读

为什么需要获取网页源码？
准备工作：Python环境与必备库
使用requests库获取网页源码（最常用）
使用urllib库获取网页源码（内置标准库）
处理动态网页——结合Selenium获取完整源码
常见问题与错误处理（问答篇）
实战案例：抓取新闻标题并保存
总结与SEO优化建议

为什么需要获取网页源码？

在数据采集、SEO分析、内容监控或自动化测试中，获取网页源码是第一步，源码中包含HTML结构、CSS样式、JavaScript数据以及元信息，是后续解析数据的基础，你想监控竞争对手的页面更新，或抓取商品价格，都离不开“拿到网页源码”这个技能。

Python案例怎么获取网页源码？

准备工作：Python环境与必备库

首先确保已安装Python 3.6+，打开终端或命令提示符，安装以下库：

pip install requests beautifulsoup4 selenium lxml

requests：最流行的HTTP库，用于发送网络请求。
urllib：Python内置，无需安装。
selenium：模拟浏览器，处理JavaScript动态加载的内容。
beautifulsoup4：解析HTML，提取数据（后续案例用到）。

方法一：使用requests库获取网页源码（最常用）

import requests
url = "https://example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'  # 防止乱码
if response.status_code == 200:
    html_source = response.text
    print(html_source[:500])  # 打印前500字符
else:
    print("请求失败，状态码：", response.status_code)

关键点：

添加User-Agent模拟浏览器，避免被屏蔽。
设置encoding解决中文乱码。
检查状态码200确保成功。

方法二：使用urllib库获取网页源码（内置标准库）

from urllib.request import urlopen, Request
url = "https://example.com"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with urlopen(req) as response:
    html_source = response.read().decode('utf-8')
    print(html_source[:500])

优缺点：

优点：无需安装第三方库。
缺点：功能较弱，处理Cookie、Session不如requests方便。

方法三：处理动态网页——结合Selenium获取完整源码

许多现代网站（如电商、社交平台）使用JavaScript动态渲染内容，此时需要模拟真实浏览器：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')  # 无头模式（不弹出浏览器窗口）
options.add_argument('--disable-gpu')
options.add_argument('User-Agent=Mozilla/5.0')
driver = webdriver.Chrome(options=options)  # 需提前下载chromedriver
driver.get("https://example.com")
html_source = driver.page_source  # 获取当前完整源码
driver.quit()
print(html_source[:500])

注意：

需要下载对应浏览器版本的chromedriver并配置路径。
可设置--headless在服务器上运行。

常见问题与错误处理（问答篇）

Q1：获取到的源码是乱码怎么办？
A：检查网页charset声明，并使用response.apparent_encoding自动检测编码：
response.encoding = response.apparent_encoding

Q2：遇到403禁止访问如何解决？
A：添加更完整的请求头，包括Referer、Cookie。

cookies = {"session": "your_session_value"}
req = requests.get(url, headers=headers, cookies=cookies)

Q3：用requests请求动态网站只拿到空白源码？
A：该网站数据通过Ajax或JS加载，改用Selenium，或分析其API接口直接请求JSON数据。

Q4：频繁请求被封IP怎么办？
A：使用代理IP轮换，并添加随机延时：

import time, random
time.sleep(random.uniform(1, 3))

Q5：如何从HTML源码中提取特定信息？
A：结合BeautifulSoup：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_source, 'lxml')s = soup.find_all('h2', class_='title')
for t in titles:
    print(t.get_text(strip=True))

实战案例：抓取新闻标题并保存

目标：抓取一个新闻网站首页的所有标题，并保存到本地文件。

import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com"  # 以Hacker News为例
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
resp.encoding = 'utf-8'
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')
    # 找到所有标题元素（根据网页结构调整选择器）s = soup.select('span.titleline > a')  # CSS选择器
    with open('headlines.txt', 'w', encoding='utf-8') as f:
        for i, link in enumerate(titles, 1):
            title = link.get_text(strip=True)
            f.write(f"{i}. {title}\n")
            print(f"{i}. {title}")
    print("保存完成！")
else:
    print("请求失败")

扩展：可将代码封装成函数，传入不同URL，实现通用网页源码获取器。

总结与SEO优化建议

获取网页源码是Python数据采集的核心,推荐优先使用requests+BeautifulSoup组合，应对动态内容则用Selenium，本文提供的三种方法覆盖了90%的场景。

SEO优化建议：

在文章前部使用“Python案例获取网页源码”自然出现。
使用H2/H3标题，符合搜索引擎爬取习惯。
提供实用代码块和问答，增加用户停留时间。
建议收藏本文,后续可扩展到“解析JSON接口”、“处理登录态”等进阶内容。

声明：本文所有案例仅供学习使用，爬取数据请遵守网站的robots.txt协议及相关法律法规。