Python案例如何遍历文件夹文件？

wen python案例 2026-06-11 56

Python案例如何遍历文件夹文件？一文掌握高效文件检索与批量处理技巧

目录导读

为什么需要遍历文件夹文件？ – 从实际工作场景看文件扫描的刚需
最常用的三大遍历方法 – os.walk｜os.listdir｜glob 的差异与选择
批量重命名图片文件 – 结合正则表达式实现智能改名
按文件后缀分类整理 – 自动创建子文件夹并移动文件
统计项目代码行数 – 递归遍历 .py / .txt 并累计行数
常见问题与避坑指南 – 路径错误、大文件卡顿、隐藏文件处理
社区高频问答 – 基于 Stack Overflow 和 CSDN 的精选问答
总结与最佳实践 – 从脚本到实用工具的生产力建议

为什么需要遍历文件夹文件？

在数据整理、日志分析、自动化备份或代码重构时,手动打开每个文件夹逐个处理文件既不现实也容易出错。

Python案例如何遍历文件夹文件？

你有5000张照片分布在30个子文件夹中，需要将所有 .jpg 挑出来；
需要统计项目里 .py 文件和 .txt 文件的总行数；
需要删除所有 .tmp 临时文件。

一句话：遍历文件夹是文件自动化处理的第一步，而 Python 凭借其简洁的语法和强大标准库,成为完成这类任务的首选。

最常用的三大遍历方法

方法	特点	适用场景
`os.walk()`	递归遍历所有子文件夹，返回根路径、子文件夹列表、文件列表	深度文件夹树、批量处理
`os.listdir()`	仅遍历当前文件夹下一级，不递归	扁平目录查找
`glob.glob()`	支持通配符模式（如 `*.py`），不递归或可递归	按文件名模式快速匹配

代码片段对比：

import os, glob
# os.walk 递归
for root, dirs, files in os.walk('/path/to/folder'):
    for file in files:
        print(os.path.join(root, file))
# glob 递归（Python 3.5+）
for file in glob.glob('/path/to/folder/**/*.py', recursive=True):
    print(file)

真实场景中，90% 的文件夹遍历任务可使用 os.walk 解决，因为它同时保留路径和文件列表,且性能稳定。

案例一：批量重命名图片文件

痛点：从数码相机导出的图片名全是 IMG_0001.JPG，需改为 2024_01_15_001.jpg 形式。

import os
import re
from datetime import datetime
def rename_images(root_dir):
    count = 0
    for root, dirs, files in os.walk(root_dir):
        for file in files:
            if file.lower().endswith(('.jpg', '.jpeg', '.png')):
                old_path = os.path.join(root, file)
                # 提取数字部分作为序号（示例）
                num = re.search(r'(\d+)', file).group(1)
                new_name = f"2024_01_15_{num:0>3d}.jpg"
                new_path = os.path.join(root, new_name)
                os.rename(old_path, new_path)
                count += 1
                print(f"已重命名: {file} -> {new_name}")
    print(f"总共处理 {count} 个文件")
rename_images(r'C:\Users\YourName\Pictures\2024')

关键点：

使用 os.path.join 避免跨平台路径问题（Windows vs Linux ）；
用 re.search 从原文件名提取数字,然后再格式化。

案例二：按文件后缀分类整理

场景：下载文件夹杂乱无章，有 .pdf、.docx、.xlsx、图片等，需自动归入 PDF/、文档/、表格/、图片/ 等子文件夹。

import os
import shutil
def organize_files(target_folder):
    # 定义分类规则
    file_types = {
        '图片': ['.jpg', '.jpeg', '.png', '.gif'],
        '文档': ['.doc', '.docx', '.pdf', '.txt'],
        '表格': ['.xls', '.xlsx', '.csv'],
        '压缩文件': ['.zip', '.rar', '.7z'],
    }
    # 首先创建分类文件夹
    for category in file_types:
        os.makedirs(os.path.join(target_folder, category), exist_ok=True)
    for root, dirs, files in os.walk(target_folder):
        for file in files:
            ext = os.path.splitext(file)[1].lower()
            moved = False
            for category, extensions in file_types.items():
                if ext in extensions:
                    src = os.path.join(root, file)
                    dst = os.path.join(target_folder, category, file)
                    # 防止重名
                    if os.path.exists(dst):
                        base, ext = os.path.splitext(file)
                        dst = os.path.join(target_folder, category, f"{base}_copy{ext}")
                    shutil.move(src, dst)
                    moved = True
                    print(f"移动: {file} -> {category}/")
                    break
            if not moved:
                print(f"未分类: {file}")
organize_files(r'D:\Downloads')

注意：shutil.move 会删除源文件，若需保留副本请用 shutil.copy2。

案例三：统计项目代码行数

需求：计算一个 Python 项目中所有 .py 文件的有效代码行数（忽略空行和注释）。

import os
def count_lines_in_project(project_path):
    total_lines = 0
    total_files = 0
    for root, dirs, files in os.walk(project_path):
        # 跳过虚拟环境目录
        if 'venv' in dirs:
            dirs.remove('venv')
        if '.git' in dirs:
            dirs.remove('.git')
        for file in files:
            if file.endswith('.py'):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                        lines = f.readlines()
                        # 过滤空行和单行注释
                        code_lines = [line for line in lines 
                                      if line.strip() and not line.strip().startswith('#')]
                        total_lines += len(code_lines)
                        total_files += 1
                        print(f"{file}: {len(code_lines)} 行代码")
                except Exception as e:
                    print(f"读取文件 {file} 出错: {e}")
    print(f"\n总文件数: {total_files}, 总代码行数: {total_lines}")
count_lines_in_project(r'D:\MyProject')

优化点：使用 dirs.remove() 跳过不需要的目录,大幅提升扫描速度。

常见问题与避坑指南

❌ 问题1：路径包含中文或空格导致错误

解决：始终使用 raw string（如 r'C:\Users\你的名字'）或 os.path.normpath。

❌ 问题2：递归陷入符号链接循环

解决：os.walk 默认不跟随符号链接，若需跟随需设置 followlinks=True,但建议谨慎。

❌ 问题3：大文件导致程序卡死

解决：采用 生成器 方式逐行处理,而非一次性读取整个文件：

def process_large_file(filepath):
    with open(filepath, 'r') as f:
        for line in f:  # 逐行，不占用大内存
            # 处理每一行
            pass

❌ 问题4：隐藏文件或系统文件干扰

解决：在 for file in files 内添加过滤：

if file.startswith('.') or file.startswith('~$'):
    continue

社区高频问答

Q1：os.walk 和 os.scandir 哪个更快？
A：在Python 3.5+中，os.scandir 底层使用 dirent 系统调用，比 os.walk 快2-5倍，但使用方式相似，对于几千个文件的目录，建议用 os.scandir,示例：

import os
for entry in os.scandir('/path'):
    if entry.is_file():
        print(entry.name)

Q2：如何在遍历时排除特定文件夹或文件类型？
A：动态修改 dirs 列表（如案例三所示）是最优雅的方式,或者在后处理中加条件判断：

exclude_dirs = {'node_modules', '__pycache__'}
for root, dirs, files in os.walk(root):
    dirs[:] = [d for d in dirs if d not in exclude_dirs]

Q3：批量操作时如何生成日志以便出错后恢复？
A：将操作写入CSV日志文件：

import csv
with open('rename_log.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['原路径', '新路径', '状态'])
    # 在执行操作前后写入
    writer.writerow([old_path, new_path, success])

Q4：遍历网络共享文件夹速度极慢怎么办？
A：①优先使用 os.scandir 而非 os.walk；②只遍历需要的层级，不递归；③用 pathlib 库（Python 3.4+）的 rglob 方法,代码更简洁：

from pathlib import Path
for file in Path('/network/folder').rglob('*.pdf'):
    print(file)

总结与最佳实践

遍历文件夹在Python中可以通过os.walk、os.listdir、glob或pathlib实现,选择哪种取决于具体需求：

深度递归 + 路径保留 → os.walk / os.scandir
简单文件名匹配 → glob.glob
现代写法 + 面向对象 → pathlib.Path.rglob

最佳实践：

始终使用 os.path.join 处理路径,确保跨平台兼容；
善用 dirs[:] = [...] 动态裁剪遍历树,提升效率；
批量操作前先预览并在测试文件夹试运行,防止误删；
将遍历逻辑封装成函数，接收路径参数,便于复用；
处理大文件时坚持 流式处理,避免内存暴涨。