代码统计脚本咋写？

wen 实用脚本 2026-06-05 80

《手把手教你写一个代码统计脚本：从入门到实战》

目录导读

为什么要写代码统计脚本？
代码统计脚本的核心功能拆解
三种主流实现方式对比（Bash / Python / Node.js）
实战：用Python写一个生产级代码统计器
常见问题与优化技巧
总结与扩展

为什么要写代码统计脚本？

很多开发者在项目交付、技术分享或团队管理中，需要快速了解项目的代码规模：总行数、文件数、注释占比、语言分布等，手动统计不仅低效，而且容易出错,编写一个自动化代码统计脚本成为刚需。

代码统计脚本咋写？

问答环节
问：统计代码行数的脚本一般用在什么场景？
答：常见场景包括：

项目开发进度汇报（量化工作成果）
开源项目筛选（评估复杂度）
代码审查前的预检（识别冗余文件）
个人技术博客的数据支撑

代码统计脚本的核心功能拆解

一个合格的代码统计脚本应包含以下能力：

递归遍历目录：自动扫描所有子文件夹
按文件扩展名过滤：只统计 .py、.js、.ts、.java 等指定后缀
区分有效代码与注释：支持单行/多行注释及空行排除
统计维度：总行数、代码行、注释行、空行、文件数
输出格式化：控制台表格或导出为JSON/CSV

进阶功能（可选）：

排除 .git、node_modules、dist 等目录
按语言分组统计
可视化图表生成

三种主流实现方式对比

语言	优势	劣势	适合场景
Bash	极简，一行命令	功能单一，不支持复杂规则	快速估算，Linux环境
Python	生态丰富，易扩展	需安装环境	跨平台，定制化需求
Node.js	前端友好，npm包多	处理大文件时内存压力	Web项目开发者

推荐：对于大多数开发者,Python是平衡易用性与性能的最佳选择。

实战：用Python写一个生产级代码统计器

步骤1：设计函数架构

import os
import re
from collections import defaultdict
def count_lines_in_file(file_path):
    """统计单个文件的有效代码、注释、空行"""
    code_lines = 0
    comment_lines = 0
    blank_lines = 0
    in_block_comment = False  # 用于跟踪多行注释
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            stripped = line.strip()
            # 空行
            if not stripped:
                blank_lines += 1
                continue
            # 多行注释结束
            if in_block_comment:
                comment_lines += 1
                if '*/' in stripped or '"""' in stripped or "'''" in stripped:
                    in_block_comment = False
                continue
            # 单行注释
            if stripped.startswith('#') or stripped.startswith('//'):
                comment_lines += 1
                continue
            # 多行注释开始
            if (stripped.startswith('/*') or 
                stripped.startswith('"""') or 
                stripped.startswith("'''")):
                comment_lines += 1
                if '*/' not in stripped and stripped.count('"""') < 2 and stripped.count("'''") < 2:
                    in_block_comment = True
                continue
            code_lines += 1
    return code_lines, comment_lines, blank_lines

步骤2：遍历目录并聚合结果

def scan_directory(root_dir, extensions=None, exclude_dirs=None):
    """扫描目录，按扩展名过滤，排除指定文件夹"""
    if extensions is None:
        extensions = {'.py', '.js', '.ts', '.java', '.c', '.cpp', '.cs', '.go'}
    if exclude_dirs is None:
        exclude_dirs = {'.git', 'node_modules', 'dist', 'build', '__pycache__', 'venv'}
    stats = defaultdict(lambda: {'files': 0, 'code': 0, 'comment': 0, 'blank': 0})
    for root, dirs, files in os.walk(root_dir):
        # 排除指定目录
        dirs[:] = [d for d in dirs if d not in exclude_dirs]
        for file in files:
            ext = os.path.splitext(file)[1].lower()
            if ext not in extensions:
                continue
            file_path = os.path.join(root, file)
            try:
                code, comment, blank = count_lines_in_file(file_path)
                stats[ext]['files'] += 1
                stats[ext]['code'] += code
                stats[ext]['comment'] += comment
                stats[ext]['blank'] += blank
            except Exception as e:
                print(f"跳过文件 {file_path}: {e}")
    return stats

步骤3：漂亮地输出结果

def print_stats(stats):
    """格式化输出统计结果"""
    print(f"\n{'语言':<12} {'文件数':<8} {'代码行':<10} {'注释行':<10} {'空行':<10} {'总行数':<10}")
    print("-" * 60)
    total_files = total_code = total_comment = total_blank = 0
    for ext, data in sorted(stats.items()):
        total_lines = data['code'] + data['comment'] + data['blank']
        lang_name = ext.lstrip('.').upper() if ext else 'UNKNOWN'
        print(f"{lang_name:<12} {data['files']:<8} {data['code']:<10} {data['comment']:<10} {data['blank']:<10} {total_lines:<10}")
        total_files += data['files']
        total_code += data['code']
        total_comment += data['comment']
        total_blank += data['blank']
    print("-" * 60)
    grand_total = total_code + total_comment + total_blank
    print(f"{'总计':<12} {total_files:<8} {total_code:<10} {total_comment:<10} {total_blank:<10} {grand_total:<10}")
if __name__ == '__main__':
    target_dir = input("请输入要统计的目录路径（默认当前目录）: ") or '.'
    stats = scan_directory(target_dir)
    print_stats(stats)

常见问题与优化技巧

问答环节

问：如何处理大型项目（如超过100万行代码）时的性能问题？
答：

使用 os.walk 而非递归遍历，减少内存占用
对文件采用逐行读取（for line in f），避免一次性加载到内存
开启多线程/多进程（concurrent.futures）加速文件扫描

问：脚本统计的代码行数与IDE显示不一致怎么办？
答：

检查换行符（Windows的\r\n vs Linux的\n）
确认是否过滤了生成代码（如编译后的.js.map文件）
统一注释判断规则（内联注释 code # comment 是否计入代码行）

优化建议：

添加 --verbose 参数显示每个文件的详细统计
支持 --ignore 忽略特定文件模式（如 *.min.js）
集成 cloc（Count Lines of Code）作为备用引擎

总结与扩展

通过本文，你已经掌握从零编写一个健壮的代码统计脚本的方法，核心要点包括：

明确需求：界定哪些行算代码、注释、空行
分层设计：文件级统计 → 目录聚合 → 输出展示
防御性编码：处理编码错误、排除干扰目录

扩展方向：

结合Git历史，统计不同版本的代码增量
生成饼图展示语言分布（使用 matplotlib）
做成命令行工具发布到PyPI，方便团队复用

如果你想直接使用现成的轮子，可以参考开源工具 cloc（Perl编写）或 scc（Go编写）,但自己动手写一遍能让你对代码结构有更深的理解。

附加资源

GitHub上搜索“code-line-counter”查看数百个实现参考
相关技术博客：探索者日记、脚本之家（原链接已替换为通用名称）
测试案例：下载一个中小型开源项目（如Flask）来验证你的脚本

打开你的编辑器，开始写第一个 line_counter.py 吧！