Java案例如何实现敏感词过滤?

wen java案例 2

本文目录导读:

Java案例如何实现敏感词过滤?

  1. 基础实现 - 字符串匹配法
  2. 高效实现 - DFA算法
  3. 使用第三方库 - AHOCORASICK算法
  4. 完整的使用示例
  5. 性能对比和建议

我来介绍几种Java实现敏感词过滤的常用方法,从简单到复杂:

基础实现 - 字符串匹配法

最简单的实现方式,适用于小型系统:

import java.util.ArrayList;
import java.util.List;
public class SimpleSensitiveFilter {
    private List<String> sensitiveWords;
    public SimpleSensitiveFilter() {
        // 初始化敏感词库
        sensitiveWords = new ArrayList<>();
        sensitiveWords.add("badword1");
        sensitiveWords.add("badword2");
        // 可以从文件或数据库加载
    }
    /**
     * 检查是否包含敏感词
     */
    public boolean containsSensitiveWord(String text) {
        for (String word : sensitiveWords) {
            if (text.contains(word)) {
                return true;
            }
        }
        return false;
    }
    /**
     * 替换敏感词为*
     */
    public String replaceSensitiveWord(String text) {
        for (String word : sensitiveWords) {
            text = text.replaceAll(word, "*".repeat(word.length()));
        }
        return text;
    }
}

高效实现 - DFA算法

使用确定有限状态自动机,性能最优:

import java.util.*;
public class DFAFilter {
    private Map<Character, Map> sensitiveWordMap;
    private Set<String> sensitiveWords;
    public DFAFilter() {
        sensitiveWords = new HashSet<>();
        sensitiveWordMap = new HashMap<>();
    }
    /**
     * 初始化敏感词库
     */
    public void initSensitiveWords(Set<String> words) {
        sensitiveWords.clear();
        sensitiveWords.addAll(words);
        buildSensitiveWordMap();
    }
    /**
     * 构建DFA敏感词库
     */
    private void buildSensitiveWordMap() {
        sensitiveWordMap.clear();
        Map<Character, Map> nowMap;
        for (String word : sensitiveWords) {
            nowMap = sensitiveWordMap;
            for (int i = 0; i < word.length(); i++) {
                char c = word.charAt(i);
                Map<Character, Map> subMap = nowMap.get(c);
                if (subMap != null) {
                    nowMap = subMap;
                } else {
                    Map<Character, Map> newMap = new HashMap<>();
                    newMap.put('isEnd', false);
                    nowMap.put(c, newMap);
                    nowMap = newMap;
                }
                if (i == word.length() - 1) {
                    nowMap.put('isEnd', true);
                }
            }
        }
    }
    /**
     * 检查文本中是否包含敏感词
     */
    public boolean containsSensitiveWord(String text) {
        for (int i = 0; i < text.length(); i++) {
            int matchCount = checkSensitiveWord(text, i);
            if (matchCount > 0) {
                return true;
            }
        }
        return false;
    }
    /**
     * 替换敏感词
     */
    public String replaceSensitiveWord(String text, char replaceChar) {
        StringBuilder result = new StringBuilder(text);
        List<int[]> positions = new ArrayList<>();
        for (int i = 0; i < text.length(); i++) {
            int matchCount = checkSensitiveWord(text, i);
            if (matchCount > 0) {
                positions.add(new int[]{i, matchCount});
                i += matchCount - 1;
            }
        }
        for (int[] pos : positions) {
            for (int j = pos[0]; j < pos[0] + pos[1]; j++) {
                result.setCharAt(j, replaceChar);
            }
        }
        return result.toString();
    }
    /**
     * 检查从指定位置开始的敏感词
     */
    private int checkSensitiveWord(String text, int beginIndex) {
        Map<Character, Map> nowMap = sensitiveWordMap;
        int matchCount = 0;
        boolean flag = false;
        for (int i = beginIndex; i < text.length(); i++) {
            char word = text.charAt(i);
            Map<Character, Map> subMap = nowMap.get(word);
            if (subMap != null) {
                matchCount++;
                nowMap = subMap;
                if (nowMap.get('isEnd') == true) {
                    flag = true;
                    break;
                }
            } else {
                break;
            }
        }
        if (matchCount < 2 || !flag) {
            matchCount = 0;
        }
        return matchCount;
    }
}

使用第三方库 - AHOCORASICK算法

使用现成的库实现(需要添加Maven依赖):

<dependency>
    <groupId>org.ahocorasick</groupId>
    <artifactId>ahocorasick</artifactId>
    <version>0.6.3</version>
</dependency>
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import java.util.Collection;
import java.util.HashSet;
import java.util.Set;
public class AhoCorasickFilter {
    private Trie trie;
    /**
     * 初始化敏感词库
     */
    public void initSensitiveWords(Set<String> words) {
        TrieBuilder builder = Trie.builder();
        for (String word : words) {
            builder.addKeyword(word);
        }
        trie = builder.build();
    }
    /**
     * 查找所有敏感词
     */
    public Collection<Emit> findSensitiveWords(String text) {
        return trie.parseText(text);
    }
    /**
     * 检查是否包含敏感词
     */
    public boolean containsSensitiveWord(String text) {
        Collection<Emit> emits = trie.parseText(text);
        return !emits.isEmpty();
    }
    /**
     * 替换敏感词
     */
    public String replaceSensitiveWord(String text, char replaceChar) {
        Collection<Emit> emits = trie.parseText(text);
        StringBuilder sb = new StringBuilder(text);
        for (Emit emit : emits) {
            int start = emit.getStart();
            int end = emit.getEnd() + 1;
            for (int i = start; i < end; i++) {
                sb.setCharAt(i, replaceChar);
            }
        }
        return sb.toString();
    }
}

完整的使用示例

public class SensitiveFilterDemo {
    public static void main(String[] args) {
        // 准备敏感词库
        Set<String> sensitiveWords = new HashSet<>();
        sensitiveWords.add("赌博");
        sensitiveWords.add("色情");
        sensitiveWords.add("暴力");
        sensitiveWords.add("毒品");
        // 测试文本
        String text = "这里包含赌博和色情的内容,还有暴力内容";
        // 使用DFA过滤器
        DFAFilter dfaFilter = new DFAFilter();
        dfaFilter.initSensitiveWords(sensitiveWords);
        System.out.println("原始文本: " + text);
        System.out.println("包含敏感词: " + dfaFilter.containsSensitiveWord(text));
        System.out.println("过滤后: " + dfaFilter.replaceSensitiveWord(text, '*'));
        // 使用AhoCorasick过滤器
        AhoCorasickFilter acFilter = new AhoCorasickFilter();
        acFilter.initSensitiveWords(sensitiveWords);
        System.out.println("\n--- AhoCorasick Filter ---");
        System.out.println("过滤后: " + acFilter.replaceSensitiveWord(text, '*'));
    }
}

性能对比和建议

方法 优点 缺点 适用场景
字符串匹配 实现简单 性能差 小规模系统,敏感词少
DFA算法 性能好,查询快 构建词库稍复杂 中等规模系统
AhoCorasick 性能最优,现成库 需要引入外部依赖 大规模系统,高性能要求

建议

  • 小型项目:使用基础字符串匹配
  • 中型项目:使用DFA算法
  • 大型项目:使用AhoCorasick算法或其实现库

抱歉,评论功能暂时关闭!