本文目录导读:

我来介绍几种Java实现敏感词过滤的常用方法,从简单到复杂:
基础实现 - 字符串匹配法
最简单的实现方式,适用于小型系统:
import java.util.ArrayList;
import java.util.List;
public class SimpleSensitiveFilter {
private List<String> sensitiveWords;
public SimpleSensitiveFilter() {
// 初始化敏感词库
sensitiveWords = new ArrayList<>();
sensitiveWords.add("badword1");
sensitiveWords.add("badword2");
// 可以从文件或数据库加载
}
/**
* 检查是否包含敏感词
*/
public boolean containsSensitiveWord(String text) {
for (String word : sensitiveWords) {
if (text.contains(word)) {
return true;
}
}
return false;
}
/**
* 替换敏感词为*
*/
public String replaceSensitiveWord(String text) {
for (String word : sensitiveWords) {
text = text.replaceAll(word, "*".repeat(word.length()));
}
return text;
}
}
高效实现 - DFA算法
使用确定有限状态自动机,性能最优:
import java.util.*;
public class DFAFilter {
private Map<Character, Map> sensitiveWordMap;
private Set<String> sensitiveWords;
public DFAFilter() {
sensitiveWords = new HashSet<>();
sensitiveWordMap = new HashMap<>();
}
/**
* 初始化敏感词库
*/
public void initSensitiveWords(Set<String> words) {
sensitiveWords.clear();
sensitiveWords.addAll(words);
buildSensitiveWordMap();
}
/**
* 构建DFA敏感词库
*/
private void buildSensitiveWordMap() {
sensitiveWordMap.clear();
Map<Character, Map> nowMap;
for (String word : sensitiveWords) {
nowMap = sensitiveWordMap;
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
Map<Character, Map> subMap = nowMap.get(c);
if (subMap != null) {
nowMap = subMap;
} else {
Map<Character, Map> newMap = new HashMap<>();
newMap.put('isEnd', false);
nowMap.put(c, newMap);
nowMap = newMap;
}
if (i == word.length() - 1) {
nowMap.put('isEnd', true);
}
}
}
}
/**
* 检查文本中是否包含敏感词
*/
public boolean containsSensitiveWord(String text) {
for (int i = 0; i < text.length(); i++) {
int matchCount = checkSensitiveWord(text, i);
if (matchCount > 0) {
return true;
}
}
return false;
}
/**
* 替换敏感词
*/
public String replaceSensitiveWord(String text, char replaceChar) {
StringBuilder result = new StringBuilder(text);
List<int[]> positions = new ArrayList<>();
for (int i = 0; i < text.length(); i++) {
int matchCount = checkSensitiveWord(text, i);
if (matchCount > 0) {
positions.add(new int[]{i, matchCount});
i += matchCount - 1;
}
}
for (int[] pos : positions) {
for (int j = pos[0]; j < pos[0] + pos[1]; j++) {
result.setCharAt(j, replaceChar);
}
}
return result.toString();
}
/**
* 检查从指定位置开始的敏感词
*/
private int checkSensitiveWord(String text, int beginIndex) {
Map<Character, Map> nowMap = sensitiveWordMap;
int matchCount = 0;
boolean flag = false;
for (int i = beginIndex; i < text.length(); i++) {
char word = text.charAt(i);
Map<Character, Map> subMap = nowMap.get(word);
if (subMap != null) {
matchCount++;
nowMap = subMap;
if (nowMap.get('isEnd') == true) {
flag = true;
break;
}
} else {
break;
}
}
if (matchCount < 2 || !flag) {
matchCount = 0;
}
return matchCount;
}
}
使用第三方库 - AHOCORASICK算法
使用现成的库实现(需要添加Maven依赖):
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.6.3</version>
</dependency>
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import java.util.Collection;
import java.util.HashSet;
import java.util.Set;
public class AhoCorasickFilter {
private Trie trie;
/**
* 初始化敏感词库
*/
public void initSensitiveWords(Set<String> words) {
TrieBuilder builder = Trie.builder();
for (String word : words) {
builder.addKeyword(word);
}
trie = builder.build();
}
/**
* 查找所有敏感词
*/
public Collection<Emit> findSensitiveWords(String text) {
return trie.parseText(text);
}
/**
* 检查是否包含敏感词
*/
public boolean containsSensitiveWord(String text) {
Collection<Emit> emits = trie.parseText(text);
return !emits.isEmpty();
}
/**
* 替换敏感词
*/
public String replaceSensitiveWord(String text, char replaceChar) {
Collection<Emit> emits = trie.parseText(text);
StringBuilder sb = new StringBuilder(text);
for (Emit emit : emits) {
int start = emit.getStart();
int end = emit.getEnd() + 1;
for (int i = start; i < end; i++) {
sb.setCharAt(i, replaceChar);
}
}
return sb.toString();
}
}
完整的使用示例
public class SensitiveFilterDemo {
public static void main(String[] args) {
// 准备敏感词库
Set<String> sensitiveWords = new HashSet<>();
sensitiveWords.add("赌博");
sensitiveWords.add("色情");
sensitiveWords.add("暴力");
sensitiveWords.add("毒品");
// 测试文本
String text = "这里包含赌博和色情的内容,还有暴力内容";
// 使用DFA过滤器
DFAFilter dfaFilter = new DFAFilter();
dfaFilter.initSensitiveWords(sensitiveWords);
System.out.println("原始文本: " + text);
System.out.println("包含敏感词: " + dfaFilter.containsSensitiveWord(text));
System.out.println("过滤后: " + dfaFilter.replaceSensitiveWord(text, '*'));
// 使用AhoCorasick过滤器
AhoCorasickFilter acFilter = new AhoCorasickFilter();
acFilter.initSensitiveWords(sensitiveWords);
System.out.println("\n--- AhoCorasick Filter ---");
System.out.println("过滤后: " + acFilter.replaceSensitiveWord(text, '*'));
}
}
性能对比和建议
| 方法 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| 字符串匹配 | 实现简单 | 性能差 | 小规模系统,敏感词少 |
| DFA算法 | 性能好,查询快 | 构建词库稍复杂 | 中等规模系统 |
| AhoCorasick | 性能最优,现成库 | 需要引入外部依赖 | 大规模系统,高性能要求 |
建议:
- 小型项目:使用基础字符串匹配
- 中型项目:使用DFA算法
- 大型项目:使用AhoCorasick算法或其实现库