领取MOLI红包

elasticsearch之内置字符过滤器

发布日期:2025-01-03 18:27    点击次数:96
前言 字符过滤器在<charFilter>属性中定义,它是对字符流进行处理。字符过滤器种类不多。elasticearch只提供了三种字符过滤器: HTML字符过滤器(HTML Strip Char Filter) 映射字符过滤器(Mapping Char Filter) 模式替换过滤器(Pattern Replace Char Filter) 我们来分别看看都是怎么玩的吧! HTML字符过滤器 HTML字符过滤器(HTML Strip Char Filter)从文本中去除HTML元素。 POST _analyze { "tokenizer": "keyword", "char_filter": ["html_strip"], "text":"<p>I&apos;m so <b>happy</b>!</p>" } 结果如下: { "tokens" : [ { "token" : """ I'm so happy! """, "start_offset" : 0, "end_offset" : 32, "type" : "word", "position" : 0 } ] } 映射字符过滤器 映射字符过滤器(Mapping Char Filter)接收键值的映射,每当遇到与键相同的字符串时,它就用该键关联的值替换它们。 PUT pattern_test4 { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } }, "char_filter": { "my_char_filter":{ "type":"mapping", "mappings":["苍井空 => 666","武藤兰 => 888"] } } } } } } 上例中,我们自定义了一个分析器,其内的分词器使用关键字分词器,字符过滤器则是自定制的,将字符中的苍井空替换为666,武藤兰替换为888。 POST pattern_test4/_analyze { "analyzer": "my_analyzer", "text": "苍井空热爱武藤兰,可惜后来苍井空结婚了" } 结果如下: { "tokens" : [ { "token" : "666热爱888,可惜后来666结婚了", "start_offset" : 0, "end_offset" : 19, "type" : "word", "position" : 0 } ] } 模式替换过滤器 模式替换过滤器(Pattern Replace Char Filter)使用正则表达式匹配并替换字符串中的字符。但要小心你写的抠脚的正则表达式。因为这可能导致性能变慢! PUT pattern_test5 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } } 上例中,我们自定义了一个正则规则。 POST pattern_test5/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" } 结果如下: { "tokens" : [ { "token" : "My", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "credit", "start_offset" : 3, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "card", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "is", "start_offset" : 15, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "123_456_789", "start_offset" : 18, "end_offset" : 29, "type" : "<NUM>", "position" : 4 } ] } 我们大致的了解elasticsearch分析处理数据的流程。但可以看到的是,我们极少地在例子中演示中文处理。因为elasticsearch内置的分析器处理起来中文不是很好。所以,接下来要介绍一个重量级的插件就是elasticsearch analysis ik(一般习惯称呼为ik分词器)。 欢迎斧正,that's all