ES 深入搜索03_Multifield Search |我的资源库露水湾

很少情况是单字段 match query，大多是多字段查询。

Multiple Query Strings
GET /_search
{
“query”: {
“bool”: {
“should”: [
{ “match”: { “title”: “War and Peace” }},
{ “match”: { “author”: “Leo Tolstoy” }}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
11
各 match 子句的 score，累加得到最终的 _score。
more-matches-is-better。

the bool query can wrap any other query type, including other bool queries.

Each clause at the same level has the same weight.

GET /_search
{
“query”: {
“bool”: {
“should”: [
{ “match”: { “title”: “War and Peace” }},
{ “match”: { “author”: “Leo Tolstoy” }},
{ “bool”: {//虽然都是 should，但层级不同，权重不同
“should”: [//translator 在这里总共占 1/3 的权重，若上移一层，跟 title 同层，则总占 1/2(2/4)权重
{ “match”: { “translator”: “Constance Garnett” }},
{ “match”: { “translator”: “Louise Maude” }}
]
}}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Prioritizing Clauses
指定优先级、权重。
boost 参数，默认 1。

{ “match”: {
“title”: {
“query”: “War and Peace”,
“boost”: 2 //
}}},
…
1
2
3
4
5
6
The “best” value for the boost parameter is most easily determined by trial and erro
A reasonable range for boost lies between 1 and 10, maybe 15.

Boosts higher than that have little more impact because scores are normalized.

Single Query String
users expect to be able to type all of their search terms into a single field, and expect that the application will figure out how to give them the right results.

Know Your Data
When your only user input is a single query string, you will encounter three scenarios frequently:

Best fields
找匹配最好的字段。
若一个 type 有 2 个字段 title 和 body。若 doc1 有 2 个字段各有一个 term 匹配，而 doc2 仅一个字段匹配但该字段匹配 2 个 term，则此时(best fields)doc2 总得分最高。
Most fields
Cross fields
Best Fields（·最·匹配，dis_max）
最，一个文档，匹配度最好的字段，唯一返回的那个。

search blog posts（博客文章，标题、内容）:

PUT /my_index/my_type/1
{
“title”: “Quick brown rabbits”,
“body”: “Brown rabbits are commonly seen.”
}
PUT /my_index/my_type/2
{
“title”: “Keeping pets healthy”,
“body”: “My quick brown fox eats rabbits on a regular basis.”
}
//搜索：“Brown fox”
{
“query”: {
“bool”: {
“should”: [
{ “match”: { “title”: “Brown fox” }},
{ “match”: { “body”: “Brown fox” }}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
结果：通过 bool 查询，doc 得分最高。
bool 计算得分过程：

执行 should 中的 2 个查询子句
相加 2 个查询子句的得分，不匹配是 0 分。
再乘以匹配子句数
最后除以总共的查询子句数，上例是 2
如上例中，
doc1 两个 field 都包含的 term 是 brown ，索引 2 个 should 子句查询都能成功匹配(contain,not equal)。
doc2 仅在一个 body 这个 field 中包含要查询的 term，但其包含 2 个查询 term——brown 和 fox，注意 title 字段中无任何匹配。所以 body 子句有较高得分，加 title 的 0 分去计算。

(1+1)2/2=2 (2+0)1/2=1 这里假设 body 较高得分为 2
1
2
但大多情况我们认为 doc2 应得分最高，因 doc 的 body 字段，同时包含 brown 和 fox，相关性更高。

因此，需要一种不是采用 bool 综合各 field 得分,而以最匹配字段的作为整体得分的查询方法：

This would give preference to a single field that contains both of the words we are looking for, rather than the same word repeated in different fields.（偏向于单个 field 包含要查询的 term，而不是分散在多个不同的 field 中)

dis_max(Disjunction Max Query)
Disjunction means or (while conjunction means and)
Disjunction 意味着 or 关系，conjunction 意味着 and 关系。

Disjunction Max Query simply means return documents that match any of these queries, and return the score of the best matching query:
所以 dis_max 意味着，匹配任一(一个为 true 即为 true)，但返回匹配度最好的字段。

{
“query”: {
“dis_max”: {//Disjunction Max Query
“queries”: [//2 个子句任一匹配都行，但仅返回一个，得分最高的一个。
{ “match”: { “title”: “Brown fox” }},
{ “match”: { “body”: “Brown fox” }}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
Tuning Best Fields Queries
若搜索的词是 “quick pets”会怎么样呢？
2 个文档都包含词 quick，但仅 doc2 包含 pets。
没有任何文档在同一字段包含 2 个 term（quick 和 pets）

{
“query”: {
“dis_max”: {// choose the single best matching field, and ignore the other
“queries”: [
{ “match”: { “title”: “Quick pets” }},
{ “match”: { “body”: “Quick pets” }}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
如上，2 个 doc 得分一样，最好的结果都是一个字段中包含一个 term。

tie_breaker

{
“query”: {
“dis_max”: {
“queries”: [
{ “match”: { “title”: “Quick pets” }},
{ “match”: { “body”: “Quick pets” }}
],
“tie_breaker”: 0.3//0~1 之间，即 dis_max 和 bool 之间
}
}
}
1
2
3
4
5
6
7
8
9
10
11
doc2 得分较高。
tie_breaker 介于 bool 和 dis_max 之间，其得分计算过程：

计算最匹配的查询子句的得分（best-match）
tie_breaker 乘以其它匹配子句的得分
得分汇总，且归一化
使用 tie_breaker 后，得分计算，考虑了所有匹配子句，但最匹配字段得分最高。

multi_match Query
{
“dis_max”: {
“queries”: [
{
“match”: {
“title”: {//title 字段
“query”: “Quick brown fox”,
“minimum_should_match”: “30%”
}
}
},
{
“match”: {
“body”: {//body 字段
“query”: “Quick brown fox”,
“minimum_should_match”: “30%”
}
}
},
],
“tie_breaker”: 0.3
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
可改(简)写为

GET my_index/_search
{
“query”: {
“multi_match”: {
“query”: “Quick brown fox”,
“type”: “best_fields”,
“fields”: [
“title”,
“body”
],
“tie_breaker”: 0.3,
“minimum_should_match”: “30%”
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
字段名使用通配符：

GET my_index/_search
{
“query”: {
“multi_match”: {
“query”: “Quick brown fox”,
“fields”: “title*”
}
}
}
1
2
3
4
5
6
7
8
9
Boosting Individual Fields(各字段加权)

GET my_index/_search
{
“query”: {
“multi_match”: {
“query”: “Quick brown fox”,
“fields”: [
“title^3”,
“body”
]
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
字段加权方法： add ^boost after the field name.

Most Fields（尽可能，most_fields）
全文检索，是在返回相关的，去掉不相干的 docs 之间平衡。目标是在返回第一页返回用户最关心的文档。
为了提高召回率，撒网更大些——不仅返回用户搜索的 exactly term ，也返回相关的。
搜索“quick brown fox”，若文档包含“fast foxes ”也期望被返回。

若有其它文档包含“quick brown fox”，则包含“fast foxes ”的相关性较低，排在后边。

常见调优全文搜索相关性的方法，索引相同 text，使用不同的方法(多字段)，各方法有不同的 relevance signal。

主字段包含的 terms，将尽可能多的匹配到文档。例如：

使用 stemmer（词干分析器）去索引 jumps, jumping, jumped —->jump（root form）。这样就是搜索 jumped 也能匹配到包含 jumping 的 doc.
包含同义词：jump, leap, hop
移除 diacritics, or accents（变音符、重音）:如 ésta, está, esta —->esta
然而，若包含 2 个文档，一个包含 jumped ，另一个包含 jumping。若查询期望第 1 个文档 rank higher？
实现：indexing the same text in other fields to provide more-precise matching.

unstemmed version TODO…
original word with diacritics，原汁原味
shingles —-> word proximity（临近）. TODO…
简单说：相同的内容，使用不同的分词器索引到多个字段。

most_fields 查询是如何执行的？
Elasticsearch 为每个字段生成独立的 match 查询，再用 bool 查询将它们包起来。

Multifield Mapping
PUT /my_index
{
“settings”: { “number_of_shards”: 1 }, //防止相关性被破坏
“mappings”: {
“my_type”: {
“properties”: {
“title”: { // stemmed by english analyzer.
“type”: “string”,
“analyzer”: “english”,
“fields”: {// 注意：关键词 fields（multifields）
“std”: { //use standard analyzer and not stemmed.
“type”: “string”,
“analyzer”: “standard”
}
}
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
写入：

PUT /my_index/my_type/1
{ “title”: “My rabbit jumps” }
PUT /my_index/my_type/2
{ “title”: “Jumping jack rabbits” }
1
2
3
4
match 查询：

GET /my_index/_search
{
“query”: {
“match”: {
“title”: “jumping rabbits” // 变为 stemmed term：jump and rabbit
}
}
}
1
2
3
4
5
6
7
8
This becomes a query for the two stemmed terms jump and rabbit, thanks to the english analyzer. The title field of both docs contains both of those terms, so both docs receive the same score.
如上，得分一样。

若用 title.std 字段搜，仅 doc2 满足。
若查询 2 个字段，通过 bool 查询综合分数，则 2 个 doc 都满足，但 doc2 得分更高(因 title.std)

GET /my_index/_search
{
“query”: {
“multi_match”: {
“query”: “jumping rabbits”,
“type”: “most_fields”, // combine the scores from all matching fields,
“fields”: [ “title”, “title.std” ]
}
}
}
1
2
3
4
5
6
7
8
9
10
standard 分词器，not stemmed
english 分词器， stemmed
为什么用 most_fields 类型？
combine the scores from all matching fields，causes multi_match query to wrap the two field-clauses in a bool query instead of a dis_max query.

若期望 dis_max 呢？
修改 type 为 best_fields，一个文档，最佳匹配的一个字段的得分为准，为文档总得分，不累加。

自定义各字段权重？
^boost

"fields":      [ "title^10", "title.std" ] //使得 title 权重大些

1
Cross-fields Entity Search#
With entities like person, product, or address, the identifying information is spread across several fields.

这样的实体，需要使用多个字段来唯一标识它的信息。如：

{//人名
“firstname”: “Peter”,
“lastname”: “Smith”
}
{//地址
“street”: “5 Poland Street”,
“city”: “London”,
“country”: “United Kingdom”,
“postcode”: “W1V 3DG”
}
1
2
3
4
5
6
7
8
9
10
有点像 Multiple Query Strings（多字符串查询）,

GET /_search
{
“query”: {
“bool”: {
“should”: [
{ “match”: { “title”: “War and Peace” }},
{ “match”: { “author”: “Leo Tolstoy” }}
]
}
}
}
1
2
3
4
5
6
7
8
9
10
11
在多字符串查询中，为每个字段使用不同的字符串查询，这里想用单个字符串在多个字段中进行搜索。

用户可能想搜索 “Peter Smith” 这个人，这些词出现在不同的字段中，如果使用 dis_max(或 best_fields) 查询去查找单个最佳匹配字段显然是个错误的方式。

{
“query”: {
“multi_match”: {
“query”: “Poland Street W1V”,
“type”: “most_fields”,//bool:合并所有匹配字段的评分
“fields”: [ “street”, “city”, “country”, “postcode” ]
}
}
}
1
2
3
4
5
6
7
8
9
most_fields 的问题

find the most（最多） fields matching any words, rather than(而不是) to find the most matching(最匹配) words across all fields.
不能使用参数 operator or minimum_should_match 来降低相关结果造成的长尾效应。
Term frequencies（词频） are different in each field and could interfere with each other to produce badly ordered results. 词频对于每个字段是不一样的，而且它们之间的相互影响会导致不好的排序结果。
Field-Centric Queries（字段中心式查询）
Custom _all Fields
cross-fields Queries
自定义 _all 是一种很好的解决方案，setting it up before you indexed your documents.
ES 也提供了 search-time 方案：the multi_match query with type cross_fields.

The cross_fields type takes a term-centric approach, quite different from the field-centric approach taken by best_fields and most_fields. It treats all of the fields as one big field, and looks for each term in any field.

{
“query”: {
“multi_match”: {
“query”: “peter smith”,
“type”: “cross_fields”,
“operator”: “and”,
“fields”: [ “first_name”, “last_name” ]
}
}
}
1
2
3
4
5
6
7
8
9
10