Elasticsearch-Mapping 映射

Elasticsearch Mapping 映射，即 Schema 定义

Elasticsearch Guide [7.17] » Mapping
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping.html

mapping 用来规定 index 中的字段数据类型，类似 metadata 元数据、schema 等概念。

动态 Mapping

Elasticsearch Guide [7.17] » Mapping » Dynamic mapping
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/dynamic-mapping.html

不需要提前定义 mapping，甚至不需要提前创建 index，直接向一个 index 插入任意字段，ES 都会自动创建 index，并自动添加 field，这就叫动态映射。

字段类型自动映射

Elasticsearch Guide [7.17] » Mapping » Dynamic mapping » Dynamic field mapping
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/dynamic-field-mapping.html

JSON 字段类型的自动识别如下：

JSON数据类型	“dynamic”:”true”	“dynamic”:”runtime”
null	不添加字段	不添加字段
true 或 false	boolean	boolean
double	float	double
integer	long	long
object	object	不添加字段
array	取决于第一个非空元素的类型	取决于第一个非空元素的类型
日期格式的string	date	date
数字格式的string	float 或 long	double 或 long
非日期且非数字格式的string	带 .keyword 子字段的 text	keyword

显式 Mapping

Elasticsearch Guide [7.17] » Mapping » Explicit mapping
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/explicit-mapping.html

为什么需要自定义mapping？

虽然 elasticsearch 中已经有动态 mapping(Dynamic Mapping)，而且新增字段默认也会添加新的 mapping，但是毕竟是机器，有时会推算的不对，比如地理位置信息，特殊格式化的日期类型等。这时，如果需要 es 提供排序、聚合等查询功能，就不能满足我们的需求。

最典型的问题，对于新出现的非数字非日期字符串字段，es 会推断类型为带 .keyword 子字段的 text 类型字段，如果不注意，后续想在此字段上用 term 筛选无法生效

{
  "str_field": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

通过手动设置 mapping，我们可以
指定哪个字段作为全文索引
指定哪个字段包含数字、日志、地理位置信息
指定日期的格式
指定字段的分词器（比如中文字段想使用 ik 分词器）
定义动态 mapping 的规则

显式自定义 mapping 和动态 mapping 可以结合使用，例如对于想明确指定类型的字段使用自定义 mapping，其他字段使用动态 mapping

mapping 会把 JSON 文档文档映射成 Lucene 所需要的扁平格式

一个 mapping 属于一个索引的 type，每个文档都属于一个 Type，一个 type 有一个 mapping 定义
从 es 7.0 开始，不需要在 mapping 定义中指定 type 信息，因为默认每个索引只有一个 type 叫 _doc

创建index时指定mapping

创建 index 的同时可指定 mapping，例如

PUT /my-index-000001
{
    "mappings":{
        "properties":{
            "firstName":{
                "type":"text", //text类型全文搜索
                "fields":{
                    "keyword":{
                        "type":"keyword", //keyword支持聚合查询
                        "ignore_above":256
                    }
                }
            },
            "lastName":{
                "type":"keyword",
                "null_value":"NULL" //支持字段为null，只有keyword类型支持
            },
            "mobile":{
                "type":"text",
                "index":false //此字段不被索引
            },
            "address":{
                "type":"text",
                "index_options":"offsets" //控制倒排索引记录的内容。offsets最多，记录四个
            }
        }
    }
}

添加字段到已有mapping中

可以使用 Update mapping API 向已有 index 的 mapping 中添加一个或多个字段，例如：

PUT /my-index-000001/_mapping
{
  "properties": {
    "employee-id": {
      "type": "keyword",
      "index": false
    }
  }
}

通过上面的调用，添加了一个 employee-id 字段，类型为 keyword，但不被索引。

更新已有index的mapping

不能修改已有字段的数据类型，否则索引数据会失效，只能修改字段属性

如果想修改索引的字段类型，可以创建一个新索引，然后将已有索引的数据重新索引（reindex）过去。
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Runtime fields 运行时字段(7.11+)

Elastic Docs ›Elasticsearch Guide [8.8] ›Mapping
https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html

运行时字段是 es 7.11+ 新增的，可以在创建 mapping 或请求中定义，可以实现例如存储年龄数值，但查询返回老（60+）中（40-60）青（30-）这种动态映射，还有比如分别存储ip和port，但查询返回ip:port

运行时字段（runtime fields）是在查询时评估的字段。运行时字段使你能够：

将字段添加到现有文档中，而无需重新索引数据
在不了解数据结构的情况下开始使用数据
覆盖查询时从索引字段返回的值
为特定用途定义字段，而无需修改基础架构

Elasticsearch：使用 Runtime fields 对索引字段进行覆盖处理以修复错误 - 7.11 发布
https://blog.csdn.net/UbuntuTouch/article/details/113795062

元数据字段

Elasticsearch Guide [7.17] » Mapping » Metadata fields
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping-fields.html

_id 字段

Elasticsearch Guide [7.17] » Mapping » Metadata fields » _id field
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping-id-field.html

每个文档都有唯一的一个 _id 字段，可以通过 GET /index/_doc/_id 来查询。_id 可在插入文档时指定，也可以由 ES 自动生成，_id 字段的类型无法在 mapping 中配置。

Loading the fielddata on the _id field is deprecated

在 Elasticsearch 7.x Java RestHighLevelClient 中，如果我们使用 _id 字段进行排序或聚合操作，内部使用的 RestClient 会给出如下提示：

org.elasticsearch.client.RestClient      : request [POST http://ip:9200/myindex/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true] returned 1 warnings: [299 Elasticsearch-7.6.1-aa751e09be0a5072e8570670309b1f12348f023b "Loading the fielddata on the _id field is deprecated and will be removed in future versions. If you require sorting or aggregating on this field you should also include the id in the body of your documents, and map this field as a keyword field that has [doc_values] enabled"]

这个警告是由于你正在尝试在 _id 字段上加载 fielddata。在 Elasticsearch 中，_id 字段默认不存储 fielddata，因为这可能会占用大量内存。
Fielddata 是 Elasticsearch 中用于在内存中存储字段值以便进行聚合、排序和脚本操作的数据结构。
如果你需要对 _id 字段进行排序或聚合，建议的做法是在文档的主体中也包含这个 id，并将这个字段映射为启用了 doc_values 的 keyword 字段。这样，你可以对这个 keyword 字段进行排序或聚合，而不需要在 _id 字段上加载 fielddata。

例如增加一个开启了 doc_values 的 my_id keyword 字段，插入数据时在业务代码中给赋值一个唯一id，后续用于排序。

{
  "mappings": {
    "properties": {
      "my_id": { 
        "type": "keyword",
        "doc_values": false
      }
    }
  }
}

Fielddata access on the _id field is disallowed

Elasticsearch 8.x 上
Java co.elastic.clients.elasticsearch.ElasticsearchClient 中
或者直接在 /_search HTTP 请求中
如果用 _id 字段进行排序或聚合操作，例如：

{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "_id": {
        "order": "asc"
      }
    }
  ],
}

会直接报错 Fielddata access on the _id field is disallowed

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata access on the _id field is disallowed, you can re-enable it by updating the dynamic cluster setting: indices.id_field_data.enabled"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true
  },
  "status": 400
}

原因：
_id 是 Elasticsearch 的特殊元数据字段，Elasticsearch 8.x 上上默认禁用 fielddata（用于排序、聚合等操作的内存数据结构）。

解决方法：
方法1、自定义业务 id 字段来支持排序，不要用 _id 字段
方法2、修改 Elasticsearch 集群配置 indices.id_field_data.enabled 为 true，支持 _id 排序：

PUT /_cluster/settings
{
  "transient": {
    "indices.id_field_data.enabled": true
  }
}

_routing 字段(Elasticsearch 分片策略/片键)

Elasticsearch Guide [7.17] » Mapping » Metadata fields » _routing field
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping-routing-field.html

文档根据下面的公式被路由到指定的分片中：

routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor

num_routing_shards 是索引的 index.number_of_routing_shards 配置值
num_primary_shards 是索引的 index.number_of_shards 配置值

默认的 _routing 路由字段（片键）是 _id，可在文档级别自定义路由字段，插入文档时指定路由字段后，搜索时也要指定路由字段。

_source

字段数据类型

Elasticsearch Guide [7.17] » Mapping » Field data types
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping-types.html

text 文本

https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

text 类型被用来索引长文本，在建立索引前会将这些文本进行分词，转化为词的组合，建立索引。允许 es 来检索这些词语。text 类型不能用来排序和聚合。

keyword 关键字

https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

Keyword 类型不需要进行分词，可以被用来检索过滤（term级查询）、排序和聚合。keyword 类型字段只能用本身来进行检索

keyword最大长度:32k字节(8191字符)

keyword 类型的最大支持的长度为 32766 字节（Lucene 的限制），如果存 UTF-8 中文字符的话大概是 32766 / 4 = 8191 个字符。text 对字符长度没有限制
超过后索引报错：

max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got xxx
type=illegal_argument_exception, reason=Document contains at least one immense term in field="content.keyword" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms. original message: bytes can be at most 32766 in length; got 63192

设置 ignore_above 后，超过给定长度后的数据将不被索引，无法通过 term 精确匹配检索返回结果。

This option is also useful for protecting against Lucene’s term byte-length limit of 32766.
The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes.
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/ignore-above.html

例如 message 字段有个子字段 keyword，类型为 keyword，设置最大长度 8191，避免 message 过长超过 32766 字节后报错
根据经验，建议可能存储变长文本的 keyword 字段都加上 ignore_above 属性，避免存入字符串过长报错

PUT my_index
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "fields":{
          "keyword":{
            "type": "keyword",
            "ignore_above": 8191
          }
        }
      }
    }
  }
}

text和keyword的区别

text 和 keyword 类型的区别

text 类型：支持分词、全文检索，不支持聚合、排序操作。适合大字段存储，如：文章正文、content字段等；
keyword 类型：支持精确匹配，支持聚合、排序操作。适合精准字段匹配筛选，如：url、name、status、gender 等字段。

两者结合使用:text有个keyword类型的子字段

dynamic 动态映射中，es的类型自动推断会将非数字非日期的字符串类型保存为带 .keyword 子字段的 text 类型字段，导致后续想做 term 筛选无法生效

对于文字标题 title 这种较短的文本，有时候需要全文检索，有时候又需要精确匹配，可以参照 es 动态映射的类型定义方式：
title 本身是 text 类型，可以指定分析器 ik_max_word，同时有个 keyword 类型的子字段 keyword(字段名就叫 keyword) 可用于精确匹配：

"title": {
  "type": "text",
  "analyzer": "ik_max_word",
  "fields": {
    "keyword": {
      "type": "keyword"
    }
  } 
}

date 日期

https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html

Date 类型在 es 内部被转为 UTC 并存储为 long 型毫秒时间戳。
Date 类型的查询内部以 range 查询实现。

1、Date 类型的默认格式
如果不指定 Date 的 format 格式，使用默认 strict_date_optional_time 配置项配置的格式或毫秒时间戳，即：strict_date_optional_time||epoch_millis

strict_date_optional_time 是 ISO 日期时间格式，格式为 yyyy-MM-dd，如果有时间的话必以 T 分割 yyyy-MM-dd'T'HH:mm:ss.SSSZ

Elastic Docs ›Elasticsearch Guide [8.6] ›Mapping ›Mapping parameters
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html#strict-date-time

2、多日期格式
可指定多个日期格式

"mappings": {
  "properties": {
    "date": {
      "type":   "date",
      "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
    }
  }
}

arrays 数组类型

https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html

在 Elasticsearch 中，没有专门的数组数据类型。默认情况下，任何字段都可以包含零个或多个值，但是，数组中的所有值必须具有相同的数据类型
long 类型存储一个值是 long 类型，存储多个自然就成为 long 数组类型；
keyword 类型存储一个值是 keyword 类型，存储多个值就成为 keyword 数组类型。

dense_vector 向量

什么是密集向量？
dense_vector（密集向量）是一种向量表示方法，其中向量的每个元素都被明确存储，没有使用稀疏表示法来省略零值或接近零值的元素。

Elasticsearch 7.x 中的向量字段

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/dense-vector.html

dense_vector 类型字段存储 float 型稠密向量字段
dense_vector 向量维度最大为 2048
dense_vector 类型字段只能是单值，不能存储多个向量
dense_vector 类型字段不支持查询、排序、聚合，只能通过『向量函数+script_score』脚本评分查询。

Functions for vector fields
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-script-score-query.html#vector-functions

Elasticsearch 8.x 中的向量字段

https://www.elastic.co/guide/en/elasticsearch/reference/8.18/dense-vector.html

dense_vector 类型字段存储数字类型的稠密向量，主要用于 kNN 向量搜索
dense_vector 向量维度最大为 4096
dense_vector 类型字段不支持聚合、排序。
dense_vector 类型字段只能是单值，不能存储多个向量。

对向量进行索引以便进行 kNN 搜索

k-nearest neighbor(kNN) 搜索可通过指定的相似度算法查找和输入 query vector 最相近的 k 个向量。
dense_vector 类型字段可用于『向量函数+script_score』脚本评分查询，也就是暴力(brute-force) 扫描全部文档计算相似度，这也是 Elasticsearch 7.x 中向量相似度检索的方式。
但是多数情况下，暴力 kNN 搜索性能不够，所以 Elasticsearch 8.x 中的 dense_vector 类型支持对向量进行索引，以便支持快速 kNN 搜索。
未显示指定 Mapping 类型的、长度在 128 到 4096 之间的 float 数组数据会被『动态 Mapping 机制』类型推断为 dense_vector 类型，并且相似度 similariy 属性为 cosine
如果不想要自动推断的 similariy=cosine，想设置其他相似度，可以显式设置 dense_vector 字段类型 Mapping

例如，设置 dense_vector 类型字段的相似度为点积 dot_product：

PUT my-index-2
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3,
        "similarity": "dot_product"
      }
    }
  }
}

dense_vector 类型字段的向量索引是默认开启的，即 index 属性默认是 true index:true

向量会被索引为 int8_hnsw，之后可指定相似度算法进行 kNN 搜索。
对向量进行索引开销较大，如果不需要在该向量字段上进行检索，可设置 index 为 false 关闭向量索引。

例如，my_vector 是向量字段，但将 index 设为 false 关闭向量索引：

PUT my-index-2
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": false
      }
    }
  }
}

Elasticsearch 使用 HNSW 算法来支持高效 kNN 搜索，和多数 kNN 算法一样，HNSW 也是个近似算法，牺牲结果精度，提升检索速度。

向量索引量化

dense_vector 类型支持量化，以减少搜索浮点向量时所需的内存占用。支持以下三种量化策略：

int8 将向量的每个维度量化为1字节整数。这以牺牲一定的准确性为代价，将内存占用减少了75%（或4倍）
int4 将向量的每个维度量化为半字节整数。这以牺牲准确性为代价，将内存占用减少了87%（或8倍）
bbq Better binary quantization(BBQ) 更好的二进制量化，将每个维度降低到单比特精度。牺牲精度将内存占用减少了96%（或32倍）。一般来说，查询时间和重新排序期间的过采样可以帮助减轻准确性损失。

要使用量化索引，可以将索引类型设置为 int8_hnsw, int4_hnsw 或 bbq_hnsw。索引浮点向量时，当前默认索引类型为 int8_hnsw。
量化向量可以使用过采样和重新采样来提高近似 kNN 搜索结果的准确性。

向量量化将继续将原始 float 浮点向量值保存在磁盘上，以便在数据的生命周期内进行重新排序、重新索引和量化提升。这意味着同时存储量化向量和原始向量，磁盘占用开销会增加：

int8 的磁盘使用率将增加约 25%
int4 将增加约 12.5%
bbq 将增加约 3.1%

int4 量化需要偶数个向量维度。
bbq 量化仅支持大于 64 的向量维度。

例如，指定 dense_vector 向量字段 my_vector 使用 int8 量化：

PUT my-byte-quantized-index
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3,
        "index": true,
        "index_options": {
          "type": "int8_hnsw"
        }
      }
    }
  }
}

dense_vector 类型字段的参数

element_type （可选，字符串）用于对向量进行编码的数据类型。
支持的数据类型是 float（默认）、byte 和 bit：

float 默认值，为向量的每个维度索引一个 4 字节的浮点值。
byte 为每个维度索引一个 1 字节的整数值。
bit 每个维度索引一个比特(1 bit)。适用于非常高维的向量或专门支持位向量的模型。注意：使用位时，维度的数量必须是8的倍数，并且必须表示位的数量。

dims （可选，整数）向量维数。
不能超过 4096。如果未指定 dims，则将其设置为添加到字段的第一个向量的长度。

index （可选，布尔）是否对向量进行 kNN 索引。
如果为true，则可以使用 kNN 搜索 API 来搜索此字段。
默认为true。

similarity （可选，string）在 kNN 搜索中使用的向量相似性度量。
文档根据其向量字段与查询向量的相似性进行排名。根据相似性度量计算每个文档与查询向量的相似度分值 _score，_score 是正数值，分数越高越相似，排名越高。
当 element_type 是 bit 时默认为 l2_norm，否则默认为 cosine。
此参数只能在 index 为 true 时指定。
bit 向量仅支持 l2_norm 作为其相似性度量。
可选值：

l2_norm 基于向量之间的 L2 距离（也称为欧几里德距离）计算相似度。文档 _score 的计算公式为 1/(1+l2_norm(query，vector)^2)
dot_product 计算两个单位向量的点积。文档 _score 的计算公式为 (1+dot_product(query，vector))/2
max_inner_product 计算两个向量的最大内积
cosine 计算余弦相似度。在索引过程中，Elasticsearch 会自动将 similarity=cosine 的向量归一化为单位长度。这允许在内部使用 dot_product 来计算相似性，更加高效。原始的未归一化向量仍然可以通过脚本访问。文档 _score 的计算公式为 (1 + cosine(query, vector)) / 2。余弦相似性不允许零幅度的向量，因为在这种情况下没有定义余弦。

index_options （可选，object）kNN 索引选项
配置 kNN 索引算法的可选部分。HNSW 算法有两个内部参数，它们会影响数据结构的构建方式。这些可以调整以提高结果的准确性，但代价是索引速度较慢。
此参数只能在 index 为 true 时指定。
index_options 属性：
type （必填，字符串）要使用的 kNN 算法的类型。可以是以下任何一种：

hnsw 利用 HNSW 算法进行可扩展的近似 kNN 搜索。支持所有 element_type 值。
int8_hnsw 浮点向量的默认索引类型。利用 HNSW 算法进行可扩展的近似 kNN 搜索。自动进行 int8 标量量化。以一定的精度为代价将内存占用减少4倍。
int4_hnsw 利用 HNSW 算法进行可扩展的近似 kNN 搜索。自动进行 int4 标量量化。以一定的精度为代价将内存占用减少8倍。
bbq_hnsw 利用 HNSW 算法进行可扩展的近似 kNN 搜索。自动进行二进制量化。以一定的精度为代价将内存占用减少32倍。
flat 用暴力(brute-force)搜索算法进行精确的 kNN 搜索。支持所有 element_type 值。
int8_flat 自动 int8 量化。暴力搜索算法进行 kNN 搜索。仅支持 float 的 element_type。
int4_flat 自动 int4 量化。暴力搜索算法进行 kNN 搜索。仅支持 float 的 element_type。
bbq_flat 自动二进制量化。暴力搜索算法进行 kNN 搜索。仅支持 float 的 element_type。

m （可选，整数）HNSW 图中每个节点将连接到的邻居数量。
默认为16。
仅适用于 hnsw, int8_hnsw, int4_hnsw 和 bbq_hnsw 索引类型。

ef_construction （可选，整数）在为每个新节点组装最近邻列表时要跟踪的候选数量。
默认值为100。
仅适用于 hnsw, int8_hnsw, int4_hnsw 和 bbq_hnsw 索引类型。

confidence_interval （可选，浮点数）量化向量时使用的置信区间。
仅适用于 int8_hnsw, int4_hnsw, int8_flat 和 int4_flat 索引类型。
可以是介于 0.90 和 1.0 之间的任何值，也可以是0。
当该值为0时，这表示应计算动态分位数以进行优化量化。
当在 0.90 和 1.0 之间时，该值限制了计算量化阈值时使用的值。例如，在计算量化阈值时，0.95 的值将仅使用值的中间95%（例如，将忽略值的最高和最低2.5%）。
对于 int8 量化向量，默认为 1/(dims+1)，对于 int4，默认为0，用于动态分位数计算。

更新向量的索引类型

为了更好地适应缩放和性能需求，可通过 Update mapping API 修改 index_options 的 type 索引类型。
必须按如下顺序（允许跳跃）修改 dense_vector 向量字段的索引类型：

flat --> int8_flat --> int4_flat --> hnsw --> int8_hnsw --> int4_hnsw

更新 HNSW 类型（hnsw, int8_hnsw, int4_hnsw）时，连接数 m 必须保持不变或增加。
对于标量量化格式（int8_flat, int4_flat, int8_hnsw, int4_hnsw），confidence_interval 必须始终保持一致（一旦定义，就不能更改）。

切换向量索引类型不会重新索引已存在的向量（已存在的向量将继续使用原索引类型），此后新入库向量将使用新的索引类型。为了将所有向量更新为新索引类型，应使用重新索引 /_reindex 或强制合并。

其他类型

long, integer, short, byte, double, float
boolean
IPv4&IPv6

object 对象类型

https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html

object 类型用于存储 JSON 对象，支持字段内嵌套子字段。

创建索引，user 字段类型是 object，内部包含 name, age, address 三个子字段，其中 address 也是个 object 字段，内含 street, city 两个子字段。

PUT /my_index
{
  "mappings": {
    "properties": {
      "user": {
        "type": "object",
        "properties": {
          "name": {"type": "text"},
          "age": {"type": "integer"},
          "address": {
            "type": "object",
            "properties": {
              "street": {"type": "text"},
              "city": {"type": "keyword"}
            }
          }
        }
      }
    }
  }
}

插入文档：

POST /my_index/_doc/1
{
  "user": {
    "name": "John Doe",
    "age": 30,
    "address": {"street": "123 Main St", "city": "New York"}
  }
}

注意：在底层存储中，object 类型会被平铺为 user.name、user.address.city 等形式，这可能导致查询时无法区分同一数组内的不同对象（需用 nested 类型解决）

nested 嵌套文档

https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

普通 object 对象在 es 中平铺存储导致丢失关联关系

普通 object 在 es 中存储时会打平，导致丢失关联关系。

例如创建 test_user 索引：

PUT /test_user
{
    "mappings": {
        "properties": {
            "user_id": {zi
                "type": "long"
            },
            "user_name": {
                "type": "text"
            },
            "address": {
                "properties": {
                    "id": {
                        "type": "long"
                    },
                    "name": {
                        "type": "text"
                    },
                    "address": {
                        "type": "text"
                    }
                }
            }
        }
    }
}

插入数据：

POST /test_user/_doc
{
    "user_id": 1,
    "user_name": "张三",
    "address": [
        {
            "id": 1,
            "name": "公司",
            "address": "北京市海淀区"
        },
        {
            "id": 2,
            "name": "家",
            "address": "北京市昌平区"
        }
    ]
}

查询：

POST /test_user/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address.id": 1 }},
        { "match": { "address.name":"家"}}
      ]
    }
  }
}

结果：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.287682,
        "hits": [
            {
                "_index": "test_user",
                "_type": "_doc",
                "_id": "Bi-25osB0YdeJP5JAWyt",
                "_score": 1.287682,
                "_source": {
                    "user_id": 1,
                    "user_name": "张三",
                    "address": [
                        {
                            "id": 1,
                            "name": "公司",
                            "address": "北京市海淀区"
                        },
                        {
                            "id": 2,
                            "name": "家",
                            "address": "北京市昌平区"
                        }
                    ]
                }
            }
        ]
    }
}

address.id=1 && address.name=”家” 的数据并不存在，但可以检索到，因为 es 中存储是平铺的

{
  "user_id": 1,
  "user_name": "张三",
  "address.id": [1,2],
  "address.name": ["公司", "家"],
  "address.address": ["北京市海淀区", "北京市昌平区"]
}

使用 netsted 类型保存一对多关联关系

创建索引

PUT /test_user_nested
{
  "mappings": {
    "properties": {
      "user_id": {
        "type": "long"
      },
      "user_name": {
          "type": "text"
      },
      "address": {
        "type": "nested",
        "properties": {
          "id": {
            "type": "long"
          },
          "name": {
            "type": "text"
          },
          "address": {
            "type": "text"
          }
        }
      }
    }
  }
}

添加文档

POST /test_user_nested/_doc
{
  "user_id": 1,
  "user_name": "张三",
  "address": [
    {
      "id": 1,
      "name": "公司",
      "address": "北京市海淀区"
    },
    {
      "id": 2,
      "name": "家",
      "address": "北京市昌平区"
    }
  ]
}

嵌套查询
查询 address.id=1 && address.name=”家” 是查不到数据的，address.id=1 && address.name=”公司” 才能查到数据

POST /test_user_nested/_search
{
  "query": {
    "nested": {
      "path": "address",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "address.id": 1
              }
            },
            {
              "match": {
                "address.name": "公司"
              }
            }
          ]
        }
      }
    }
  }
}

嵌套查询可以和普通查询一起组成联合查询：

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "user_id": 1
          }
        },
        {
          "nested": {
            "path": "address",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "address.id": 1
                    }
                  },
                  {
                    "match": {
                      "address.name": "公司"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

嵌套子文档是独立存储的

只插入了一个文档，但 _cat 可以看到文档个数是3（用 /_count 查文档数还是1），这是因为 nested 子文档在 ES 内部其实也是独立存储的隐藏 lucene 文档，查询时 es 内部做了 join 处理，对外表现为一个文档

/_cat/indices?v
health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test_user_nested         2tNlE4t8SqiLKDtW8B6Isg   1   1          3            0      9.7kb          4.8kb

如果创建一个有 100 个地址的 user 文档，则内部一共是 101 个 Lucene 文档：1个父user文档，100个子address文档

嵌套文档性能问题与个数限制

index.mapping.nested_fields.limit 每个索引的最大嵌套字段数，默认50
index.mapping.nested_objects.limit 每个文档中的嵌套子文档最大个数，默认10000

join 父子文档

https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html

字段映射参数

dynamic 是否可动态添加字段

Elasticsearch Guide [7.17] » Mapping » Mapping parameters » dynamic
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/dynamic.html

dynamic 可以是整个 mapping 的属性，也可以是某个字段的属性，子字段会继承父字段或整个 mapping 的 dynamic 属性

dynamic 参数控制是否可自动添加字段，可取值如下，默认值是 true

true 自动添加新字段到 mapping
runtime 新字段作为运行时字段被添加到 mapping，这些字段不会被索引，但查询时会在 _source 字段中返回。
false 新字段会被忽略。这些字段不被索引，不可搜索，不会被添加到 mapping，但数据依然会被存储在原始 _source 中，_source 字段保存了文档的原始 JSON 数据，依然会在查询结果的 _source 字段中返回。
strict 如果检测到新字段，会抛出异常，文档无法插入，新字段必须显式添加到 mapping 中。

动态 dynamic 属性设置
更新整个索引的 dynamic 属性：

PUT dynamic_mapping_test/_mapping
{
  "dynamic": false
}

enabled

index 是否被索引

index 控制当前字段是否被索引，默认为 true，如果设置成 false，该字段不可被搜索

index_options

index_options 控制倒排索引记录的内容

docs 记录 doc id
freqs 记录 doc id 和 term frequencies
positions 记录 doc id/term frequencies/term position
offsets 记录 doc id/term frequencies/term position/character offects

null_value 替换null值

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/null-value.html

1、空值不被索引，不可搜索
当字段值为 null、空数组 [] 或 null 值数组（如 [null]）时，Elasticsearch 会将该字段视为没有值，不会在倒排索引中存储任何标记。这意味着：

无法通过 term 查询直接搜索 null（会抛出 illegal_argument_exception 错误）；
exist 查询无法匹配这些文档
空值字段的文档在聚合或统计时会被忽略。

2、通过 null_value 实现对 null 值的索引和搜索
在字段映射中定义 null_value，Elasticsearch 会在存储时将 显式null值 替换为指定占位符，使其可被索引和搜索

注意：null_value 的类型需要和字段本身的类型相同，例如 long 型字段不可以设置字符串类型的 null_value

例如：

PUT /my_index
{
  "mappings": {
    "properties": {
      "status_code": {
        "type": "keyword",
        "null_value": "NULL"  // 替换 null 为 "NULL"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "status_code": null  // 值被替换为 "NULL"
}

PUT my_index/_doc/2
{
  "status_code": []   // 空数组不会被替换
}

设置 "null_value": "NULL" 后的行为特性：

插入文档时，若字段值为 null（即 显式null值），实际索引值为 "NULL"
插入空数组时，由于数组内不包含任何显式的 null，所以并不会被替换为 "NULL"
查询时使用 term: "NULL" 可匹配显式 null 值的文档
不影响原始数据，_source 中仍显示原始 null 值

copy_to

Elasticsearch Guide [7.17] » Mapping » Mapping parameters » copy_to
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/copy-to.html

copy_to 可以将字段的值拷贝到另一个字段中，常用于将多个字段的值合并到同一个字段方便检索
将字段内容拷贝到目标字段，查询时可以用目标字段作为查询条件，但是不会出现 _source 中

例如，将 first_name 和 last_name 字段的内容拷贝到 full_name 中，然后可直接在 full_name 字段上搜索全名。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text",
        "copy_to": "full_name" 
      },
      "last_name": {
        "type": "text",
        "copy_to": "full_name" 
      },
      "full_name": {
        "type": "text"
      }
    }
  }
}

fields 同一字段存储为多种类型

fields 多字段(multi-fields)
在字段下新增一个字段，可以自定义类型，使用不同的 analyzer
fields 的目的是为了使单个字段可以被多重方式索引和检索，例如可以用来实现以拼音方式搜索中文字段。

在动态 mapping 中，非日期且非数字格式的 string 字段会被自动映射为 text 类型，但是带有一个字段名为 keyword，类型也是 keyword 的 fields 字段。
例如，下面 name 本身是 text 类型，但 name.keyword 是 keyword 类型

"name": {
    "type":"text",
    "fields":{
        "keyword":{
            "ignore_above":256,
            "type":"keyword"
        }
    }
}

ignore_above 超长后忽略

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/ignore-above.html
超过 ignore_above 长度的字符串不会被索引和存储，不参与匹配和聚合查询，但依然会完整在 _source 中返回。
对于字符串数组类型，ignore_above 会作用在每个数组元素上。
默认值 2147483647

ignore_above 不会自动截断

注意：超过 ignore_above 长度后整个字段都不会被索引，没有自动截断功能，不是仅索引前 ignore_above 个字符，忽略后续字符

示例如下：

PUT /test_ignore_above
{
  "mappings": {
    "properties": {
      "para": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 10
          }
        }
      },
      "title": {
        "type": "keyword",
        "ignore_above": 10
      }
    }
  }
}

PUT /test_ignore_above/_doc/1
{
  "title": "lesstitle",
  "para": "lesspara"
}

PUT /test_ignore_above/_doc/2
{
  "title": "longlongtitle",
  "para": "longlongpara"
}

写入2个文档后，match_all 都能查到，但是：
{ “query”: { “exists”: { “field”: “title” } } } 结果只有文档1
{ “query”: { “exists”: { “field”: “para.keyword” } } } 结果也只有文档1

{ “query”: { “wildcard”: { “title”: “*title” } } } 模糊匹配结果只有文档1

{ “query”: { “term”: { “title”: “lesstitle” } } } 能查到文档1
{ “query”: { “term”: { “title”: “longlongtitle” } } } 无法查到文档2，返回 hits.total 个数为0

如果文档1和2都存在时，在 title 上脚本查询：

{
  "query": {
    "script": {
      "script": {
        "source": "doc['title'].value.contains(params.query)",
        "lang": "painless",
        "params": {
          "query": "tit"
        }
      }
    }
  }
}

报错：

{ "type": "illegal_state_exception", "reason": "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!" }

因为文档2的title不可被检索匹配，删除文档2后不报错。
也可以改下脚本，兼容 title 不存在的文档：

{
  "query": {
    "script": {
      "script": {
        "source": "if (doc.containsKey('title') && doc['title'].size() > 0) {return doc['title'].value.contains(params.query);} else {return false;}",
        "lang": "painless",
        "params": {
          "query": "tit"
        }
      }
    }
  }
}

但这样也无法查询到文档2

analyzer 索引分析器

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analyzer.html

analyzer 参数指定文档索引（插入）阶段的分析器。
用于索引阶段对字段内容进行分词处理，决定了文档如何被拆分为词项（terms）并存储到倒排索引中。例如，text 类型字段默认使用 standard 分析器，会将文本按空格、标点分割并转为小写

内置分析器

Built-in analyzer reference
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-analyzers.html

内置分析器：

standard 默认分词器，按词切分，小写处理
simple 按照非字母切分（符号被过滤），小写处理
stop 小写处理，停用词过滤（the、a、is）
whitespace 按照空格切分，不转小写
keyword 不分词，直接将输入当作输出
patter 正则表达式，默认 \W+（非字符分隔）
language 提供了 30 多种常见语言的分词器（english、german）
中文分词 icu_analyzer、ik、thulac

search_analyzer 检索分析器

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-analyzer.html

search_analyzer 参数指定搜索阶段的分析器。
用于搜索阶段对查询词进行分词处理，确保查询词的分词逻辑与索引时的分词逻辑匹配。若未显式指定，默认使用同字段 analyzer 的配置。

analyzer 与 search_analyzer 用法最佳实践

典型用法：
索引时：使用细粒度分词（如 ik_max_word）尽可能拆解更多潜在关键词。
搜索时：使用粗粒度分词（如 ik_smart）提高搜索精准度。

PUT /my_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",  // 索引时分词
        "search_analyzer": "ik_smart"  // 搜索时分词
      }
    }
  }
}

similarity 相似度算法

Elasticsearch 允许每个字段单独设置相似度算法。
字段的 similarity 属性用于定义该字段的相似度评分算法。

适用字段类型：主要针对 text 类型的全文检索字段

Elasticsearch 内置三种开箱即用的相似度算法，无需额外配置即可使用：

BM25 默认相关性算法，基于 Okapi BM25 模型，优化了词频饱和度和文档长度归一化，适合自然语言文本检索。
classic 经典的 TF/IDF 算法，计算词频（TF）和逆文档频率（IDF）的乘积，支持字段长度归一化。
boolean 无需全文相关性评分的字段，可以设置为 boolean，仅表示 term 是否匹配。

自定义相似度算法参数

Elasticsearch 支持通过 similarity 模块调整内置相似度算法的参数（如 BM25 的 k1 和 b）:
1、在索引配置中定义一个定制化的相似度算法 custom_bm25，修改 BM25 算法的 k1 和 b 参数。
2、之后在 text 类型的字段上可配置 similarity 属性为 custom_bm25

PUT /my_index
{
  "settings": {
    "index": {
      "similarity": {
        "custom_bm25": {
          "type": "BM25",
          "k1": 1.2,     // 控制词频饱和度
          "b": 0.75      // 控制文档长度归一化强度
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "similarity": "custom_bm25" // 引用自定义算法
      }
    }
  }
}

format 日期字段格式

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/mapping-date-format.html

如果不指定 Date 的 format 格式，使用默认 strict_date_optional_time 配置项配置的格式或毫秒时间戳，即：strict_date_optional_time||epoch_millis

指定日期字段格式：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

properties 嵌套字段的属性

object 或 nested 字段的子属性

doc_values 用于加速聚合/排序的正排索引

Elasticsearch Guide [7.17] » Mapping » Mapping parameters » doc_values
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/doc-values.html

倒排索引可以提供全文检索能力，但是无法提供对排序和数据聚合的支持。
聚合（aggregations）、排序（Sorting）、脚本（scripts access to field）等查询需要通过访问文档的字段值来进行，这是和使用倒排索引查询不同的一种数据访问模式。这类查询不使用倒排索引，无法直接根据 term 找到对应的文档，而是需要通过访问文档来找到他包含的 term 值。

Doc Values 是磁盘上的一种数据结构，在索引文档时创建，与 _source 独立，占用额外存储空间，同时开启 doc_values 和 _source 则会将该字段原始内容保存两份。
Doc Values 是一种类似 doc -> field value 的正排索引映射关系，可快速找到文档包含的 term，可使得排序和聚合查询更高效
默认情况下，ES几乎会为所有类型的字段存储doc_value，但是 text 或 text_annotated 等可分词字段不支持 doc values 。如果不需要对某个字段进行排序或者聚合，则可以关闭该字段的doc_value存储
除了 text 和 annotated_text，全部字段类型都支持 Doc values。

倒排索引示例，方便查找某个 term 在哪些文档中，每个字段都有倒排索引，这里只示意 Field1 的：
| Field1 | Doc_1 | Doc_2 | Doc_3 |
| —— | —– | —– | —– |
| brown | X | X | |
| color | | X | X |
| dog | | | X |

Doc Values 正排索引示例，方便查找某个 Doc 中包含哪些 term：
| Doc | Field1 | Field2 |
| —– | ———— | ———– |
| Doc_1 | brown, color | meat |
| Doc_2 | brown | fruit, meat |
| Doc_3 | dog, color | meat |

之后的查询如果需要按 Field2 聚合，通过查 Doc Values 可以知道 Doc_1 和 Doc_3 的 Field2 字段值相同，可快速聚合。

支持 Doc values 的字段默认开启 doc_values 功能，如果确认某个字段不需要用来做排序、聚合、脚本查询，可以通过 "doc_values": false 关闭 doc_values 支持，以便节省磁盘空间。
注意："doc_values": false 的字段不支持排序、聚合、脚本查询。
例如：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "status_code": { 
        "type":       "keyword"
      },
      "session_id": { 
        "type":       "keyword",
        "doc_values": false
      }
    }
  }
}

illegal_argument_exception: Text fields are not optimised for operations

Java 中查询 es 报错：

Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [id] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]
        at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:485)
        at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:396)
        at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:426)

原因：
java 中使用 id asc 排序分页查询 es 文档：

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.size(1000);
searchSourceBuilder.sort(new FieldSortBuilder("id").order(SortOrder.ASC));
SearchRequest searchRequest = new SearchRequest(index);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

但查询的索引上 id 字段是个 text 类型。
因为没有预先在 mappings 中定义 id 为 keyword 类型，直接写入带 string id 的数据后导致 es 自动推断 id 类型为 text，text 类型的字段是不支持排序的，所以报错

{
  "id": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

解决：
创建索引时在 mappings 中定义 id 为 keyword 类型。

eager_global_ordinals 全局序号

Elasticsearch Guide [7.17] » Mapping » Mapping parameters » eager_global_ordinals
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/eager-global-ordinals.html

ES 使用 Doc Values 正排索引结构来支持聚合、排序、脚本查询等需要访问文档字段值的操作。
ES 使用全局序号代替真正的 term 值来压缩优化 Doc Values 的存储，可提高聚合查询的性能、节省存储 Doc Values 使用的磁盘空间、节省 fielddata 缓存使用的内存空间。

Global Ordinals 是 Shard 级别的，因此当一个 Shard 的 Segment 发生变动时就需要重新构建 Global Ordinals，比如有新数据写入导致产生新的Segment、Segment Merge等情况。当然，如果Segment没有变动，那么构建一次后就可以一直利用缓存了（适用于历史数据）。

默认情况下，Global Ordinals 是在收到聚合查询请求并且该查询会命中相关字段时构建，而构建动作是在查询最开始做的，即在Filter之前。在遇到某个字段的值种类很多时会变的非常慢，严重影响聚合查询速度。在追求查询的场景下很影响查询性能。可以使用 eager_global_ordinals，即在每次refresh以后即可更新字典，字典常驻内存，减少了查询的时候构建字典的耗时。

7.x 开始移除 Type

Removal of mapping types
https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/removal-of-types.html

index 中的 document 可以分组，这种分组就叫做 Type，比如 twitter 索引中可以有 user 类型的数据和 tweet 类型的数据。

ES 6.x 之前的版本，可在一个索引库下创建多个 type
ES 6.x 版只允许每个 Index 包含一个 Type，并预告 7.x 版将会彻底移除 Type
ES 7.x 开始，彻底废弃一个 index 下多个 type 支持，包括 api 层面

为什么 ES 要移除 Type？

一开始 es 发布时，声明 index 对应关系数据库中的 database，type 对应 table，document 对应 row 数据行。
但其实并不是这样，关系数据库中不同 table 间的同名字段是互相独立互不影响的，但 es 中同一个 index 下不同 type 间的同名字段是互相影响的，其实在 Lucene 内部是存储在同一字段中的。
还以 twitter 索引中有 user 和 tweet 两种 type 的数据为例，比如两类数据中都有 user_name 字段，则 Lucene 内部都使用 user_name 字段索引，所以两类数据中 user_name 字段的类型必须一致。
某些情况下我们可能想要不同 type 中的同名字段是不同类型，比如一个 type 中 deleted 是 date 类型，另一个 type 中是 boolean 类型，这种是实现不了的。
此外，在同一个 index 中存储不同类型的文档会导致数据稀疏，和 Lucene 的文档压缩能力冲突。

当前位置 : 首页 » 文章分类 : 开发 » Elasticsearch-Mapping 映射