Elasticsearch-集群及运维

Elasticsearch 集群、配置、运维

CAT 运维 API

Compact and aligned text (CAT) APIs
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html

JSON 参数格式的接口方便程序处理，但对人眼来说，我们熟悉的分行分列的 Linux 命令行输出格式更加友好，Elasticsearch 提供的 CAT(compact and aligned text) 接口就是以命令行输出格式返回的。

CAT 接口返回的结果只适合人眼阅读，如果需要程序处理，建议使用 JSON 返回格式的接口。

公共参数

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html#common-parameters

?help 可用列

全部 CAT 接口都有 help 参数(query string)，可用来查询此接口提供的列信息及解释，例如 http://localhost:8200/_cat/nodes?help
返回的三列从左到右分别是：字段名、别名（缩写）、解释

curl "http://10.92.54.76:8200/_cat/indices?help"
health                           | h                              | current health status
status                           | s                              | open/close status
index                            | i,idx                          | index name
uuid                             | id,uuid                        | index uuid
pri                              | p,shards.primary,shardsPrimary | number of primary shards
rep                              | r,shards.replica,shardsReplica | number of replica shards
docs.count                       | dc,docsCount                   | available docs
docs.deleted                     | dd,docsDeleted                 | deleted docs
store.size                       | ss,storeSize                   | store size of primaries & replicas
pri.store.size                   |                                | store size of primaries

?v 详情

全部 CAT 接口都有 v 参数(query string)，用来开启详情输出，例如 localhost:9200/_cat/indices?v

?h 指定输出列

全部 CAT 接口都有 h 参数(query string)，用于指定输出列，例如 _cat/nodes?h=ip,port,heapPercent,name
h 支持通配符，例如 /_cat/thread_pool?h=ip,queue*

curl "http://localhost:8200/_cat/nodes?h=ip,port,heapPercent,name"
127.0.0.97 9300 17 es-7-master-2
127.0.0.95 9300 50 es-7-master-1
127.0.0.96 9300 38 es-7-master-0

?s 指定排序列

CAT 接口支持通过 ?s 设置排序字段，可通过列名或别名指定，可通过逗号分割指定多个排序列，默认是升序排序，列名后加 :desc 可指定降序排序，例如 s=column1,column2:desc,column3

GET _cat/templates?v=true&s=order:desc,index_patterns

?format 返回格式

CAT 接口支持通过 ?format 指定返回格式，默认是 text，支持的格式有：text, json, smile, yaml, cbor

curl "http://10.92.54.76:8200/_cat/indices"
green open user_0124 fRqr86C7QDWp2q1JzNL9DQ 1 1 104772294 2540905 760.7gb 379.7gb

curl "http://10.92.54.76:8200/_cat/indices?format=json"
[{"health":"green","status":"open","index":"user_0124","uuid":"fRqr86C7QDWp2q1JzNL9DQ","pri":"1","rep":"1","docs.count":"104772294","docs.deleted":"2540905","store.size":"760.7gb","pri.store.size":"379.7gb"}]

curl "http://10.92.54.76:8200/_cat/indices?format=yaml"
---
- health: "green"
  status: "open"
  index: "user_0124"
  uuid: "fRqr86C7QDWp2q1JzNL9DQ"
  pri: "1"
  rep: "1"
  docs.count: "104772294"
  docs.deleted: "2540905"
  store.size: "760.7gb"
  pri.store.size: "379.7gb"

/_cat 列出全部CAT接口

=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/tasks
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/nodeattrs
/_cat/repositories
/_cat/snapshots/{repository}
/_cat/templates
/_cat/ml/anomaly_detectors
/_cat/ml/anomaly_detectors/{job_id}
/_cat/ml/trained_models
/_cat/ml/trained_models/{model_id}
/_cat/ml/datafeeds
/_cat/ml/datafeeds/{datafeed_id}
/_cat/ml/data_frame/analytics
/_cat/ml/data_frame/analytics/{id}
/_cat/transforms
/_cat/transforms/{transform_id}

/_cat/health?v 查看集群状态

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-health.html

epoch      timestamp cluster  status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1643354450 07:20:50  es-7     green           3         3      2   1    0    0        0             0                  -                100.0%

/_cat/nodes?v 查看全部节点

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
192.168.1.1           61          56   1    1.71    1.52     1.38 cdfhilmrstw *      es-7-master-1
192.168.1.2           28          80   0    1.71    1.52     1.38 cdfhilmrstw -      es-7-master-2
192.168.1.3           49          80   1    1.71    1.52     1.38 cdfhilmrstw -      es-7-master-0

/_cat/indices?v 查看所有索引

curl -X GET "localhost:9200/_cat/indices?v"
返回如下

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   article      8wczi0SdTfqTjrOwqM5FOg   1   1          1            0      5.1kb          5.1kb
green  open   my_3shards   Ox5GfotcSjikFg08MHv-lQ   3   1     141527        36972      9.1gb          4.5gb

store.size 是包含全部分片和副本的总数据大小，比如索引是3分片1副本的，包括的就是全部3个分片及每个分片的一主一从副本的总和数据量

/_cat/count/index?v 查看索引的文档数

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-count.html

/_cat/count 查看全部索引的文档总数
/_cat/count/<target> 查看指定索引的文档个数

epoch      timestamp count
1643355132 07:32:12  104772294

/_cat/segments/index?v 查看索引的段数据

Elasticsearch Guide [7.17] » REST APIs » Compact and aligned text (CAT) APIs » cat segments API
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cat-segments.html

GET /_cat/segments 查看全部段数据
GET /_cat/segments/index 查看指定 index 的段数据

size.memory 是一个 segment 占用的堆内存大小。

字段说明：

GET /_cat/segments?help
index        | i,idx                 | index name                       
shard        | s,sh                  | shard name                       
prirep       | p,pr,primaryOrReplica | primary or replica               
ip           |                       | ip of node where it lives        
id           |                       | unique id of node where it lives 
segment      | seg                   | segment name                     
generation   | g,gen                 | segment generation               
docs.count   | dc,docsCount          | number of docs in segment        
docs.deleted | dd,docsDeleted        | number of deleted docs in segment
size         | si                    | segment size in bytes            
size.memory  | sm,sizeMemory         | segment memory in bytes          
committed    | ic,isCommitted        | is segment committed             
searchable   | is,isSearchable       | is segment searched              
version      | v,ver                 | version                          
compound     | ico,isCompound        | is segment compound

示例

GET /_cat/segments/index
index           shard prirep ip           segment generation docs.count docs.deleted    size size.memory committed searchable version compound
my_blog_3shards 0     p      192.168.1.1  _73            255    1185600            0   4.2gb       18244 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _da            478    1124265            0     4gb       18020 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _j9            693    1272105            0   4.6gb       18884 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _o8            872    1002084            0   3.6gb       17540 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _ug           1096    1126064            0     4gb       18404 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _zq           1286    1176128            0   4.2gb       18372 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _15i          1494     904718            0   3.2gb       17188 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _1b8          1700    1081184            0   3.9gb       18148 true      true       8.10.1  false
my_blog_3shards 0     p      192.168.1.1  _1hq          1934     915554            0   3.3gb       17012 true      true       8.10.1  true

/_cat/shards?v 查看分片状态

https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html

/_cat/shards 查看全部分片信息
/_cat/shards/<target> 查看具体某个分片信息

index                   shard prirep state        docs   store ip            node
my_app_article_inf_0124 0     p      STARTED 104772294 379.7gb 127.0.64.96  es-7-master-0
my_app_article_inf_0124 0     r      STARTED 104772294 380.9gb 127.0.64.97  es-7-master-2
my_3shards_new          2     r      STARTED     47169   1.5gb 127.0.65.152 es-7-master-2
my_3shards_new          2     p      STARTED     47169   1.4gb 127.0.67.98  es-7-master-1
my_3shards_new          1     r      STARTED     47211   1.5gb 127.0.67.98  es-7-master-1
my_3shards_new          1     p      STARTED     47211   1.5gb 127.0.66.2   es-7-master-0
my_3shards_new          0     p      STARTED     47147   1.5gb 127.0.65.152 es-7-master-2
my_3shards_new          0     r      STARTED     47147   1.5gb 127.0.66.2   es-7-master-0

/_cat/allocation?v 查看磁盘使用

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cat-allocation.html

curl /_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
   312       21.7gb   653.4gb      2.9tb      3.5tb           17 127.0.1.51  127.0.1.51  det-es-7-master-1
   312         26gb     2.8tb      603gb      3.4tb           82 127.0.0.116 127.0.0.116 det-es-7-master-0
   312        8.3gb   398.5gb      3.1tb      3.5tb           10 127.0.2.48  127.0.2.48  det-es-7-master-2

shards 当前节点上分配的主分片和副本分片总数。若集群中存在分片未分配（如磁盘空间不足或节点故障），该数值可能小于预期值。
disk.indices 索引数据占用的磁盘空间（仅包含已分配分片的 Lucene 文件）。
不包含未分配分片或 Translog 日志的磁盘占用
如果存在硬链接文件（如通过索引收缩、克隆生成的分片），此值可能重复计算

disk.used 节点操作系统的总磁盘使用量，包括 Elasticsearch 数据、操作系统文件及其他应用数据。
disk.avail 节点当前可用的磁盘空间。Elasticsearch 根据此值决定是否允许分配新分片（如启用磁盘水位线策略）。
disk.total 节点的磁盘总容量（已用 + 可用空间）
disk.percent 磁盘使用率百分比，计算公式为 disk.used / disk.total

Elasticsearch 数据迁移

Elastic数据迁移方法及注意事项
https://www.cnblogs.com/zhengchunyuan/p/9957851.html

Cluster 集群 API

Elasticsearch Guide [7.17] » REST APIs » Cluster APIs
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster.html

PUT /_cluster/settings 修改动态配置

Elasticsearch Guide [7.17] » REST APIs » Cluster APIs » Cluster update settings API
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster-update-settings.html

PUT /_cluster/settings

例如

PUT /_cluster/settings
{
  "persistent" : {
    "indices.recovery.max_bytes_per_sec" : "50mb"
  }
}

GET /_cluster/settings 查询集群配置

Elasticsearch Guide [7.17] » REST APIs » Cluster APIs » Cluster get settings API
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster-get-settings.html

GET /_cluster/settings

默认只返回显式修改过的配置，增加 include_defaults=true 参数可返回默认配置。

GET /_nodes/stats 查询节点统计信息

Elasticsearch Guide [7.17] » REST APIs » Cluster APIs » Nodes stats API
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster-nodes-stats.html

GET /_nodes/stats 查询全部节点的全部统计信息
GET /_nodes/<node_id>/stats 查询指定节点的全部统计信息
GET/_nodes/stats/<metric> 查询全部节点的指定统计信息，例如 /_nodes/stats/jvm 查全部节点的 jvm 信息
GET/_nodes/<node_id>/stats/<metric> 查询指定节点的指定统计信息

GET /_nodes/stats/jvm 查询节点的JVM信息

{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "es-7",
    "nodes": {
        "JadPfrWATmu3br_YrOogIA": {
            "timestamp": 1648638934312,
            "name": "master-2",
            "transport_address": "127.0.0.1:9300",
            "host": "127.0.0.1",
            "ip": "127.0.0.1:9300",
            "roles": [
                "data",
                "data_cold",
                "data_content",
                "data_frozen",
                "data_hot",
                "data_warm",
                "ingest",
                "master",
                "ml",
                "remote_cluster_client",
                "transform"
            ],
            "attributes": {
                "ml.machine_memory": "17179869184",
                "ml.max_open_jobs": "512",
                "xpack.installed": "true",
                "ml.max_jvm_size": "8589934592",
                "transform.node": "true"
            },
            "jvm": {
                "timestamp": 1648638934312,
                "uptime_in_millis": 3454100628,
                "mem": {
                    "heap_used_in_bytes": 4971640320,
                    "heap_used_percent": 57,
                    "heap_committed_in_bytes": 8589934592,
                    "heap_max_in_bytes": 8589934592,
                    "non_heap_used_in_bytes": 185095632,
                    "non_heap_committed_in_bytes": 188481536,
                    "pools": {
                        "young": {
                            "used_in_bytes": 4160749568,
                            "max_in_bytes": 0,
                            "peak_used_in_bytes": 5146411008,
                            "peak_max_in_bytes": 0
                        },
                        "old": {
                            "used_in_bytes": 806977024,
                            "max_in_bytes": 8589934592,
                            "peak_used_in_bytes": 4854326784,
                            "peak_max_in_bytes": 8589934592
                        },
                        "survivor": {
                            "used_in_bytes": 3913728,
                            "max_in_bytes": 0,
                            "peak_used_in_bytes": 469762048,
                            "peak_max_in_bytes": 0
                        }
                    }
                },
                "threads": {
                    "count": 65,
                    "peak_count": 80
                },
                "gc": {
                    "collectors": {
                        "young": {
                            "collection_count": 24232,
                            "collection_time_in_millis": 698748
                        },
                        "old": {
                            "collection_count": 0,
                            "collection_time_in_millis": 0
                        }
                    }
                },
                "buffer_pools": {
                    "mapped": {
                        "count": 2658,
                        "used_in_bytes": 2496772173825,
                        "total_capacity_in_bytes": 2496772173825
                    },
                    "direct": {
                        "count": 52,
                        "used_in_bytes": 9262228,
                        "total_capacity_in_bytes": 9262227
                    },
                    "mapped - 'non-volatile memory'": {
                        "count": 0,
                        "used_in_bytes": 0,
                        "total_capacity_in_bytes": 0
                    }
                },
                "classes": {
                    "current_loaded_count": 24047,
                    "total_loaded_count": 24085,
                    "total_unloaded_count": 38
                }
            }
        }
    }
}

GET /_nodes/stats/indices 查看节点的索引统计信息

GET /_nodes/stats/indices 查看全部节点的索引统计信息
GET /_nodes/JadPfrWATmu3br_YrOogIA/stats/indices 查看指定节点的索引统计信息

GET /_nodes/JadPfrWATmu3br_YrOogIA/stats/indices
{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "es-7",
    "nodes": {
        "JadPfrWATmu3br_YrOogIA": {
            "timestamp": 1649384639759,
            "name": "es-7-master-2",
            "transport_address": "127.0.0.1:9300",
            "host": "127.0.0.1",
            "ip": "127.0.0.1:9300",
            "roles": [
                "data",
                "data_cold",
                "data_content",
                "data_frozen",
                "data_hot",
                "data_warm",
                "ingest",
                "master",
                "ml",
                "remote_cluster_client",
                "transform"
            ],
            "attributes": {
                "ml.machine_memory": "17179869184",
                "ml.max_open_jobs": "512",
                "xpack.installed": "true",
                "ml.max_jvm_size": "8589934592",
                "transform.node": "true"
            },
            "indices": {
                "docs": {
                    "count": 672675191,
                    "deleted": 0
                },
                "shard_stats": {
                    "total_count": 2
                },
                "store": {
                    "size_in_bytes": 2555649098412,
                    "total_data_set_size_in_bytes": 2555649098412,
                    "reserved_in_bytes": 0
                },
                "indexing": {
                    "index_total": 672675191,
                    "index_time_in_millis": 859673193,
                    "index_current": 0,
                    "index_failed": 0,
                    "delete_total": 0,
                    "delete_time_in_millis": 0,
                    "delete_current": 0,
                    "noop_update_total": 0,
                    "is_throttled": false,
                    "throttle_time_in_millis": 8124403
                },
                "get": {
                    "total": 0,
                    "time_in_millis": 0,
                    "exists_total": 0,
                    "exists_time_in_millis": 0,
                    "missing_total": 0,
                    "missing_time_in_millis": 0,
                    "current": 0
                },
                "search": {
                    "open_contexts": 0,
                    "query_total": 557,
                    "query_time_in_millis": 913940,
                    "query_current": 0,
                    "fetch_total": 147,
                    "fetch_time_in_millis": 1528,
                    "fetch_current": 0,
                    "scroll_total": 0,
                    "scroll_time_in_millis": 0,
                    "scroll_current": 0,
                    "suggest_total": 0,
                    "suggest_time_in_millis": 0,
                    "suggest_current": 0
                },
                "merges": {
                    "current": 0,
                    "current_docs": 0,
                    "current_size_in_bytes": 0,
                    "total": 11405,
                    "total_time_in_millis": 2334432062,
                    "total_docs": 1457464133,
                    "total_size_in_bytes": 5626849591443,
                    "total_stopped_time_in_millis": 3821957,
                    "total_throttled_time_in_millis": 848917760,
                    "total_auto_throttle_in_bytes": 31632531
                },
                "refresh": {
                    "total": 9288,
                    "total_time_in_millis": 45898273,
                    "external_total": 643,
                    "external_total_time_in_millis": 8947460,
                    "listeners": 0
                },
                "flush": {
                    "total": 8442,
                    "periodic": 8161,
                    "total_time_in_millis": 373382635
                },
                "warmer": {
                    "current": 0,
                    "total": 641,
                    "total_time_in_millis": 64
                },
                "query_cache": {
                    "memory_size_in_bytes": 0,
                    "total_count": 0,
                    "hit_count": 0,
                    "miss_count": 0,
                    "cache_size": 0,
                    "cache_count": 0,
                    "evictions": 0
                },
                "fielddata": {
                    "memory_size_in_bytes": 0,
                    "evictions": 0
                },
                "completion": {
                    "size_in_bytes": 0
                },
                "segments": {
                    "count": 662,
                    "memory_in_bytes": 12082184,
                    "terms_memory_in_bytes": 7308480,
                    "stored_fields_memory_in_bytes": 4225008,
                    "term_vectors_memory_in_bytes": 0,
                    "norms_memory_in_bytes": 0,
                    "points_memory_in_bytes": 0,
                    "doc_values_memory_in_bytes": 548696,
                    "index_writer_memory_in_bytes": 0,
                    "version_map_memory_in_bytes": 0,
                    "fixed_bit_set_memory_in_bytes": 0,
                    "max_unsafe_auto_id_timestamp": -1,
                    "file_sizes": {}
                },
                "translog": {
                    "operations": 0,
                    "size_in_bytes": 110,
                    "uncommitted_operations": 0,
                    "uncommitted_size_in_bytes": 110,
                    "earliest_last_modified_age": 2840440564
                },
                "request_cache": {
                    "memory_size_in_bytes": 71040,
                    "evictions": 0,
                    "hit_count": 39,
                    "miss_count": 242
                },
                "recovery": {
                    "current_as_source": 0,
                    "current_as_target": 0,
                    "throttle_time_in_millis": 0
                }
            }
        }
    }
}

GET /_cluster/stats 查询集群统计信息

Elasticsearch Guide [7.17] » REST APIs » Cluster APIs » Cluster stats API
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cluster-stats.html

配置 Elasticsearch

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/settings.html#dynamic-cluster-setting

配置 Elasticsearch

配置文件位置

Elasticsearch 有三个配置文件：

elasticsearch.yml 用于配置 Elasticsearch
jvm.options 用于配置 Elasticsearch JVM 设置
log4j2.properties 用于配置 Elasticsearch 日志记录

我启动 es 容器使用的配置文件 elasticsearch.yml 如下：

# 开启跨域
http.cors.enabled: true
# 允许任何域访问
http.cors.allow-origin: "*"

# 节点名称
node.name: "node-1"
# 集群名称
cluster.name: "docker-es"
# 节点ip 单机默认回环地址 集群必须绑定真实ip
network.host: 0.0.0.0

默认情况下，Elastic 只允许本机访问，如果需要远程访问，可以修改 Elastic 安装目录的 config/elasticsearch.yml 文件，去掉 network.host 的注释，将它的值改成 0.0.0.0，然后重新启动 Elastic。

配置文件格式

ES 的配置文件是 yaml 格式

环境变量替换

配置文件中可以通过 ${...} 引用环境变量

集群和节点配置

静态配置，只能在集群启动前在 elasticsearch.yml 中配置
动态配置，可通过集群配置 API PUT /_cluster/settings 在运行中动态修改，也可以在 elasticsearch.yml 中配置

动态配置

动态配置分为：

临时的(transient) 集群重启后失效
持久的(persistent) 集群重启后还在

可通过配置 API 给配置项赋值 null 来重置持久的或临时的配置项。

如果在多处配置了相同的配置项，优先级如下：

transient 临时配置项
persistent 持久配置项
elasticsearch.yml 中的配置
默认值

transient 配置项优先级最高，可以通过 transient 配置项覆盖 persistent 配置项或 elasticsearch.yml 中的配置。

ES 不建议再使用 transient 临时配置项，因为在集群不稳定时临时配置项可能无故消失，导致潜在的问题。

静态配置

静态配置只能集群启动前在 elasticsearch.yml 中配置
静态配置在集群的每个节点都需要配置

集群分片分配和路由配置

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Cluster-level shard allocation and routing settings
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-cluster.html#disk-based-shard-allocation

磁盘分片分配设置

磁盘分配器通过 高水位(high watermark) 和 低水位(low watermark) 阈值控制分片数据在磁盘的分配，主要目标是要确保每个节点的磁盘使用率都不超过高水位，或者只是临时超过高水位。如果一个节点的磁盘使用率超过了高水位，ES 会将此节点上的分片数据移动到集群中的其他节点

注意：节点的磁盘使用率临时超过高水位值是正常的。

节点的磁盘使用率超过低水位后，分配器会停止往此节点分配分片数据，以使节点远离高水位。如果全部节点都超过了低水位，ES 就无法再分配分片数据，同时也无法在节点间移动分片数据。所以要始终保证集群中有几个节点的磁盘使用率低于低水位。

如果节点的磁盘填充速度非常快，ES 可能还来不及将分片数据移动到其他节点，可能导致磁盘被完全占满。为了避免这种情况，如果节点的磁盘使用率超过 **洪水位(flood-stage watermark)**，ES 会禁止向有分片在此节点的索引中写入数据。ES 会继续将此节点的分片数据移动到其他节点，一旦磁盘使用率低于高水位，就会解除写阻塞。

cluster.routing.allocation.disk.threshold_enabled 是否开启磁盘分配器水位阈值检查，默认 true，设为 false 禁用检查。
cluster.routing.allocation.disk.watermark.low 磁盘使用率低水位，默认值 85%，即磁盘使用率超过 85% 后 ES 就停止向此节点写入分片数据。也可以设为绝对值，例如 500mb，表示磁盘空间低于 500mb 后就禁止写入。
cluster.routing.allocation.disk.watermark.high 磁盘使用率高水位，默认值 90%，即磁盘使用率超过 90% 后 ES 会尝试将此节点的分片数据移动到其他节点。也可以设为绝对值，例如 500mb，表示磁盘空间低于 500mb 后就尝试向外移动分片数据。
cluster.routing.allocation.disk.watermark.flood_stage 磁盘使用率洪水位，默认值 95%，即磁盘使用率超过 95% 后 ES 将有分片在此节点的索引设为只读可删 index.blocks.read_only_allow_delete，这是防止节点磁盘被占满的最终手段，磁盘使用率低于高水位后，就会自动解除写阻塞。

注意：不能在这几个配置项中混合使用百分比和绝对值，因为 ES 内部要验证其合理性，保证低水位低于高水位，高水位低于洪水位

例如，通过 API 解除指定索引的只读可删限制

PUT /my-index-000001/_settings
{
  "index.blocks.read_only_allow_delete": null
}

cluster.info.update.interval ES 检查磁盘使用率的时间间隔，默认 30s

429 disk usage exceeded flood-stage watermark

问题：
es 插入数据报错

{
    "error":{
        "root_cause":[
            {
                "type":"cluster_block_exception",
                "reason":"index [my_index] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"
            }
        ],
        "type":"cluster_block_exception",
        "reason":"index [my_index] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"
    },
    "status":429
}

原因：
磁盘使用率超过 95% 洪水位，es 禁止写入数据到index

解决：
1、关闭磁盘水位阈值检查，可通过 API 不停机动态更新，或停机后修改 elasticsearch.yml

PUT /_cluster/settings
{
    "transient": {
        "cluster.routing.allocation.disk.threshold_enabled": false
    }
}

或者提高各个水位阈值，同样也可以通过 API 更新或配置文件更新，例如更新低水位到 100gb，高水位到 50gb，洪水位到 10gb，设置信息刷新间隔为 1 分钟

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "100gb",
    "cluster.routing.allocation.disk.watermark.high": "50gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
    "cluster.info.update.interval": "1m"
  }
}

2、关闭水位检查或者调高水位值后，还是不能写入，因为各个index已经被加上了只读可删的 block，需要手动去掉 block，下面的 API 通过 _all 去掉全部索引的 block，也可以指定 index

PUT _all/_settings
{
    "index.blocks.read_only_allow_delete": null
}

https://stackoverflow.com/questions/50609417/elasticsearch-error-cluster-block-exception-forbidden-12-index-read-only-all

Elasticsearch 日志配置

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Logging
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/logging.html

elasticsearch 暴露了三个变量
${sys:es.logs.base_path} 等于 elasticsearch.yml 中的 path.logs 目录
${sys:es.logs.cluster_name} 集群名称
${sys:es.logs.node_name} 结点名称
供 log4j2.properties 配置文件使用

elasticsearch 慢日志

es 中有两种慢日志：
索引慢日志(index slow logs) elasticsearch_index_indexing_slowlog.log
搜索慢日志(search slow logs) elasticsearch_index_search_slowlog.log

Logging configuration
https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html

Circuit Breaker 断路器配置

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Circuit breaker settings
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/circuit-breaker.html

Elasticsearch 有很多断路器(circuit breaker)，用于阻止各种操作可能导致的 OOM 内存溢出。每个断路器都有一个阈值指定最多可以使用多少内存。此外，还有一个父断路器指定了所有断路器最多可以使用多少内存。

Request circuit breaker 请求断路器

request 断路器用于限制执行单个请求需要的内存，比如一个聚合请求可能会用 JVM 内存来做一些汇总计算。

indices.breaker.request.limit Dynamic 参数，请求允许使用的最大内存，默认 JVM 堆内存的 60%

429 circuit_breaking_exception Data too large

{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [128107988/122.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [128107696/122.1mb], new bytes reserved: [292/292b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=292/292b, accounting=2309/2.2kb]",
                "bytes_wanted": 128107988,
                "bytes_limit": 123273216,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [128107988/122.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [128107696/122.1mb], new bytes reserved: [292/292b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=292/292b, accounting=2309/2.2kb]",
        "bytes_wanted": 128107988,
        "bytes_limit": 123273216,
        "durability": "PERMANENT"
    },
    "status": 429
}

原因
jvm 堆内存不够当前查询加载数据所以会报 data too large, 请求被熔断，indices.breaker.request.limit 默认为 jvm heap 的 60%
我的 es 的堆大小设为 128M ，只在里面创建了一个 index，插入了两个 document，每个只有一句话，就报这个错了。看来还需要给 es 多分配写内存。

Field data circuit breaker 列缓存断路器

field data 断路器可以估算将一个 field 加载到列数据缓存中需要占用多少内存，如果此加载操作会引起内存使用超过预定义的阈值，就会返回错误。

indices.breaker.fielddata.limit Dynamic 参数，列缓存允许使用的最大内存，默认是 JVM 堆内存的 40%

node query cache 节点查询缓存

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Node query cache settings
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-cache.html

filter 类型的查询结果会缓存在节点的查询缓存中，每个节点都有一个全部索引共用的查询缓存，使用 LRU 淘汰策略：当缓存满的时候最早的查询结果会被删除。此缓存的内容无法被查看。

query cache 属于 node-level 缓存，能够被当前节点的所有 shard 所共享。

从 5.1.1 版本开始，term filter 查询不再被缓存，因为倒排索引本身就是 term 到文档的一种缓存，本身就很快，如果缓存 term 查询反而会冲掉 LRU 中真正需要被缓存的结果
https://www.elastic.co/blog/elasticsearch-5-1-1-released

Term queries are no longer cached. The reason for this is twofold: term queries are almost always fast, and queries for thousands of terms can trash the query cache history, preventing more expensive queries from being cached.

默认节点查询缓存可存储最多 10000 条查询结果，最多占用 10% JVM 堆内存。

indices.queries.cache.size Static 配置，节点级配置，filter 查询缓存的最大值，默认是 JVM 堆内存的 10%。可配置为 JVM 堆内存的百分比如 10%，或者绝对值比如 512mb
index.queries.cache.enabled Static 配置，索引级配置，是否开启索引的查询缓存，默认 true 开启。

shard request chache 分片请求缓存

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Shard request cache settings
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/shard-request-cache.html

在一个或多个索引上进行检索时，所涉及的每个分片会在本地执行检索然后将局部结果返回给协调节点，协调节点将这些分片的局部结果组合成完整的全局检索结果

ES 会缓存每个分片上的检索结果，使得高频检索请求可以立即返回。这个缓存也是 LRU 缓存，满的时候最早的结果被删除。

缓存key是这个查询的DSL语句。所以如果要命中缓存查询生成的DSL一定要一样，这里的一样是指DSL这个字符串一样。

当更新文档、更新 mapping 时，缓存会自动失效

indices.requests.cache.size 默认值最大值是 JVM 堆内存的 2%

打开/关闭分片请求缓存

分片请求缓存默认是开启的，创建索引时可指定关闭缓存：

PUT /my-index-000001
{
  "settings": {
    "index.requests.cache.enable": false
  }
}

也可以动态开启/关闭已有索引的缓存：

PUT /my-index-000001/_settings
{ "index.requests.cache.enable": true }

查看请求缓存使用量

/index/_stats
"request_cache": {
    "memory_size_in_bytes": 168128,
    "evictions": 0,
    "hit_count": 64,
    "miss_count": 466
}

fielddata cache 列缓存配置

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Field data cache settings
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-fielddata.html

fielddata 缓存包含字段数据和全局序号 global ordinals，用于支持字段的聚合查询，是位于 JVM 堆内存上的一种缓存结构。

indices.fielddata.cache.size Static 配置，列缓存的最大值，默认无限制，比如堆内存的 38%，或者绝对值 12GB，这个值应该小于 indices.breaker.fielddata.limit

Elasticsearch 线程池配置

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Thread pools
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-threadpool.html

查看结点状态，里面有线程池配置数据

curl -XGET 'http://localhost:9200/_nodes/stats?pretty'

2.0 之后 5.0 之前，可通过 http api 动态修改线程池大小，无需重启，5.0 之后不能动态修改了，必须重启。

curl -XPUT 'localhost:9200/_cluster/settings' -d '{
    "transient": {
        "threadpool.index.type": "fixed",
        "threadpool.index.size": 100,
        "threadpool.index.queue_size": 500
    }
}'

elasticsearch 的3种线程池类型

elasticsearch 线程池的线程按照源码的实现来看分为 fixed 固定大小线程池， fixed_auto_queue_size 固定大小带阻塞队列的线程池和 scaling 可变大小线程池三种，其中 fixed_auto_queue_size 是实现类型，可能在之后的版本中去除。

fixed 固定大小线程池
fixed_auto_queue_size 固定大小带阻塞队列的线程池
scaling 可变大小线程池

search 线程池

用作 count/search/suggest 操作，线程池类型是 fixed_auto_queue_size ，线程池默认大小为 int((# of available_processors * 3) / 2) + 1，queue_size 默认大小为 1000
配置示例

thread_pool:
    search:
        size: 30
        queue_size: 500
        min_queue_size: 10
        max_queue_size: 1000
        auto_queue_frame_size: 2000
        target_response_time: 1s

write 线程池

用作 index/delete/update 及 bulk 批量操作，线程池类型是 fixed ，默认大小为 # of available processors，允许设置的最大值是 1 + # of available processors， queue_size 默认大小为 200，
配置示例

thread_pool:
    write:
        size: 30
        queue_size: 1000

processors 处理器个数设置

线程池配置中的 # of available processors 指的是自动检测到的 逻辑处理器 个数，等于

# 查看逻辑CPU的个数
cat /proc/cpuinfo| grep "processor"| wc -l

比如 thread_pool.write.size 要求最大值是 1 + # of available processors ，如果逻辑cpu个数为4，则线程池最大为5，如果配置项中指定了比5更大的值会报错

java.lang.IllegalArgumentException: Failed to parse value [30] for setting [thread_pool.write.size] must be <= 5

如果确定想改为更大的值，可以在配置文件 elasticsearch.yml 中手动指定 processors 个数，例如

processors: 2

429 es_rejected_execution_exception

429/Too Many Requests

写入时报错 es_rejected_execution_exception

{
    "error":{
        "root_cause":[
            {
                "type":"remote_transport_exception",
                "reason":"[ZKjMEXP][127.0.0.1:9300][indices:data/write/bulk[s][p]]"
            }
        ],
        "type":"es_rejected_execution_exception",
        "reason":"rejected execution of processing of [2026943][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[user_profile_indicator_data][0]] containing [index {[user_profile_indicator_data][indicator_base_info][5725976], source[n/a, actual length: [2.4kb], max length: 2kb]}], target allocation id: IbC5nk5CSOO9ReABdDvcvA, primary term: 1 on EsThreadPoolExecutor[name = ZKjMEXP/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1f44c59d[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 1442357]]"
    },
    "status":429
}

查询时报错 EsRejectedExecutionException

[2020-05-18T15:48:31,645][DEBUG][o.e.a.s.TransportSearchAction] [ZKjMEXP] All shards failed for phase: [query]
org.elasticsearch.ElasticsearchException$1: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@4475dcce on QueueResizingEsThreadPoolExecutor[name = ZKjMEXP/search, queue capacity = 100, min queue capacity = 100, max queue capacity = 1000, frame size = 1000, targeted response rate = 1s, task execution EWMA = 21.5ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@1328861b[Running, pool size = 30, active threads = 30, queued tasks = 384, completed tasks = 19350]]
        at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:657) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:131) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:100) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.InitialSearchPhase.access$100(InitialSearchPhase.java:48) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$onFailure$1(InitialSearchPhase.java:220) ~[elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) [elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.8.7.jar:6.8.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7.jar:6.8.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@4475dcce on QueueResizingEsThreadPoolExecutor[name = ZKjMEXP/search, queue capacity = 100, min queue capacity = 100, max queue capacity = 1000, frame size = 1000, targeted response rate = 1s, task execution EWMA = 21.5ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@1328861b[Running, pool size = 30, active threads = 30, queued tasks = 384, completed tasks = 19350]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.7.jar:6.8.7]

es_rejected_execution_exception[bulk] 是批量队列错误。当对 Elasticsearch 集群的请求数超过批量队列大小 (threadpool.bulk.queue_size) 时，会发生此问题。每个节点上的批量队列可以容纳 50 到 200 个请求，具体取决于您使用的 Elasticsearch 版本。队列已满时，将拒绝新请求。

其实，Elasticsearch 分别对不同的操作【例如：index、bulk、get 等】提供不同的线程池，并设置线程池的线程个数与排队任务上限。可以在数据索引所在节点的 settings 中查看

这里面，有两种类型的线程池，一种是 fixing，一种是 scaling，其中 fixing 是固定大小的线程池，默认是 core 个数的 5 倍，也可以指定大小，scaling 是动态变化的线程池，可以设置最大值、最小值。

解决：
在不增加节点的情况下，把节点的线程池设置大一点、队列上限设置大一点，就可以处理更多的请求了。这个方法需要改变 Elasticsearch 集群的配置，然后重启集群，但是一般情况下会有风险，因为节点的硬件配置【内存、CPU】没有变化，单纯增加线程池，会给节点带来压力，可能会宕机，谨慎采用。配置信息参考如下：

-- 修改 elasticsearch.yml 配置文件
threadpool.bulk.type: fixed
threadpool.bulk.size: 64
threadpool.bulk.queue_size: 1500

Elasticsearch 中的 429 错误 es_rejected_execution_exception
https://www.playpi.org/2017042601.html

Elasticsearch 高级配置(JVM配置)

Elasticsearch Guide [7.17] » Set up Elasticsearch » Configuring Elasticsearch » Advanced configuration
https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html

不要直接修改 /usr/share/elasticsearch/config/jvm.options，将自定义 JVM 参数放到 /usr/share/elasticsearch/config/jvm.options.d/ 目录中

ES 默认根据节点的角色和总内存大小自动配置 JVM 堆内存大小，建议直接使用默认值。

-Xms和-Xmx配置必须相同

-Xms和-Xmx配置的内存大小必须相同，避免 resize，否则 es 启动会报错

ERROR: [1] bootstrap checks failed. You must address the points described in the following [1] lines before starting Elasticsearch.
bootstrap check failure [1] of [1]: initial heap size [8589934592] not equal to maximum heap size [17179869184]; this can cause resize pauses

-Xms 和 -Xmx 建议不要超过总内存的 50%，因为除了 JVM，es还有其他占用内存的地方。

young gc 频繁

{"type": "server", "timestamp": "2022-03-05T18:28:59,824Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es-7", "node.name": "es-7-master-0", "message": "[gc][1347730] overhead, spent [311ms] collecting in the last [1s]", "cluster.uuid": "-YDlZAJIQxKModujOTof2g", "node.id": "K_g4Ids0Rz-SHHG2jhp9dQ"  }
{"type": "server", "timestamp": "2022-03-05T18:48:00,164Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es-7", "node.name": "es-7-master-0", "message": "[gc][1348869] overhead, spent [324ms] collecting in the last [1s]", "cluster.uuid": "-YDlZAJIQxKModujOTof2g", "node.id": "K_g4Ids0Rz-SHHG2jhp9dQ"  }

发现和集群形成

Elasticsearch Guide [7.17] » Set up Elasticsearch » Discovery and cluster formation
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-discovery.html

包括发现节点、选举 master、形成集群、发布集群状态。

ES 7.x 之前，Bully 算法（Zen Discovery 集群协调子系统）：根据节点的ID大小来判定谁是leader，简单粗暴，可能出现选举时集群暂时不可用、以及无法选出master的问题
ES 7.x 使用了新的选主算法，类Raft算法

ElasticSearch-新老选主算法对比
https://yemilice.com/2021/06/16/elasticsearch-%E6%96%B0%E8%80%81%E9%80%89%E4%B8%BB%E7%AE%97%E6%B3%95%E5%AF%B9%E6%AF%94/

Quorum(多数派)选举

Elasticsearch Guide [7.17] » Set up Elasticsearch » Discovery and cluster formation » Quorum-based decision making
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-discovery-quorums.html

MasterNotDiscoveredException

es 启动报错：
“type”: “server”, “timestamp”: “2022-03-30T02:08:52,263Z”, “level”: “WARN”, “component”: “r.suppressed”, “cluster.name”: “es-7”, “node.name”: “es-7-master-0”, “message”: “path: /_cluster/health, params: {wait_for_status=green, timeout=1s}”,
“stacktrace”: [“org.elasticsearch.discovery.MasterNotDiscoveredException: null”

原因：
从 6.x 升级到 7.x 后，需要在环境变量或 elasticsearch.yml 中配置 cluster.initial_master_nodes 指定 master 节点
例如

# 三台实例保证相同
cluster.name: my-cluster
# 设置成对应的 ${HOSTNAME}
node.name: es-01
# 设置成三台实例的 ${HOSTNAME}
discovery.seed_hosts: ["es-01", "es-02", "es-03"]
cluster.initial_master_nodes: ["es-01", "es-02", "es-03"]

当前位置 : 首页 » 文章分类 : 开发 » Elasticsearch-集群及运维

CAT 运维 API

公共参数

?help 可用列

?v 详情

?h 指定输出列

?s 指定排序列

?format 返回格式

/_cat 列出全部CAT接口

/_cat/health?v 查看集群状态

/_cat/nodes?v 查看全部节点

/_cat/indices?v 查看所有索引

/_cat/count/index?v 查看索引的文档数

/_cat/segments/index?v 查看索引的段数据

/_cat/shards?v 查看分片状态

/_cat/allocation?v 查看磁盘使用

Elasticsearch 数据迁移

Cluster 集群 API

PUT /_cluster/settings 修改动态配置

GET /_cluster/settings 查询集群配置

GET /_nodes/stats 查询节点统计信息

GET /_nodes/stats/jvm 查询节点的JVM信息

GET /_nodes/stats/indices 查看节点的索引统计信息

GET /_cluster/stats 查询集群统计信息

配置 Elasticsearch

配置 Elasticsearch

配置文件位置

配置文件格式

环境变量替换

集群和节点配置

动态配置

静态配置

集群分片分配和路由配置

磁盘分片分配设置

429 disk usage exceeded flood-stage watermark

Elasticsearch 日志配置

elasticsearch 慢日志

Circuit Breaker 断路器配置

Request circuit breaker 请求断路器

429 circuit_breaking_exception Data too large

Field data circuit breaker 列缓存断路器

node query cache 节点查询缓存

shard request chache 分片请求缓存

打开/关闭分片请求缓存

查看请求缓存使用量

fielddata cache 列缓存配置

Elasticsearch 线程池配置

elasticsearch 的3种线程池类型

search 线程池

write 线程池

processors 处理器个数设置

429 es_rejected_execution_exception

Elasticsearch 高级配置(JVM配置)

-Xms和-Xmx配置必须相同

young gc 频繁

发现和集群形成

Quorum(多数派)选举

MasterNotDiscoveredException

页面信息

评论