ElasticSearch的ILM错误问题排查
今天看到生产的es集群有报错:
{"type": "server", "timestamp": "2021-11-15T14:19:15,189Z", "level": "ERROR", "component": "o.e.x.i.IndexLifecycleRunner", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "policy [sai-log] for index [sai-detail-2021-11-02] failed on step [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]. Moving to ERROR step", "cluster.uuid": "dscxSgouRw--mhyuj5Y2fw", "node.id": "9yklgtqpTNuQU25hIOWkxQ" ,
"stacktrace": ["java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [sai-detail-2021-11-02] is empty or not defined",
"at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:65) [x-pack-core-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:327) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:265) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183) [x-pack-core-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:216) [x-pack-core-7.10.1.jar:7.10.1]",
"at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]",
"at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
跟同事一起查看了下,发现kibana的Index Management界面,有错误提示:
后面同事修改了ILM,错误消失。
下班回家,去查了日志,发现错误仍在,看来美誉解决问题,
GET /sai-log-2021-11-13/_ilm/explain
output:
{
"indices" : {
"sai-log-2021-11-13" : {
"index" : "sai-log-2021-11-13",
"managed" : false
}
}
}
POST /sai-log-2021-11-13/_ilm/retry
output:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "cannot retry an action for an index [sai-log-2021-11-13] that has not encountered an error when running a Lifecycle Policy"
}
],
"type" : "illegal_argument_exception",
"reason" : "cannot retry an action for an index [sai-log-2021-11-13] that has not encountered an error when running a Lifecycle Policy"
},
"status" : 400
}
前往kibana的ILM界面,绑定ILM策略到index template
GET /sai-log-2021-11-13/_ilm/explain
output:
{
"indices" : {
"sai-log-2021-11-13" : {
"index" : "sai-log-2021-11-13",
"managed" : true,
"policy" : "sai-log",
"lifecycle_date_millis" : 1636732801982,
"age" : "2.94d",
"phase" : "hot",
"phase_time_millis" : 1636986555834,
"action" : "rollover",
"action_time_millis" : 1636733356166,
"step" : "check-rollover-ready",
"step_time_millis" : 1636986555834,
"is_auto_retryable_error" : true,
"failed_step_retry_count" : 211,
"phase_execution" : {
"policy" : "sai-log",
"phase_definition" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "30gb",
"max_age" : "60d"
}
}
},
"version" : 3,
"modified_date_in_millis" : 1636960453087
}
}
}
}
解决了?
前往kibana的Index Management界面,依旧存在 42 indices have lifecycle errors。
手动来一下:
POST /sai-log-2021-11-13/_ilm/retry
output:
{
"acknowledged" : true
}
重复explain,发现输出一致,再去kibana的Index Management界面瞅瞅:
还有一个,retry命令处理下即可。
基本算是解决了。
参考:
ElasticSearch的ILM错误问题排查
https://blog.puresai.com/2021/11/15/380/