ElasticSearch的ILM错误问题排查

今天看到生产的es集群有报错：

{"type": "server", "timestamp": "2021-11-15T14:19:15,189Z", "level": "ERROR", "component": "o.e.x.i.IndexLifecycleRunner", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "policy [sai-log] for index [sai-detail-2021-11-02] failed on step [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]. Moving to ERROR step", "cluster.uuid": "dscxSgouRw--mhyuj5Y2fw", "node.id": "9yklgtqpTNuQU25hIOWkxQ" , 
"stacktrace": ["java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [sai-detail-2021-11-02] is empty or not defined",
"at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:65) [x-pack-core-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:327) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:265) [x-pack-ilm-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183) [x-pack-core-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:216) [x-pack-core-7.10.1.jar:7.10.1]",
"at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]",
"at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }

跟同事一起查看了下，发现kibana的Index Management界面，有错误提示：

ILM

后面同事修改了ILM，错误消失。

下班回家，去查了日志，发现错误仍在，看来美誉解决问题，

GET /sai-log-2021-11-13/_ilm/explain

output：
{
  "indices" : {
    "sai-log-2021-11-13" : {
      "index" : "sai-log-2021-11-13",
      "managed" : false
    }
  }
}

POST /sai-log-2021-11-13/_ilm/retry

output:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "cannot retry an action for an index [sai-log-2021-11-13] that has not encountered an error when running a Lifecycle Policy"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "cannot retry an action for an index [sai-log-2021-11-13] that has not encountered an error when running a Lifecycle Policy"
  },
  "status" : 400
}

前往kibana的ILM界面，绑定ILM策略到index template
ILM poliy

GET /sai-log-2021-11-13/_ilm/explain

output:
{
  "indices" : {
    "sai-log-2021-11-13" : {
      "index" : "sai-log-2021-11-13",
      "managed" : true,
      "policy" : "sai-log",
      "lifecycle_date_millis" : 1636732801982,
      "age" : "2.94d",
      "phase" : "hot",
      "phase_time_millis" : 1636986555834,
      "action" : "rollover",
      "action_time_millis" : 1636733356166,
      "step" : "check-rollover-ready",
      "step_time_millis" : 1636986555834,
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 211,
      "phase_execution" : {
        "policy" : "sai-log",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "30gb",
              "max_age" : "60d"
            }
          }
        },
        "version" : 3,
        "modified_date_in_millis" : 1636960453087
      }
    }
  }
}

解决了？

前往kibana的Index Management界面，依旧存在 42 indices have lifecycle errors。

手动来一下：

POST /sai-log-2021-11-13/_ilm/retry

output:
{
  "acknowledged" : true
}

重复explain，发现输出一致，再去kibana的Index Management界面瞅瞅：

还有一个，retry命令处理下即可。

基本算是解决了。

参考：

Troubleshooting index lifecycle management errorse

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。