5台机器,qps 写入15w, 程序log,排查问题, 日数据量3t, 不做分词, 冷数据存30天

问题总结

  1. dashboard等导入 —> 最终采取的是es 迁5,用5的export导到json,在6中import. –> 遭遇kibana 列表不能滚动bug
  2. jvm分配内存分配不了 —> 怀疑是磁盘内存碎片,最终停了老的es2,启动成功
  3. 写入qps不高 —> 把shard改为1,显著提升,但gc full了。改g1gc观察, 改回thread pool
  4. 痛点:没机器,只能在这个5个机器做….哎。。。

text字段不能visualize 了

es6 引入了新的text需要改为keywords. 默认的text是不进行fielddata的。而logstash有一个坑:就是template_overwrite是如果有logstash的template,不进行复制,如果没有,则写一个新的。这点很坑。解决方法是先让logstash生成一个默认的,然后在cerebo改生成的默认模板text为keyword.而这其实有一个开关控制:manage_template关了就行了。

合理的template

{
  "order": 0,
  "index_patterns": [
    "*"
  ],
  "settings": {},
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "message_field": {
            "path_match": "message",
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "norms": false
            }
          }
        },
        {
          "string_fields": {
            "match": "*",
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "norms": false
            }
          }
        }
      ],
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "@version": {
          "type": "keyword"
        },
        "geoip": {
          "dynamic": true,
          "properties": {
            "ip": {
              "type": "ip"
            },
            "location": {
              "type": "geo_point"
            },
            "latitude": {
              "type": "half_float"
            },
            "longitude": {
              "type": "half_float"
            }
          }
        }
      }
    }
  },
  "aliases": {}
}

elasticdump不能迁移dashboard

2迁移到5是可以的,但6不行。wtf

kibana 4.3.0 -> 6.2.2

首先根据官方的migrate文档,将.kibana reindex.

尝试用dump 将 reindex 的 5的索引弄到6的集群上。不行。dump不认reindex索引

尝试使用snapshot, snapshot没有针对单个索引,放弃

尝试使用 reindex 将es5的升级完的.kibana导入 6。 提示reindex.remote.whitelist加白名单需要重启6集群。放弃

尝试使用logstash导入,可以导入成功

另一种解决方案

不用那么复杂,只要如此即可:

  1. 用dump将2的objects导入到5
  2. 用5的kibana的export导出到json
  3. 在用6的import即可,但有一个问题,需要把所有的index都建好
  4. 建好后发现问题
Index Pattern Conflicts
The following saved objects use index patterns that do not exist. Please select the index patterns you'd like re-associated them with.

建好了也报这个错,这种方案不好使

kibana已经解决了 https://github.com/elastic/kibana/pull/17043

直接在发布版本改less不好使,使用daily build,版本不一致。只能自己编一个

@timestamp如何产生的

在output加rubydebug可以看到已经生成了timestamp

使用consumer tool看一下原始日志

{"message":"{ \"log_time\":\"19/Mar/2018:11:36:19 +0800\", \"cid\":2, \"did\":1301001077, \"pid\":16105, \"sid\":464111163, \"remote_addr\":\"10.210.1.90\", \"host\":\"bjzjy01-video-bjyz001077.bjzjy01.ksyun.com\", \"method\":\"GET\", \"uri\":\"/play\", \"status\":302, \"event\":0, \"ss_cid\":2, \"ss_lid\":1300001090, \"ss_sid\":2024579057, \"ss_slot\":2, \"begin_time_ms\":1521430579996, \"duration_ms\":2, \"unique_name\":\"panda\", \"app\":\"live_panda\", \"name\":\"3cf08b31f7643b7cc7c10829b892b8e7_mid\" }","@version":"1","@timestamp":"2018-03-19T03:36:20.791Z","host":"bjzjy01-video-bjyz001077.bjzjy01.ksyun.com","path":"/data/logs/lmds/logstash.log","type":"lmds-log"}

可见在kafka里面就已经有了,这东西应该是shipper弄出来的。

shipper:

file input –> kafka output

output filter doc:

The default codec is plain. Logstash will encode your events with not only the message field but also with a timestamp and hostname.

kibana5 启动不起来

[illegal_argument_exception] mapper [hits] cannot be changed from type [long] to [integer]

估计是reindex整的,删了.kibana,然后让kibana自己建index就好了

kibana 6创建.kibana失败

由于几个index template是给*配置的导致冲突,解决方法:先把*template改名。创建.kibana后,再改过来

如何将consumer的起始commit归0?

bin/kafka-consumer-groups.sh --bootstrap-server 10.64.2.157:8092 --group lktest --reset-offsets --to-latest --all-topics --execute

curator

curator_cli  --host "10.66.96.205" --port "9200" --timeout 600   close --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":1}]'

开始入数据时产生的问题

  1. indexer_time分析错了,失误
  2. supervisor还是得使用3,ansible搞定
  3. 消费不过来,怀疑是磁盘已经满了,单机器启动俩实例也不行

分片改为一个后gc

5个data,gc不来,rss超过32g,kill不行,kill9才可以

[WARN ][o.e.m.j.JvmGcMonitorService] [video_wuqing_data6_1_158] [gc][160162] overhead, spent [51.5s] collecting in the last [51.3s]
[2018-04-12T13:18:40,900][WARN ][o.e.m.j.JvmGcMonitorService] [video_wuqing_data6_1_158] [gc][160164] overhead, spent [2.6m] collecting in the last [2.6m]
31580 work      20   0  863g  54g  21g S 100.1 10.9  18428:35 java

尝试换G1 去掉bulk queue

qps上不去,是indexer停了,以为是shard导致,实际错了. 大约16w, iostat会打满, cpu 用了16个核

配置

{
  "order": 1,
  "index_patterns": [
    "*"
  ],
  "settings": {
    "index": {
      "routing": {
        "allocation": {
          "total_shards_per_node": "5"
        }
      },
      "refresh_interval": "60s",
      "number_of_shards": "20",
      "translog": {
        "flush_threshold_size": "3g",
        "sync_interval": "120s",
        "durability": "async"
      },
      "merge": {
        "scheduler": {
          "max_thread_count": "10"
        }
      },
      "max_result_window": "100000",
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "_default_": {
      "_all": {
        "omit_norms": true,
        "enabled": false
      }
    }
  },
  "aliases": {}
}

tbd

  1. 所有模块indexer启用
  2. 建立index pattern
  3. 备份index pattern?
  4. 查dashboard是否有bug
  5. 解决commit reset问题
  6. 部署源站indexer,全量写入