突发报警
半夜,业务服务器监控突然报警,报500,查看报警日志后发现是api中连接kafka报错了,查看kafka,发现docker容器在不断重启,报警内容是
1 | ERROR Error while accepting connection (kafka.network.Acceptor) |
发觉可能和 连接数 超过ulimit 限制
查看服务器和容器和ulimit配置 ulimit -aH
- 服务器参数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30446
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited - 容器参数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30446
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
发现服务器的open files, 是65535,按理说是够用的, 但是容器的open files是 4096,觉得问题应该是出在这里了,因为最近kafka对接了新的日志输入,导致了再某个时间点多项业务链接叠加,导致连接数过大
因为用的是aws服务器,aws服务器自己对容器的ulimit有限制,
具体文档:https://docs.aws.amazon.com/zh_cn/AmazonECS/latest/APIReference/API_Ulimit.html
The ulimit settings to pass to the container.
Amazon ECS tasks hosted on Fargate use the default resource limit values set by the operating system with the exception of the nofile resource limit parameter which Fargate overrides. The nofile resource limit sets a restriction on the number of open files that a container can use. The default nofile soft limit is 1024 and hard limit is 4096.
查看docker 的配置
- sudo vi /etc/sysconfig/docker
发现默认ulimit和文档中写的一样1
2
3
4
5
6
7
8
9
10
11
12
13# The max number of open files for the daemon itself, and all
# running containers. The default value of 1048576 mirrors the value
# used by the systemd service unit.
DAEMON_MAXFILES=1048576
# Additional startup options for the Docker daemon, for example:
# OPTIONS="--ip-forward=true --iptables=true"
# By default we limit the number of open files per container
OPTIONS="--default-ulimit nofile=1024:4096"
# How many seconds the sysvinit script waits for the pidfile to appear
# when starting the daemon.
DAEMON_PIDFILE_TIMEOUT=10
那么找到问题的所在,需要对容器的ulimit 进行修改
1 | OPTIONS="--default-ulimit nofile=65535:65535" |
修改配置文件后需要加载然后重启docker
1 | $ sudo systemctl daemon-reload |
查看配置是否应用 systemctl status docker.service
1 | ● docker.service - Docker Application Container Engine |
确认成功!
这里是修改docker全局的配置,也可以只针对某个容器去修改,这样则需要在docker容器的配置文件中添加ulimits参数
我踩的坑
最开始认为是服务器的ulimt导致的,所以想将 65535 设置成 unlimited
1 | open files (-n) 65535 |
1 | # sudo vi /etc/security/limits.conf |
结果发现,设置完成后 各种报错,sudo 无法使用,报警如下
1 | sudo: pam_open_session: Permission denied |
也无法登录服务器,报警如下
1 | Permission denied (publickey,gssapi-keyex,gssapi-with-mic). |
因为aws默认用户权限是ec2-user,不是root,什么也无法操作,没办法,只能重置服务器
aws 重置服务器的方法,重置后重新部署kafka服务
1 | aws重置服务器的方法 |