0%

docker的ulimit配置

突发报警

半夜,业务服务器监控突然报警,报500,查看报警日志后发现是api中连接kafka报错了,查看kafka,发现docker容器在不断重启,报警内容是

1
2
ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: No file descriptors available

发觉可能和 连接数 超过ulimit 限制

查看服务器和容器和ulimit配置 ulimit -aH

  • 服务器参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    core file size          (blocks, -c) unlimited
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 30446
    max locked memory (kbytes, -l) unlimited
    max memory size (kbytes, -m) unlimited
    open files (-n) 65535
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) unlimited
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited
  • 容器参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    core file size          (blocks, -c) unlimited
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 30446
    max locked memory (kbytes, -l) unlimited
    max memory size (kbytes, -m) unlimited
    open files (-n) 4096
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) unlimited
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited

发现服务器的open files, 是65535,按理说是够用的, 但是容器的open files是 4096,觉得问题应该是出在这里了,因为最近kafka对接了新的日志输入,导致了再某个时间点多项业务链接叠加,导致连接数过大

因为用的是aws服务器,aws服务器自己对容器的ulimit有限制,
具体文档:https://docs.aws.amazon.com/zh_cn/AmazonECS/latest/APIReference/API_Ulimit.html

The ulimit settings to pass to the container.

Amazon ECS tasks hosted on Fargate use the default resource limit values set by the operating system with the exception of the nofile resource limit parameter which Fargate overrides. The nofile resource limit sets a restriction on the number of open files that a container can use. The default nofile soft limit is 1024 and hard limit is 4096.

查看docker 的配置

  • sudo vi /etc/sysconfig/docker
    发现默认ulimit和文档中写的一样
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    # The max number of open files for the daemon itself, and all
    # running containers. The default value of 1048576 mirrors the value
    # used by the systemd service unit.
    DAEMON_MAXFILES=1048576

    # Additional startup options for the Docker daemon, for example:
    # OPTIONS="--ip-forward=true --iptables=true"
    # By default we limit the number of open files per container
    OPTIONS="--default-ulimit nofile=1024:4096"

    # How many seconds the sysvinit script waits for the pidfile to appear
    # when starting the daemon.
    DAEMON_PIDFILE_TIMEOUT=10

那么找到问题的所在,需要对容器的ulimit 进行修改

1
OPTIONS="--default-ulimit nofile=65535:65535"

修改配置文件后需要加载然后重启docker

1
2
$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

查看配置是否应用 systemctl status docker.service

1
2
3
4
5
6
7
8
9
10
11
12
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2021-05-20 10:07:40 UTC; 45s ago
Docs: https://docs.docker.com
Process: 23357 ExecStartPre=/usr/libexec/docker/docker-setup-runtimes.sh (code=exited, status=0/SUCCESS)
Process: 23345 ExecStartPre=/bin/mkdir -p /run/docker (code=exited, status=0/SUCCESS)
Main PID: 23364 (dockerd)
Tasks: 29
Memory: 45.0M
CGroup: /system.slice/docker.service
├─23364 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=65535:65535

确认成功!

这里是修改docker全局的配置,也可以只针对某个容器去修改,这样则需要在docker容器的配置文件中添加ulimits参数

我踩的坑

最开始认为是服务器的ulimt导致的,所以想将 65535 设置成 unlimited

1
open files (-n) 65535
1
2
3
# sudo vi /etc/security/limits.conf
* soft nofile unlimited
* hard nofile unlimited

结果发现,设置完成后 各种报错,sudo 无法使用,报警如下

1
2
sudo: pam_open_session: Permission denied
sudo: policy plugin failed session

也无法登录服务器,报警如下

1
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

因为aws默认用户权限是ec2-user,不是root,什么也无法操作,没办法,只能重置服务器
aws 重置服务器的方法,重置后重新部署kafka服务

1
2
3
aws重置服务器的方法
实例-选择实例-操作-监控和故障排除-替换根卷

参考:
kafka集群连接不上问题解决过程
docker daemon 配置文件
一次修改limits.conf 引发的血案

------------- 本文结束 感谢您的阅读-------------