Octavia访问频繁引起SYN洪泛的分析与解决

Posted by elrond on May 17, 2022

1. 问题

octavia-api进程跑一会就卡死,api无响应,telnet也无响应

2. 排查

思路主要是

  • 查应用代码
  • 查系统日志得出触发了syn flooding
  • 根据tcp/socket知识找到问题原因

2.1. gdb定位下代码卡在哪里

(gdb) py-list
 442
 443                buf.seek(0, 2)  # seek end
 444                self._rbuf = StringIO()  # reset _rbuf.  we consume it via buf.
 445                while True:
 446                    try:
>447                        data = self._sock.recv(self._rbufsize)
 448                    except error, e:
 449                        if e.args[0] == EINTR:
 450                            continue
 451                        raise
 452                    if not data:

(gdb) py-bt
    self.process_request(request, client_address)
#32 Frame 0x7fd926f4de18, for file /usr/lib64/python2.7/SocketServer.py, line 238, in serve_forever (self=<WSGIServer(RequestHandlerClass=<classobj at remote 0x7fd931382598>, base_environ={'CONTENT_LENGTH': '', 'SERVER_NAME': 'controller1', 'GATEWAY_INTERFACE': 'CGI/1.1', 'SCRIPT_NAME': '', 'SERVER_PORT': '9876', 'REMOTE_HOST': ''}, _BaseServer__shutdown_request=False, server_name='controller1', server_address=('172.16.10.10', 9876), application=<RecursiveMiddleware(application=<CORS(application=<HTTPProxyToWSGI(application=<SkippingAuthProtocol(_service_token_roles=set(['service']), _service_token_roles_required=False, _interface='admin', log=<KeywordArgumentAdapter(logger=<Logger(name='keystonemiddleware.auth_token', parent=<Logger(name='keystonemiddleware', parent=<---Type <return> to continue, or q <return> to quit---
RootLogger(name='root', parent=None, handlers=[<WatchedFileHandler(stream=<file at remote 0x7fd92849f660>, level=0, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7fd92adc3570>, _RLock__count=0) at ...(truncated)
    self._handle_request_noblock()
#36 Frame 0x7fd9284b4608, for file /usr/lib/python2.7/site-packages/octavia/cmd/api.py, line 46, in main (app=<RecursiveMiddleware(application=<CORS(application=<HTTPProxyToWSGI(application=<SkippingAuthProtocol(_service_token_roles=set(['service']), _service_token_roles_required=False, _interface='admin', log=<KeywordArgumentAdapter(logger=<Logger(name='keystonemiddleware.auth_token', parent=<Logger(name='keystonemiddleware', parent=<RootLogger(name='root', parent=None, handlers=[<WatchedFileHandler(stream=<file at remote 0x7fd92849f660>, level=0, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7fd92adc3570>, _RLock__count=0) at remote 0x7fd92847b650>, encoding=None, dev=2053L, _name=None, ino=201451105, baseFilename='/var/log/octavia/api.log', mode='a', filters=[], formatter=<ContextFormatter(project='octavia', datefmt='%Y-%m-%d %H:%M:%S', version='unknown', _fmt='%(asctime)s.%(msecs)03d %(process)d %(levelname)s %(name)s [%(request_id)s %(user_identity)s] %(inst...(truncated)
    srv.serve_forever()

看起来正常监听

2.2. dmesg 日志

[Sun May 15 20:16:52 2022] TCP: request_sock_TCP: Possible SYN flooding on port 9876. Sending cookies.  Check SNMP counters.
[Sun May 15 20:22:22 2022] TCP: request_sock_TCP: Possible SYN flooding on port 9876. Sending cookies.  Check SNMP counters.

octavia监听端口触发 syn flooding,通过这个信息可以知道问题处在tcp连接

2.3. strace查看系统调用

strace -o output.txt -T -tt -e trace=all -p 101847
17:48:20.962664 fcntl(10, F_SETFL, O_RDWR) = 0 <0.000006>
17:48:20.962702 recvfrom(10, "\7\0\0\1\0\0\0\0\0\0\0", 8192, 0, NULL, NULL) = 11 <0.000007>
17:48:20.962738 fcntl(10, F_GETFL)      = 0x2 (flags O_RDWR) <0.000006>
17:48:20.962761 fcntl(10, F_SETFL, O_RDWR) = 0 <0.000006>
17:48:20.962819 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout) <0.000006>
17:48:20.969238 sendto(5, "HTTP/1.0 200 OK\r\n", 17, 0, NULL, 0) = 17 <0.000019>
17:48:20.969337 sendto(5, "Date: Mon, 16 May 2022 09:48:20 "..., 37, 0, NULL, 0) = 37 <0.000008>
17:48:20.969401 sendto(5, "Server: WSGIServer/0.1 Python/2."..., 37, 0, NULL, 0) = 37 <0.000006>
17:48:20.969439 sendto(5, "Content-Length: 2934\r\nContent-Ty"..., 122, 0, NULL, 0) = 122 <0.000006>
17:48:20.969473 sendto(5, "{\"loadbalancers\": [{\"provider\": "..., 2934, 0, NULL, 0) = 2934 <0.000013>
17:48:20.969525 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=528, ...}) = 0 <0.000030>
17:48:20.969597 write(2, "172.16.204.135 - - [16/May/2022 "..., 92) = 92 <0.000012>
17:48:20.969661 shutdown(5, SHUT_WR)    = 0 <0.000013>
17:48:20.969703 close(5)                = 0 <0.000011>
17:48:20.969744 select(5, [4], [], [], {tv_sec=0, tv_usec=500000}) = 1 (in [4], left {tv_sec=0, tv_usec=499997}) <0.000009>
17:48:20.969786 accept(4, {sa_family=AF_INET, sin_port=htons(49511), sin_addr=inet_addr("172.16.205.227")}, [16]) = 5 <0.000010>
17:48:20.969849 recvfrom(5,

卡在revcfrom 接受客户端请求,到这里不太有思路了回到上面的tcp flooding

2.4. 查看连接状态

netstat -anolp|grep 9876 -w
tcp        6      0 172.16.10.10:9876      0.0.0.0:*               LISTEN      168014/python        keepalive (0.03/0/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:37110      SYN_RECV    -                    on (0.23/0/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36202      SYN_RECV    -                    on (0.23/2/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:37090      SYN_RECV    -                    on (0.00/0/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36308      SYN_RECV    -                    on (3.43/3/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36302      SYN_RECV    -                    on (3.43/3/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36301      SYN_RECV    -                    on (3.43/3/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36934      SYN_RECV    -                    on (3.83/2/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36348      SYN_RECV    -                    on (4.63/3/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:36178      SYN_RECV    -                    on (4.63/3/0)
tcp      384      0 172.16.10.10:9876      172.16.83.77:36962      ESTABLISHED -                    off (0.00/0/0)
tcp      384      0 172.16.10.10:9876      172.16.83.77:36732      ESTABLISHED -                    off (0.00/0/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:59818      TIME_WAIT   -                    timewait (29.91/0/0)
tcp        0      0 172.16.10.10:9876      172.16.83.77:33976      TIME_WAIT   -                    timewait (52.32/0/0)

有几个连接是SYN_RECV状态,syn flooding现象,但连接数量不多,再看看线程属性

2.5. 查看socket统计信息

ss -anlop|grep octavia-api                                                                    Tue May 17 11:27:51 2022

tcp    LISTEN     6 	 5    172.16.10.10:9876                  *:*                   users:(("octavia-api",pid=173917,fd=4))

第三列、第四列为

  • Recv-Q 接收队列
    • LISTEN 状态 表示当前等待服务器调用accept完成三次握手的Listen backlog值
    • 其他状态 接收队列的字节数
  • Send-Q 发送队列一般创建socket监听时,维护一个syn队列
    • LISTEN状态 表示最大的listen backlog值
    • 其他状态 发送队列中的字节数

tcp-half-open

客户端通过socket连接向服务端发送syc包时维护一个等待队列 进入半连接状态,如果等待队列满了,服务端就会丢弃,客户端返回连接超时,只要客户端没有收到syn+ack,客户会多次尝试,表现在体验上面就是超时无响应

看到这里在参考下面红帽的链接,就可以知道问题所在: octavia-api在启动监听套接字时backlog太小了,如果一次性的请求大于5个且持续请求就处理不过来,所以这里将octavia-api启动的backlog改大

# python listen签名
Socket.listen(backlog)

3. 解决

找到octavia-api启动时的backlog配置

cmd.api.svc.sevrve_forever -> wsgiref.simple_server.WSGIServer -> lib64/BaseHTTPServer.HTTPServer -> lib64/SocketServer.TCPServer ->

request_queue_size = 5

简单改的话地方有如下三个,他们之间是继承关系,该其中任何一个都行,但基于最小化改动原则,simple_server 是相比其他两个掉用最少的,所以修改 simple_server

  • wsgiref.simple_server.WSGIServer
  • lib64/BaseHTTPServer.HTTPServer
  • lib64/SocketServer.TCPServer
# /usr/lib64/python2.7/wsgiref/simple_server.py
class WSGIServer(HTTPServer):

    """BaseHTTPServer that implements the Python WSGI protocol"""

    request_queue_size = 64 #新增

    application = None

修改完重启 octavia-api 即可,验证使用如下命令

ss -anlop|grep octavia-api                                                                    Tue May 17 11:27:51 2022

tcp    LISTEN    0 	 64    172.16.10.10:9876                  *:*                   users:(("octavia-api",pid=173917,fd=4))

第三列64即生效,这里改大可能会影响系统其他进程,由于操作系统总的队列数是通过如下两个参数限制的,改完可以粗略的计算下,如果把所有ss输出listen状态的加起来小于下面输出则不用改下面两个参数,如果大于则需要适量调整

sysctl -n net.core.somaxconn
sysctl -n net.ipv4.tcp_max_syn_backlog

4. 总结

  • api设计时应该考虑多线程,后续octavia-api改写可能可以参考nova-api
  • 管控面API调用不应该太频繁,并发量不应该太大

5. 参考