记录一些遇见的有趣BUG

不能不学习啊

Posted by 大狗 on January 7, 2021

记录一些遇见的有趣BUG

前言

这篇博客的目的是记录下遇见过的有趣BUG

今天想想忽然觉得比较搞笑,几个公司做的东西不一样就会遇到各种不同的问题:在huayao涉及到的都是操作系统/内核/内存很多本身的问题,因为代码很老,做的东西又很底层,直接DPDK在用户态开搞;在mt公司涉及到的都是不规范的代码/不规范的环境/不规范的C/C++开发流程;在轻舟涉及到的就是各种新工具比方说buildfarm/bazel,但是面临的问题很多都是仿真测试,业务环境,k8s的问题。。。

BUG列表

1 因为引用库和数据结构对不上触发的COREDUMP

1.1 Bug 100606 - [FT 2000]Atcp coredump at raise () when repeat bug98307.

ARM平台上,对一个消息做softssl hash,会触发COREDUMP

这个BUG本质上是个金丝雀(canary)保护的例子,出现的原因是栈溢出。 出现的原因是两方面的叠加: 1 openssl自己保护的SHA512_CTX结构和我们自己的不一致,openssl的末尾多两个unsigned int。 2 引用的库操作是OPENSSL的库,OPENSSL会对OPENSSL结构下的SHA512_CTX末尾两个数字进行操作。 这里面存在两个问题 1 两个不同文件的同名数据结构,编译器为什么不认为出错就过去了? 2 为什么X86下不会认为出错,不会产生访问异常?

第一个问题比较好解答,编译和链接是不同过程,链接是不可能知道这还有两个数据区别。 第二个问题需要仔细检查,隐藏着几个小问题:X86架构上有没有canary保护?缺页中断保护的究竟是什么?是数据,还是异常内存访问?为什么ARM架构下就触发了canary保护? 在USTACK原本的X86架构上,编译选项没有加上fstack-protect,导致编译器不会有fstack-protect-all标记,因此X86不会编入金丝雀保护。而ARM平台下,

2 CACHE_LINE_SIZE不同导致的COREDUMP

2.1 Bug 98768 - [FT 2000]There is backend coredump when use command “debug trace tcp all” to capture packages

ATCP enqueue数据,BACKEND进程DEQUEUE会触发COREDUMP。

一开始怀疑点是代码乱序,因为代码确实存在乱序可能性。没有任何内存屏障保护等等。 但是看到br->br_cons_mask和br->br_prod_mask不同开始怀疑,毕竟这个变量对消费者生产者都一样,也不会更改,怎么可能不同?

最后在ATCP和BACKEND里面使用命令#define offset_of(type, memb) ((unsigned long)(&((type *)0)->memb))检查发现两个进程的CACHE_LINE_SIZE不同,因而读取失败。最终修复除了引用相同头文件修复CACHE_LINE_SIZE还更新了RTE_RING代码。

3 因为OPENSSL用的NID和国密不一致导致的失败

3.1 Bug 94844 - [2020.6]CLI Cannot Import CFCA SM2 Encryption Digital Envelope.

这个BUG的排查比较花时间,因为出问题的点一开始没想到。毕竟头疼医头脚疼医脚是最直接的反应。

起因是QA拿着CFCA的申请请求去找CA了,发现签回来的信封导入不进去。

我一开始以为是10.4代码因为OPENSSL我们大改(OPENSSL1.1.1d的代码PATCH了国密,健康检查很多东西),所以验签失败。检查OPENSSL验签的代码,后来发现似乎没什么逻辑错误,产生了怀疑。将熟悉信封导入到10.3.1的系统也失败产生了怀疑。

拿asn1 dump检查CSR申请请求的时候发现问题所在了,CSR里面的公钥为0x000….相当于CA每次都在给错误的公钥做签名。出现问题的根本原因是因为OPENSSL看国密的NID是NID_X9_62_id_ecPublicKey(408),但国内把国密的NID改成了NID_sm2(1172)。OPENSSL取公钥的时候发现NID不匹配就直接返回NULL了,但是CSR会成功生成。所以这个BUG是OPENSSL PATCH国密错+CA错导致的,两个地方都没报才会没发现。实际上这块因为NID不一致的问题很多,修改了好几个这种BUG。不过触发条件很奇葩。

4 兆芯CPU的BUG

4.1 Bug 103720 - [1850_ZX]The ECC key can’t import & Bug 103313 - [1850_ZX][auto] ssl does not work. & Bug 103951 - The cert process coredump when APV starts on U5580-ZX-C platform & …

103720直接现象是不能导入一个secp384r1对应的EC 私钥,103313直接现象是不能正确生成2048的私钥/公钥,其它的一些BUG也有这个问题,大部分都是触发了断言,assert((*wnumtop)==0,很多都是div引起的。问题平台为ZX-5580CPU,其它的X86_64平台都没问题。这个BUG的排查特别花时间,虽然我早就猜到了应该是汇编级别的错误,但是汇编的错误藏得太深了,而且103313大师没有找到ROOT-CAUSE而是打PATCH绕开了问题,导致ROOT-CAUSE很难查。

首先导入私钥的时候,报错报的的是

140737347789312:error:1012606B:elliptic curve routines:EC_POINT_set_affine_coordinates:point is not on
curve:crypto/ec/ec_lib.c:812:

也就是说验证点不在椭圆曲线上,这个函数EC_POINT_is_on_curve嵌套地特别深,但是查了很久发现问题并不在私钥的坐标上(椭圆曲线私钥是个点),反而是椭圆曲线的GROUP对应的公钥信息不对。追查为什么椭圆曲线的公钥信息不对,汇编级别的指令特别难插,关键点的条件基本找不到。废了九牛二虎之力最后发现问题所在竟然是ZX CPU的问题:

经过追究发现,根本原因出在div 上:ZX 5580 CPU对于特定输入会算错,稳定复现 无论正常机器还是错误机器,div %rcx汇编指令/机器码都相同,且进行运算的值相同。 div是无符号整数除法,高64位置于rdx,低64置于rax。触发运算结束,除数放于rax,余数放于rdx里。

正常机器的div前后的寄存器值分别为:

正常机器 div前

rax            0x0	0
rcx            0xffffffffffffffff	-1
rdx            0xffffffff00000001	-4294967295

正常机器 div后

rax            0xffffffff00000001	-4294967295
rcx            0xffffffffffffffff	-1
rdx            0xffffffff00000001	-4294967295

错误机器的div前后的寄存器值分别为: 错误机器 div前

(gdb) info reg
rax            0x0                 0
rcx            0xffffffffffffffff  -1
rdx            0xffffffff00000001  -4294967295

错误机器 div后

rax            0x1                 1
rcx            0xffffffffffffffff  -1
rdx            0x1                 1

为了验证div在正常机器上运算正常,所以我用python做了以下验证

>>> print(a)
>>> 340282366841710300967557013911933812736
>>> print('0x%x'%a)
>>> 0xffffffff000000010000000000000000
>>> print(b)
>>> 18446744073709551615
>>> print('0x%x'%b)
>>> 0xffffffffffffffff
>>> print(a/b)
>>> 18446744069414584321
>>> print('0x%x'%(a/b))
>>> 0xffffffff00000001
>>> print(a-b*(a/b))
>>> 18446744069414584321
>>> print('0x%x'%(a-b*(a/b)))
>>> 0xffffffff00000001

为了验证是不是还有别的机器有这种问题,我写了一个test.c可以用来验证别的ZX CPU是不是也有这个问题。结果输出为1的都有这问题,结果输出位-4294967295的都是正常机器。经过测试,发现只有ZHAOXIN KaiXian KX-U5580@1.8GHz有这个问题,ZHAOXIN KaiSheng KH-37800D@2.7GHz没有这个问题,

#include <stdio.h>
#define BN_ULONG unsigned long long *

unsigned long long n0 = 18446744069414584321;
unsigned long long n1 = 0;
unsigned long long d0 = 18446744073709551615;
unsigned long long q = 4294967295;

#if 0
BN_ULONG bn_div_words(BN_ULONG h, BN_ULONG l, BN_ULONG d)                               \
{ __asm__ (
        "pushq %rbp\n\t"
        "movq (%rsp), %rbp\n\t"
        "movabsq $0xcef9f3930, %rax\n\t"
        "pushq %rax\n\t"
        "movq (%rsp), %rbp\n\t"
        "leaq 0x30(%rsp), %rsp\n\t"
        );
}

#endif

BN_ULONG bn_div_words(BN_ULONG h, BN_ULONG l, BN_ULONG d)
{
    BN_ULONG ret, waste;

 asm("divq      %4":"=a"(ret), "=d"(waste)
 :     "a"(l), "d"(h), "r"(d)
 :     "cc");

    return ret;
}


int main() {
        q = bn_div_words(n0, n1, d0);
        printf("q is %lld\n", q);
        return 0;
}


4.2 Bug 102140 - Atcp coredump and can not find interface when performance test SoftSSl(ECDHE-ECDSA-AES128-SHA256)

坦白讲上面这个BUG和4.1的CPU型号不同,这个是 C-QuadCore C4600@2.0GHz。目前看起来,在 C-QuadCore C4600@2.0GHz上,每次MOVQ RBP都会触发的COREDUMP,而ZX-5580U没有这个问题,具体原因还没找到。等要到设备再研究。

5 不同gcc编译器编译代码混用导致的问题

5.1 gcc 4.8 unordered_map coredump in gcc 5.0

今天遇到一个很有意思的BUG,一个4.8.5编出来的so库给gcc5.3.1用,一find就会coredump。查找的的时候使用的结构和函数为为:

class T {
private:
  ...
    std::unordered_map<std::string, T*> T;
    pthread_rwlock_t rwlock; 
...
public:
    static T *instance() {
        static T inst;
        return &inst;
    }
...
};

		pthread_rwlock_rdlock(&rwlock);
    auto it = T_map.find(ns_key_name);
    if (it != T_map.end()) {
        pthread_rwlock_unlock(&rwlock);
        return it->second;
    }

这个问题可以说是非常蛋疼了,最后惊奇的发现,在gcc5.0的库里面,T结构上锁的时候参数rdi明明应该是rwlock,却修改了T_map的_M_buckets结构,仔细查看发现gcc4.8里面的unordered_map是48字节,所以锁的位置是unordered_map+48。而gcc5里面unordered_map是56字节,理论上应该上锁的位置是unordered_map+56,但是.so里面代码还是对+48位置上锁导致错误修改_T_buckets从而coredump。

6 签名数据变长导致的COREUMP

今天可以说是一个特别蛋疼的问题,一开始一直没想到。在做ECDSA/RSA签名的优化。然后想当然的认为,消息摘要缓冲区长度不需要超过64字节,那么签名结果缓冲区不需要超过64字节。然后出了一连串莫名其妙的问题:ecdsa签名/研签的时候没出错,rsa签名完了,ret之后,一打印signature的内容就疯狂coredump。然后lldb/gdb报告coredump的点是string的assign即赋值函数,我一开始没明白单线程赋值为啥会core。后来反应过来内存访问越界了。。。好坑!

				/* 签名的最后结果并不一定小于64,对ECDSA而言256bit可以达到72字节。所以signature_buf长度最好保存1024,直接防备rsa 4096的密钥 */
        static bool ecdsa_sha256_sign_openssl(const std::string &message, const std::string &pem_private_key, std::string &signature) {
            uint32_t digest_len = 0;
            uint32_t signature_len = 0;
            unsigned char digest[EVP_MAX_MD_SIZE];
            char signature_buf[EVP_MAX_MD_SIZE];					//这个地方错了,长度64不能够存储足够的签名信息。对rsa4096/ecdsa等都会超过64字节。最后修改所有的signature_buf为1024字节就好了
        
            std::string key = "-----BEGIN PRIVATE KEY-----\n" + pem_private_key + "\n-----END PRIVATE KEY-----";
            BIO *bio = BIO_new(BIO_s_mem());
            if (bio == nullptr) {
                return false;
            }
            BIO_puts(bio, key.c_str());

            EC_KEY *ec_key = PEM_read_bio_ECPrivateKey(bio, NULL, NULL, NULL);;
            if (ec_key == nullptr) {
                BIO_free(bio);
                return false;
            }

            EVP_MD_CTX *md_ctx = EVP_MD_CTX_create();
            if (md_ctx == nullptr) {
                BIO_free(bio);
                EC_KEY_free(ec_key);
                return false;
            }

            bool ret = true;
            bzero(digest, EVP_MAX_MD_SIZE);
            if (EVP_DigestInit(md_ctx, EVP_sha256()) <= 0
                ||EVP_DigestUpdate(md_ctx, (const void *)message.c_str(), message.length()) <= 0
                    ||EVP_DigestFinal(md_ctx, digest, &digest_len) <= 0
                        ||ECDSA_sign(0, digest, digest_len, (unsigned char*)signature_buf, &signature_len, ec_key) <= 0) {
                ret = false;
            }


            if (ret == true) {
                std::string final_sig(signature_buf, signature_len);
                signature = final_sig;
            }

            BIO_free(bio);
            EC_KEY_free(ec_key);
            EVP_MD_CTX_destroy(md_ctx);

            return ret;

        }
        

7 openssl base64的bug

今天做代码测试的时候发现的一个问题,openssl自己实现的openssl有bug,经常会出现本来应该解码错误的情况解码成功。然后无论是1.1.1k还是1.0.2k都是错误的。然后我看了boringssl的实现,发现boringssl是对的。太坑了!

8 log4cplus的奇怪问题

这个问题实际上没找到解决问题,简单描述下问题怎么回事。我打了一个动态库,依赖1.1.3的log4cplus,boost169,然后我使用的环境是gcc 4.8.5(这个对c11的支持不完全)。然后客户用gcc8.3.1使用这个库的时候用了另一个日志库,另一个日志库包含了1.2.2的log4cplus库,然而这引发了一个非常奇怪的问题。

每次load配置文件的时候,会检查存储配置的map里面是不是存在字符串layout的,触发一个从char*到string的隐式转换,但是每次转换都会在malloc里面的_int_malloc触发core。当然如果使用gcc4.8.5也会触发这个问题。我检查了从string的构造到malloc与Properties的各种属性和参数,都正常,但是每次都会稳定core,就非常奇怪。本身也看不到具体的malloc的代码,就很蛋疼。

发现实际上问题并不是直接处在char*的转换上,而是上一次分配664字节的RollingFileAppender的时候的mutex加锁的位置会破坏malloc的结果,mutex的位置竟然超出了RollingFileAppender的范围,多写了8个字节。

具体的现象表现为

g++ main.cpp /usr/lib64/libclogV2.a ./auth_centos7_4cpp.so -std=c++11 -o -g statica.out,使用静态库编译出来的log4cplus::RollingFileAppender::RollingFileAppender分配出来的RollingFileAppender是664字节,锁的位置在RollingFileAppender之内,mutex的位置和我自己include的头文件结构位置匹配。

但是使用g++ main.cpp /usr/lib64/libclogV2.so ./auth_centos7_4cpp.so -std=c++11 -o -g statica.out时rollingFileAppender的大小还是664字节,但是mutex的偏移不在664之内,在664之外会写坏内存,mutex的位置和clogV2包含的头文件结构位置匹配。

查看实现单独使用clogV2动/静态库链接的时候,发现RollingFileAppender的大小是680字节,原来clogV2用的log4cplus是基于1.2.2的,而我们用的log4cplus是1.1.3的版本,这两个版本的头文件不一致,导致数据结构RollingFileAppender大小不同,因为我们包入了头文件,每次真正malloc的时候实际上都是分配664字节,也就是说并没有分配680字节。这种行为就属于标准的UB,才会出现这种问题。

下面是具体查询的流程。

我个人在centos7,gcc4.8.5尝试复现这个问题,环境参数如下



/* 
gcc 8.3.1 调用编译出来的库和clogV2.so的命令为
g++ main.cpp ./auth_centos7_4cpp.so /usr/lib64/libclogV2.so
执行的时候触发core
*/
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff79d7d18 in log4cplus::spi::ObjectRegistryBase::putVal (this=0x604270, name=..., object=<optimized out>)
    at objectregistry.cxx:87
#2  0x00007ffff79e3ab8 in log4cplus::spi::FactoryRegistry<log4cplus::spi::LayoutFactory>::put (object=..., this=0x604270)
    at ../include/log4cplus/spi/factory.h:163
#3  log4cplus::initializeFactoryRegistry () at factory.cxx:169
#4  0x00007ffff79cd47b in log4cplus::initializeLog4cplus () at global-init.cxx:374
#5  0x00007ffff79cd4ed in (anonymous namespace)::_static_log4cplus_initializer::_static_log4cplus_initializer (
    this=0x7ffff7dda74a <(anonymous namespace)::initializer>) at global-init.cxx:557
#6  __static_initialization_and_destruction_0 (__priority=<optimized out>, __initialize_p=1) at global-init.cxx:569
#7  0x00007ffff7dea9c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7ddc17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#9  0x0000000000000001 in ?? ()
#10 0x00007fffffffe3a1 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)
      
/* 如果先使用clogV2的动态库,再调用auth动态库,命令为g++ main.cpp /usr/lib64/libclogV2.so ./auth_centos7_4cpp.so,同样core。最奇怪的事情是如果使用libclogV2的静态库不会core*/
(gdb) bt
#0  0x00007ffff63cb3d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff63ccac8 in abort () from /lib64/libc.so.6
#2  0x00007ffff640df67 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6417b36 in _int_malloc () from /lib64/libc.so.6
#4  0x00007ffff641a78c in malloc () from /lib64/libc.so.6
#5  0x00007ffff6cda18d in operator new(unsigned long) () from /lib64/libstdc++.so.6
#6  0x00007ffff6d38cd9 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) ()
   from /lib64/libstdc++.so.6
#7  0x00007ffff6d3a561 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
#8  0x00007ffff6d3a998 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#9  0x00007ffff7b9815b in log4cplus::helpers::Properties::exists(char const*) const () from /lib64/libclogV2.so
#10 0x00007ffff7b4b895 in log4cplus::Appender::Appender(log4cplus::helpers::Properties const&) () from /lib64/libclogV2.so
#11 0x00007ffff7b6714f in log4cplus::FileAppenderBase::FileAppenderBase(log4cplus::helpers::Properties const&, std::_Ios_Openmode) ()
   from /lib64/libclogV2.so
#12 0x00007ffff7b71646 in log4cplus::FileAppender::FileAppender(log4cplus::helpers::Properties const&, std::_Ios_Openmode) ()
   from /lib64/libclogV2.so  //已经出错了
#13 0x00007ffff7b71959 in log4cplus::RollingFileAppender::RollingFileAppender(log4cplus::helpers::Properties const&) ()
   from /lib64/libclogV2.so
#14 0x00007ffff77009f1 in log4cplus::spi::FactoryTempl<log4cplus::RollingFileAppender, log4cplus::spi::AppenderFactory>::createObject (
    this=<optimized out>, props=...) at ../include/log4cplus/spi/factory.h:242
#15 0x00007ffff7b50bb9 in log4cplus::PropertyConfigurator::configureAppenders() () from /lib64/libclogV2.so
#16 0x00007ffff7b53d2e in log4cplus::PropertyConfigurator::configure() () from /lib64/libclogV2.so
#17 0x00007ffff7b53f24 in log4cplus::PropertyConfigurator::doConfigure(std::string const&, log4cplus::Hierarchy&, unsigned int) ()
   from /lib64/libclogV2.so
#18 0x00007ffff744ed07 in AuthService::Init (this=this@entry=0x6089c0)
    at /home/vgdog/code/sts/auth_sdk_cpp/source/auth_service_real.cpp:9
#19 0x00007ffff744c6ac in kms::Singleton<AuthService>::instance ()
    at /home/vgdog/code/sts/auth_sdk_cpp/source/./utils/Singleton.h:54
#20 inf::auth::AuthServiceSignInit (param=...) at /home/vgdog/code/sts/auth_sdk_cpp/source/auth_service.cpp:8
#21 0x0000000000400dee in main ()

/* 查询发现log4cplus::Properties是没有问题的*/
 
/* 所以拜托SRE把glibc的源代码装上,开始查,在进入log4cplus::helpers::Properties::exists(char const*)前,这块内存就已经被破坏了 ,内存地址在watch *(struct malloc_chunk *) 0x60eed0上,开始检测这块地址,可以发现实际上是在创建一个智能指针,具体的类型为helpers::SharedObjectPtr<Appender> SharedAppenderPtr*/
  
/* 所以实际上是分配内存,然后构造相应的mutex的流程时破坏了malloc的内存结构,ok,问题找到了分配出来的地址返回之后修改mutex的时候会直接修改内存块的size,具体看这里*/
/* 初始化mutex的时候的参数和调用栈*/
(gdb) info reg
rax            0x7ffff7dd5bb0	140737351867312
rbx            0x60ec40	6351936
rcx            0x60ec30	6351920
rdx            0x60ec40	6351936
rsi            0x1	1
rdi            0x60eed8	6352600

(gdb) bt
#0  0x00007ffff7ba4510 in log4cplus::thread::Mutex::Mutex(log4cplus::thread::Mutex::Type) () from /lib64/libclogV2.so
#1  0x00007ffff7b71934 in log4cplus::RollingFileAppender::RollingFileAppender(log4cplus::helpers::Properties const&) ()
   from /lib64/libclogV2.so
#2  0x00007ffff77009f1 in log4cplus::spi::FactoryTempl<log4cplus::RollingFileAppender, log4cplus::spi::AppenderFactory>::createObject (
    this=<optimized out>, props=...) at ../include/log4cplus/spi/factory.h:242
#3  0x00007ffff7b50bb9 in log4cplus::PropertyConfigurator::configureAppenders() () from /lib64/libclogV2.so
#4  0x00007ffff7b53d2e in log4cplus::PropertyConfigurator::configure() () from /lib64/libclogV2.so
#5  0x00007ffff7b53f24 in log4cplus::PropertyConfigurator::doConfigure(std::string const&, log4cplus::Hierarchy&, unsigned int) ()
   from /lib64/libclogV2.so
  
(gdb) p (&((struct malloc_chunk *) 0x60eed0)->size)
$103 = (size_t *) 0x60eed8

/* 链接clogV2的静态库的话malloc的参数是664字节,也就是sizeof(log4cplus::RollingFileAppender),地址为0x60cfd0,堆的地址是从低到高增长的,所以size在0x60cfd0+664的地方为实际的大小,是1025。所以我们能看到具体构造mutex的时候mutex的地址为0x60d258,log4cplus::thread::Mutex::Mutex (this=0x60d258, t=log4cplus::thread::Mutex::RECURSIVE)。这个地址减去log4cplus::RollingFileAppender的起始地址为648的偏移,也就是说还在分配出来的内存量里面*/

/*而链接clogV2动态库的话malloc的参数是664字节,返回地址为(void *) 0x60ec40,实际大小为0x60ec40+664的地址也就是1025。而构造mutex时mutex的地址为0x60eed8,但是操作的偏移量为664字节,换言之mutex写的已经超了,那么为什么会出现这种问题呢? */


/* 现在实际上是初始化SharedObejct部分的mutex*/
Dump of assembler code for function log4cplus::RollingFileAppender::RollingFileAppender(log4cplus::helpers::Properties const&):
=> 0x00007ffff79e94d0 <+0>:	mov    %rbx,-0x30(%rsp)
   0x00007ffff79e94d5 <+5>:	mov    %rbp,-0x28(%rsp)
   0x00007ffff79e94da <+10>:	mov    %rdi,%rbx
   0x00007ffff79e94dd <+13>:	mov    %r14,-0x10(%rsp)
   0x00007ffff79e94e2 <+18>:	mov    %r12,-0x20(%rsp)
   0x00007ffff79e94e7 <+23>:	lea    0x280(%rdi),%r14    /* 静态库的偏移量时0x280,也就是说在库的内部,是正确的*/
   0x00007ffff79e94ee <+30>:	mov    %r13,-0x18(%rsp)
   0x00007ffff79e94f3 <+35>:	mov    %r15,-0x8(%rsp)
   0x00007ffff79e94f8 <+40>:	sub    $0x58,%rsp
   0x00007ffff79e94fc <+44>:	mov    0x3d0fa5(%rip),%rax        # 0x7ffff7dba4a8
   0x00007ffff79e9503 <+51>:	mov    %rsi,%rbp
   0x00007ffff79e9506 <+54>:	mov    $0x1,%esi
   0x00007ffff79e950b <+59>:	add    $0x10,%rax
   0x00007ffff79e950f <+63>:	mov    %rax,0x280(%rdi)
   0x00007ffff79e9516 <+70>:	lea    0x288(%rdi),%rdi
   0x00007ffff79e951d <+77>:	callq  0x7ffff76e6f40 <_ZN9log4cplus6thread5MutexC1ENS1_4TypeE@plt>
   
   (gdb) info reg
rax            0x60cfd0	6344656
rbx            0x60cfd0	6344656
rcx            0x60cfc0	6344640
rdx            0x60cfd0	6344656
rsi            0x7fffffffdf60	140737488346976
rdi            0x60cfd0	6344656
rbp            0x7fffffffe050	0x7fffffffe050
rsp            0x7fffffffdec8	0x7fffffffdec8
r8             0x400	1024
r9             0x298	664
r10            0x7fffffffd2a0	140737488343712
r11            0x7ffff79e94d0	140737347753168
r12            0x7fffffffdf60	140737488346976
r13            0x7ffff72663e0	140737339876320
r14            0x7fffffffe060	140737488347232
r15            0x7fffffffdf60	140737488346976
rip            0x7ffff79e94d0	0x7ffff79e94d0 <log4cplus::RollingFileAppender::RollingFileAppender(log4cplus::helpers::Properties const&)>
eflags         0x202	[ IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
   
   
Dump of assembler code for function _ZN9log4cplus19RollingFileAppenderC1ERKNS_7helpers10PropertiesE:
=> 0x00007ffff7b71900 <+0>:	push   %r15
   0x00007ffff7b71902 <+2>:	push   %r14
   0x00007ffff7b71904 <+4>:	push   %r13
   0x00007ffff7b71906 <+6>:	push   %r12
   0x00007ffff7b71908 <+8>:	mov    %rsi,%r12
   0x00007ffff7b7190b <+11>:	push   %rbp
   0x00007ffff7b7190c <+12>:	push   %rbx
   0x00007ffff7b7190d <+13>:	mov    %rdi,%rbx
   0x00007ffff7b71910 <+16>:	lea    0x298(%rdi),%rdi    /* rdi是rollingfileAppender的值,直接加了0x298,直接就偏移过多了啊,就超过了sizeof(rollingFileAppender)了,赋值完了rdi就是(gdb) p/x 0x60ec40+0x298 $14 = 0x60eed8,直接就越界了*/
   0x00007ffff7b71917 <+23>:	mov    $0x1,%esi
   0x00007ffff7b7191c <+28>:	sub    $0x38,%rsp
   0x00007ffff7b71920 <+32>:	mov    0x266139(%rip),%rax        # 0x7ffff7dd7a60
   0x00007ffff7b71927 <+39>:	add    $0x10,%rax
   0x00007ffff7b7192b <+43>:	mov    %rax,-0x8(%rdi)
   0x00007ffff7b7192f <+47>:	callq  0x7ffff7b3fbb0 <_ZN9log4cplus6thread5MutexC1ENS1_4TypeE@plt>
   
   
rax            0x60ec40	6351936
rbx            0x60ec40	6351936
rcx            0x60ec30	6351920
rdx            0x60ec40	6351936
rsi            0x7fffffffe090	140737488347280
rdi            0x60ec40	6351936
rbp            0x7fffffffdff0	0x7fffffffdff0
rsp            0x7fffffffdf38	0x7fffffffdf38
r8             0x400	1024
r9             0x298	664
r10            0x7fffffffd320	140737488343840
r11            0x7ffff7b71900	140737349359872
r12            0x7fffffffe090	140737488347280
r13            0x7fffffffe198	140737488347544
r14            0x7fffffffdfe0	140737488347104
r15            0x0	0
rip            0x7ffff7b71900	0x7ffff7b71900 <log4cplus::RollingFileAppender::RollingFileAppender(log4cplus::helpers::Properties const&)>
eflags         0x202	[ IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0

9 openssl和boringssl混用的bug

10 GRPC调用的问题

今天在写grpc的UT的时候,忽然爆了一个错误:

liboffboard_Sdashboard_Sproto_Slibscenarios_Ucc_Ugrpc.so: undefined symbol: _ZN4grpc24g_core_codegen_interfaceE

先拿c++filt看下

c++filt _ZN4grpc24g_core_codegen_interfaceE
grpc::g_core_codegen_interface

看起来是grpc的代码,在bazel的BUILD里面加上了一句搞完了

"@com_github_grpc_grpc//:grpc++",

11 REbase错误

今天写代码的时候发现CI出错了,然后发现postsubmit bazel build出错。感觉revert,revert完了当事人把原始代码贴了下发现依赖是加了的,后来看了下发现,rebase代码的时候前面有一个修改间接rebase的MR,一开始以为是间接依赖被删除导致的,但是我还是感觉有点迷惑,原始的commit 是加的直接依赖那个头文件的trajectory_developer啊, 他加的trajectory_developer依赖的是两个地方,cyclist_behavior_predictor & vehicle_intention_predictor。怎么rebase完了加依赖的地方就成了 cyclist_behavior_predictor & vehicle_behavior_predictor。

我感觉这属于rebase的错误了

12 DOUBLE NAN的问题

这几天postsubmit的pipeline出现了问题,出现了一个非常奇怪的问题,问题代码为:

PlannerState other_state = state;
//EXPECT_EQ(state, state);
EXPECT_EQ(state, other_state); <======问题出在这句上,但是即使上一句也会报这个错误

//一般也不会出错,得用下面的语句
bazel test -c opt --cache_test_results=no //onboard/planner:planner_state_test --spawn_strategy=local

百思不得其解,最后问了川,发现是double的默认值为nan了,PlannerState结构体里面有一个double的结构体,默认初始化是nan,而nan和自己比较为false,换言之下面的代码的靠下的部分是true的,从而会报错。我最后改了代码加了C++11的isnan判断,发现确实打印了错误。那么问题来了,为什么同样的在local模式会出现这个错误?IEEE 754-1985中,用指数部分全为1、小数部分非零表示NaN。也就是说IEEE 754-1985中,用指数部分全为1、小数部分非零表示NaN。以32位IEEE单精度浮点数的NaN为例,按位表示即:S111 1111 1AXX XXXX XXXX XXXX XXXX XXXX,S为符号位,符号位S的取值无关紧要;A是小数部分的最高位(the most significant bit of the significand),其取值表示了NaN的类型:小数部分不能全为0,小数部分除了A之外的部分被称为NaN的payload;[注 1]

64位IEEE单精度浮点数的NAN也是这样子,也是一开始是11111…,然后小数部分为全为0

而这个int的值为1111111111111111111111111111111111111111101000000000000000000000,十进制为18446744073703259376,十六进制为ffffffffffa00000,去跑isnan发现还是为nan

然后最有趣的事情来了,这个数字即使是做运算也就是nan-nan,结果也还是nan而不是零,也就是说减法不能用,等于不能用,符号运算全不能用。所以这个问题的最彻底解决方法还是显示初始化

struct PlannerState {
    ...
      double current_route_s; <=== 问题出在这一行
    ...
};

bool PlannerState::operator==(const PlannerState& other) const {
    ...
    current_route_s != current_route_s //这个语句是true的,

    if (isnan(current_route_s)) {
      LOG(INFO) << "what is nan default?";
    }
    ...
}

看了一下拷贝出错的地方的汇编代码是下面的代码中间的地方,也非常直接,那么直接对这个赋值的地方呢?

    0x55555557283e <+798>:  callq  0x55555558c010            ; symbol stub for: qcraft::planner::SpacetimePlannerTrajectories::SpacetimePlannerTrajectories(qcraft::planner::SpacetimePlannerTrajectories const&)
    0x555555572843 <+803>:  vmovsd 0x5e0(%r15), %xmm0        ; xmm0 = mem[0],zero 
    0x55555557284c <+812>:  vmovsd %xmm0, 0x5e0(%r13)        //0x5e0的偏移为double current_route_s所在的地方,
    0x555555572855 <+821>:  leaq   0x5e8(%r13), %rdi 
    0x55555557285c <+828>:  leaq   0x5e8(%r15), %rsi         //0x6a0的偏移为MissionStageProto的所在地方
    0x555555572863 <+835>:  callq  0x555555576a00            ; ::LanePath() at lane_path.h:26
    0x555555571fd8 <+536>:  callq  0x55555558bed0            ; symbol stub for: qcraft::planner::SpacetimePlannerTrajectories::SpacetimePlannerTrajectories(google::protobuf::Arena*, bool)
->  0x555555571fdd <+541>:  vxorps %xmm0, %xmm0, %xmm0
    0x555555571fe1 <+545>:  vmovups %ymm0, 0x608(%r15)
    0x555555571fea <+554>:  vmovups %ymm0, 0x5e8(%r15)       //可以看到实际上对double的操作就没做,最多就是一个0x5e8地址的写操作,那得看有没有相应的初始化了,找一下初始化的代码,也就是说malloc的代码
    0x555555571ff3 <+563>:  movq   $0x0, 0x628(%r15)
    0x555555571ffe <+574>:  movb   $0x1, 0x630(%r15)
    0x555555572006 <+582>:  vxorps %xmm1, %xmm1, %xmm1
    0x55555557200a <+586>:  vmovups %xmm1, 0x648(%r15)
    0x555555572013 <+595>:  movq   $0x0, 0x658(%r15)
    0x55555557201e <+606>:  movq   $0x0, 0x668(%r15)
    0x555555572029 <+617>:  movq   %r13, 0x670(%r15)
    0x555555572030 <+624>:  leaq   0x678(%r15), %rbx
    0x555555572037 <+631>:  vmovups %ymm0, 0x678(%r15)
    0x555555572040 <+640>:  movabsq $0x7fefffffffffffff, %rax ; imm = 0x7FEFFFFFFFFFFFFF 
    0x55555557204a <+650>:  movq   %rax, 0x698(%r15)   //这个砍掉8字节
    0x555555572051 <+657>:  leaq   0x6a0(%r15), %rdi   //0x6a0是要传递给MissionStageProto的this指针或者说首地址
    0x555555572058 <+664>:  xorl   %esi, %esi
    0x55555557205a <+666>:  xorl   %edx, %edx
    0x55555557205c <+668>:  vzeroupper 
    0x55555557205f <+671>:  callq  0x55555558bee0            ; symbol stub for: qcraft::MissionStageProto::MissionStageProto(google::protobuf::Arena*, bool)

13 多线程下folly::threadlocal问题

这几天维护代码库的时候报了一个错误,报错如下

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (7 ms total)
[  PASSED  ] 1 test.
terminate called after throwing an instance of 'std::system_error'
  what():  pthread_setspecific failed: Invalid argument

然后追了一下,下面是完整的调用栈,看了一下为什么。主线程退出的时候,调用pthread_key_delete删除了5这个pkey,但是之后又使用pthread_setspecific来操作这个pkey了。所以在操作的时候报错

[qcraft@qcraft-dev-qcraft:/qcraft(master) ] $ 
[qcraft@qcraft-dev-qcraft:/qcraft(master) ] $ lldb bazel-bin/onboard/prediction/container/objects_history_test
(lldb) target create "bazel-bin/onboard/prediction/container/objects_history_test"
Current executable set to '/qcraft/bazel-bin/onboard/prediction/container/objects_history_test' (x86_64).
(lldb) b pthread_setspecific
Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) b pthread_key_create
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) b pthread_key_delete
Breakpoint 3: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) r
...
Process 6522 resuming

# ,
Process 6522 stopped
* thread #1, name = 'objects_history', stop reason = breakpoint 3.1
    frame #0: 0x00007ffff0c9a0d0 libpthread.so.0`pthread_key_delete
libpthread.so.0`pthread_key_delete:
->  0x7ffff0c9a0d0 <+0>:  cmpl   $0x3ff, %edi              ; imm = 0x3FF 
    0x7ffff0c9a0d6 <+6>:  ja     0x7ffff0c9a100            ; <+48>
    0x7ffff0c9a0d8 <+8>:  movl   %edi, %edi
    0x7ffff0c9a0da <+10>: leaq   0x20b23f(%rip), %rax      ; __GI___pthread_keys
(lldb) register read
General Purpose Registers:
       rax = 0x00005555556519e0  objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::instance_
       rbx = 0x0000000000000001
       rcx = 0x0000000000000001
       rdx = 0x0000000000000000
       rdi = 0x0000000000000005
       rsi = 0x000000000000ffff
       rbp = 0x00007fffffffd760
       rsp = 0x00007fffffffd738
        r8 = 0x00005555556519e0  objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::instance_
        r9 = 0x00007fffef6eea00
       r10 = 0x00000000fffffffa
       r11 = 0x0000000000000000
       r12 = 0x00007fffefcff718
       r13 = 0x00000000000005f0
       r14 = 0x00007fffefd04708
       r15 = 0x0000555555716b60
       rip = 0x00007ffff0c9a0d0  libpthread.so.0`pthread_key_delete
    rflags = 0x0000000000000257
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

(lldb) bt
* thread #1, name = 'objects_history', stop reason = breakpoint 3.1
  * frame #0: 0x00007ffff0c9a0d0 libpthread.so.0`pthread_key_delete
    frame #1: 0x0000555555643f92 objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::~PthreadKeyUnregister(this=0x00005555556519e0) at ThreadLocalDetail.h:266:7
    frame #2: 0x00007fffef957161 libc.so.6`___lldb_unnamed_symbol109$$libc.so.6 + 369
    frame #3: 0x00007fffef95725a libc.so.6`exit + 26
    frame #4: 0x00007fffef935bfe libc.so.6`__libc_start_main + 238
    frame #5: 0x00005555555e6b5a objects_history_test`_start + 42

...

(lldb) c
Process 6522 resuming
Process 6522 stopped
* thread #2, name = 'objects_history', stop reason = breakpoint 1.1
    frame #0: 0x00007ffff0c9a1a0 libpthread.so.0`__pthread_setspecific
libpthread.so.0`__pthread_setspecific:
->  0x7ffff0c9a1a0 <+0>: pushq  %r13
    0x7ffff0c9a1a2 <+2>: pushq  %r12
    0x7ffff0c9a1a4 <+4>: pushq  %rbp
    0x7ffff0c9a1a5 <+5>: pushq  %rbx
(lldb) register read
General Purpose Registers:
       rax = 0x00007fffeeceb560
       rbx = 0x00005555556ef4f8
       rcx = 0x00005555557151b0
       rdx = 0x00005555556fe5c0
       rdi = 0x0000000000000005
       rsi = 0x00007fffeeceb560
       rbp = 0x00007fffeeaea050
       rsp = 0x00007fffeeaea018
        r8 = 0x00000000620e53ab
        r9 = 0x00007fffeeae9d88
       r10 = 0x00007fffeeae9d40
       r11 = 0x0000000000000000
       r12 = 0x00007fffeeaea880
       r13 = 0x0000000000000000
       r14 = 0x000055555571bc40
       r15 = 0x00007fffffffc6a0
       rip = 0x00007ffff0c9a1a0  libpthread.so.0`__pthread_setspecific
    rflags = 0x0000000000000202
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

(lldb) thread list
Process 6522 stopped
  thread #1: tid = 6522, 0x00007ffff7deaf2b, name = 'objects_history'
* thread #2: tid = 7615, 0x00007ffff0c9a1a0 libpthread.so.0`__pthread_setspecific, name = 'objects_history', stop reason = breakpoint 1.1
(lldb) c
Process 6522 resuming
terminate called after throwing an instance of 'std::system_error'
  what():  pthread_setspecific failed: Invalid argument
Process 6522 stopped
* thread #2, name = 'objects_history', stop reason = signal SIGABRT
    frame #0: 0x00007fffef952fb7 libc.so.6`raise + 199
libc.so.6`raise:
->  0x7fffef952fb7 <+199>: movq   0x108(%rsp), %rcx
    0x7fffef952fbf <+207>: xorq   %fs:0x28, %rcx
    0x7fffef952fc8 <+216>: movl   %r8d, %eax
    0x7fffef952fcb <+219>: jne    0x7fffef952fec            ; <+252>
(lldb) 

(lldb) 
Process 11199 resuming
Process 11199 stopped
* thread #1, name = 'objects_history', stop reason = breakpoint 3.1
    frame #0: 0x0000555555641c8b objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::registerKey(key=5) at ThreadLocalDetail.h:272:31
(lldb) bt
* thread #1, name = 'objects_history', stop reason = breakpoint 3.1
  * frame #0: 0x0000555555641c8b objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::registerKey(key=5) at ThreadLocalDetail.h:272:31
    frame #1: 0x0000555555640d8d objects_history_test`folly::threadlocal_detail::StaticMetaBase::StaticMetaBase(this=0x00005555557151b0, threadEntry=(libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::getThreadEntrySlow() at ThreadLocalDetail.h:468), strict=false)(), bool) at ThreadLocalDetail.cpp:73:3
    frame #2: 0x00007ffff55a525b libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::StaticMeta(this=0x00005555557151b0) at ThreadLocalDetail.h:420:9
    frame #3: 0x00007ffff55a5138 libonboard_Sglobal_Slibcounter.so`void* folly::detail::StaticSingletonManagerWithRtti::make<folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void> >() at StaticSingletonManager.h:80:16
    frame #4: 0x000055555563e3a9 objects_history_test`folly::detail::(anonymous namespace)::StaticSingletonManagerWithRttiImpl::Entry::get(this=0x00005555556ea1c0, make=0x00007ffff55a5110)()) at StaticSingletonManager.cpp:46:33
    frame #5: 0x000055555563e121 objects_history_test`folly::detail::(anonymous namespace)::StaticSingletonManagerWithRttiImpl::create(this=0x0000555555728910, key=0x00007ffff55b6740, make=0x00007ffff55a5110, cache=0x00007ffff55b9f58)(), std::atomic<void*>&) at StaticSingletonManager.cpp:34:33
    frame #6: 0x000055555563e098 objects_history_test`folly::detail::StaticSingletonManagerWithRtti::create_(arg=0x00007ffff55b9f58) at StaticSingletonManager.cpp:65:19
    frame #7: 0x00007ffff55a5053 libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::instance() [inlined] void* folly::detail::StaticSingletonManagerWithRtti::create_<false>(arg=0x00007ffff55b9f58) at StaticSingletonManager.h:85:12
    frame #8: 0x00007ffff55a504a libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::instance() [inlined] folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>& folly::detail::StaticSingletonManagerWithRtti::create<folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>, void>() at StaticSingletonManager.h:64
    frame #9: 0x00007ffff55a5008 libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::instance() [inlined] folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>& folly::detail::createGlobal<folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>, void>() at StaticSingletonManager.h:97
    frame #10: 0x00007ffff55a5008 libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::instance() at ThreadLocalDetail.h:433
    frame #11: 0x00007ffff55a4f4d libonboard_Sglobal_Slibcounter.so`folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::getSlowReserveAndCache(ent=0x00005555556fbe30, id=0x00007fffffffc894, threadEntry=0x00007fffef2ee850, capacity=0x00007fffef2ee858) at ThreadLocalDetail.h:458:18
    frame #12: 0x00007ffff55a4e2f libonboard_Sglobal_Slibcounter.so`folly::ThreadLocalPtr<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::get() const [inlined] folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::get(ent=0x00005555556fbe30) at ThreadLocalDetail.h:448:7
    frame #13: 0x00007ffff55a4dc4 libonboard_Sglobal_Slibcounter.so`folly::ThreadLocalPtr<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::get(this=0x00005555556fbe30) const at ThreadLocal.h:158
    frame #14: 0x00007ffff55941a1 libonboard_Sglobal_Slibcounter.so`folly::ThreadLocal<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::operator->() const [inlined] folly::ThreadLocal<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::get(this=0x00005555556fbe30) const at ThreadLocal.h:68:27
    frame #15: 0x00007ffff559419c libonboard_Sglobal_Slibcounter.so`folly::ThreadLocal<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::operator->(this=0x00005555556fbe30) const at ThreadLocal.h:73
    frame #16: 0x00007ffff5588c39 libonboard_Sglobal_Slibcounter.so`qcraft::CounterImpl::AddCounterEvent(this=0x00005555556fbe30, name="schedulefuture_callback_size", val=1) at counter.cc:51:3
    frame #17: 0x00007ffff558a23d libonboard_Sglobal_Slibcounter.so`qcraft::Counter::AddCounterEventInternal(this=0x0000555555711cd0, name="schedulefuture_callback_size", val=1) at counter.cc:191:10
    frame #18: 0x00005555555f0745 objects_history_test`void qcraft::Counter::AddCounterEvent<long>(this=0x0000555555711cd0, name="schedulefuture_callback_size", val=1) at counter.h:70:5
    frame #19: 0x00005555555e9fa7 objects_history_test`qcraft::Future<std::result_of<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > (void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(qcraft::ThreadPool*, boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > >)::'lambda'()...)>::type> qcraft::ThreadPool::Schedule<void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(this=0x000055555572ae70, f=0x00007fffffffcbb0)::'lambda'()>(boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > >&&, void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(qcraft::ThreadPool*, boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > >)::'lambda'()&&...) at thread_pool.h:88:5
    frame #20: 0x00005555555e9951 objects_history_test`auto qcraft::ScheduleFuture<void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(qcraft::ThreadPool*, boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > >)::'lambda'()>(thread_pool=0x000055555572ae70, f=0x00007fffffffcbb0)::'lambda'()&&...) at async_util.h:24:23
    frame #21: 0x00005555555e9835 objects_history_test`void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(thread_pool=0x000055555572ae70, container=<unavailable>) at async_util.h:38:3
    frame #22: 0x00005555555e96fb objects_history_test`void qcraft::DestroyContainerAsync<boost::circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > >(container=circular_buffer<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject>, std::allocator<qcraft::elements_history::Node<double, qcraft::prediction::PredictionObject> > > @ 0x00007fffffffcc80) at async_util.h:47:3
    frame #23: 0x00005555555e9666 objects_history_test`qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>::~ElementHistory(this=0x000055555572ae30) at elements_history.h:103:23
    frame #24: 0x00005555555e950b objects_history_test`std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> >::operator(this=0x000055555570f6c8, __ptr=0x000055555572ae30)(qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>*) const at unique_ptr.h:81:2
    frame #25: 0x00005555555e9450 objects_history_test`std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > >::~unique_ptr(this=0x000055555570f6c8) at unique_ptr.h:292:4
    frame #26: 0x00005555555e93dd objects_history_test`std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >::~pair(this=0x000055555570f6a8) at stl_iterator.h:1283:12
    frame #27: 0x00005555555e93b9 objects_history_test`void __gnu_cxx::new_allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > >(this=0x00007fffffffd1b8, __p=0x000055555570f6a8) at new_allocator.h:152:10
    frame #28: 0x00005555555e938d objects_history_test`void std::allocator_traits<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > >(__a=0x00007fffffffd1b8, __p=0x000055555570f6a8) at alloc_traits.h:496:8
    frame #29: 0x00005555555e935d objects_history_test`void absl::lts_20210324::container_internal::map_slot_policy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >::destroy<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >(alloc=0x00007fffffffd1b8, slot=0x000055555570f6a8) at container_memory.h:408:7
    frame #30: 0x00005555555e932d objects_history_test`void absl::lts_20210324::container_internal::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >::destroy<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >(alloc=0x00007fffffffd1b8, slot=0x000055555570f6a8) at flat_hash_map.h:567:5
    frame #31: 0x00005555555e914d objects_history_test`void absl::lts_20210324::container_internal::hash_policy_traits<absl::lts_20210324::container_internal::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >, void>::destroy<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >(alloc=0x00007fffffffd1b8, slot=0x000055555570f6a8) at hash_policy_traits.h:101:5
    frame #32: 0x00005555555e9040 objects_history_test`absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >, absl::lts_20210324::container_internal::StringHash, absl::lts_20210324::container_internal::StringHashEq::Eq, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >::destroy_slots(this=0x00007fffffffd198) at raw_hash_set.h:1565:9
    frame #33: 0x00005555555e8f79 objects_history_test`absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >, absl::lts_20210324::container_internal::StringHash, absl::lts_20210324::container_internal::StringHashEq::Eq, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >::~raw_hash_set(this=0x00007fffffffd198) at raw_hash_set.h:970:21
    frame #34: 0x00005555555e8f55 objects_history_test`absl::lts_20210324::container_internal::raw_hash_map<absl::lts_20210324::container_internal::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > >, absl::lts_20210324::container_internal::StringHash, absl::lts_20210324::container_internal::StringHashEq::Eq, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >::~raw_hash_map(this=0x00007fffffffd198) at raw_hash_map.h:31:7
    frame #35: 0x00005555555e8b05 objects_history_test`absl::lts_20210324::flat_hash_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > >, absl::lts_20210324::container_internal::StringHash, absl::lts_20210324::container_internal::StringHashEq::Eq, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan>, std::default_delete<qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> > > > > >::~flat_hash_map(this=0x00007fffffffd198) at flat_hash_map.h:108:7
    frame #36: 0x00005555555f5f59 objects_history_test`qcraft::elements_history::ElementsHistory<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, qcraft::prediction::PredictionObject, qcraft::elements_history::ElementHistory<double, qcraft::prediction::PredictionObject, qcraft::prediction::ObjectHistorySpan> >::~ElementsHistory(this=0x00007fffffffd190) at elements_history.h:152:7
    frame #37: 0x00005555555e81f5 objects_history_test`qcraft::prediction::ObjectsHistory::~ObjectsHistory(this=0x00007fffffffd190) at objects_history.h:11:7
    frame #38: 0x00005555555e73ae objects_history_test`qcraft::prediction::ObjectsHistoryTest_Test_ObjectHistory_Push_Test::TestBody(this=0x000055555572b5c0) at objects_history_test.cc:13:1
    frame #39: 0x00007ffff11d925b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=0x000055555572b5c0, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2607:10
    frame #40: 0x00007ffff11c6cad libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x000055555572b5c0, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2643:14
    frame #41: 0x00007ffff11aafe3 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::Test::Run(this=0x000055555572b5c0) at gtest.cc:2682:5
    frame #42: 0x00007ffff11abc19 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::TestInfo::Run(this=0x0000555555711b10) at gtest.cc:2861:11
    frame #43: 0x00007ffff11ac474 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::TestSuite::Run(this=0x00005555556fc500) at gtest.cc:3015:28
    frame #44: 0x00007ffff11be62b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::internal::UnitTestImpl::RunAllTests(this=0x00005555557183f0) at gtest.cc:5855:44
    frame #45: 0x00007ffff11dc0fb libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x00005555557183f0, method=d0 e1 1b f1 ff 7f 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2607:10
    frame #46: 0x00007ffff11c9243 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x00005555557183f0, method=d0 e1 1b f1 ff 7f 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2643:14
    frame #47: 0x00007ffff11be15e libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::UnitTest::Run(this=0x00007ffff11f80f8) at gtest.cc:5438:10
    frame #48: 0x00007ffff11fcb01 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so`RUN_ALL_TESTS() at gtest.h:2490:46
    frame #49: 0x00007ffff11fca9b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so`main(argc=1, argv=0x00007fffffffd8a8) at gmock_main.cc:70:10
    frame #50: 0x00007fffef935bf7 libc.so.6`__libc_start_main + 231
    frame #51: 0x00005555555e6b5a objects_history_test`_start + 42
(lldb) thread list
Process 11199 stopped
* thread #1: tid = 11199, 0x0000555555641c8b objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::registerKey(key=5) at ThreadLocalDetail.h:272:31, name = 'objects_history', stop reason = breakpoint 3.1
  thread #2: tid = 11202, 0x00007ffff7ddd4ab, name = 'objects_history'
(lldb) break list
Current breakpoints:
1: name = 'pthread_create', locations = 1, resolved = 1, hit count = 2
  1.1: where = libpthread.so.0`pthread_create@@GLIBC_2.2.5, address = 0x00007ffff0c929b0, resolved, hit count = 2 

2: name = 'pthread_setspecific', locations = 1, resolved = 1, hit count = 8
  2.1: where = libpthread.so.0`__pthread_setspecific, address = 0x00007ffff0c9a1a0, resolved, hit count = 8 

3: name = 'folly::threadlocal_detail::PthreadKeyUnregister::registerKey', locations = 1, resolved = 1, hit count = 3
  3.1: where = objects_history_test`folly::threadlocal_detail::PthreadKeyUnregister::registerKey(unsigned int) + 11 at ThreadLocalDetail.h:272:31, address = 0x0000555555641c8b, resolved, hit count = 3 

(lldb) br d 2
ambiguous command 'breakpoint d'. Possible completions:
	delete
	disable
(lldb) br de 2
1 breakpoints deleted; 0 breakpoint locations disabled.
(lldb) 

写了一个段这个来测试删除pkey是不是会导致pthread_setsepcific返回22,符合了预期。主线程退出的时候destroy了pthread_pkey,导致的问题

#include<stdio.h>
#include<pthread.h>
#include<string.h>
#include <unistd.h>


pthread_key_t p_key;

void func1() {
        int *tmp = (int*)pthread_getspecific(p_key);//同一线程内的各个函数间共享数据。
        printf("%d is runing in %s\n",*tmp,__func__);
}

void *thread_func(void *args)
{
        int ret =  pthread_setspecific(p_key,args);
        printf("ret is %d \n",ret);
        int *tmp = (int*)pthread_getspecific(p_key);//获得线程的私有空间
        printf("%d is runing in %s\n",*tmp,__func__);
        *tmp = (*tmp)*100;//修改私有变量的值
        func1();
        return (void*)0;
}

int main()
{
        pthread_t pa, pb;
        int a=1;
        int b=2;
        pthread_key_create(&p_key,NULL);
        pthread_create(&pa, NULL,thread_func,&a);
        sleep(2);
        int ret = pthread_key_delete(p_key);
        printf("pthread_key_delete ret is %d \n",ret);
        pthread_create(&pb, NULL,thread_func,&b);
        pthread_join(pa, NULL);
        pthread_join(pb, NULL);
        return 0;
}

qcraft@BJ-vgdog:~/code_test/pthread_key_delete_test$ ./a.out 
ret is 0 
1 is runing in thread_func
100 is runing in func1
pthread_key_delete ret is 0 
ret is 22 
Segmentation fault

2023/03/21再探该问题,今天又遇到这个问题,同样是pthread_setspecific提示invalid argument,q公司内部实际上有一个曾经的分析,但是当时我没追踪这个问题,另外分析的过程臆想的部分过多,所以从新过滤思路,走了一遍debug。最终的思路如下:

这个问题看起来似乎并不好调试,但是只需要确定是多线程的问题外加线程局部变量即可检查:因为主线程退出会导致静态变量析构,因此只需要在gtest结束之后调试主线程的时候断点断住pthread_key_delete和pthread_setspecific,令其他线程都暂停即set scheduler-locking on(这里注意,好几个thread local变量清理掉了)。之后打开其他线程继续调试即可。

这个过程当中可以发现问题的整个过程如下

gtest主线程调用一个planner的模块,该模块会启动线程2执行任务,执行任务的过程中会调用load地图的函数,注意这里面load地图的函数会传递一个析构函数到主线程。因此当planner的模块退出时,主线程调用析构函数。而该主线程在退出时调用的析构函数会再次创建一个新的线程去执行析构(DestroyContainerAsync,本质是ScheduleFuture一个对象到线程池)。换言之在主线程on_exit的时候,调用了ThreadPool::DisposalPool()去创建第三个线程,执行任务。

实际上这里如果能够保证第三个线程管理的内存对象都释放掉,那么问题并不严重,但是我们的代码里面调用了folly的threadlocal类型对象,即folly::ThreadLocal<XXXPerThread, XXXTag> xxx_per_thread_;这个东西,它最终会进入下面的代码的pthread_setspecific从而抛出invalid argument,即key是invalid。那么问题就出现了pthread_getspecific没出错说明这个时候key是有效的,pthread_setspecific出错说明这个时候key无效,那么必然是key删除了,这个key为什么会删除,在哪里删除?

FOLLY_EXPORT FOLLY_NOINLINE static ThreadEntry* getThreadEntrySlow() {
    auto& meta = instance();
    auto key = meta.pthreadKey_;
    ThreadEntry* threadEntry =
        static_cast<ThreadEntry*>(pthread_getspecific(key));
    if (!threadEntry) {
      ThreadEntryList* threadEntryList = StaticMeta::getThreadEntryList();
      if (kUseThreadLocal) {
        static thread_local ThreadEntry threadEntrySingleton;
        threadEntry = &threadEntrySingleton;
      } else {
        threadEntry = new ThreadEntry();
      }
      // if the ThreadEntry already exists
      // but pthread_getspecific returns NULL
      // do not add the same entry twice to the list
      // since this would create a loop in the list
      if (!threadEntry->list) {
        threadEntry->list = threadEntryList;
        threadEntry->listNext = threadEntryList->head;
        threadEntryList->head = threadEntry;
      }

      threadEntry->tid() = std::this_thread::get_id();
      threadEntry->tid_os = folly::getOSThreadID();

      // if we're adding a thread entry
      // we need to increment the list count
      // even if the entry is reused
      threadEntryList->count++;

      threadEntry->meta = &meta;
      int ret = pthread_setspecific(key, threadEntry);
      checkPosixError(ret, "pthread_setspecific failed");

这里追踪就能发现folly::threadlocal_detail::PthreadKeyUnregister::~PthreadKeyUnregister()在析构的时候pthread_delete掉了pthread_key类型的变量,PthreadKeyUnregister是一个singleton,主线程退出会调用其析构删除pthread key对象,一旦在set之前析构掉了,那么就会报错pthread_setspecific invalid argument。参考这个实现就很清楚了(实际上注释也写的很清楚了,这个singleton是由析构顺序造成的问题的,其logic并不安全),PthreadKeyUnregister管理着全局的所有的pthread_key。每次新注册的pthread_local都会走这套去注册pthread_key到该类型中,而析构的时候并最终释放掉这些pthread keys。而另外一个新线程此时需要绑定一个局部tls到这个pthreadkey,从而导致这个问题。


class PthreadKeyUnregister {
 public:
  static constexpr size_t kMaxKeys = 1UL << 16;

  ~PthreadKeyUnregister() {
    // If static constructor priorities are not supported then
    // ~PthreadKeyUnregister logic is not safe.
#if !defined(__APPLE__) && !defined(_MSC_VER)
    MSLGuard lg(lock_);
    // 最终考虑了下,once and for all还是直接注释掉下面三句话。。。。
    while (size_) {
     pthread_key_delete(keys_[--size_]);
    }
#endif
  }

  static void registerKey(pthread_key_t key) { instance_.registerKeyImpl(key); }

 private:
  /**
   * Only one global instance should exist, hence this is private.
   * See also the important note at the top of this class about `constexpr`
   * usage.
   */
  constexpr PthreadKeyUnregister() : lock_(), size_(0), keys_() {}
  friend struct folly::threadlocal_detail::PthreadKeyUnregisterTester;

  void registerKeyImpl(pthread_key_t key) {
    MSLGuard lg(lock_);
    if (size_ == kMaxKeys) {
      throw_exception<std::logic_error>(
          "pthread_key limit has already been reached");
    }
    keys_[size_++] = key;
  }

  MicroSpinLock lock_;
  size_t size_;
  pthread_key_t keys_[kMaxKeys];

  static PthreadKeyUnregister instance_;
};

这里我们最终回归一下,为什么说程序开始的时候调用DestroyContainerAsync(std::move(x x x)); 强制初始化好ThreadLocal变量,只是虚假的解决了这个问题,而不是真正地解决,还是回到下面的代码,是因为如果强行构造std::move来程序正常运行时绑定了tls到pthread_key,之后这里的pthread_getspecific获取就不会出现threadEntry为空的情况,从而就不会调用的pthread_setspecific,所以不会出现pthread_setspecific的问题。其实际上并没有保证内存是正常的

   FOLLY_NOINLINE static void getSlowReserveAndCache(
      EntryID* ent,
      uint32_t& id,
      ThreadEntry*& threadEntry,
      size_t& capacity) {
    auto& inst = instance();
    threadEntry = inst.threadEntry_(); <=== run here
    if (UNLIKELY(threadEntry->getElementsCapacity() <= id)) {
      inst.reserve(ent);
      id = ent->getOrInvalid();
    }
    capacity = threadEntry->getElementsCapacity();
    assert(capacity > id);
  } 


FOLLY_EXPORT FOLLY_NOINLINE static ThreadEntry* getThreadEntrySlow() {
    // 拿到对应的staticMeta
    auto& meta = instance();
    // 拿到对应的pthreadkey
    auto key = meta.pthreadKey_;
    // 获取对当前线程拿到的ThreadEntry
    ThreadEntry* threadEntry =
        static_cast<ThreadEntry*>(pthread_getspecific(key));  // 就是这里实际上不会返回null了
    if (!threadEntry) {
      // threadEntry为空说明当前线程就没这个玩意,那么我们就得分配新的ThreadEntry开始用啦
      // 这里注意ThreadEntryList是StaticMeta的东西
      ThreadEntryList* threadEntryList = StaticMeta::getThreadEntryList();
      if (kUseThreadLocal) {
        static thread_local ThreadEntry threadEntrySingleton;
        threadEntry = &threadEntrySingleton;
      } else {
        threadEntry = new ThreadEntry();
      }
      // if the ThreadEntry already exists
      // but pthread_getspecific returns NULL
      // do not add the same entry twice to the list
      // since this would create a loop in the list
      // 刚才的新造出来的,当前线程的存储的局部ThreadEntry放到threadEntryList里面
      if (!threadEntry->list) {
        threadEntry->list = threadEntryList;
        threadEntry->listNext = threadEntryList->head;
        threadEntryList->head = threadEntry;
      }
      // 再存储下tid
      threadEntry->tid() = std::this_thread::get_id();
      threadEntry->tid_os = folly::getOSThreadID();

      // if we're adding a thread entry
      // we need to increment the list count
      // even if the entry is reused
      threadEntryList->count++;

      threadEntry->meta = &meta;
      // 上面都加好了,现在可以设置绑定到当前线程啦
      int ret = pthread_setspecific(key, threadEntry);   <=== run here
      checkPosixError(ret, "pthread_setspecific failed");
    }
    //不为空直接返回了拿到的当前线程的ThreadEntry
    return threadEntry;
  }

附上程序栈和部分代码的分析

(gdb) bt
#0  0x00007fffe8de1d50 in __GI___pthread_setspecific (key=9, value=0x7fffde552518) at pthread_setspecific.c:26
# 这里就到了有趣的地方了,解释看上面的部分
#1  0x00007fffee3dd94c in folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::getThreadEntrySlow() () at external/folly/folly/detail/ThreadLocalDetail.h:497

# frame 2 getSlowReserveAndCache这个的代码就不贴了,这里实际上是拿到staticmeta<Tag>的instance
# 即只针对某一种类型的staticmeta的单例
#2  0x00007fffee3dd4ab in folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::getSlowReserveAndCache(folly::threadlocal_detail::StaticMetaBase::EntryID*, unsigned int&, folly::threadlocal_detail::ThreadEntry*&, unsigned long&) (ent=0x5555556eaa80, id=@0x7fffde550c64: 1, threadEntry=@0x7fffde552508: 0x0, capacity=@0x7fffde552510: 0) at external/folly/folly/detail/ThreadLocalDetail.h:459

# frame 3的代码如下
#  FOLLY_EXPORT FOLLY_ALWAYS_INLINE static ElementWrapper& get(EntryID* ent) {
#    // Eliminate as many branches and as much extra code as possible in the
#    // cached fast path, leaving only one branch here and one indirection below.
#    uint32_t id = ent->getOrInvalid();
##ifdef FOLLY_TLD_USE_FOLLY_TLS
#    static FOLLY_TLS ThreadEntry* threadEntry{};   <== notice here, it's a per thread static thing as well
#    static FOLLY_TLS size_t capacity{};     <== FOLLY_TLS is macro for _thread per thread as well
##else
#    ThreadEntry* threadEntry{};
#    size_t capacity{};
##endif
#    if (FOLLY_UNLIKELY(capacity <= id)) {
#      getSlowReserveAndCache(ent, id, threadEntry, capacity);  <== run here
#    }
#    return threadEntry->elements[id];
#  }
# 这次在做什么操作呢,实际上就是在给元素做cache缓存和分配。这里注意
# ThreadEntry* threadEntry{}是第一次分配的时候的事情,这时候这个东西实际上是空的指针!也就是说如果第一次过来会申请
# 但是如果已经走了一遍申请,因为静态变量不会多次初始化,所以只要它变了,它就是变了,
# 下次再来访问这里的时候,id和capcity都是确定的,如果是新的那么id就增加了,capacity就会
#
#
#
#
#3  0x00007fffee3dd37f in folly::threadlocal_detail::StaticMeta<qcraft::CounterImpl::CounterTag, void>::get(folly::threadlocal_detail::StaticMetaBase::EntryID*) (ent=0x5555556eaa80) at external/folly/folly/detail/ThreadLocalDetail.h:448

# frame 4的代码如下
#  T* get() const {
#    threadlocal_detail::ElementWrapper& w = StaticMeta::get(&id_); <== run here
#    return static_cast<T*>(w.ptr);
#  }
# 所以可以清楚的看到,这个时候实际上是从StaticMeta当中通过id拿到真实元素的wrapper,StaticMeta是一个类型设定好了singleton
# 对于整个程序而言,一种类型对应一种StaticMeta,用folly自己的话
# We have one of these per "Tag", by default one for the whole system,
# 在这里我们就能确定我们操作的是CounterTag的StaticMeta,但是这里的id_是什么?如果是默认构造函数,能够看到id_调用默认构造
# 由其他的ThreadLocalPtr那么就拿别人的id_
#4  0x00007fffee3dd37f in folly::ThreadLocalPtr<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::get() const (this=0x5555556eaa80) at external/folly/folly/ThreadLocal.h:158

# frame 5 这里实际上就是获取对应的perthread的数据的过程,对应的代码如下
#  FOLLY_ERASE T* get() const {
#    auto const ptr = tlp_.get();  <==运行到这里
#    return FOLLY_LIKELY(!!ptr) ? ptr : makeTlp();
# }
# 第一个问题就来了tlp_是个什么东西,其定义为mutable ThreadLocalPtr<T, Tag, AccessMode> tlp_
# 所以实际上就是一个ThreadLocalPtr类型,参考https://github.com/facebook/folly/blob/main/folly/docs/ThreadLocal.md
# 所以实际上folly::ThreadLocal就是一成folly::ThreadLocalPtr的包裹,哈哈哈,这个具体干啥可以直接查询到
#5  0x00007fffee3ca7d1 in folly::ThreadLocal<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::get() const (this=0x5555556eaa80) at external/folly/folly/ThreadLocal.h:68

# frame 6 真正开始调用operator->(),因为其是一个threadlocal类型的变量,所以指针寻找的过程实际上重写了对应的过程
# 这里实际上就是把xxxx_per_thread_当成了一个指针
#6  0x00007fffee3ca7d1 in folly::ThreadLocal<qcraft::CounterImpl::CounterPerThread, qcraft::CounterImpl::CounterTag, void>::operator->() const (this=0x5555556eaa80) at external/folly/folly/ThreadLocal.h:73

# frame 7 的代码实际上是个xxxx_per_thread_->spin_lock.Lock(), 
# xxxx_per_thread_对应的类别是XXXPerThread,里面放了一个boost::circular_buffer和SpinLock
#7  0x00007fffee3b90a4 in qcraft::CounterImpl::AddCounterEvent(char const*, long) (this=0x5555556eaa80, name=0x7fffee41e698 "AddTraceEvent", val=12096) at onboard/global/counter.cc:70
#8  0x00007fffee3baa3d in qcraft::Counter::AddCounterEventInternal(char const*, long) (this=0x555555744ca0, name=0x7fffee41e698 "AddTraceEvent", val=12096) at onboard/global/counter.cc:208
#9  0x00007ffff76a1d25 in qcraft::Counter::AddCounterEvent<long>(char const*, long) (this=0x555555744ca0, name=0x7fffee41e698 "AddTraceEvent", val=12096) at ./onboard/global/counter.h:40
#10 0x00007ffff7668d3b in qcraft::CounterEventWrapper::~CounterEventWrapper() (this=0x7fffde550f18) at ./onboard/global/counter.h:64
#11 0x00007fffee440325 in qcraft::TraceImpl::AddTraceEvent(qcraft::TraceEvent const&) (this=0x7fffd901e0e0, trace_event=...) at onboard/global/trace.cc:75
#12 0x00007fffee42f675 in qcraft::Trace::AddTraceEvent(qcraft::TraceEvent const&) (this=0x7fffd901a0f0, trace_event=...) at onboard/global/trace.cc:342
#13 0x00007ffff7bc2a96 in qcraft::ScopedTrace::ScopedTrace(char const*, bool) (this=0x7fffde5510d0, name=0x7ffff3d3ab2b "DestroyContainerAsync", use_ftrace=true) at ./onboard/global/trace.h:150
#14 0x00007ffff3dbeb76 in void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}::operator()() (this=0x555555728d78)
    at ./onboard/async/async_util.h:40
#15 0x00007ffff3dc275d in std::__invoke_impl<void, void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&>(std::__invoke_other, void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&) (__f=...)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60
#16 0x00007ffff3dc272d in std::__invoke<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&>(void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&, (std::__invoke_result&&)...) (__fn=...)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#17 0x00007ffff3dc2709 in std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>::__call<void>(std::tuple<>&&, std::_Index_tuple<>) (this=0x555555728d78, __args=...) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:400
#18 0x00007ffff3dc26e6 in std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>::operator()<, void>() (this=0x555555728d78) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:482
#19 0x00007ffff3dc269d in std::__invoke_impl<void, std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>&>(std::__invoke_other, std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>&) (__f=...)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60
#20 0x00007ffff3dc266d in std::__invoke<std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>&>(std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>&, (std::__invoke_result&&)...) (__fn=...)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#21 0x00007ffff3dc263c in std::__future_base::_Task_state<std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}::operator()() const (this=0x7fffde551660) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:1421
#22 0x00007ffff3dc251c in std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void>::operator()() const (this=0x7fffde551640)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:1362
#23 0x00007ffff3dc23b0 in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:285
#24 0x00007ffff3dab658 in std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const (this=0x7fffde551640) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:688
#25 0x00007ffff3dab379 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) (this=0x555555728d50, __f=0x7fffde551640, __did_set=0x7fffde5515c6)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:561
#26 0x00007ffff3dab601 in std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (__f=
    @0x7fffde5515b0: (void (std::__future_base::_State_baseV2::*)(std::__future_base::_State_baseV2 * const, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *)) 0x7ffff3dab350 <std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>, __t=@0x7fffde5515a8: 0x555555728d50, __args=@0x7fffde5515a0: 0x7fffde551640, __args=@0x7fffde551598: 0x7fffde5515c6)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:73
#27 0x00007ffff3dab50c in std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (__fn=
    @0x7fffde5515b0: (void (std::__future_base::_State_baseV2::*)(std::__future_base::_State_baseV2 * const, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *)) 0x7ffff3dab350 <std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>, __args=@0x7fffde5515a8: 0x555555728d50, __args=@0x7fffde5515a0: 0x7fffde551640, __args=@0x7fffde551598: 0x7fffde5515c6)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#28 0x00007ffff3dab49c in void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}::operator()() const (this=0x7fffde551518)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/mutex:671
#29 0x00007ffff3dab424 in void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::operator()() const (this=0x7ffff7e57758)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/mutex:676
#30 0x00007ffff3dab3f9 in void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#2}::__invoke() () at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/mutex:676
#31 0x00007fffe8de24df in __pthread_once_slow (once_control=0x555555728d68, init_routine=0x7ffff7d51c20 <__once_proxy>) at pthread_once.c:116
#32 0x00007ffff3d938cb in __gthread_once(int*, void (*)()) (__once=0x555555728d68, __func=0x7ffff7d51c20 <__once_proxy>) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:700
#33 0x00007ffff3dab334 in std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (__once=..., __f=
---Type <return> to continue, or q <return> to quit---
    @0x7fffde5515b0: (void (std::__future_base::_State_baseV2::*)(std::__future_base::_State_baseV2 * const, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *)) 0x7ffff3dab350 <std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>, __args=@0x7fffde5515a8: 0x555555728d50, __args=@0x7fffde5515a0: 0x7fffde551640, __args=@0x7fffde551598: 0x7fffde5515c6)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/mutex:683
#34 0x00007ffff3daae2e in std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) (this=0x555555728d50, __res=..., __ignore_failure=false)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:401
#35 0x00007ffff3dc1f71 in std::__future_base::_Task_state<std::_Bind<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>, std::allocator<int>, void ()>::_M_run() (this=0x555555728d50) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:1423
#36 0x00007ffff3dc0a7b in std::packaged_task<void ()>::operator()() (this=0x5555558181b0) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/future:1551
#37 0x00007ffff3dc388d in qcraft::Future<std::result_of<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>::type> qcraft::ThreadPool::Schedule<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}>(void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&&, (std::result_of&&)...)::{lambda()#1}::operator()() const (this=0x5555557287e0)
    at ./onboard/async/thread_pool.h:118
#38 0x00007ffff3dc372d in std::_Function_handler<void (), qcraft::Future<std::result_of<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1} ()>::type> qcraft::ThreadPool::Schedule<void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}>(void qcraft::DestroyContainerAsync<qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager> >(qcraft::ThreadPool*, qcraft::LockFileAndDelete<qcraft::mapping::v2::SemanticMapManager>)::{lambda()#1}&&, (std::result_of&&)...)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:300
#39 0x00007ffff7668ad5 in std::function<void ()>::operator()() const (this=0x7fffde551748) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/std_function.h:688
#40 0x00007fffee4756bb in qcraft::ThreadPool::ThreadPool(int, std::function<void (int)> const&)::$_0::operator()() const (this=0x555555727e88) at onboard/async/thread_pool.cc:74
#41 0x00007fffee47542d in _ZSt13__invoke_implIvZN6qcraft10ThreadPoolC1EiRKSt8functionIFviEEE3$_0JEET_St14__invoke_otherOT0_DpOT1_ (__f=...) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60
#42 0x00007fffee4753bd in _ZSt8__invokeIZN6qcraft10ThreadPoolC1EiRKSt8functionIFviEEE3$_0JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA_ (__fn=...) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95
#43 0x00007fffee475395 in std::thread::_Invoker<std::tuple<qcraft::ThreadPool::ThreadPool(int, std::function<void (int)> const&)::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x555555727e88) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244
#44 0x00007fffee475365 in std::thread::_Invoker<std::tuple<qcraft::ThreadPool::ThreadPool(int, std::function<void (int)> const&)::$_0> >::operator()() (this=0x555555727e88) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251
#45 0x00007fffee475229 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<qcraft::ThreadPool::ThreadPool(int, std::function<void (int)> const&)::$_0> > >::_M_run() (this=0x555555727e80) at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195
#46 0x00007ffff7d52de4 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#47 0x00007fffe8dd9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#48 0x00007fffe858b133 in clone () at /lib/x86_64-linux-gnu/libc.so.6

到这里这个事情基本就结束了

14 因为升级glog导致的问题

这几天维护代码库遇到一个问题,我们有一个Async的Logtest忽然开始bangbangbang出错了,代码大致如下

TEST(AsyncLoggerTest, SetLoggerToGlog) {
...
  AsyncLogger logger(google::base::GetLogger(google::INFO));
  google::base::SetLogger(FLAGS_minloglevel, &logger);
...
  google::ShutdownGoogleLogging();
}

//gtest的报错信息

FAIL: //cyber/logger:async_logger_test ...
INFO: From Testing //cyber/logger:async_logger_test:
==================== Test output for //cyber/logger:async_logger_test:
Running main() from gmock_main.cc
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from AsyncLoggerTest
[ RUN      ] AsyncLoggerTest.WriteAndFlush
[       OK ] AsyncLoggerTest.WriteAndFlush (3 ms)
[ RUN      ] AsyncLoggerTest.SetLoggerToGlog
E20220301 18:15:28.254946  8519 async_logger_test.cc:65] [AsyncLoggerTest2]test set async logger to glog
free(): invalid pointer
================================================================================
Target //cyber/logger:async_logger_test up-to-date:
  bazel-bin/cyber/logger/async_logger_test
INFO: Elapsed time: 1.325s, Critical Path: 1.09s
INFO: 2 processes: 2 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 2 total actions
Test cases: finished with 0 passing and 1 failing out of 1 test cases

然后我开始介入,基本上free报invalid pointer都是传入的指针不对,看了下就是logger的地址被传递给了free的参数,一开始没反应过来,以为是函数写越界地址写错了,然后后来忽然感觉不对劲:logger是个局部变量,明明是栈上的怎么就在ShutdownGoogleLogging里面给free了,然后看了下free函数寄存器等东西发现确实是free的这个地址!

rdx = 0x0000555555664010
rdi = 0x00007fffffffd208
rsi = 0x0000000000000000

apollo::cyber::logger::AsyncLogger::AsyncLogger


(lldb) register read
General Purpose Registers:
       rax = 0x0000000000000000
       rbx = 0x0000000000000000
       rcx = 0x0000000000000002
       rdx = 0x0000555555664010
       rdi = 0x00007fffffffd208
       rsi = 0x0000000000000000
       rbp = 0x00007fffffffd0c0
       rsp = 0x00007fffffffd0a8
        r8 = 0x000055555567dd60
        r9 = 0x00007fffffffcd10
       r10 = 0x00007fffffffcd10
       r11 = 0x0000000000000206
       r12 = 0x000055555558edd0  async_logger_test`_start
       r13 = 0x00007fffffffd8f0
       r14 = 0x0000000000000000
       r15 = 0x0000000000000000
       rip = 0x00007fffed33ca30  libc.so.6`cfree
    rflags = 0x0000000000000206
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000


(lldb) br list
Current breakpoints:
1: name = 'apollo::cyber::logger::AsyncLogger::~AsyncLogger', locations = 2, resolved = 2, hit count = 21
  1.1: where = libcyber_Slogger_Slibasync_Ulogger.so`apollo::cyber::logger::AsyncLogger::~AsyncLogger() + 20 at async_logger.cc:39:29, address = 0x00007ffff7e5efd4, resolved, hit count = 7 
  1.2: where = libcyber_Slogger_Slibasync_Ulogger.so`apollo::cyber::logger::AsyncLogger::~AsyncLogger() + 20 at async_logger.cc:39:29, address = 0x00007ffff7e5eaf4, resolved, hit count = 14 

5: name = 'free', locations = 3, resolved = 3, hit count = 16
  5.1: where = ld-2.27.so`free, address = 0x00007ffff7dee800, resolved, hit count = 0 
  5.2: where = libexternal_Sboost_Slibasio.so`boost::asio::detail::object_pool<boost::asio::detail::epoll_reactor::descriptor_state>::free(boost::asio::detail::epoll_reactor::descriptor_state*) + 24 at object_pool.hpp:128:9, address = 0x00007ffff3317648, resolved, hit count = 0 
  5.3: where = libc.so.6`cfree, address = 0x00007fffed33ca30, resolved, hit count = 16 
  
很有趣的一点,明明这个变量在栈上,为什么会送到free函数上?
看下分配的函数

->  0x55555558f5f7 <+87>:  callq  0x5555555d9a80            ; symbol stub for: google::base::GetLogger(int)
    0x55555558f5fc <+92>:  movq   %rax, %rsi
    0x55555558f5ff <+95>:  leaq   -0x78(%rbp), %rdi
    0x55555558f603 <+99>:  movq   %rdi, -0xc0(%rbp)
    0x55555558f60a <+106>: callq  0x5555555d9a90            ; symbol stub for: apollo::cyber::logger::AsyncLogger::AsyncLogger(google::base::Logger*)
    0x55555558f60f <+111>: movq   -0xc0(%rbp), %rsi
    0x55555558f616 <+118>: movq   0x4eb33(%rip), %rax
    0x55555558f61d <+125>: movl   (%rax), %edi
    0x55555558f61f <+127>: callq  0x5555555d9bf0            ; symbol stub for: google::base::SetLogger(int, google::base::Logger*)
    0x55555558f624 <+132>: jmp    0x55555558f629            ; <+137> at async_logger_test.cc
    0x55555558f629 <+137>: leaq   -0x78(%rbp), %rdi
    0x55555558f62d <+141>: callq  0x5555555d9b60            ; symbol stub for: apollo::cyber::logger::AsyncLogger::Start()
    0x55555558f632 <+146>: jmp    0x55555558f637            ; <+151> at async_logger_test.cc:63:3

非常清楚,调用apollo::cyber::logger::AsyncLogger::AsyncLogger的对象(实际上是rdi)是从栈上申请出来的地址,怎么最后就送到了free函数了呢?

开始翻修改记录,发现同事把代码从glog 0.4升级到了0.5的代码,然后打了下断点看了下调用栈

Running main() from gmock_main.cc
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from AsyncLoggerTest
[ RUN      ] AsyncLoggerTest.WriteAndFlush
[       OK ] AsyncLoggerTest.WriteAndFlush (3 ms)
[ RUN      ] AsyncLoggerTest.SetLoggerToGlog
E20220301 18:26:51.491030 15911 async_logger_test.cc:65] [AsyncLoggerTest2]test set async logger to glog
free(): invalid pointer
Process 15911 stopped
* thread #1, name = 'async_logger_te', stop reason = signal SIGABRT
    frame #0: 0x00007fffed30dfb7 libc.so.6`raise + 199
libc.so.6`raise:
->  0x7fffed30dfb7 <+199>: movq   0x108(%rsp), %rcx
    0x7fffed30dfbf <+207>: xorq   %fs:0x28, %rcx
    0x7fffed30dfc8 <+216>: movl   %r8d, %eax
    0x7fffed30dfcb <+219>: jne    0x7fffed30dfec            ; <+252>
(lldb) bt
* thread #1, name = 'async_logger_te', stop reason = signal SIGABRT
  * frame #0: 0x00007fffed30dfb7 libc.so.6`raise + 199
    frame #1: 0x00007fffed30f921 libc.so.6`abort + 321
    frame #2: 0x00007fffed358967 libc.so.6`___lldb_unnamed_symbol237$$libc.so.6 + 631
    frame #3: 0x00007fffed35f9da libc.so.6`___lldb_unnamed_symbol295$$libc.so.6 + 26
    frame #4: 0x00007fffed366f0c libc.so.6`cfree + 1244
    frame #5: 0x00007ffff7e5efe2 libcyber_Slogger_Slibasync_Ulogger.so`apollo::cyber::logger::AsyncLogger::~AsyncLogger(this=0x00007fffffffd218) at async_logger.cc:39:29
    frame #6: 0x00007fffef39da79 libexternal_Scom_Ugithub_Ugoogle_Uglog_Slibglog.so`google::LogDestination::~LogDestination(this=0x0000555555691c70) at logging.cc:638:5
    frame #7: 0x00007fffef39dbff libexternal_Scom_Ugithub_Ugoogle_Uglog_Slibglog.so`google::LogDestination::DeleteLogDestinations() at logging.cc:902:5
    frame #8: 0x00007fffef3a3282 libexternal_Scom_Ugithub_Ugoogle_Uglog_Slibglog.so`google::ShutdownGoogleLogging() at logging.cc:2573:3
    frame #9: 0x000055555558f838 async_logger_test`apollo::cyber::logger::AsyncLoggerTest_SetLoggerToGlog_Test::TestBody(this=0x00005555556b5180) at async_logger_test.cc:67:3
    frame #10: 0x00007fffeeb94d9b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=0x00005555556b5180, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2607:10
    frame #11: 0x00007fffeeb82a0d libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x00005555556b5180, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2643:14
    frame #12: 0x00007fffeeb673e3 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::Test::Run(this=0x00005555556b5180) at gtest.cc:2682:5
    frame #13: 0x00007fffeeb67fc9 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::TestInfo::Run(this=0x00005555556c2080) at gtest.cc:2861:11
    frame #14: 0x00007fffeeb68814 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::TestSuite::Run(this=0x000055555570f3b0) at gtest.cc:3015:28
    frame #15: 0x00007fffeeb7a5d8 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::internal::UnitTestImpl::RunAllTests(this=0x0000555555707b20) at gtest.cc:5855:44
    frame #16: 0x00007fffeeb97c2b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x0000555555707b20, method=80 a1 b7 ee ff 7f 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2607:10
    frame #17: 0x00007fffeeb84f73 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x0000555555707b20, method=80 a1 b7 ee ff 7f 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2643:14
    frame #18: 0x00007fffeeb7a118 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so`testing::UnitTest::Run(this=0x00007fffeebb3b60) at gtest.cc:5438:10
    frame #19: 0x00007fffeebb5a81 libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so`RUN_ALL_TESTS() at gtest.h:2490:46
    frame #20: 0x00007fffeebb5a1b libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so`main(argc=1, argv=0x00007fffffffd908) at gmock_main.cc:70:10
    frame #21: 0x00007fffed2f0bf7 libc.so.6`__libc_start_main + 231
    frame #22: 0x000055555558edba async_logger_test`_start + 42
(lldb) 

然后比较了一下glog 0.4和glog 0.5发现确实代码变了,还真是这个问题,然后查了一下glog的代码修改流程发现还真是这个事情,直接把整个logger的管理丢给了https://github.com/google/glog/commit/28321d8959574a8268de3ebe41d40cf65415fc9b

//glog 0.4
~LogDestination() { }

//glog 0.5
LogDestination::~LogDestination() {
  if (logger_ && logger_ != &fileobject_) {
    // Delete user-specified logger set via SetLogger().
    delete logger_;
  }
}

15奇怪的调度

这个问题说起来并不难查,但是因为很久没做过性能方面的优化和分析,本身又对k8s不熟悉,所以漏了怯了。。。。回头写一个单独如何做性能分析的文章来总结下问题查询的方法。轻舟做CI,目前在迁移仿真从aws到阿里云集群(这个背景要注意),有的时候仿真的集群运行忽然就很慢,往往跑个pod发现要跑一个多小时,然后因为pod超时被kill,这个场景的测试又多次重试,从而又失败了。加了几条代码打日志看了下,代码如下:


std::vector<std::vector<cv::Mat>> PanoNet::ClassifyImagePixels(
    const std::vector<cv::Mat>& images) {
  SCOPED_QTRACE_ARG1("PanoNet::ClassifyImagePixels", "gpu_id",
                     GpuId_Name(net_param_.gpu_id()));
  absl::Time start = absl::Now();
  .....
  if (SimCacheActive() && !use_gpu_) {
    ....
    auto sim_cache = GlobalSimCache::GetSimCache();
    if (!sim_cache->GetResult(cache_key, &outputs)) {
      ....
    }
    int64_t into_check_latency_2 = absl::ToInt64Milliseconds(absl::Now() - start);
    LOG(INFO) << "2 classify image pixels into check takes " << into_check_latency_2 << "ms" << "cache key is " << cache_key;
    tmp_cache_key = cache_key;
  } else {
    QCHECK(pano_net_->GetOutputs(images, batch_size, &outputs));
  }
  int64_t into_check_latency_3 = absl::ToInt64Milliseconds(absl::Now() - start);
  LOG(INFO) << "3 classify image pixels into check takes " << into_check_latency_3 << "ms" <<" tmp cache key is " << tmp_cache_key;
....
}

发现在日志“2 classify image pixels xxx” 和 “3 classify image pixels xxx”之间竟然有一个长达4-10s的gap,而代码本身之间没东西。查仿真进程发现进程调度oncpu时间和整体运行时间比例相差极大,一开始以为是调度问题,呼叫阿里云介入,阿里云反应是读写bps导致cpu被打断,我表示不能理解,你一个读取完数据的cpu搁这里因为整体读写磁盘负载高被打断?现在cpu的负载也不重,这个解释有点牵强,仿佛逗我笑,但是当时忘了查进程状态,所以就搁置了,运维同学介入说换个跑pod数量少的node机器,降低并发磁盘读写。

第二天换了机器问题还在,感觉不对劲,重新介入,查了下慢的simulator进程状态,是SLl也就是说等待事件,感觉不对,这不应该了。然后看了下主线程的pstack发现好几次都是在做sim-cache析构,然后往metric push消息。也就是说在阿里云的pod上往aws push metric,然后注释掉push metric的地方,重新触发仿真5分钟通过

[root@sim-pipe241251spot ~]# ps -aux|grep 188359
root      188359  3.0  3.9 8693260 5077808 ?     SLl  12:10   0:19 offboard/simulation/simulator_main --variation=0 --qevents_dir=/tmp/9f07-e87d-f912-f353 --qevents_whitelist=* --logtostderr --enable_run_analysis=false --run_analysis_config_file= --run_analysis_server_address= --use_run_map=1 --sim_cache_dir=/media/s3/sim_cache --prom_push_gateway_url=a57985cd13e064fa7a332ec0870f38ea-559580627.cn-north-1.elb.amazonaws.com.cn:9091 --prom_aggregation_gateway_url=ae1be9e5057a74bcd909c63b8f134e95-1323279091.cn-north-1.elb.amazonaws.com.cn --sim_cache_monitor_keys=scenario_test_presubmit_cn --vehicle_param_dir=onboard/params/run_params/vehicles --scenario_path=/tmp/b99d-d2a3-11eb-9dda --output_file=/tmp/0a6d-cb37-11ca-0793 --enable_localizer_gpu=false --cpu_image_codec --fen_use_gpu=false --pcn_use_gpu=false --tcn_use_gpu=false --knet_use_gpu=false --panonet_use_gpu=false --mof_net_use_gpu=false --disable_vantage_forwarding_module --undefok=variation,qevents_dir,qevents_whitelist,logtostderr,enable_run_analysis,run_analysis_config_file,run_analysis_server_address,use_run_map,sim_cache_dir,prom_push_gateway_url,prom_aggregation_gateway_url,sim_cache_monitor_keys,vehicle_param_dir,scenario_path,output_file,enable_localizer_gpu,cpu_image_codec,fen_use_gpu,pcn_use_gpu,tcn_use_gpu,knet_use_gpu,panonet_use_gpu,mof_net_use_gpu,disable_vantage_forwarding_module --sim_msg_dir=/tmp/dc5e-c1ab-58aa-441c --replace_with_icc_agent=true
root      225817  0.0  0.0 112828  2344 pts/0    S+   12:21   0:00 grep --color=auto 188359
[root@sim-pipe241251spot ~]# pstack 188359
Thread 1 (process 188359):
#0  0x00007f562243c9cf in ?? ()
#1  0x0000000000000001 in ?? ()
#2  0x00007ffc30f1ccf0 in ?? ()
#3  0x0000000000000002 in ?? ()
#4  0x000003e800000000 in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x000055f858a6f0a5 in Curl_poll ()
#7  0x000055f858a62647 in Curl_multi_wait ()
#8  0x000055f858a628aa in curl_multi_poll ()
#9  0x000055f858a53f8f in curl_easy_perform ()
#10 0x000055f857b3e866 in curlpp::internal::CurlHandle::perform() ()
#11 0x000055f857b3c580 in qcraft::HttpClient::Post(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int, int) ()
#12 0x000055f857b3b934 in qcraft::prometheus_gateway::PromGateWayClient::PushMetric(qcraft::prometheus_gateway::PromMetric const&) ()
#13 0x000055f857b37fe4 in qcraft::SimCache::~SimCache() ()
#14 0x000055f857b39831 in qcraft::SimFileCache::~SimFileCache() ()
...
#20 0x00007f5623db247f in ?? ()
#21 0x00007f5623db24d0 in ?? ()

这次事件的反思是:

  • 代码里面if语句结束,不是直接就到下一句,if块里面的变量会析构,从而触发一些消耗,说实在的,这个虽然我大学的时候就知道,但是写C习惯了以后再看C++,我有的时候是真的想不起来。。。。一个variable的析构会触发一个curl,这说实在的也实在是神奇的代码。。。
  • 应该在发现调度不上之后就确定进程的状态,通过ps -aux可以直接查看进程是不是处于SLl状态,即等待事件发生。这个单纯是那天太晚了,太困了,想不起来了。。。。
  • 确定进程一直等待之后就可以调用pstack 188359查看进程的主线程的栈了,这个也是老方法了,当年查内存泄露用了好多次。。。

16 错误的ssh key

这个问题坦白讲一般不会遇到,我们有台云上机器,需要后处理一些敏感数据,需要拉取git仓库,配置了私钥格式如下,

-----BEGIN OPENSSH PRIVATE KEY-----
~~~~~~~~~~~~~~~~~~~~~~~
-----END OPENSSH PRIVATE KEY-----

但是仍然一直报错

Load key "xxx/.ssh/id_ed25519": invalid format
yyyyy: Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

最后发现是因为没有在私钥最后加一个newline。。。

改成多一个newline即可。。。

-----BEGIN OPENSSH PRIVATE KEY-----
~~~~~~~~~~~~~~~~~~~~~~~
-----END OPENSSH PRIVATE KEY-----
<=== notice here

17 not support negative lookaround

这俩天写正则,遇到一个问题,我们有些用户创建分支都已origin打头,比方说建立了什么origin/master,origin/dbet什么的,然而gitlab使用re2引擎,因此它不支持negative lookaround。为了避免这个问题只能写原始匹配,因为origin太长了,写起来太费劲,所以写了匹配orig,最后的命令如下

^([^o].*$)|(o([^r].*$|$))|(or([^i].*$|$))|(ori([^g].*$|$))

18 git push ref missing head

这个问题,坦白讲我不知道root cause,也不知道解决原因为什么可以。简单来说我们有个巨大的git仓库放了很多的地图文件,每隔一段时间会遇到这种问题,即push代码会报错

create mode 100644 zhuhai_1/semantic_map/4276747992/others.pb.bin
 create mode 100644 zhuhai_1/semantic_map/semantic_map_meta.pb.bin
Locking support detected on remote "origin". Consider enabling it with:
  $ git config lfs.https://china-map.qcraftai.com/qcraft/qcraft-maps-china-v2.git/info/lfs.locksverify true
ref xxxxx:: missing object: 019e6b672af008e6873dcaec850c8c1fa5c63623
error: failed to push some refs to 'china-map.qcraftai.com:qcraft/qcraft-maps-china-v2.git'

但是这个文件是实际上是lfs文件,而且文件也存在,而报错git missing,是不是意味着这个文件没有track在git的repo里面?而且后面操作

qcrafter@runner-dghmx1qy-project-4-concurrent-09k7q5:/builds/DgHMx1qY/0/root/qcraft/qcraft-maps-china$ git cat-file -p 019e6b672af008e6873dcaec850c8c1fa5c63623
version https://git-lfs.github.com/spec/v1
oid sha256:5990808a372ce1ff02742a9557bc3312e6cddd7439e036f51acf019ea1ffb985
size 3003

qcrafter@runner-dghmx1qy-project-4-concurrent-09k7q5:/builds/DgHMx1qY/0/root/qcraft/qcraft-maps-china$ sha256sum beijing_yizhuang2nd/semantic_map/4283367543/sections.pb.bin
5990808a372ce1ff02742a9557bc3312e6cddd7439e036f51acf019ea1ffb985  beijing_yizhuang2nd/semantic_map/4283367543/sections.pb.bin
qcrafter@runner-dghmx1qy-project-4-concurrent-09k7q5:/builds/DgHMx1qY/0/root/qcraft/qcraft-maps-china$ git cat-file -p 019e6b672af008e6873dcaec850c8c1fa5c63623
version https://git-lfs.github.com/spec/v1
oid sha256:5990808a372ce1ff02742a9557bc3312e6cddd7439e036f51acf019ea1ffb985
size 3003

看到一个了类似的issue,https://stackoverflow.com/questions/52612880/lfs-upload-missing-object-but-the-file-is-there

还有一个现象很类似的issue,https://github.com/git-lfs/git-lfs/issues/3587

19 奇怪的时间

写这个希望不会再有下一个倒霉蛋遇到这个事情了。。。。

这两天写golang的gorm一个简单的crud程序,遇到一个有趣的问题,我们有两个数据库(都是oceanbase,线上和测试环境),里面有个表是datetime类型,gorm的model里面用的类型是time.Time,插入数据的时候直接插入了time.Time{},就是0000-00-00 00:00:00,在测试环境数据库读取时间没什么问题,在生产环境就稳定出现问题,读取出来的时间是-0001-11-30 00:00:00 +0000,因为是用同一个环境的同一套代码同一个elf程序跑的,所以一直没理解咋回事,今天仔细看了一下程序里面的流程,栈如下,连接测试数据库和生产数据库都一样

(dlv) bt
 0  0x0000000001583232 in github.com/go-sql-driver/mysql.parseBinaryDateTime
    at /root/go/pkg/mod/github.com/go-sql-driver/mysql@v1.7.0/utils.go:230
 1  0x000000000157cff0 in github.com/go-sql-driver/mysql.(*binaryRows).readRow
    at /root/go/pkg/mod/github.com/go-sql-driver/mysql@v1.7.0/packets.go:1314
 2  0x000000000157ec37 in github.com/go-sql-driver/mysql.(*binaryRows).Next
    at /root/go/pkg/mod/github.com/go-sql-driver/mysql@v1.7.0/rows.go:198
 3  0x0000000001555d0a in database/sql.(*Rows).nextLocked
    at /usr/local/go/src/database/sql/sql.go:2974
 4  0x0000000001555ab4 in database/sql.(*Rows).Next.func1
    at /usr/local/go/src/database/sql/sql.go:2952
 5  0x0000000001559213 in database/sql.withLock
    at /usr/local/go/src/database/sql/sql.go:3405
 6  0x0000000001555a2b in database/sql.(*Rows).Next
    at /usr/local/go/src/database/sql/sql.go:2951
 7  0x00000000015fc622 in gorm.io/gorm.Scan
    at /root/go/pkg/mod/gorm.io/gorm@v1.25.2-0.20230530020048-26663ab9bf55/scan.go:327
 8  0x0000000001643377 in gorm.io/gorm/callbacks.Query
    at /root/go/pkg/mod/gorm.io/gorm@v1.25.2-0.20230530020048-26663ab9bf55/callbacks/query.go:28
 9  0x00000000015df4fe in gorm.io/gorm.(*processor).Execute
    at /root/go/pkg/mod/gorm.io/gorm@v1.25.2-0.20230530020048-26663ab9bf55/callbacks.go:130
10  0x00000000015e9708 in gorm.io/gorm.(*DB).First
    at /root/go/pkg/mod/gorm.io/gorm@v1.25.2-0.20230530020048-26663ab9bf55/finisher_api.go:129
11  0x000000000166505c in xxxxxxxx
    at xxxxxx
12  0x000000000167d2d4 in xxxxxxxxxx
    at xxxxx
13  0x000000000167ff56 in xxxxxxx
    at xxxxxx
14  0x0000000001681ff0 in main.main
    at ./cmd/datasource/main.go:105
15  0x00000000004affd3 in runtime.main
    at /usr/local/go/src/runtime/proc.go:250
16  0x00000000004e23a1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1598

然后打印了一下解析的内容,测试的环境回复的数据,日期部分(长度+内容)为[0…],也就是说长度为0,生产环境回复的数据里面日期的部分为[…,11,0,0,0,0,0,0,0,0,0,0,0,…+7 more],也就是说长度是11,然后内容都是0。接下来调用parsetime的逻辑就是下面的东西,测试环境长度是0,所以直接返回了默认时间。生产环境长度是11,默认的空时间戳,走11的逻辑,写了个小程序(下面),跑了一下,出来了-0001-11-30 00:00:00 +0000。

两个版本prod环境的oceanbase版本是3.2.3.1, test环境的oceanbase版本是 3.2.3.3

我去看了一下mysql里面的bianry encode的标准:https://dev.mysql.com/doc/dev/mysql-server/latest/page_protocol_binary_resultset.html,和golang mysql的driver的代码也是匹配的,包括长度就是0,4,7,11。但是按照标准,全0的时间应该是只返回一个表示0长度的byte。所以我理解这应该是oceanbase 3.2.3.1的bug


func parseBinaryDateTime(num uint64, data []byte, loc *time.Location) (driver.Value, error) {
	switch num {
	case 0:
		return time.Time{}, nil
	case 4:
		return time.Date(
			int(binary.LittleEndian.Uint16(data[:2])), // year
			time.Month(data[2]),                       // month
			int(data[3]),                              // day
			0, 0, 0, 0,
			loc,
		), nil
	case 7:
		return time.Date(
			int(binary.LittleEndian.Uint16(data[:2])), // year
			time.Month(data[2]),                       // month
			int(data[3]),                              // day
			int(data[4]),                              // hour
			int(data[5]),                              // minutes
			int(data[6]),                              // seconds
			0,
			loc,
		), nil
	case 11:
		return time.Date(
			int(binary.LittleEndian.Uint16(data[:2])), // year
			time.Month(data[2]),                       // month
			int(data[3]),                              // day
			int(data[4]),                              // hour
			int(data[5]),                              // minutes
			int(data[6]),                              // seconds
			int(binary.LittleEndian.Uint32(data[7:11]))*1000, // nanoseconds
			loc,
		), nil
	}
	return nil, fmt.Errorf("invalid DATETIME packet length %d", num)
}
package main

import (
	"encoding/binary"
	"fmt"
	"time"
)

func main() {
	data := []byte{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
	shot := time.Date(
		int(binary.LittleEndian.Uint16(data[:2])), // year
		time.Month(data[2]),                       // month
		int(data[3]),                              // day
		int(data[4]),                              // hour
		int(data[5]),                              // minutes
		int(data[6]),                              // seconds
		0,
		time.UTC,
	)
	invalidShot := time.Date(
		0, // year
		0, // month
		0, // day
		0, // hour
		0, // minutes
		0, // seconds
		0,
		time.Local,
	)
	fmt.Printf("time is %v\n", shot)
	fmt.Printf("checked time is %v\n", invalidShot)
	fmt.Printf("year is %v\n", int(binary.LittleEndian.Uint16(data[:2])))
	fmt.Printf("month is %v\n", data[2])
	fmt.Printf("day is %v\n", data[3])
	fmt.Printf("hour is %v\n", data[4])
	fmt.Printf("minutes is %v\n", data[5])
	fmt.Printf("seconds is %v\n", data[6])

}

20 奇怪的卡顿

这几天做一个grpc的性能测试,忽然遇到一个奇怪的问题,有个同步模式的grpc服务器,然后不断地启动新的grpc client连接它,且不断开旧的grpc,然后发现请求一段时间以后新的grpc就会卡住,不被响应/调度。我一开始以为是代码问题,后来忽然想起来,grpc本身服务器支持的连接数量是有限的,毕竟是文件描述符嘛,一般会报错:error reading server preface: EOF,这个时候可以看下文件描述符是不是打满了,

为了验证这一点,可以加上在服务端和client端执行的命令前面加上环境变量,如下所示

 GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info /usr/sbin/xxx --config /etc/xxxx/config.yml

client端执行后会看到下面的报错信息

2023/12/20 17:20:11 INFO: [core] [Channel #149968 SubChannel #149969] Subchannel Connectivity change to IDLE, last error: connection error: desc = "error reading server preface: EOF"
2023/12/20 17:20:11 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc1d0b88858, {IDLE connection error: desc = "error reading server preface: EOF"}
2023/12/20 17:20:11 INFO: [core] [Channel #149968] Channel Connectivity change to IDLE
2023/12/20 17:20:11 INFO: [core] [Channel #149968 SubChannel #149969] Subchannel Connectivity change to CONNECTING
2023/12/20 17:20:11 INFO: [core] [Channel #149968 SubChannel #149969] Subchannel picks a new address "172.21.208.11:5444" to connect
2023/12/20 17:20:11 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc1d0b88858, {CONNECTING <nil>}
2023/12/20 17:20:11 INFO: [core] [Channel #149968] Channel Connectivity change to CONNECTING
2023/12/20 17:20:11 INFO: [transport] transport: closing: connection error: desc = "error reading server preface: EOF"
2023/12/20 17:20:11 INFO: [core] Creating new client transport to "{\n  \"Addr\": \"172.21.208.11:5444\",\n  \"ServerName\": \"172.21.208.11:5444\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}": connection error: desc = "error reading server preface: EOF"
2023/12/20 17:20:11 WARNING: [core] [Channel #149968 SubChannel #149969] grpc: addrConn.createTransport failed to connect to {
  "Addr": "172.21.208.11:5444",
  "ServerName": "172.21.208.11:5444",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "error reading server preface: EOF"
2023/12/20 17:20:11 INFO: [core] [Channel #149968 SubChannel #149969] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "error reading server preface: EOF"
2023/12/20 17:20:11 INFO: [transport] transport: loopyWriter exiting with error: transport closed by client
2023/12/20 17:20:11 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc1d0b88858, {TRANSIENT_FAILURE connection error: desc = "error reading server preface: EOF"}
2023/12/20 17:20:11 INFO: [core] [Channel #149968] Channel Connectivity change to TRANSIENT_FAILURE

下面的代码是怎么对golang产生trace的代码,顺便做个附加了

import (
        "context"
        "flag"
        "fmt"
+       "os"
+       "os/signal"
+       "runtime/trace"

....

func main() {
        flag.Parse()
+				f, err := os.Create("/tmp/trace.out")
+    		if err != nil {
+        		panic(err)
+    		}
+    		defer f.Close()
+
+    		// 启动trace
+    		err = trace.Start(f)
+    		if err != nil {
+        		panic(err)
+    		}
+    		defer trace.Stop()
        if *versionFlag {
                fmt.Println("pomerium:", version.FullVersion())
                fmt.Println("envoy:", files.FullVersion())
                return
        }

-       if err := run(ctx); !errors.Is(err, context.Canceled) {
-               log.Fatal().Err(err).Msg("error")
-       }
-       log.Info(ctx).Msg("cmd/command: exiting")
+       go func() {
+               ctx := context.Background()
+               run(ctx)
+       }()
+
+    		signalChan := make(chan os.Signal, 1)
+    		signal.Notify(signalChan, os.Interrupt)
+    		<-signalChan

21 低性能的Unix Domain Socket

这个问题出现有两个原因

  • 原因1(主要原因),k8s云环境配置的limit太小,线程数量很高导致都在互相抢占,上下文切换带来的性能问题很严重。这个最后看了统计里面的CPU throttling监控得到了验证。
  • 原因2(次要原因),tokio的异步await轮询本身就是牺牲了时延的产物,因此会导致时延增高。请注意,尽管这是次要原因,但是如果使用了Tokio的容器k8s limit配置过低,它的线程触发了k8s的CPU throttling,无论服务端的处理速度有多快,那么它的异步线程调度会导致它那边看到的时延迅速增加!

简单描述下上下文,同一个pod里面,两个不同k8s容器里面的程序通过UDS socket进行通信。服务端容器的k8s limit配置为1,是C语言,libevent+多线程痛处理UDS通信事件。另一个是客户端 Rust+Tokio发送数据,结果tokio经常提示UDS通信超时,超过了5ms。我自己在研发环境dev的时候平均时延并不高,显示处理UDS很快,也就几百us(我在dev进群没有配置k8s limit的限制,因此尽管CPU使用率最终显示为90%,但实际上程序还是并行的线程)。可能的原因

猜想1,资源不足,k8s limit为1导致的上下文切换代价比较高。这个验证也很简单,如果不做限制改成大一些的值,那么应该时延会迅速下降,超时的比例会很低。

猜想2,Rust客户端发送了,到了服务端进了backlog,最终超时了。在backlog里面待了一会,然后等待IO时间处理又等了一会,最终导致响应的很慢,也就是说超过了500us。这个并不可能,因为只要进入了backlog,不断溢出必然出错,但是并没有出错的日志。

猜想3,Rust客户端多次await,这个导致多次异步轮询耽误了时间。这个验证直接看下异步下的时间延迟即可,

关于猜想1的验证执行了两个操作

  • 查看k8s告警的CPU throttling监控,判断是否出现限流。从图片可以看出来确实有明显的限流。处于保密的原则,就不截图了。这个东西用户可以进入pod的container里面,通过cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat来确定有没有触发限流。处于保密的缘故,我不能截图,但是我在我自己的dev环境做了复现,内容如下,可以看到有一半的周期都触发了CPU限流。

    # 配置k8s cpu limit 1时
    root@hxn-debug-zslgf:/suricata# cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat
    nr_periods 6227
    nr_throttled 3038
    throttled_time 3271452651481
    nr_bursts 0
    burst_time 0
      
    # 配置k8s cpu limit 放到和线程数量相等甚至更大时
    root@hxn-debug-zslgf:/suricata# cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat
    nr_periods 721708
    nr_throttled 0
    throttled_time 0
    
  • 修正limit为8,在压力测试下,看超时日志有没有减少。尽管后台的CPU throttling依然触发了限流,但是超时日志出现了明显的减少,从1w每分钟减少到了200每分钟(超时维持5ms,如果超时是1ms,那么超时会从1w每分钟降低到2k每分钟)

关于猜想2直接放弃了验证,因为理论可能性太低。

关于猜想3,我在客户端和服务端分别记录了接受和发送的时刻,结果如下。证明了纯异步确实对时延有一定的影响,在我当时测试的情况下,影响确实不如第一个方面更大。


Rust程序
=================================
准备建立连接
时刻为 3982289 秒和 50986778 纳秒
尝试异步建连后,处理其它事情
=================================
																				C程序
																				=================================
																				处理建立连接
																				时刻为 3982289 s, 51285037 ns
																				连接握手接受,分配资源
																				时刻为 3982289 s, 51328058 ns
																				=================================																	

Rust程序
=================================
发现连接建立完成,准备发送数据
时刻为 3982289 秒和 52508772 纳秒
尝试异步发送,先处理其它事情
=================================

Rust程序
=================================
发现连接建立完成,准备发送数据
时刻为 3982289 秒和 52508772 纳秒
尝试异步发送,先处理其它事情
=================================


Rust程序
=================================
发现发送数据完成,再次flush数据
时刻为 在3982289 秒和 52774571 纳秒
异步fluash完成,先处理其它事情
=================================

																				C程序
																				====================================
																				处理IO数据,接受数据
																				时刻为 3982289 s, 52787736 ns
																				处理结束,发送处理的结果
																				时刻为 3982289 s, 52940983 ns
																				====================================


Rust程序
=================================
发现发送数据结束
时刻为 在3982289 秒和 52956091 纳秒
准备接受数据
=================================


...

22 优雅地退出Libevent

最近在写一个Libevent服务器,主线程启动多个IO子线程,每个IO子线程都有自己的Loop(Event Base),主线程监听Socket连接,建立连接会注册事件到子线程。现在遇到一个问题是说,希望在不会导致任何IO事件被丢弃的前提下,IO子线程可以被替换。初步解决的方案是:

  • 每个子线程内部置一个状态位,用于标识当前子线程是不是应该终止了。建立一个更新子线程,专门用于创建新的IO子线程并替换。
  • IO子线程会调用 event_base_loopexit(base, &one_minute);,然后调用event_base_dispatch(base);,如果没有注册IO事件就返回检查线程状态,如果状态为“准备终止”,就释放内存
  • 更新子线程监听SIGHUP信号,每当收到SIGHUP信号,创建新的多子线程。锁住主线程的保护锁,替换主线程记录的子线程。
  • 替换完成之后,更新子线程会把旧的IO子线程的状态置为“准备终止”,解析并等待旧的IO子线程的状态都被置为“终止完成”,到了终止完成之后。释放旧的IO线程的内存

但这个方案会导致莫名其妙地丢数据,然后反应过来有个这个情况。

极端情况
                                                  -------------------------------
                                                  IO线程正好没事情做,工作状态还在Working
                                                  从event_base_dispatched返回
                                                  -------------------------------
------------
主线程分配任务到
-------------
                 -----------------------------
                 更新线程执行替换操作
                 置IO线程状态为NeedTerminate
                 -----------------------------
                                                  -----------------------------------
                                                  IO线程开始加载状态,发现是NeedTerminate
                                                  释放内存,但实际上此时是有未处理的事件的
                                                  -----------------------------------
                                                  

所以最终的修复方式很简单,IOWorker的工作函数改成

while (1) {

    struct timeval one_minute = {kDefaultSleepDuration, 0};
    int state = atomic_load(&context->state_);
    if (state == kStateNeedTerminate) {
      // DCLP
      event_base_loopexit(base, &one_minute);
      event_base_dispatch(base);
      cleanupSuricataContext(context);

      break;
    }
    event_base_loopexit(base, &one_minute);
    event_base_dispatch(base);
  }

23 Suricata初始化

这两天该代码,会概率性出现报错

Error: misc: pcre2_substring_copy_bynumber failed [ParseSizeString:util-misc.c:101]
Error: stream-tcp: Error parsing stream.memcap from conf file - 64mb.  Killing engine [StreamTcpInitConfig:stream-tcp.c:400]

结尾

唉,尴尬

狗头的赞赏码.jpg