踩内存(内存溢出)的异常定位

介绍

某天晚上11点,钉钉预警提示某服务catch到一个异常导致服务重启,因为是上线了一个多月的服务,半夜遇到这样的问题,整个人有点懵,立刻打开电脑下载dump文件进行分析,不幸的是在分析的过程中,又连续的出现了几次崩溃,在这么反复崩溃、重启折腾了1小时左右,服务终于正常了。

分析dump

下载路径:链接:https://pan.baidu.com/s/1GPzzipmxWyIr5WKNq11pIQ 提取码:6dsn

分析[11-13 20-53-03]full.dmp

  1. 使用windbg打开dmp,并加载符号文件
  2. 使用命令 .ecxr 切换到异常上下文

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    0:061> .ecxr
    eax=00000000 ebx=dfa477a6 ecx=0000020c edx=00000000 esi=10171adc edi=dfa40000
    eip=77bee41b esp=08c8f51c ebp=08c8f550 iopl=0 nv up ei pl nz na pe nc
    cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206
    ntdll!RtlInitUnicodeString+0x1f3:
    77bee41b 8930 mov dword ptr [eax],esi ds:002b:00000000=????????
    0:061> kv
    *** Stack trace for last set context - .thread/.cxr resets it
    ChildEBP RetAddr Args to Child
    WARNING: Stack unwind information not available. Following frames may be wrong.
    08c8f550 77bee023 00936e28 00936e28 08c8f5b4 ntdll!RtlInitUnicodeString+0x1f3
    *** ERROR: Symbol file could not be found. Defaulted to export symbols for kernel32.dll -
    08c8f568 774414ad 008e0000 00000000 1065fd28 ntdll!RtlFreeHeap+0x7e
    *** ERROR: Symbol file could not be found. Defaulted to export symbols for msvcr120.dll -
    08c8f57c 70fdecfa 008e0000 00000000 1065fd28 kernel32!HeapFree+0x14
    *** WARNING: Unable to verify checksum for topsvr.exe
    08c8f590 00315e35 1065fd28 0f611df0 002cef01 msvcr120!free+0x1a
    08c8f5a8 00315f4a 00936e28 00000000 00000000 topsvr!redisBufferWrite+0xb5
    08c8f5bc 003160b7 00936e28 08c8f5d0 1033f5e8 topsvr!redisGetReply+0x4a
    08c8f5d4 0034ae42 00936e28 0f813270 69b2eb38 topsvr!redisCommand+0x37
    08c8f6dc 0034ee4f 08c8f764 08c8f818 69b2ea74 topsvr!CRedisBase::RedisCommand+0x92 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\redisbase.cpp @ 217]
    08c8f790 0034d24a 08c8f9fc 08c8f818 69b2e5d4 topsvr!CRedisMaster::GetYQWRoomInfo+0x5f (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\redismaster.cpp @ 623]
    08c8f830 00378957 08c8f9fc 08c8f924 69b2e414 topsvr!CRedisMaster::GetYQWRoomInfo+0x4a (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\redismaster.cpp @ 636]
    08c8f9f0 0036bab4 08c8fafc 00048e0f 69b2e604 topsvr!ToPSvrThriftHandler::GetYQWRoomByNo+0xc7 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\thirdpart\thrift-0.10.0\gen-cpp\topsvrthrift_server.skeleton.cpp @ 86]
    08c8fbe0 0036b31f 00000000 100e3170 100e30e0 topsvr!Tcy::ToPSvr::Thrift::ToPSvrThriftProcessor::process_GetYQWRoomByNo+0x1e4 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\thirdpart\thrift-0.10.0\gen-cpp\topsvrthrift.cpp @ 1823]
    08c8fcec 00375429 100e3170 100e30e0 08c8fd64 topsvr!Tcy::ToPSvr::Thrift::ToPSvrThriftProcessor::dispatchCall+0x28f (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\thirdpart\thrift-0.10.0\gen-cpp\topsvrthrift.cpp @ 1742]
    08c8fd90 002dcf11 100e3170 0f611e08 100e30e0 topsvr!apache::thrift::TDispatchProcessor::process+0xd9 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\svn143\library\thrift-0.10.0\lib\cpp\src\thrift\tdispatchprocessor.h @ 121]
    08c8fe44 002cab31 0093f988 002cadde 69b2e34c topsvr!apache::thrift::server::TConnectedClient::run+0x121
    08c8fea8 002cf00c 69b2e334 0093fed0 0093efa8 topsvr!apache::thrift::concurrency::ThreadManager::Task::run+0x11
    08c8fed0 002cfcd1 0093efa8 0093fed0 69b2e2e8 topsvr!apache::thrift::concurrency::StdThread::threadMain+0x5c
    *** ERROR: Symbol file could not be found. Defaulted to export symbols for msvcp120.dll -
    08c8ff0c 7196f33c 432a8f49 00000000 009817d8 topsvr!std::_LaunchPad<std::_Bind<1,void,void (__cdecl*const)(boost::shared_ptr<apache::thrift::concurrency::StdThread>),boost::shared_ptr<apache::thrift::concurrency::StdThread> > >::_Run+0x71
    08c8ff34 70ffc01d 0377f764 432a8da9 00000000 msvcp120!std::_Pad::_Release+0x6c
  3. 异常来自CRedisBase::RedisCommand,最终是kernel32!HeapFree 导致异常

  4. 使用 !analyze -v 查看错误原因
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SYMBOL_NAME:  heap_corruption!heap_corruption

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: heap_corruption

IMAGE_NAME: heap_corruption

DEBUG_FLR_IMAGE_TIMESTAMP: 0

STACK_COMMAND: ~61s; .ecxr ; kb

FAILURE_BUCKET_ID: HEAP_CORRUPTION_c0000005_heap_corruption!heap_corruption

BUCKET_ID: APPLICATION_FAULT_HEAP_CORRUPTION_INVALID_POINTER_WRITE_NULL_POINTER_WRITE_heap_corruption!heap_corruption
  1. 基本可以确定是 heap被破坏导致的。

heap异常

对于heap异常,在溢出或被踩时不一定会在第一时间出现问题。
那么如何才能让heap被破坏时就产生异常呢? 就是通过之前介绍的 “windows的应用验证机制”

使用appverif.exe

页堆有两种运行模式:

  1. 普通页堆

  2. 完全页堆

普通页堆通过填充模式(增加元数据)检测堆块异常。完全页堆则多增加了一个防护页。

普通页堆因为没有将元数据和内存分离开,那么当元数据被破坏时,异常也是无法第一时间捕获。

完全页堆的缺点是需要大量的内存,会使原来的程序内存使用量提高一个数量级。

一般建议方案是在测试环境使用普通页堆方式,然后定期的检查中使用完全页堆。
只有在条件允许的生产环境下采用完全页堆方式;也可以通过检测dll或基于内存分配大小来缩小内存检测范围。

upload successful

  • full: 普通页堆 or 完全页堆
  • dlls: 可以指定堆测试中包含哪些dll。dll名字包括扩展名,如果有多个dll,那么用空格来分隔dll
  • size:可以指定测试大小指定范围内的内存
  • backward: 完全页堆默认防止下溢的异常; 那么这个开关增加上溢的异常检测
    ……

分析[11-13 21-10-08]full 开启heap检测.dmp

在第二天晚上蹲点,出现异常时立刻给服务开启heap检测,服务运行几分钟后又出现崩溃。

使用windbg打开该dmp

  1. .ecxr 切换到异常上下文
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0:119> .ecxr 
eax=22ac1000 ebx=08de9ce8 ecx=22ac1000 edx=22ac0fff esi=1c8c0fe0 edi=08de9ce0
eip=00bb6e31 esp=17d7f258 ebp=17d7f2ac iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
*** WARNING: Unable to verify checksum for topsvr.exe
topsvr!Base64encode+0x1c1:
00bb6e31 c60000 mov byte ptr [eax],0 ds:002b:22ac1000=??
0:119> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
17d7f2ac 00bbc06e 22ac0fa8 22d3ee58 00000042 topsvr!Base64encode+0x1c1 (FPO: [Non-Fpo]) (CONV: cdecl) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\base64.cpp @ 207]
17d7f50c 00bb83d1 22d3e198 22d3ed80 17d7f620 topsvr!CBillDB::MakeParam_PushPlayerInfo+0x19e (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\billdb.cpp @ 625]
17d7f668 00bb8842 22d3e198 22d3e390 00000000 topsvr!CBillDB::PushBill2DB+0x351 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\billdb.cpp @ 102]
17d7f718 00bec143 22d3e198 22d3e390 11588718 topsvr!CBillDB::PushBill+0xe2 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\billdb.cpp @ 154]
17d7f884 00be7695 21df6fe8 22528fe0 1732cfb8 topsvr!CSockServer::OnYQWResultEx+0x4a3 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\socksvr.cpp @ 1733]
17d7f91c 00c1ac94 21df6fe8 22528fe0 08de9ce8 topsvr!CSockServer::OnRequest+0x3b5 (FPO: [Non-Fpo]) (CONV: thiscall) [d:\program files (x86)\jenkins\workspace\publish_gamechannel\topsvr\socksvr.cpp @ 293]
17d7f944 00c1efeb 00000000 17324c40 17070c40 topsvr!CIocpWorker::DoWorkLoop+0xa4
17d7f95c 00c1efbb 17d7f99c 70ffc01d 08de9ce0 topsvr!CBaseWorker::WorkerThreadProc+0x2b
  1. 可以很容易看到是Base64encode导致异常
  2. 通过windbg的watch 查看入参参数,定位到原因: nikename数组只分配了128字节,但是某个玩家是156字节,导致转换溢出。
  3. 长度修改为256,服务发布后,就不再出现问题。