TL;DR: Being able to see the stack trace and investigate the values in those stack frames from a failed node process makes it much easier to understand the error messages and what has gone wrong. mdb on SmartOS makes this possible, arguably, even easy.
I run several small Triton Smart Data Center installations, both for fun and as part of my job. In both cases, I'm running the open source product, in a minimal configuration which might have more in common with a lab or development environment than a large datacenter. I'm not running any HA. I have backups and I know how to recover from them. I don't have fully independent power. Bringing the system back up when all of the HA elements were down at the same time isn't a challenge I wanted to engage in. The non HA nature of my configurations will become relevant shortly.
Upgrades to Triton Smart Data Center are smooth. Amazingly smooth. I've been doing essentially monthly upgrades of the components for nearly 2 years. In those two years, I've had two upgrades not go perfectly smoothly. This is the story of the second one.
During a standard monthly update, while doing an sdcadm update --allthe process timed out in one of the processing stages. This has happened before because my system was slow, but up until now, simply re-running the update command has cleaned up the environment. Which is what I did, which is also where the fun began.
Triton has most of its state stored in a single service called manatee. Thus, this is a somewhat sensitive service/zone. The last zone to upgrade after the failed first run was manatee, and with great concern, it did not start correctly after the update. That moment is when you know you are going to be having a longer night than you had planned.
More than one Triton service uses waferlock. In my manatee zone waferlock was not starting. The logs were... correct, but not helpful. Sadly, they have rotated out before I have written this blog, so I can not correctly reproduce them here.
The logs indicated that waferlock was crashing here in the process of attempting to start. The logs did not, however contain a full listing of what was in inst.metadata, only that DATACENTER was undefined and that a core had been dumped. I spent several hours chasing around red herrings, attempting to figure out what part of my configuration had gone so very wrong. On the second attempt to figure this out, after I took a break for sleep, I decided it was time to attempt to figure this out from the core file.
root@6a4702d7-cd1a-4648-876a-3f95b81c36e7 (us-elns-1:manatee0) /]# mdb core Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ] > ::stack _ZN2v84base2OS5AbortEv+9(93a2008, ada186f5, 80430c0, 80432c8, 80432cc, 0) _ZN2v88internal2IC9TypeErrorENS0_15MessageTemplate8TemplateENS0_6HandleINS0_6ObjectEEES6_+0x4d(804321c, 8043238, 2f, 80432cc, 80432c8, 80432cc) _ZN2v88internal6LoadIC4LoadENS0_6HandleINS0_6ObjectEEENS2_INS0_4NameEEE+0x1a2(804321c, 8043238, 80432cc, 80432c8, 8043238, 80432cc) _ZN2v88internal11LoadIC_MissEiPPNS0_6ObjectEPNS0_7IsolateE+0x17e(4, 80432cc, 93a2008, aa10a07c, aa10a001, 80432a0) 0xaa10a07c(81459af5, e, 9594c681, 9fa08099, 9fa08099, 9fa08099) 0x9f7d5337(81459911, 9fa08099, 6, ada185bd, 14, 8043338) 0xaa11a143(ada17b49, 0, 81459911, 9fa08099, 9fa08099, 81459911) 0xaa17f1ef(4, ada17b49, 9fa08099, ada185bd, 9fa08099, a3c424f1) 0x9f7324a1(9fa08099, ada185bd, ada17b49, 2, a3c42699, 14) 0xaa11a143(ada185bd, ada17b49, a3c42699, ada17bd5, ada17b49, ad9fdd1d) 0x9f7d4d13(ada17985, ada16195, ad9fe6fd, 9fa08089, 9fa6c7c9, 8) 0xaa11ef22(ada17995, 9fa6c7c9, ad9fdd1d, 9fa608bd, ada17995, ada17995) 0xaedbfc99(9fa6c7c9, 8, ad9fdd71, 14, 8043454, 9f7cc8db) 0xaa11a143(ada17985, ada16195, ad9fe6fd, 9fa08089, 9fa08099, ad9fdd71) 0x9f7cc8db(ada16b8d, ada16195, ad9fe6fd, 9fa08089, 9fa08099, ada0ba35) 0x9f7cc243(ada16195, 0, 0, aa11edc1, 10, 9fa24719) 0xaa11ef22(ada17655, ada16195, ada16789, 9fa608bd, ada17655, ada1689d) 0x9f735516(ada16195, 9fa6082d, 4, ada173a9, 6, 85889251) 0xaede2768(ada16195, 9fa08231, ada166c5, 9fa08099, 85889251, a3c4d821) 0xaa172733(9fa551ed, ada16195, a3c4d821, 8588b641, 8586db55, 8043580) 0x9f7cb300(ada16195, ada1620d, 9fa08099, 8588b641, 858890c5, 9fa08211) 0x9f71cf74(ada16195, ada1620d, 8588b641, 9fa08099, 8588f749, ada16fad) 0xaedc164f(9fa6c735, 858100f5, aa11a921, 10, 0, 80435f8) 0xaa11a9e1(0, 0, 2, 0, 93fcdf8, 93a2008) 0xaa11999f(aedc1320, 858100f5, 9fa6c735, 0, 0, 93a2008) _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_b+0xff(80436ac, 93a2008, 93fcdf8, 93fccd8, 0, 0) _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_+0xd7(804372c, 93fcdf8, 93e402c, 93fccd8, 0, 0) _ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_+0x3c(804378c, 93fcdf8, 93fccd8, 0, 0, 93a2008) _ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE+0x22e(8043818, 9410a48, 93e4018, 3, 8043824, 93fcfd8) _ZN4node10StreamBase8EmitDataEiN2v85LocalINS1_6ObjectEEES4_+0xec(9410a70, 24d, 93e400c, 0, 0, 9460848) _ZN4node10StreamWrap10OnReadImplEiPK8uv_buf_t14uv_handle_typePv+0x125(24d, 8043950, 0, 9410a48) _ZN4node10StreamWrap6OnReadEP11uv_stream_siPK8uv_buf_t+0x76(9410aa8, 24d, 8043950, 8566f5c, 10000, 0) uv__read+0x1cd(9410aa8, 0, 4, fe8c81b9) uv__stream_io+0x2c1(9366e40, 9410aec, 1, 8538914, 94517d8, 9366e6c) uv__io_poll+0x20d(9366e40, c8, 4, 9fa08255, 9366f3c, 9366e50) uv_run+0x127(9366e40, 1, fe956720, 8335a3e) _ZN4node5StartEiPPc+0x59d(2, 8047d44, 4, 400, 9355ff4, 8047cf8) main+0x42(fe958737, fe9d2548, 8047d38, 8314543, 3, 8047d44) _start+0x83(3, 8047e34, 8047e7d, 8047e7d, 0, 8047e9e) >
Yup, that's a stack. That's not very helpful. Let's load up the v8 tooling.
> ::load v8 mdb_v8 version: 1.4.1 (release, from 0cd139c) V8 version: 18.104.22.168 Autoconfigured V8 support from target C++ symbol demangling enabled > ::jsstack native: v8::base::OS::Abort+9 native: v8::internal::IC::TypeError+0x4d native: v8::internal::LoadIC::Load+0x1a2 native: v8::internal::LoadIC_Miss+0x17e (1 internal frame elided) js: <anonymous> (as <anon>) (1 internal frame elided) js: InnerArrayForEach js: forEach (1 internal frame elided) js: <anonymous> (as <anon>) (1 internal frame elided) js: <anonymous> (as <anon>) (1 internal frame elided) js: parseResponse js: done (1 internal frame elided) js: g js: emitNone js: emit js: endReadableNT js: nextTickCallbackWith2Args js: _tickCallback (1 internal frame elided) (1 internal frame elided) native: v8::internal::Execution::Call+0xff native: v8::Function::Call+0xd7 native: v8::Function::Call+0x3c native: node::AsyncWrap::MakeCallback+0x22e native: node::StreamBase::EmitData+0xec native: node::StreamWrap::OnReadImpl+0x125 native: node::StreamWrap::OnRead+0x76 native: uv__read+0x1cd native: uv__stream_io+0x2c1 native: uv__io_poll+0x20d native: uv_run+0x127 native: node::Start+0x59d native: main+0x42 native: _start+0x83 >
That's better? Not really helpful yet. Let's try the -vn0 modifier to see if that helps.
> ::jsstack -vn0 native: v8::base::OS::Abort+9 native: v8::internal::IC::TypeError+0x4d native: v8::internal::LoadIC::Load+0x1a2 native: v8::internal::LoadIC_Miss+0x17e (1 internal frame elided) js: <anonymous> (as <anon>) file: /opt/smartdc/waferlock/lib/sapi.js posn: line 184 this: 9fa08099 (Oddball: "undefined") arg1: 81459911 (<unknown subclass>: String) (1 internal frame elided) js: InnerArrayForEach file: native array.js posn: position 16657 this: 9fa08099 (Oddball: "undefined") arg1: ada185bd (JSFunction) arg2: 9fa08099 (Oddball: "undefined") arg3: ada17b49 (JSArray) arg4: 4 (SMI: value = 2) js: forEach file: native array.js posn: position 17214 this: ada185bd (JSFunction) arg1: 9fa08099 (Oddball: "undefined") (1 internal frame elided) js: <anonymous> (as <anon>) file: /opt/smartdc/waferlock/lib/sapi.js posn: line 151 this: 9fa6c7c9 (JSGlobalProxy) arg1: 9fa08089 (Oddball: "null") arg2: ad9fe6fd (JSObject: ClientRequest) arg3: ada16195 (JSObject: IncomingMessage) arg4: ada17985 (JSArray) (1 internal frame elided) js: <anonymous> (as <anon>) file: /opt/smartdc/waferlock/node_modules/mooremachine/lib/fsm.js posn: line 199 this: 9fa6c7c9 (JSGlobalProxy) (1 internal frame elided) js: parseResponse file: /opt/smartdc/waferlock/node_modules/restify-clients/lib/JsonClient.js posn: line 60 this: 9fa08099 (Oddball: "undefined") arg1: 9fa08089 (Oddball: "null") arg2: ad9fe6fd (JSObject: ClientRequest) arg3: ada16195 (JSObject: IncomingMessage) arg4: ada16b8d (SeqOneByteString) js: done file: /opt/smartdc/waferlock/node_modules/restify-clients/lib/StringClient.js posn: line 194 this: ada16195 (JSObject: IncomingMessage) (1 internal frame elided) js: g >> More [<space>, <cr>, q, n, c, a] ?
Okay, I can now clearly orient myself in the stack frame. Based on the source code, I'd like to see what was in the 'added' array at time of failure.
> ada17b49::jsprint [ "b4d65123-85b8-4406-b065-0d4a3719fd8f", "b9094c09-353c-4a49-b3cc-a16940697c00", ]
Okay, I had an immediate and strong reaction to that of 'what?'. Remember how I said earlier that I was not running an HA environment? Well that was an indication that there was more than one moray being listed, and in my environment, there's only one moray.
Additional investigation eventually led me to noting that the moray0tmp that gets created during the upgrade process hadn't been cleared out of SAPI, probably due to the crash during the first half of the update. The moray0tmp that's created during the update process does not contain all of the metadata a full moray does, and does not have the DATACENTER element, which is where waferlock was crashing.
I then manually cleaned up SAPI's record of reality, removing moray0tmp's record, and waferlock started immediately. Problem resolved. Board shows green again.
In the process of writing this entry, I stumbled over this wonderful block comment explaining that this does occasionally happen. It appears there is some cleanup code for this case, which would have cleaned up the leftover records at the next update of moray.
So, next time a system helpfully drops you a core file, just open it up and start looking.