- November 4th -- 16:30 to 15:00 UTC
- November 13th -- 15:00 to 19:30 UTC
- November 15th -- 15:30 to 18:00 UTC
The root cause of these downtime was insufficient resources: both hardware and human. This is a full post-mortem where we will be look at how npmjs.org works, what went wrong, how we changed the previous architecture of The npm Registry to fix it, as well next steps we are taking to prevent this from happening again.
All of the next steps require additional expenditure from Nodejitsu: both servers and labor. This is why along with this post-mortem we are announcing our crowdfunding campaign: scalenpm.org! Our goal is to raise enough funds so that Nodejitsu can continue to run The npm Registry as a free service for you, the community.
Please take a minute now to donate at https://scalenpm.org!
How does npmjs.org work?
There are two distinct components that make up npmjs.org operated by different people:
- http://registry.npmjs.org: The main CouchApp (Github: isaacs/npmjs.org) that stores both package tarballs and metadata. It is operated by Nodejitsu since we acquired IrisCouch in May. The primary system administrator is Jason Smith, the current CTO at Nodejitsu, cofounder of IrisCouch, and the System Administrator of registry.npmjs.org since 2011.
- http://npmjs.org: The npmjs website that you interact with using a web browser. It is a Node.js program (Github: isaacs/npm-www) maintained and operated by Isaac and running on a Joyent Public Cloud SmartMachine.
Here is a high-level summary of the old architecture:
What went wrong and how was it fixed?
As illustrated above, before November 13th, 2013, npm operated as a single CouchDB server with regular daily backups. We briefly ran a multi-master CouchDB setup after downtime back in August, but after reports that
npm login no longer worked correctly we rolled back to a single CouchDB server. On both November 13th and November 15th CouchDB became unresponsive on requests to the
/registry database while requests to all other databases (e.g.
/public_users) remained responsive. Although the root cause of the CouchDB failures have yet to be determined given that only requests to
/registry were slow and/or timed out we suspect it is related to the massive number of attachments stored in the registry.
The incident on November 4th was ultimately resolved by a reboot and resize of the host machine, but when the same symptoms reoccured less than 10 days later additional steps were taken:
- The registry was moved to another machine of equal resources to exclude the possibility of a hardware issue.
- The registry database itself was compacted.
When neither of these yielded a solution Jason Smith and I decided to move to a multi-master architecture with continuous replication illustrated below:
This should have been the end of our story but unfortunately our supervision logic did not function properly to restart the secondary master on the morning of November 15th. During this time we moved briefly back to a single master architecture. Since then the secondary master has been closely monitored by the entire Nodejitsu operations team to ensure it's continued stability.
What is being done to prevent future incidents?
The public npm registry simply cannot go down. Ever. We gained a lot of operational knowledge about The npm Registry and about CouchDB as a result of these outages. This new knowledge has made clear several steps that we need to take to prevent future downtime:
- Always be in multi-master: The multi-master CouchDB architecture we have setup will scale to more than just two CouchDB servers. As npm grows we'll be able to add additional capacity!
- Decouple www.npmjs.org and registry.npmjs.org: Right now www.npmjs.org still depends directly on registry.npmjs.org. We are planning to add an additional replica to the current npm architecture so that Isaac can more easily service requests to www.npmjs.org. That means it won't go down if the registry goes down.
- Always have a spare replica: We need have a hot spare replica running continuous replication from either to swap out when necessary. This is also important as we need to regularly run compaction on each master since the registry is growing ~10GB per week on disk.
- Move attachments out of CouchDB: Work has begun to move the package tarballs out of CouchDB and into Joyent's Manta service. Additionally, MaxCDN has generously offered to provide CDN services for npm, once the tarballs are moved out of the registry database. This will help improve delivery speed, while dramatically reducing the file system I/O load on the CouchDB servers. Work is progressing slowly, because at each stage in the plan, we are making sure that current replication users are minimally impacted.
When these new infrastructure components are in-place The npm Registry will look like this:
You are npm! And we need your help!
The npm Registry has had a 10x year. In November 2012 there were 13.5 million downloads. In October 2013 there were 114.6 million package downloads. We're honored to have been a part of sustaining this growth for the community and we want to see it continue to grow to a billion package downloads a month and beyond.
But we need your help! All of these necessary improvements require more servers, more time from Nodejitsu staff and an overall increase to what we spend maintaining the public npm registry as a free service for the Node.js community.
Please take a minute now to donate at https://scalenpm.org!
2013.11.20, Version 0.11.9 (Unstable)
uv: upgrade to v0.11.15 (Timothy J Fontaine)
v8: upgrade to 220.127.116.11 (Timothy J Fontaine)
buffer: remove warning when no encoding is passed (Trevor Norris)
build: make v8 use random seed for hash tables (Ben Noordhuis)
crypto: build with shared openssl without NPN (Ben Noordhuis)
crypto: update root certificates (Ben Noordhuis)
debugger: pass on v8 debug switches (Ben Noordhuis)
domain: use AsyncListener API (Trevor Norris)
fs: add recursive subdirectory support to fs.watch (Nick Simmons)
fs: make fs.watch() non-recursive by default (Ben Noordhuis)
http: cleanup freeSockets when socket destroyed (fengmk2)
http: force socket encoding to be null (isaacs)
http: make DELETE requests set
node: add AsyncListener support (Trevor Norris)
src: remove global HandleScope that hid memory leaks (Ben Noordhuis)
tls: add ECDH ciphers support (Erik Dubbelboer)
tls: do not default to 'localhost' servername (Fedor Indutny)
tls: more accurate wrapping of connecting socket (Fedor Indutny)
Source Code: http://nodejs.org/dist/v0.11.9/node-v0.11.9.tar.gz
Macintosh Installer (Universal): http://nodejs.org/dist/v0.11.9/node-v0.11.9.pkg
Windows Installer: http://nodejs.org/dist/v0.11.9/node-v0.11.9-x86.msi
Windows x64 Installer: http://nodejs.org/dist/v0.11.9/x64/node-v0.11.9-x64.msi
Windows x64 Files: http://nodejs.org/dist/v0.11.9/x64/
Linux 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x86.tar.gz
Linux 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x64.tar.gz
Solaris 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x86.tar.gz
Solaris 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x64.tar.gz
Other release files: http://nodejs.org/dist/v0.11.9/
3a1af0d716042d617e51a82b43b2f97542f6c03a node-v0.11.9-darwin-x64.tar.gz 580696c5b2f30c8394cdee265c4888ad67e03b89 node-v0.11.9-darwin-x86.tar.gz 585df0690254afc22b29d7dc5f12fcb50f1bb588 node-v0.11.9-linux-x64.tar.gz 16fb7e69b90b6b6ac84a9120202918f197d4c0c0 node-v0.11.9-linux-x86.tar.gz bff51e2ab3752f4ae338adca14c2c453294e6017 node-v0.11.9-sunos-x64.tar.gz 9a7ada5c862174d6ce4b524d5816e26a36b763a8 node-v0.11.9-sunos-x86.tar.gz 3a4de2ae0dd8268592546b4fdcbd78f1cbc68118 node-v0.11.9-x86.msi 8c70de5cd39ecd33e6c02d4b0f6ca4010d21372e node-v0.11.9.pkg b4fc0e38ccde4edae45db198f331499055d77ca2 node-v0.11.9.tar.gz 9a4b04f0d40251696fac9161567e97c228d4e57d node.exe 96bfa67a417599c96818461d6d27f50401a74a36 node.exp 5f56ef7c2204ea75916876f6ab9e641b312dff11 node.lib 2db6eb844b36d96a0e34370ad1d311a57facd3d6 node.pdb 4a3590db0e6131739661628632ae1b8d70e2247b pkgsrc/nodejs-ia32-0.11.9.tgz 3ce0291cf0972ac5a2c0543fd1672d8b20569891 pkgsrc/nodejs-x64-0.11.9.tgz 9b736ec896e6b1b5856730869471c7e736f6ce78 x64/node-v0.11.9-x64.msi 1ec3593262b6e281457748ff3f3f195cd682592d x64/node.exe 1cf9ba6d503d5f8b18bfc2c1554bce04eba8a536 x64/node.exp 0a3e79ceecd05add6dab97bb6d9f460a61adddbc x64/node.lib 2037b32bbcb14e24d10c6cb2abe128bc9a85a932 x64/node.pdb
2013.11.12, Version 0.10.22 (Stable)
npm: Upgrade to 1.3.14
uv: Upgrade to v0.10.19
child_process: don't assert on stale file descriptor events (Fedor Indutny)
darwin: Fix "Not Responding" in Mavericks activity monitor (Fedor Indutny)
debugger: Fix bug in sb() with unnamed script (Maxim Bogushevich)
repl: do not insert duplicates into completions (Maciej Małecki)
src: Fix memory leak on closed handles (Timothy J Fontaine)
tls: prevent stalls by using read(0) (Fedor Indutny)
v8: use correct timezone information on Solaris (Maciej Małecki)
Source Code: http://nodejs.org/dist/v0.10.22/node-v0.10.22.tar.gz
Macintosh Installer (Universal): http://nodejs.org/dist/v0.10.22/node-v0.10.22.pkg
Windows Installer: http://nodejs.org/dist/v0.10.22/node-v0.10.22-x86.msi
Windows x64 Installer: http://nodejs.org/dist/v0.10.22/x64/node-v0.10.22-x64.msi
Windows x64 Files: http://nodejs.org/dist/v0.10.22/x64/
Linux 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x86.tar.gz
Linux 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x64.tar.gz
Solaris 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x86.tar.gz
Solaris 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x64.tar.gz
Other release files: http://nodejs.org/dist/v0.10.22/
3082a8d13dfafa7212a7f75bd0a83447fb4d7b99 node-v0.10.22-darwin-x64.tar.gz dca37fa37c8ce3c0df68e74643ed822bec7a12b3 node-v0.10.22-darwin-x86.tar.gz 3739f75bbb85c920a237ceb1c34cb872409d61f7 node-v0.10.22-linux-x64.tar.gz 7e99b654c21bc2a5cbccc33f1bae3ce6e26b3d12 node-v0.10.22-linux-x86.tar.gz 3dfb3585386ca0645ba02b5ad06014ddccda8cbe node-v0.10.22-sunos-x64.tar.gz e6004f073fc81826335dc0c8fba04a82beada0bc node-v0.10.22-sunos-x86.tar.gz 3beff0c7893e39df54e416307b624eb642bffa62 node-v0.10.22-x86.msi b4433b98f87f3f06130adad410e2fb5f959bbf37 node-v0.10.22.pkg d7c6a39dfa714eae1f8da7a00c9a07efd74a03b3 node-v0.10.22.tar.gz 0ff278f5d6225d2be2a51bd4c7ba8fa0d15e98a4 node.exe 6cded62495794c53f6642745d34cbeb7a28266b1 node.exp caaa11790ac8ec40d074e141afa7ffa611f216b4 node.lib 3c7592832d403c93a17b29852f2c828760a45128 node.pdb f335aef2844a6bf9d8d5a9782e7c631d730acc2e pkgsrc/nodejs-ia32-0.10.22.tgz 6d47f98efd86faa71e1e9887aa63916e884bb2a8 pkgsrc/nodejs-x64-0.10.22.tgz c3c169304c6371ee7bd119151bcbced61a322394 x64/node-v0.10.22-x64.msi 307de602a091fa2af3adaa64812200e32ee00fdc x64/node.exe 67440fca57eb4be5800434245ef1a5d16f5aea01 x64/node.exp e6ee29859cd069ff5b8bf749a598112d9f09ed3c x64/node.lib fee98420155b88c0c4b11616aa416d2328cec97d x64/node.pdb
2013.10.30, Version 0.11.8 (Unstable)
uv: Upgrade to v0.11.14
v8: upgrade 18.104.22.168
assert: indicate if exception message is generated (Glen Mailer)
buffer: add buf.toArrayBuffer() API (Trevor Norris)
cluster: fix premature 'disconnect' event (Ben Noordhuis)
crypto: add SPKAC support (Jason Gerfen)
debugger: count space for line numbers correctly (Alex Kocharin)
debugger: make busy loops SIGUSR1-interruptible (Ben Noordhuis)
debugger: repeat last command (Alex Kocharin)
debugger: show current line, fix for #6150 (Alex Kocharin)
dgram: send() can accept strings (Trevor Norris)
dns: rename domain to hostname (Ben Noordhuis)
dns: set hostname property on error object (Ben Noordhuis)
dtrace, mdb_v8: support more string, frame types (Dave Pacheco)
http: add statusMessage (Patrik Stutz)
http: expose supported methods (Ben Noordhuis)
http: provide backpressure for pipeline flood (isaacs)
process: Add exitCode property (isaacs)
tls: socket.renegotiate(options, callback) (Fedor Indutny)
util: format as Error if instanceof Error (Rod Vagg)
Source Code: http://nodejs.org/dist/v0.11.8/node-v0.11.8.tar.gz
Macintosh Installer (Universal): http://nodejs.org/dist/v0.11.8/node-v0.11.8.pkg
Windows Installer: http://nodejs.org/dist/v0.11.8/node-v0.11.8-x86.msi
Windows x64 Installer: http://nodejs.org/dist/v0.11.8/x64/node-v0.11.8-x64.msi
Windows x64 Files: http://nodejs.org/dist/v0.11.8/x64/
Linux 32-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-linux-x86.tar.gz
Linux 64-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-linux-x64.tar.gz
Solaris 32-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-sunos-x86.tar.gz
Solaris 64-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-sunos-x64.tar.gz
Other release files: http://nodejs.org/dist/v0.11.8/
1911bc1407fd116318edaa0cfd01bd664b2b352c node-v0.11.8-darwin-x64.tar.gz bac43c31e257e9f2deffb08c4154f522d5925825 node-v0.11.8-darwin-x86.tar.gz 1b2dac1788f3aad51ec643854ae57771792e6647 node-v0.11.8-linux-x64.tar.gz 1f674dd1ac15561dbf99ecf80d00e2cfcdc1a23b node-v0.11.8-linux-x86.tar.gz 51d29f3624b18e75cf5736eedd62a55931551251 node-v0.11.8-sunos-x64.tar.gz b995b05a3b14373c61faf4cd5c05157e06f410c8 node-v0.11.8-sunos-x86.tar.gz 5f6fd1f68d9f61c889c7a0148a6bfbb681a119b5 node-v0.11.8-x86.msi 95097ea074fa1b20c3bd46eae33a24935842149b node-v0.11.8.pkg 21d3927c78adaaf3fe7cc9602ffb0a85de7f6ea0 node-v0.11.8.tar.gz f735cf8b6404428087ba759dc21818b4d968e2ba node.exe c632e716ac2b303a4e2f3e0c81819b4020c9e0df node.exp dea16a4911693689c3981e19ae2fa77ea2884797 node.lib 0a5bfce12045512b1f4a0341d1381459e9731321 node.pdb 25b8d468c1ef53332834a46aaae0ee1820771871 pkgsrc/nodejs-ia32-0.11.8.tgz fb16a45a0a467aa7661048a3d00d4e81c35bbf56 pkgsrc/nodejs-x64-0.11.8.tgz b4b2c453404f5aa0d37fbce5d55ac1e030f3e7cc x64/node-v0.11.8-x64.msi 799da7eb400d91b7eec157d25da0e138630f27e4 x64/node.exe 6482cce41d8a98ba55daaccc581929df018f2edf x64/node.exp 7e2bb85b6ca45c4df487b9cca7d420e87170b272 x64/node.lib 1aa3a1f9d767e81dbdd1af1d13f221830c467d68 x64/node.pdb