Tue, 26 Nov 2013 15:14:59 UTC - Charlie Robbins - npm

We know the availability and overall health of The npm Registry is paramount to everyone using Node.js as well as the larger JavaScript community and those of your using it for some awesome projects and ideas. Between November 4th and November 15th 2013 The npm Registry had several hours of downtime over three distinct time periods:

  1. November 4th -- 16:30 to 15:00 UTC
  2. November 13th -- 15:00 to 19:30 UTC
  3. November 15th -- 15:30 to 18:00 UTC

The root cause of these downtime was insufficient resources: both hardware and human. This is a full post-mortem where we will be look at how npmjs.org works, what went wrong, how we changed the previous architecture of The npm Registry to fix it, as well next steps we are taking to prevent this from happening again.

All of the next steps require additional expenditure from Nodejitsu: both servers and labor. This is why along with this post-mortem we are announcing our crowdfunding campaign: scalenpm.org! Our goal is to raise enough funds so that Nodejitsu can continue to run The npm Registry as a free service for you, the community.

Please take a minute now to donate at https://scalenpm.org!

How does npmjs.org work?

There are two distinct components that make up npmjs.org operated by different people:

  • http://registry.npmjs.org: The main CouchApp (Github: isaacs/npmjs.org) that stores both package tarballs and metadata. It is operated by Nodejitsu since we acquired IrisCouch in May. The primary system administrator is Jason Smith, the current CTO at Nodejitsu, cofounder of IrisCouch, and the System Administrator of registry.npmjs.org since 2011.
  • http://npmjs.org: The npmjs website that you interact with using a web browser. It is a Node.js program (Github: isaacs/npm-www) maintained and operated by Isaac and running on a Joyent Public Cloud SmartMachine.

Here is a high-level summary of the old architecture:

old npm architecture

Diagram 1. Old npm architecture

What went wrong and how was it fixed?

As illustrated above, before November 13th, 2013, npm operated as a single CouchDB server with regular daily backups. We briefly ran a multi-master CouchDB setup after downtime back in August, but after reports that npm login no longer worked correctly we rolled back to a single CouchDB server. On both November 13th and November 15th CouchDB became unresponsive on requests to the /registry database while requests to all other databases (e.g. /public_users) remained responsive. Although the root cause of the CouchDB failures have yet to be determined given that only requests to /registry were slow and/or timed out we suspect it is related to the massive number of attachments stored in the registry.

The incident on November 4th was ultimately resolved by a reboot and resize of the host machine, but when the same symptoms reoccured less than 10 days later additional steps were taken:

  1. The registry was moved to another machine of equal resources to exclude the possibility of a hardware issue.
  2. The registry database itself was compacted.

When neither of these yielded a solution Jason Smith and I decided to move to a multi-master architecture with continuous replication illustrated below:

current npm architecture

Diagram 2. Current npm architecture -- Red-lines denote continuous replication

This should have been the end of our story but unfortunately our supervision logic did not function properly to restart the secondary master on the morning of November 15th. During this time we moved briefly back to a single master architecture. Since then the secondary master has been closely monitored by the entire Nodejitsu operations team to ensure it's continued stability.

What is being done to prevent future incidents?

The public npm registry simply cannot go down. Ever. We gained a lot of operational knowledge about The npm Registry and about CouchDB as a result of these outages. This new knowledge has made clear several steps that we need to take to prevent future downtime:

  1. Always be in multi-master: The multi-master CouchDB architecture we have setup will scale to more than just two CouchDB servers. As npm grows we'll be able to add additional capacity!
  2. Decouple www.npmjs.org and registry.npmjs.org: Right now www.npmjs.org still depends directly on registry.npmjs.org. We are planning to add an additional replica to the current npm architecture so that Isaac can more easily service requests to www.npmjs.org. That means it won't go down if the registry goes down.
  3. Always have a spare replica: We need have a hot spare replica running continuous replication from either to swap out when necessary. This is also important as we need to regularly run compaction on each master since the registry is growing ~10GB per week on disk.
  4. Move attachments out of CouchDB: Work has begun to move the package tarballs out of CouchDB and into Joyent's Manta service. Additionally, MaxCDN has generously offered to provide CDN services for npm, once the tarballs are moved out of the registry database. This will help improve delivery speed, while dramatically reducing the file system I/O load on the CouchDB servers. Work is progressing slowly, because at each stage in the plan, we are making sure that current replication users are minimally impacted.

When these new infrastructure components are in-place The npm Registry will look like this:

planned npm architecture

Diagram 3. Planned npm architecture -- Red-lines denote continuous replication

You are npm! And we need your help!

The npm Registry has had a 10x year. In November 2012 there were 13.5 million downloads. In October 2013 there were 114.6 million package downloads. We're honored to have been a part of sustaining this growth for the community and we want to see it continue to grow to a billion package downloads a month and beyond.

But we need your help! All of these necessary improvements require more servers, more time from Nodejitsu staff and an overall increase to what we spend maintaining the public npm registry as a free service for the Node.js community.

Please take a minute now to donate at https://scalenpm.org!

Thu, 21 Nov 2013 00:40:35 UTC - release

2013.11.20, Version 0.11.9 (Unstable)

  • uv: upgrade to v0.11.15 (Timothy J Fontaine)

  • v8: upgrade to 3.22.24.5 (Timothy J Fontaine)

  • buffer: remove warning when no encoding is passed (Trevor Norris)

  • build: make v8 use random seed for hash tables (Ben Noordhuis)

  • crypto: build with shared openssl without NPN (Ben Noordhuis)

  • crypto: update root certificates (Ben Noordhuis)

  • debugger: pass on v8 debug switches (Ben Noordhuis)

  • domain: use AsyncListener API (Trevor Norris)

  • fs: add recursive subdirectory support to fs.watch (Nick Simmons)

  • fs: make fs.watch() non-recursive by default (Ben Noordhuis)

  • http: cleanup freeSockets when socket destroyed (fengmk2)

  • http: force socket encoding to be null (isaacs)

  • http: make DELETE requests set req.method (Nathan Rajlich)

  • node: add AsyncListener support (Trevor Norris)

  • src: remove global HandleScope that hid memory leaks (Ben Noordhuis)

  • tls: add ECDH ciphers support (Erik Dubbelboer)

  • tls: do not default to 'localhost' servername (Fedor Indutny)

  • tls: more accurate wrapping of connecting socket (Fedor Indutny)

Source Code: http://nodejs.org/dist/v0.11.9/node-v0.11.9.tar.gz

Macintosh Installer (Universal): http://nodejs.org/dist/v0.11.9/node-v0.11.9.pkg

Windows Installer: http://nodejs.org/dist/v0.11.9/node-v0.11.9-x86.msi

Windows x64 Installer: http://nodejs.org/dist/v0.11.9/x64/node-v0.11.9-x64.msi

Windows x64 Files: http://nodejs.org/dist/v0.11.9/x64/

Linux 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x86.tar.gz

Linux 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x64.tar.gz

Solaris 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x86.tar.gz

Solaris 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x64.tar.gz

Other release files: http://nodejs.org/dist/v0.11.9/

Website: http://nodejs.org/docs/v0.11.9/

Documentation: http://nodejs.org/docs/v0.11.9/api/

Shasums:

3a1af0d716042d617e51a82b43b2f97542f6c03a  node-v0.11.9-darwin-x64.tar.gz
580696c5b2f30c8394cdee265c4888ad67e03b89  node-v0.11.9-darwin-x86.tar.gz
585df0690254afc22b29d7dc5f12fcb50f1bb588  node-v0.11.9-linux-x64.tar.gz
16fb7e69b90b6b6ac84a9120202918f197d4c0c0  node-v0.11.9-linux-x86.tar.gz
bff51e2ab3752f4ae338adca14c2c453294e6017  node-v0.11.9-sunos-x64.tar.gz
9a7ada5c862174d6ce4b524d5816e26a36b763a8  node-v0.11.9-sunos-x86.tar.gz
3a4de2ae0dd8268592546b4fdcbd78f1cbc68118  node-v0.11.9-x86.msi
8c70de5cd39ecd33e6c02d4b0f6ca4010d21372e  node-v0.11.9.pkg
b4fc0e38ccde4edae45db198f331499055d77ca2  node-v0.11.9.tar.gz
9a4b04f0d40251696fac9161567e97c228d4e57d  node.exe
96bfa67a417599c96818461d6d27f50401a74a36  node.exp
5f56ef7c2204ea75916876f6ab9e641b312dff11  node.lib
2db6eb844b36d96a0e34370ad1d311a57facd3d6  node.pdb
4a3590db0e6131739661628632ae1b8d70e2247b  pkgsrc/nodejs-ia32-0.11.9.tgz
3ce0291cf0972ac5a2c0543fd1672d8b20569891  pkgsrc/nodejs-x64-0.11.9.tgz
9b736ec896e6b1b5856730869471c7e736f6ce78  x64/node-v0.11.9-x64.msi
1ec3593262b6e281457748ff3f3f195cd682592d  x64/node.exe
1cf9ba6d503d5f8b18bfc2c1554bce04eba8a536  x64/node.exp
0a3e79ceecd05add6dab97bb6d9f460a61adddbc  x64/node.lib
2037b32bbcb14e24d10c6cb2abe128bc9a85a932  x64/node.pdb

Tue, 12 Nov 2013 20:52:56 UTC - release

2013.11.12, Version 0.10.22 (Stable)

  • npm: Upgrade to 1.3.14

  • uv: Upgrade to v0.10.19

  • child_process: don't assert on stale file descriptor events (Fedor Indutny)

  • darwin: Fix "Not Responding" in Mavericks activity monitor (Fedor Indutny)

  • debugger: Fix bug in sb() with unnamed script (Maxim Bogushevich)

  • repl: do not insert duplicates into completions (Maciej Małecki)

  • src: Fix memory leak on closed handles (Timothy J Fontaine)

  • tls: prevent stalls by using read(0) (Fedor Indutny)

  • v8: use correct timezone information on Solaris (Maciej Małecki)

Source Code: http://nodejs.org/dist/v0.10.22/node-v0.10.22.tar.gz

Macintosh Installer (Universal): http://nodejs.org/dist/v0.10.22/node-v0.10.22.pkg

Windows Installer: http://nodejs.org/dist/v0.10.22/node-v0.10.22-x86.msi

Windows x64 Installer: http://nodejs.org/dist/v0.10.22/x64/node-v0.10.22-x64.msi

Windows x64 Files: http://nodejs.org/dist/v0.10.22/x64/

Linux 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x86.tar.gz

Linux 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x64.tar.gz

Solaris 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x86.tar.gz

Solaris 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x64.tar.gz

Other release files: http://nodejs.org/dist/v0.10.22/

Website: http://nodejs.org/docs/v0.10.22/

Documentation: http://nodejs.org/docs/v0.10.22/api/

Shasums:

3082a8d13dfafa7212a7f75bd0a83447fb4d7b99  node-v0.10.22-darwin-x64.tar.gz
dca37fa37c8ce3c0df68e74643ed822bec7a12b3  node-v0.10.22-darwin-x86.tar.gz
3739f75bbb85c920a237ceb1c34cb872409d61f7  node-v0.10.22-linux-x64.tar.gz
7e99b654c21bc2a5cbccc33f1bae3ce6e26b3d12  node-v0.10.22-linux-x86.tar.gz
3dfb3585386ca0645ba02b5ad06014ddccda8cbe  node-v0.10.22-sunos-x64.tar.gz
e6004f073fc81826335dc0c8fba04a82beada0bc  node-v0.10.22-sunos-x86.tar.gz
3beff0c7893e39df54e416307b624eb642bffa62  node-v0.10.22-x86.msi
b4433b98f87f3f06130adad410e2fb5f959bbf37  node-v0.10.22.pkg
d7c6a39dfa714eae1f8da7a00c9a07efd74a03b3  node-v0.10.22.tar.gz
0ff278f5d6225d2be2a51bd4c7ba8fa0d15e98a4  node.exe
6cded62495794c53f6642745d34cbeb7a28266b1  node.exp
caaa11790ac8ec40d074e141afa7ffa611f216b4  node.lib
3c7592832d403c93a17b29852f2c828760a45128  node.pdb
f335aef2844a6bf9d8d5a9782e7c631d730acc2e  pkgsrc/nodejs-ia32-0.10.22.tgz
6d47f98efd86faa71e1e9887aa63916e884bb2a8  pkgsrc/nodejs-x64-0.10.22.tgz
c3c169304c6371ee7bd119151bcbced61a322394  x64/node-v0.10.22-x64.msi
307de602a091fa2af3adaa64812200e32ee00fdc  x64/node.exe
67440fca57eb4be5800434245ef1a5d16f5aea01  x64/node.exp
e6ee29859cd069ff5b8bf749a598112d9f09ed3c  x64/node.lib
fee98420155b88c0c4b11616aa416d2328cec97d  x64/node.pdb

Wed, 30 Oct 2013 15:54:47 UTC - release

2013.10.30, Version 0.11.8 (Unstable)

  • uv: Upgrade to v0.11.14

  • v8: upgrade 3.21.18.3

  • assert: indicate if exception message is generated (Glen Mailer)

  • buffer: add buf.toArrayBuffer() API (Trevor Norris)

  • cluster: fix premature 'disconnect' event (Ben Noordhuis)

  • crypto: add SPKAC support (Jason Gerfen)

  • debugger: count space for line numbers correctly (Alex Kocharin)

  • debugger: make busy loops SIGUSR1-interruptible (Ben Noordhuis)

  • debugger: repeat last command (Alex Kocharin)

  • debugger: show current line, fix for #6150 (Alex Kocharin)

  • dgram: send() can accept strings (Trevor Norris)

  • dns: rename domain to hostname (Ben Noordhuis)

  • dns: set hostname property on error object (Ben Noordhuis)

  • dtrace, mdb_v8: support more string, frame types (Dave Pacheco)

  • http: add statusMessage (Patrik Stutz)

  • http: expose supported methods (Ben Noordhuis)

  • http: provide backpressure for pipeline flood (isaacs)

  • process: Add exitCode property (isaacs)

  • tls: socket.renegotiate(options, callback) (Fedor Indutny)

  • util: format as Error if instanceof Error (Rod Vagg)

Source Code: http://nodejs.org/dist/v0.11.8/node-v0.11.8.tar.gz

Macintosh Installer (Universal): http://nodejs.org/dist/v0.11.8/node-v0.11.8.pkg

Windows Installer: http://nodejs.org/dist/v0.11.8/node-v0.11.8-x86.msi

Windows x64 Installer: http://nodejs.org/dist/v0.11.8/x64/node-v0.11.8-x64.msi

Windows x64 Files: http://nodejs.org/dist/v0.11.8/x64/

Linux 32-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-linux-x86.tar.gz

Linux 64-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-linux-x64.tar.gz

Solaris 32-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-sunos-x86.tar.gz

Solaris 64-bit Binary: http://nodejs.org/dist/v0.11.8/node-v0.11.8-sunos-x64.tar.gz

Other release files: http://nodejs.org/dist/v0.11.8/

Website: http://nodejs.org/docs/v0.11.8/

Documentation: http://nodejs.org/docs/v0.11.8/api/

Shasums:

1911bc1407fd116318edaa0cfd01bd664b2b352c  node-v0.11.8-darwin-x64.tar.gz
bac43c31e257e9f2deffb08c4154f522d5925825  node-v0.11.8-darwin-x86.tar.gz
1b2dac1788f3aad51ec643854ae57771792e6647  node-v0.11.8-linux-x64.tar.gz
1f674dd1ac15561dbf99ecf80d00e2cfcdc1a23b  node-v0.11.8-linux-x86.tar.gz
51d29f3624b18e75cf5736eedd62a55931551251  node-v0.11.8-sunos-x64.tar.gz
b995b05a3b14373c61faf4cd5c05157e06f410c8  node-v0.11.8-sunos-x86.tar.gz
5f6fd1f68d9f61c889c7a0148a6bfbb681a119b5  node-v0.11.8-x86.msi
95097ea074fa1b20c3bd46eae33a24935842149b  node-v0.11.8.pkg
21d3927c78adaaf3fe7cc9602ffb0a85de7f6ea0  node-v0.11.8.tar.gz
f735cf8b6404428087ba759dc21818b4d968e2ba  node.exe
c632e716ac2b303a4e2f3e0c81819b4020c9e0df  node.exp
dea16a4911693689c3981e19ae2fa77ea2884797  node.lib
0a5bfce12045512b1f4a0341d1381459e9731321  node.pdb
25b8d468c1ef53332834a46aaae0ee1820771871  pkgsrc/nodejs-ia32-0.11.8.tgz
fb16a45a0a467aa7661048a3d00d4e81c35bbf56  pkgsrc/nodejs-x64-0.11.8.tgz
b4b2c453404f5aa0d37fbce5d55ac1e030f3e7cc  x64/node-v0.11.8-x64.msi
799da7eb400d91b7eec157d25da0e138630f27e4  x64/node.exe
6482cce41d8a98ba55daaccc581929df018f2edf  x64/node.exp
7e2bb85b6ca45c4df487b9cca7d420e87170b272  x64/node.lib
1aa3a1f9d767e81dbdd1af1d13f221830c467d68  x64/node.pdb

← Page 4

Page 6 →