As of this past weekend, Ben Noordhuis has decided to step away from Node.js and libuv, and is no longer acting as a core committer.
Ben has done a tremendous amount of great work in the past. We're sad to lose the benefit of his continued hard work and expertise, and extremely grateful for what he has added to Node.js and libuv over the years.
Many of you already have expressed your opinion regarding recent drama, and I'd like to ask that you please respect our wishes to let this issue rest, so that we can all focus on the road forward.
- November 4th -- 16:30 to 15:00 UTC
- November 13th -- 15:00 to 19:30 UTC
- November 15th -- 15:30 to 18:00 UTC
The root cause of these downtime was insufficient resources: both hardware and human. This is a full post-mortem where we will be look at how npmjs.org works, what went wrong, how we changed the previous architecture of The npm Registry to fix it, as well next steps we are taking to prevent this from happening again.
All of the next steps require additional expenditure from Nodejitsu: both servers and labor. This is why along with this post-mortem we are announcing our crowdfunding campaign: scalenpm.org! Our goal is to raise enough funds so that Nodejitsu can continue to run The npm Registry as a free service for you, the community.
Please take a minute now to donate at https://scalenpm.org!
How does npmjs.org work?
There are two distinct components that make up npmjs.org operated by different people:
- http://registry.npmjs.org: The main CouchApp (Github: isaacs/npmjs.org) that stores both package tarballs and metadata. It is operated by Nodejitsu since we acquired IrisCouch in May. The primary system administrator is Jason Smith, the current CTO at Nodejitsu, cofounder of IrisCouch, and the System Administrator of registry.npmjs.org since 2011.
- http://npmjs.org: The npmjs website that you interact with using a web browser. It is a Node.js program (Github: isaacs/npm-www) maintained and operated by Isaac and running on a Joyent Public Cloud SmartMachine.
Here is a high-level summary of the old architecture:
What went wrong and how was it fixed?
As illustrated above, before November 13th, 2013, npm operated as a single CouchDB server with regular daily backups. We briefly ran a multi-master CouchDB setup after downtime back in August, but after reports that
npm login no longer worked correctly we rolled back to a single CouchDB server. On both November 13th and November 15th CouchDB became unresponsive on requests to the
/registry database while requests to all other databases (e.g.
/public_users) remained responsive. Although the root cause of the CouchDB failures have yet to be determined given that only requests to
/registry were slow and/or timed out we suspect it is related to the massive number of attachments stored in the registry.
The incident on November 4th was ultimately resolved by a reboot and resize of the host machine, but when the same symptoms reoccured less than 10 days later additional steps were taken:
- The registry was moved to another machine of equal resources to exclude the possibility of a hardware issue.
- The registry database itself was compacted.
When neither of these yielded a solution Jason Smith and I decided to move to a multi-master architecture with continuous replication illustrated below:
This should have been the end of our story but unfortunately our supervision logic did not function properly to restart the secondary master on the morning of November 15th. During this time we moved briefly back to a single master architecture. Since then the secondary master has been closely monitored by the entire Nodejitsu operations team to ensure it's continued stability.
What is being done to prevent future incidents?
The public npm registry simply cannot go down. Ever. We gained a lot of operational knowledge about The npm Registry and about CouchDB as a result of these outages. This new knowledge has made clear several steps that we need to take to prevent future downtime:
- Always be in multi-master: The multi-master CouchDB architecture we have setup will scale to more than just two CouchDB servers. As npm grows we'll be able to add additional capacity!
- Decouple www.npmjs.org and registry.npmjs.org: Right now www.npmjs.org still depends directly on registry.npmjs.org. We are planning to add an additional replica to the current npm architecture so that Isaac can more easily service requests to www.npmjs.org. That means it won't go down if the registry goes down.
- Always have a spare replica: We need have a hot spare replica running continuous replication from either to swap out when necessary. This is also important as we need to regularly run compaction on each master since the registry is growing ~10GB per week on disk.
- Move attachments out of CouchDB: Work has begun to move the package tarballs out of CouchDB and into Joyent's Manta service. Additionally, MaxCDN has generously offered to provide CDN services for npm, once the tarballs are moved out of the registry database. This will help improve delivery speed, while dramatically reducing the file system I/O load on the CouchDB servers. Work is progressing slowly, because at each stage in the plan, we are making sure that current replication users are minimally impacted.
When these new infrastructure components are in-place The npm Registry will look like this:
You are npm! And we need your help!
The npm Registry has had a 10x year. In November 2012 there were 13.5 million downloads. In October 2013 there were 114.6 million package downloads. We're honored to have been a part of sustaining this growth for the community and we want to see it continue to grow to a billion package downloads a month and beyond.
But we need your help! All of these necessary improvements require more servers, more time from Nodejitsu staff and an overall increase to what we spend maintaining the public npm registry as a free service for the Node.js community.
Please take a minute now to donate at https://scalenpm.org!
2013.11.20, Version 0.11.9 (Unstable)
uv: upgrade to v0.11.15 (Timothy J Fontaine)
v8: upgrade to 188.8.131.52 (Timothy J Fontaine)
buffer: remove warning when no encoding is passed (Trevor Norris)
build: make v8 use random seed for hash tables (Ben Noordhuis)
crypto: build with shared openssl without NPN (Ben Noordhuis)
crypto: update root certificates (Ben Noordhuis)
debugger: pass on v8 debug switches (Ben Noordhuis)
domain: use AsyncListener API (Trevor Norris)
fs: add recursive subdirectory support to fs.watch (Nick Simmons)
fs: make fs.watch() non-recursive by default (Ben Noordhuis)
http: cleanup freeSockets when socket destroyed (fengmk2)
http: force socket encoding to be null (isaacs)
http: make DELETE requests set
node: add AsyncListener support (Trevor Norris)
src: remove global HandleScope that hid memory leaks (Ben Noordhuis)
tls: add ECDH ciphers support (Erik Dubbelboer)
tls: do not default to 'localhost' servername (Fedor Indutny)
tls: more accurate wrapping of connecting socket (Fedor Indutny)
Source Code: http://nodejs.org/dist/v0.11.9/node-v0.11.9.tar.gz
Macintosh Installer (Universal): http://nodejs.org/dist/v0.11.9/node-v0.11.9.pkg
Windows Installer: http://nodejs.org/dist/v0.11.9/node-v0.11.9-x86.msi
Windows x64 Installer: http://nodejs.org/dist/v0.11.9/x64/node-v0.11.9-x64.msi
Windows x64 Files: http://nodejs.org/dist/v0.11.9/x64/
Linux 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x86.tar.gz
Linux 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-linux-x64.tar.gz
Solaris 32-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x86.tar.gz
Solaris 64-bit Binary: http://nodejs.org/dist/v0.11.9/node-v0.11.9-sunos-x64.tar.gz
Other release files: http://nodejs.org/dist/v0.11.9/
3a1af0d716042d617e51a82b43b2f97542f6c03a node-v0.11.9-darwin-x64.tar.gz 580696c5b2f30c8394cdee265c4888ad67e03b89 node-v0.11.9-darwin-x86.tar.gz 585df0690254afc22b29d7dc5f12fcb50f1bb588 node-v0.11.9-linux-x64.tar.gz 16fb7e69b90b6b6ac84a9120202918f197d4c0c0 node-v0.11.9-linux-x86.tar.gz bff51e2ab3752f4ae338adca14c2c453294e6017 node-v0.11.9-sunos-x64.tar.gz 9a7ada5c862174d6ce4b524d5816e26a36b763a8 node-v0.11.9-sunos-x86.tar.gz 3a4de2ae0dd8268592546b4fdcbd78f1cbc68118 node-v0.11.9-x86.msi 8c70de5cd39ecd33e6c02d4b0f6ca4010d21372e node-v0.11.9.pkg b4fc0e38ccde4edae45db198f331499055d77ca2 node-v0.11.9.tar.gz 9a4b04f0d40251696fac9161567e97c228d4e57d node.exe 96bfa67a417599c96818461d6d27f50401a74a36 node.exp 5f56ef7c2204ea75916876f6ab9e641b312dff11 node.lib 2db6eb844b36d96a0e34370ad1d311a57facd3d6 node.pdb 4a3590db0e6131739661628632ae1b8d70e2247b pkgsrc/nodejs-ia32-0.11.9.tgz 3ce0291cf0972ac5a2c0543fd1672d8b20569891 pkgsrc/nodejs-x64-0.11.9.tgz 9b736ec896e6b1b5856730869471c7e736f6ce78 x64/node-v0.11.9-x64.msi 1ec3593262b6e281457748ff3f3f195cd682592d x64/node.exe 1cf9ba6d503d5f8b18bfc2c1554bce04eba8a536 x64/node.exp 0a3e79ceecd05add6dab97bb6d9f460a61adddbc x64/node.lib 2037b32bbcb14e24d10c6cb2abe128bc9a85a932 x64/node.pdb
2013.11.12, Version 0.10.22 (Stable)
npm: Upgrade to 1.3.14
uv: Upgrade to v0.10.19
child_process: don't assert on stale file descriptor events (Fedor Indutny)
darwin: Fix "Not Responding" in Mavericks activity monitor (Fedor Indutny)
debugger: Fix bug in sb() with unnamed script (Maxim Bogushevich)
repl: do not insert duplicates into completions (Maciej Małecki)
src: Fix memory leak on closed handles (Timothy J Fontaine)
tls: prevent stalls by using read(0) (Fedor Indutny)
v8: use correct timezone information on Solaris (Maciej Małecki)
Source Code: http://nodejs.org/dist/v0.10.22/node-v0.10.22.tar.gz
Macintosh Installer (Universal): http://nodejs.org/dist/v0.10.22/node-v0.10.22.pkg
Windows Installer: http://nodejs.org/dist/v0.10.22/node-v0.10.22-x86.msi
Windows x64 Installer: http://nodejs.org/dist/v0.10.22/x64/node-v0.10.22-x64.msi
Windows x64 Files: http://nodejs.org/dist/v0.10.22/x64/
Linux 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x86.tar.gz
Linux 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-linux-x64.tar.gz
Solaris 32-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x86.tar.gz
Solaris 64-bit Binary: http://nodejs.org/dist/v0.10.22/node-v0.10.22-sunos-x64.tar.gz
Other release files: http://nodejs.org/dist/v0.10.22/
3082a8d13dfafa7212a7f75bd0a83447fb4d7b99 node-v0.10.22-darwin-x64.tar.gz dca37fa37c8ce3c0df68e74643ed822bec7a12b3 node-v0.10.22-darwin-x86.tar.gz 3739f75bbb85c920a237ceb1c34cb872409d61f7 node-v0.10.22-linux-x64.tar.gz 7e99b654c21bc2a5cbccc33f1bae3ce6e26b3d12 node-v0.10.22-linux-x86.tar.gz 3dfb3585386ca0645ba02b5ad06014ddccda8cbe node-v0.10.22-sunos-x64.tar.gz e6004f073fc81826335dc0c8fba04a82beada0bc node-v0.10.22-sunos-x86.tar.gz 3beff0c7893e39df54e416307b624eb642bffa62 node-v0.10.22-x86.msi b4433b98f87f3f06130adad410e2fb5f959bbf37 node-v0.10.22.pkg d7c6a39dfa714eae1f8da7a00c9a07efd74a03b3 node-v0.10.22.tar.gz 0ff278f5d6225d2be2a51bd4c7ba8fa0d15e98a4 node.exe 6cded62495794c53f6642745d34cbeb7a28266b1 node.exp caaa11790ac8ec40d074e141afa7ffa611f216b4 node.lib 3c7592832d403c93a17b29852f2c828760a45128 node.pdb f335aef2844a6bf9d8d5a9782e7c631d730acc2e pkgsrc/nodejs-ia32-0.10.22.tgz 6d47f98efd86faa71e1e9887aa63916e884bb2a8 pkgsrc/nodejs-x64-0.10.22.tgz c3c169304c6371ee7bd119151bcbced61a322394 x64/node-v0.10.22-x64.msi 307de602a091fa2af3adaa64812200e32ee00fdc x64/node.exe 67440fca57eb4be5800434245ef1a5d16f5aea01 x64/node.exp e6ee29859cd069ff5b8bf749a598112d9f09ed3c x64/node.lib fee98420155b88c0c4b11616aa416d2328cec97d x64/node.pdb