proactive redundancy

tags:


after the fiasco earlier this week, i’ve been taking steps to minimize the impact if tilde.team were to go down. it’s still a large spof (single-point-of-failure), but i’m reasonably certain that at least the irc net will remain up and functional in the event of another outage.

the first thing that i set up was a handful of additional ircd nodes: see the tilde.chat wiki for a full list. slash.tilde.chat is on my personal vps, and bsd.tilde.chat is hosted on the bsd vps that i set up for tilde.team.

i added the ipv4 addresses for these machines, along with the ip for yourtilde.com as A records for tilde.chat, creating a dns round-robin. host tilde.chat will return all four. requesting the dns record will return any one of them, rotating them in a semi-random fashion. this means that when connecting to tilde.chat on 6697 for irc, you might end up on any of {your,team,bsd,slash}.tilde.chat.

this creates the additional problem that visiting the tilde.chat site will end up at any of those 4 machines in much the same way. for the moment, the site is deployed on all of the boxes, making site setup issues hard to debug. the solution to this problem is to use a subdomain as the roundrobin host, as other networks like freenode do (see host chat.freenode.net for the list of servers).

i’m not sure how to make any of the other services more resilient. it’s something that i have been and will continue to research moving forward.

the other main step that i have taken to prevent the same issue from happening again was to configure the firewall to drop outgoing requests to the subnets as defined in rfc 1918.

i’d like to consider at least this risk to be mitigated.

thanks for reading,

~ben

update: the round robin host is now irc.tilde.chat, which resolves the site issues that we were having, due to the duplicated deployments.