On June 26th I will be speaking at the first NServiceBus conference, in London. The conference hosted by Skillsmatter will run for two days where some great speakers like Greg Young and Udi Dahan will be giving talks on the technologies and techniques you can use to build distributed systems.
I will focus on how our engineering team have dealt with the problem of scaling our platform to meet rapid growth in customers whilst at the same time increasing our technical headcount from fifteen to two hundred people spread across five countries. I’ll also look at how we transitioned from an n-tier to a service oriented architecture, whilst dealing with pressure from the business to deliver new features and keep the platform stable.
Our customer growth was rapid we saw 10+% increase month on month for several years. Our n-tier architecture was brittle, when a 3rd party service went offline or ran slowly we would gradually see our web tier become starved of threads as the synchronous calls backed up. We tried to mitigate this by adding more servers and using async threads but we could not guarantee that our website would stay up if there was a prolonged 3rd party outage during a busy period. Our load is spikey, typically we see 2-3 times more activity in the last couple of days of the month than normal. We needed a system that would keep serving our customers when under heavy load or when one of our 3rd party services went offline or started to run in a degraded state. NserviceBus helped us acheive this because it is inherently asynchronous due to being built on top of a messaging transport layer. Now if a 3rd party service goes down or starts running slowly command messages either queue up or are transferred to the error queue for retrying later.
Our move from synchronous web service calls to asynchronous messaging did present challenges, we had to think differently about how we constructed our acceptance tests, often introducing polling into our tests to wait for messages to be processed. We developed a framework that assisted developers in writing these sort of tests. We also had to create testing tools that could inject test messages to simulate customer activity so that we could test specific parts of the system.
When integrating with with third party API’s that were mostly HTTP endpoints we used a pattern that ensured the non-transactional nature of HTTP would not detrimentally affect our customers. For example when processing card payments we did not want to risk collecting the same payment twice so we set out retries to zero and created a saga to gracefully handle exceptions.
As we moved into new regions and they started to gain traction the number of messages increased, in order to continue to meet our message processing SLA’s we had to split our handlers into separate endpoints. We found our monitoring solution proved invaluable as it allowed us to see the volume of each message type and the length of time it took to process them. Armed with this information it was a relatively straightforward task of identifying which message handlers needed to be moved into their own endpoint.
If these are the sort of problems you are working on and would like to know more you can find details about the conference here, if you have any questions feel free to contact me on Twitter my nickname is @porkstone.