Chipping Away at the Monolith September 14th, 2016 Chris Arrington, Senior Software Engineer Chris Arrington, Senior Software Engineer Chris Arrington, Senior Software Engineer Read More About Chris Chris Arrington, Senior Software Engineer Many software teams find themselves living with large, monolithic code bases. But as an organization grows, working in these environments can be difficult, particularly when it comes to developing and deploying code at scale. As a result, many teams try to migrate to a microservices architecture, where a large system can be decomposed into smaller services. These independently deployable and scalable services are easier to manage and can make better use of an organization’s computing resources. But transitioning to this approach is hard, particularly in a SaaS environment where systems must be available 24/7. Think about trying to replace the engine on your car while driving it. Making such critical changes In a SaaS environment is just as tough because stopping the car is not an option. At Bronto, we’ve been working toward decomposing our platform into microservices for a few years. It’s been a gradual process, but we’re getting there. My team recently rolled out a major refactor to the way we render emails, migrating the code into a dedicated microservice. The code for email rendering is now in one place and handles all of our different types of rendering, including bulk deliveries, single message deliveries, and rendering emails for display in a web browser. We’re very proud of the fact that we were able to perform this switch without any disruption, and in this post, I’ll share what led to our success, what we learned during the process, and the benefits we’ve seen from this effort. We Took Our Time Every year, our engineering team is challenged with meeting increased demands on our infrastructure during the busy holiday season. While we’ve consistently been successful, we have also seen the need to improve the performance of email rendering, which led us to this project called the Content service. We embarked on this journey not knowing how long it would take, but our number one priority was quality. Rendering and sending email correctly and in a timely fashion is too critical to our business to rush or cut corners. Don’t get me wrong. We set deadlines for ourselves and were very motivated to complete the effort on time, but we were not going to sacrifice quality just to say the job was done. Fortunately, our entire engineering organization, including management, believes in this philosophy, which allowed us to work our way through this task the right way. Verification Mode One advantage that we have in developing a SaaS product is that we control how code is deployed, which allows us to test upcoming features with specific “beta” clients. For almost a year, the service was deployed and running in what we called “verification mode,” allowing us to verify the correctness of our new implementation in a production environment. Whenever the legacy code rendered a batch of messages, we would send a small sampling of the results to the new service, where it would render the same messages, compare the results and log any differences detected. Utilizing this approach, we were able to find bugs and test the performance of our new service with no disruption to our customers – a huge advantage that gave us a lot of confidence when we finally rolled it out. Running in verification mode also gave us insight into the ways our users are leveraging features like tags (placeholders for customer-specific data) and dynamic content (our high-level DSL) to create personalized emails for their customers. In some cases, we were able to see where users were submitting incorrect or missing data that resulted in incorrect rendering, even in the old system. We were able to reach out to them about the problem and advise them on how to achieve the desired result a better way. Features All bulk deliveries are rendered minutes prior to being sent, with the pre-rendered message data now stored in Redis. Previously, we used Memcached to store pre-rendered message data, but it isn’t as efficient at storing many objects of varying sizes. This new implementation uses a Redis-based work queue for pre-rendering batches of messages for bulk deliveries. Each system that pre-renders batches of messages reads from a distributed Redis-based queue (based on our Reliable queue implementation). If used as is, tasks are completed in the order they were submitted, which can cause a delay if sends are scheduled in an order that differs from when they should go out. For example, if you schedule a send at 12 p.m. that should go out at 1 p.m. and then schedule another send at 12:45 p.m. that needs to send immediately, the 1 p.m. send would land in the queue first and process first, which could delay the 12:45 p.m. send. To address this, we first write batches to a Redis ZSet, sorted based on the timestamp of when the delivery is to be sent. A thread on each pre-rendering system takes batches from the ZSet and submits them to the work queue when the queue size is below a defined threshold. As a result, the portion of the queue that operates in a FIFO fashion remains small, and the batches of deliveries that are going to be sent next are prioritized. Benefits Now that the work is complete, we’re really beginning to reap the benefits of our efforts. By moving rendering into an independent service, we have much more flexibility as to when we can update this part of the system, and as a result, we can develop and roll out new features to our customers much faster than before. The most notable improvement has been in performance, where we’ve seen an approximate 50% improvement in rendering speed. Much of the code we were replacing was several years old and used older libraries that did not perform as well as newer ones. Starting over from scratch gave us the opportunity to apply different approaches, techniques, and technologies. The new service also uses fewer virtual machines, meaning we are able to do more with less, and better utilize our computing resources. Having dedicated infrastructure is also an advantage because we can provision virtual machines for peak sending times, such as Cyber Monday. And by having separate, dedicated resources, there’s a much better chance that any problems in our service won’t affect other parts of the Bronto platform. In the context of the larger Bronto platform, the API for the service serves as an architectural boundary, where underlying changes to the implementation and deployment can be hidden from the users of the service. In the future, we plan to take advantage of this to further enhance the performance and maintainability of this code.