{"id":12565,"date":"2023-04-11T11:14:00","date_gmt":"2023-04-11T10:14:00","guid":{"rendered":"https:\/\/thndrblog.bikostudio.com\/?p=12565"},"modified":"2024-04-07T12:39:13","modified_gmt":"2024-04-07T11:39:13","slug":"how-we-reduced-redis-latency-at-thndr","status":"publish","type":"post","link":"https:\/\/thndr.horizondm.com\/blogpost\/how-we-reduced-redis-latency-at-thndr\/","title":{"rendered":"Reducing Redis Latency for Enhanced Investing at Thndr"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"12565\" class=\"elementor elementor-12565\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-6f576a95 elementor-section-boxed elementor-section-height-default elementor-section-height-default rt-parallax-bg-no\" data-id=\"6f576a95\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4b48f238\" data-id=\"4b48f238\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-22ade3d4 elementor-widget elementor-widget-text-editor\" data-id=\"22ade3d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<h2>Thndr&#8217;s all-hands meeting<\/h2>\n<p id=\"d71a\">During one of those weekly \u201call-hands\u201d meetings at Thndr<strong>\u26a1\ufe0f<\/strong>, where we get together to share the week\u2019s updates, everyone was celebrating that around\u00a0<strong>86%<\/strong>\u00a0of EGX\u2019s (Egyptian Exchange) investor growth in 2022 was registered and signed up through our Thndr app (<a href=\"https:\/\/play.google.com\/store\/apps\/details?id=com.axismarkets.thndr&amp;hl=en&amp;gl=US&amp;pli=1\" target=\"_blank\" rel=\"noreferrer noopener\">Android<\/a>, and\u00a0<a href=\"https:\/\/apps.apple.com\/us\/app\/thndr-invest-your-money\/id1494883259\" target=\"_blank\" rel=\"noreferrer noopener\">iOS<\/a>). Amidst the general excitement, as usual, the engineering team had their minds elsewhere and were concerned about something entirely different.\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/ali-farouk-5ab179164\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ali<\/a>, our engineering lead, was thinking about a technical problem that had been looming for a while and was considered a priority technical debt item that should be tackled sometime in 2023. Because of the tremendous increase in the app\u2019s usage, he thought that it would be a better idea to prioritize working on it right away. This was the start of a good conversation that I had with Ali and <a href=\"https:\/\/www.linkedin.com\/in\/seif-elkhonany\/\" target=\"_blank\" rel=\"noreferrer noopener\">Seif<\/a>\u00a0to figure out when and what to start with.<\/p>\n\n<figure class=\"wp-block-image\"><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:875\/1*O97zISRmr3JuDMuBGaitbw.png\" alt=\"86% of new investors in the EGX in 2022 came from Thndr from How we reduced Redis latency for a better investing experience article\" width=\"875\" height=\"875\" \/><\/figure>\n\n<h3 id=\"a74d\" class=\"wp-block-heading\">Hey, it\u2019s Redis again!<\/h3>\n\n<p id=\"b08a\">Redis is a powerful in-memory data structure store that has become increasingly popular among developers for its speed, versatility, and ease of use. With its ability to handle data structures such as strings, hashes, lists, sets, and more, Redis is a great tool for solving complex problems in real-time applications.<\/p>\n\n<p id=\"1b18\">At Thndr we rely heavily on Redis. Our use cases vary between caching, distributed rate limiting, Pub\/Sub for most of our background jobs, and even as a persistent key-value store, among others. And while at first, it wasn\u2019t an issue, now, with our growing user base, which grew by more than 400% in the last 5 months, we started seeing a substantial increase in our Redis latency.<\/p>\n\n<p id=\"cf82\">We\u2019re using AWS ElastiCache as a managed Redis solution. At the beginning of all of this, we had a single Redis server and database provisioned and used by all our microservices. Some of the data that lives in this database is used by multiple services, and some of it is scoped to a single service. This is the famous \u201ccommon data coupling\u201d anti-pattern, and gradually, we started seeing more and more issues with this setup, increased latency being one of them. First, we hit our instance bandwidth limit, so we went ahead and scaled our instance vertically so that AWS would give us more bandwidth. The problem was that we were wasting a lot of money as our CPU usage was less than 7% and our memory consumption didn\u2019t exceed 22%. Nonetheless, this solution helped us become resilient for a while. We would still have some latency spikes in market hours that could go up to 1s+ (ideally, Redis should have a latency of 1 ms for writes and much less for reads), but we were coping.<\/p>\n\n<h2 id=\"9400\" class=\"wp-block-heading\">We have a problem..<\/h2>\n\n<p id=\"21e8\">Soon after, this past December, we had our first big incident. It started with a huge spike in latency. Redis requests queue became backed up under the sufficient load, where the requests were being queued much faster than they were being processed. During the incident, Redis\u2019 p99 latency reached the 40s mark. This affected our SLA and the user experience negatively. Our most critical services were affected, and the entire app became very slow. We saw 1+ minute latencies in some of our most important services. This was truly a wake-up call that relying on vertical scaling is not an option anymore and that it was time we addressed the root cause of the problem at hand.<\/p>\n\n<p id=\"5dd0\">The following graphs demonstrate how this had a significant impact on our latency throughout that incident.<\/p>\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" class=\"alignnone\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:875\/1*EO21v0lPRIgrRw0iTrmNYQ.png\" alt=\"Redis p99 latency reached the 40s mark. \" width=\"875\" height=\"291\" \/><\/figure>\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" class=\"alignnone\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:875\/1*MA8TTBkPlubLhAj2R3NemA.png\" alt=\"Redis latency graph\" width=\"875\" height=\"322\" \/><\/figure>\n\n<h3 id=\"fd5c\" class=\"wp-block-heading\">Discovery<\/h3>\n\n<p id=\"1b26\">As mentioned before, we hit our bandwidth limit in Redis, partially because of our growing user base, in addition to some anti-patterns that we have scattered around the codebase, but more on that later.<\/p>\n\n<p id=\"718c\">Redis is single-threaded by nature, which means no two commands can execute in parallel. This might appear as an issue, but in most cases, network bandwidth and memory bandwidth are the bottlenecks as opposed to the CPU; thus, the performance gain from multi-threading is negligible. Since most threads will end up being blocked on I\/Os, it wouldn\u2019t justify the overhead of multi-threading, which includes significant code complexity from resource locking and thread synchronization and CPU overhead from thread creation, destruction, and context switching.<\/p>\n\n<p id=\"1857\">Nevertheless, when you use it to perform multiple key operations periodically on hundreds of thousands of keys, like we were doing at the time, it can take a lot of time to process and block the event loop until it finishes. These expensive O(n) operations, like <strong><em>mset<\/em><\/strong>\u00a0and\u00a0<strong><em>mget<\/em><\/strong>\u00a0that we used to run, attempt to get more than\u00a0<strong>25k values at a time<\/strong>. There\u2019s also the usage of the\u00a0<strong><em>KEYS<\/em><\/strong>\u00a0command, which is a very expensive and blocking operation and should be avoided at all costs in production and replaced instead with\u00a0<strong><em>SCAN<\/em><\/strong>.<\/p>\n\n<p id=\"8de8\">Through monitoring tools (we mainly rely on\u00a0<a href=\"https:\/\/www.datadoghq.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Datadog<\/a>\u00a0for this), we found out that unfortunately, our most impacted service is the market service, where all of the trading and most of the magic happens.<\/p>\n\n<h4 id=\"407d\" class=\"wp-block-heading\">Operation \u201cSplit and Scale\u201d<\/h4>\n\n<p id=\"c9cf\">First, we started by splitting our huge, monolith Redis database into multiple ones, one for each service. We were able to end up with a Redis database per service for most of our services, except for a few where there\u2019s this common data coupling issue that we mentioned in the intro. Already, by doing this split, we were able to give ourselves a lot of headroom and buy ourselves some time, to focus on the other contributing factors.<\/p>\n\n<p id=\"e533\">Then, we started looking into how to scale the different Redis instances, and naturally, we considered\u00a0<a href=\"https:\/\/redis.io\/docs\/management\/scaling\/\" target=\"_blank\" rel=\"noreferrer noopener\">Redis Cluster<\/a>. Since Redis operates with a single-threaded model, vertical scaling is not really a helpful option, as it is not capable of utilizing multiple cores of the CPU to process commands. So instead, horizontal scaling seems like a more plausible solution. Redis Cluster utilizes horizontal scaling by not only adding several servers and replication the data but also distributing the data set among the nodes in the cluster (what you\u2019re maybe familiar with as sharding) enabling the processing of requests in parallel. What makes Redis Cluster extra special, however, is its sharding algorithm; Redis Cluster does not use\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Consistent_hashing\" target=\"_blank\" rel=\"noreferrer noopener\">consistent hashing<\/a>, but a different form of sharding where every key is assigned to a hash slot.<\/p>\n\n<p id=\"fe5e\">Hash slots share the concept of using hashes or composite partitioning but do not rely on the circle-based algorithm upon which consistent hashing is based. One of the drawbacks of consistent hashing is that as the system evolves, operations such as addition\/removal of nodes, expiration of keys, the addition of new keys, etc. cause the cluster to end up with imbalanced shards. In a way, this would have created more problems instead of solving them.<\/p>\n\n<p id=\"3c74\">The way \u201cHash Slots\u201d effectively solve this is by partitioning the keyspace across the different nodes in the cluster. Specifically, it is split into 16384 slots where all the keys in the key space are hashed into an integer range 0 ~ 16383, following the formula <strong><em>slot = CRC16(key) mod 16383<\/em><\/strong>. To compute what the hash slot of a given key is, we simply take the CRC16 of the key modulo 16384. Doing this makes the evolution of the cluster and its processes (such as adding or removing nodes from the cluster) much easier. If we are adding a new node, we need to move some hash slots from existing nodes to the new one. Same way, if we would like to remove a node, then we can move the hash slot served by that node to the other nodes present in the cluster. In return, this creates a much more balanced cluster, solving the issue with consistent hashing.<\/p>\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p id=\"54eb\"><em>One thing to take into consideration though is that AWS Elasticache operates a little bit differently than Redis. So, for example, there is no sentinel mode, so sharding is always obligatory if you\u2019re thinking of adding more nodes to scale horizontally. Also, cluster mode behaves a bit differently than how Redis normally does. Like for example there is no minimum limit of three on the number of nodes that a cluster should have as Redis does. So AWS Elasticache documentation should be consulted first before going through with it as a solution expecting it to behave just like Redis.<\/em><\/p>\n<\/blockquote>\n\n<h3 id=\"733e\" class=\"wp-block-heading\">Another Discovery<\/h3>\n\n<p id=\"e3e5\">We also looked at the library that we are using now to make calls to our Redis databases. We found that it was deprecated, and didn\u2019t support Redis Cluster. Digging deeper into its source code, we found that it was also not that performative in terms of some multi-key commands. For example, running commands in a pipeline runs the commands in a for loop instead of <a href=\"https:\/\/redis.io\/docs\/manual\/pipelining\/\" target=\"_blank\" rel=\"noreferrer noopener\">pipelining<\/a> the commands. We decided that we should look at other libraries that support the Redis Cluster, which are hopefully much more performative.<\/p>\n\n<p id=\"6d3c\">Another area for improvement was the fact that we use Redis as a Pub\/Sub mainly to schedule celery jobs. In a nutshell, using Pub\/Sub with Redis Cluster is generally a bad idea. As the client can send SUBSCRIBE to any node and can also send PUBLISH to any node, the published messages will be replicated to all nodes in the cluster. This makes Redis Cluster as a Pub\/Sub solution inefficient.<\/p>\n\n<h2 id=\"5264\" class=\"wp-block-heading\">The Solution<\/h2>\n\n<p id=\"5667\">The first thing we did was replace our Celery jobs with Kubernetes jobs instead across all our services to mitigate the load a bit from Redis. An added bonus was that we have much better observability on Kubernetes jobs than we did on Redis jobs anyway.<\/p>\n\n<p id=\"5491\">Then, as mentioned before, our market service is not only the most impacted service, but it\u2019s also one of the most important and most used services in our ecosystem, so we decided to start from there.<\/p>\n\n<p id=\"056e\">We wanted to try implementing Redis Cluster there first and monitor its performance. And since this is a big move, we took it in steps, so we first looked at alternatives to our deprecated Redis library. Our criteria for choosing this library were 3 things;<\/p>\n\n<ul class=\"wp-block-list\">\n<li>Huge support from the Redis community<\/li>\n\n<li>Performative operations where for example pipelining is actually pipelining the operations and not just putting them in a for loop under the hood.<\/li>\n\n<li>Support of Redis Cluster mode and multi-key commands in cluster mode.<\/li>\n<\/ul>\n\n<h4 id=\"3c28\" class=\"wp-block-heading\">Obstacles along the way<\/h4>\n\n<p id=\"e500\">One of the challenges that we faced while picking a library is that, for the aforementioned third point, most libraries didn\u2019t seem to support multi-key commands in cluster mode, which are essential as they are widely used in our codebase (luckily, we don\u2019t really care about atomicity when it comes to these kinds of operations). Thankfully, we found a suitable client library, and we were able to use it without introducing lots of changes to our code base. It\u2019s important to mention that, in our case, it was relatively simple since our applications didn\u2019t care about atomicity when it came to our multi-key operations, but in case the need arises in the future, we can always use\u00a0<a href=\"https:\/\/redis.io\/docs\/reference\/cluster-spec\/#hash-tags\" target=\"_blank\" rel=\"noreferrer noopener\">hash tags<\/a>, which can be used to force certain keys to be stored in the same hash slot. This should be used carefully, however, as it can easily skew the data distribution across hash slots and across nodes.<\/p>\n\n<h3 id=\"50a3\" class=\"wp-block-heading\">Moving Forward<\/h3>\n\n<p id=\"ef9f\">As mentioned, we took this in steps. So we started by just deploying, taking out the old Redis library and replacing it with the one in the market service, which is much better supported, and making sure it worked well before switching to cluster mode.<\/p>\n\n<p id=\"ea66\">Then, since we use AWS Elasticache for a managed Redis deployment, we created an ElastiCache cluster of 3 nodes with cluster mode enabled (using T<a href=\"https:\/\/www.terraform.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">erraform<\/a>, which we use to define and deploy all our infrastructure) only for the market service. But when we started applying this in code on the market service we faced an issue along the way\u2026<\/p>\n\n<h4 id=\"c191\" class=\"wp-block-heading\">More Obstacles<\/h4>\n\n<p id=\"d0bc\">Turns out Redis Cluster is not supported in the Datadog APM library that we use, and since monitoring and observability are two key components to anything that we do at Thndr, we couldn\u2019t move forward. We faced a dead end and were contemplating what to do. Should we wait until it\u2019s supported? Should we fork the library and build support for it ourselves? The truth is, we were really tight on time, and we were expecting a substantial jump in our app load any day, so we had to find a solution that worked now.<\/p>\n\n<p id=\"39b0\">Monkey patching to the rescue! Even though monkey patching is not really the most efficient or aesthetically pleasing code to write and maintain, it was a fast and effective option. We were able to monkey-patch the library and make sure monitoring was working on our staging environment in around 30 minutes.<\/p>\n\n<h5 id=\"652f\" class=\"wp-block-heading\">Showtime and Results<\/h5>\n\n<p id=\"73f8\">And finally, we were ready to deploy Redis cluster mode in one of our services and see it in action on production. Immediately upon deploying our solution to production, the results were immediate. The following graph was taken during market hours, and it shows Redis latency mapped just to mere microseconds!<\/p>\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:875\/1*wYnXqah9xbIHI55-7c05Cw.png\" alt=\"Redis latency mapped just to mere microseconds!\" width=\"875\" height=\"306\" \/><\/figure>\n\n<h2 id=\"87d8\" class=\"wp-block-heading\">Wrapping up<\/h2>\n\n<p id=\"d67f\">To sum it up, we discussed the problem, discovery, solution, and implementation of one of Thndr\u2019s app\u2019s technical debt items that had become a higher priority due to the app\u2019s phenomenal growth in daily transactions and user base. At Thndr, we are firm believers in continuous improvement and doing things the right way. We also acknowledge that there will always be trade-offs and that technical debt should be prioritized during each iteration and at every level. However, we also know that there are many amazing projects and codebases for products that were not fortunate enough to continue. Every day, trade-offs are inevitable!<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Redis is a powerful in-memory data structure store that has become increasingly popular among developers for its speed, versatility, and ease of use.<\/p>\n","protected":false},"author":11,"featured_media":12568,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[257],"tags":[],"class_list":["post-12565","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-talk"],"_links":{"self":[{"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/posts\/12565","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/comments?post=12565"}],"version-history":[{"count":19,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/posts\/12565\/revisions"}],"predecessor-version":[{"id":13365,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/posts\/12565\/revisions\/13365"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/media\/12568"}],"wp:attachment":[{"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/media?parent=12565"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/categories?post=12565"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thndr.horizondm.com\/blogpost\/wp-json\/wp\/v2\/tags?post=12565"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}