Ragavendra Dani - Portfolio

The Problem

We had a scheduled job that was running for over 7 hours. When it was initially created a couple years ago, it used to run for 1 hour. The job of the scheduler was to update our database so that production customers would receive today’s latest data. We ran this before the start of the business day.

Overtime as the process got bloated, the scheduler would take almost an entire day (~7hr) to run.

Optimization steps:

We identified two paths of optimization.

Run the scheduled job for active users only.
Run the process parallelly split by logical blocks.

Run the scheduled job for active users only.

From the data, we learned that the scheduled job was running sequentially across every user group, regardless of whether the group was active. Each user group contains hundreds of thousands to a few million rows, but only a small subset of these groups was actively used.

By filtering the job to run only for active user groups, we reduced the amount of data scanned and processed. This change alone brought the scheduled job runtime down to ~1 hour (previously much longer).

Run the process parallelly split by user blocks.

Since the processing boundaries were the user groups themselves, we parallelized the workflow by splitting execution into independent user group blocks and assigning one thread per user group.

We implemented a multi-threaded solution, introduced a 5-minute timeout per thread as a safety guardrail, and deployed the change to production. With this, the full process consistently completed in under 5 minutes.

Initially, we suspected the 5-minute timeout was artificially reducing runtime by skipping work. However, after reviewing production logs over several weeks, we confirmed that:

The timeout rate was < 1% of total executions
The longest async user group processing duration was ~1.5 minutes
This was observed across 250k+ async executions in the last 7 days

This validated that the runtime improvement came from scoping to active groups and removing sequential bottlenecks, not from the timeout mechanism.

Future improvement

We did have some processes timeout (<1%), when monitored over 7 days. The timeouts were due to database timeouts caused by some blocks having higher amount of data. We setup retry mechanism to process the timed-out events to address this issue.

Conclusion

We thus improved the job time of the 7-hour scheduler to under 5 mins. Now we can promise the business that the customers are always getting the latest data every day.