The case of background job

The other day, I had a task to design a feature that accepts a list of items and returns an ID to the caller. The caller can use that ID to query the status and download the result. It is a classic problem that background job comes to mind.

My project is on .NET Core running on Azure. There are a number of options to implement a background job. I asked around and also asked Sonnet 4.6. There are three options (there are more) that I can choose from:

Azure Storage Queue + Azure Function
Hangfire
In memory background job as Hosted Service inside the Web API instance

Hangfire seems the most powerful one and widely used by developers and organizations. It has been there for years.

But, there is a classic error in my judgement. When I generalized my specific feature to be a background job, I changed the question completely off track. I wanted to run jobs in the background and a way of keeping track of progress and storing the result. In short, what I wanted to implement:

Ability to store the jobs.
Ability to have a function to pick them up and execute them in the background. And the ability to retry and figure out what failed and how to reschedule them.
Ability to query for status, like count completed, in progress, pending, or failed items.
Ability to download the result.

With that reframe, Azure Storage Queue and Azure Function was perfect solution. It was simple and already in our codebase. We did not bring in any new dependencies.

This is the data model that keeps the job. For our purpose, we just need one table

public class Job
{
    public Guid Id {get; set;}

    /// <summary>
    /// A task have a number of jobs. 
    /// </summary>
    public Guid TaskId { get; set; }

    /// <summary>
    /// Status of a job:
    /// Pending: not started yet
    /// InProgress: Executing
    /// Completed: Completed
    /// Failed: Failed
    /// </summary>
    public string Status { get; set; } = null!;

    /// <summary>
    /// Number of times the job has been retried.
    /// If it is 1, no retry yet; if it is 2, it is the first retry, and so on.
    /// </summary>
    public int RetryCount { get; set; }

    /// <summary>
    /// When the status is Failed, this field captures the detail
    /// </summary>
    public string? ErrorMessage { get; set; }

    /// <summary>
    /// When the job moves out of Pending into InProgress.
    /// Note that the Updated field (from base class) provides the end time.
    /// </summary>
    public DateTimeOffset? Started { get; set; }

    /// <summary>
    /// Executors use it to know what to do with required information.
    /// Store as JSONB in the database; allow us to extend.
    /// </summary>
    public JobDetail JobDetail { get; set; } = null!;

    /// <summary>
    /// When the job completed, it is available.
    /// Store as JSONB in the database; allow us to extend.
    /// </summary>
    public JobResult? JobResult { get; set; }
}

Here is the flow:

API receives the request and create one job for one item. They link together by "TaskId".
Once jobs are stored in the database, send job ids to a queue. One job id, one queue message.
Azure Function picks up the queue message and execute. By default, Azure Function with Queue Binding will retry 5 times before moving the message to the poison queue. By monitoring the poison queue, we can troubleshoot the system and/or bring them back to the main queue.
The result is stored in "JobResult" field.

It is easy to query status and download result. User sends in the "TaskId", a single query to the database will return required information.

Is it better than Hangfire approach? No idea. There are trade offs for both approaches. I picked the one that suited me the most at that point in time.

Always analyze your requirements, make them as concrete as possible. Do not generalize too early. Probably do not at all.