Yet another AVaaS. This time with Azure Functions!
Problem Statement
For as long as I can only remember architects have been looking for ways to ensure the binary payload entering a security perimeter of the systems they design is virus-free. Also, for as long as I can only remember it has not been a clear path to add this piece to their lovely big picture. If the customer you work for is already signed up for an enterprise AV solution it is not too bad — you could just put another box under the wings of the existing AV scanner. But what if it is NOT there?
Amazingly enough, even in the cloud era this problem has not been sorted to date. Giants like AWS or Azure, we are supposed to stand on the shoulders of, have not come up with a AV as a Service solution (AVaaS) that you could seamlessly integrate with their offering. We should keep asking why because that is one of the stimuli for the cloud providers to eventually start doing so. Given the increasingly rising importance and criticality of cybersecurity, it should be less and less questionable such layer of value is eventually commoditised (Azure brings MS Antimalware to the table, however, it is primarily targeted towards IaaS).
Meanwhile, architects persistently have attempted to build their own solutions relying on the established software made available through cloud marketplaces or widely accepted packages. One of the packages is ClamAV. Using some funny metaphor, ClamAV has been like pasta served with different sauces: Kubernetes ClamAV, AWS Lambda ClamAV, IaaS ClamAV, ClamAV with scan on access, etc. It is all good till you come across a brand new sauce you need to mix your pasta together with :) It should be exciting, should it not ?
Recently, on an Azure project, I googled Azure and ClamAV up to seek advice on ClamAV in Azure and was surprised to see there was not a well defined way to integrate ClamAV with PaaS or FaaS services. It just so happens my project wants to steer away from IaaS or pure containers if only possible. If you are already invested in K8S it is probably wiser to take advantage of it. If you are not, then you definitely want to keep away from standing up a cluster just to deliver your AV solution given its complexity. The previously mentioned marketplace options do not often provide enough flexibility. We wanted to focus on Azure Functions since they provided us with a very simple and clear programming model. The model we have been used to for a while, where a lot of concerns are available out of the box i.e. variety of triggers such as HTTP, Timer, etc as well as observability.
We have found a couple of articles trying to outline or prove a concept of Azure Functions reacting to Blob triggers and subsequently scanning a blob but they were lacking in addressing the following:
- how to keep up to date with recent AV definitions — you need to have a clear way to keeping your AV scanner up to date, otherwise its value diminishes over time
- observability integration — you need to have a way to understand what happens to your solution in runtime on various events
- performance — ideally your scan is quick i.e. relies on the ClamAV daemon to be warmed up and waiting for requests rather than spawning the whole ClamAV process from scratch on every scan. The latter entails an overhead of loading AV definitions taking around 20–30 seconds. While possible, it is not the level of efficiency weusually pursue within my designs.
- scalability — you would like to be able to scale your solution. Ideally horizontally, but vertically is wholeheartedly welcome, as well.
- reusability —”it always starts with one, right?” I would like to be able to run my AVaaS independently of other services. By saying independently I mean wewould like to be able to deploy it as a separate unit so that it can be exposed to different kinds of consumer
- security — whatever it is and wherever it gets eventually deployed it should implement some reasonable security model.
- cost effectiveness — the reusability of the solution should boost cost effectiveness by being able to share a single deployment across multiple environments.
TLDR
I will be honest with you. Since there is no one-size-fits-all solution I will focus here to outline a few constraints, which I believe, are fair assumptions for many projects.
I want to deliver a secure, performant, scalable and cost-effective Azure Functions based AVaaS scanner wrapping the best of ClamAV core libraries, that I can reuse across multiple consumers.
I want AVaaS to keep itself up to date without manual intervention.
I want have a solid view of the health of the solution while running.
A single payload subject to scanning must be no larger than 100MB.
Solution
The above diagram refers to a full-blown architecture with Azure API Management (APIM) placed in front of the Azure Function App. APIM allows for better security segregation around different APIs and allows for the likes of traffic control and others. It is not part of the AVaaS, though.
The Azure Function used to host AVaaS runs at App Service Plan. I wish I could use Consumption plan for better cost management but given its limitations of:
- not being able to deploy a custom Docker image
- inability to mount Storage Account as a file share is still in preview
Consumption plan remains outside the solution space.
Why do those 2 points matter ?
The former is required to have ClamAV available to the function runtime. The latter is to have once place where AV definitions are stored so they can be shared across multiple instances of Azure Function Hosts.
It will be definitely a good idea to revise the solution once Microsoft removes the above shortcomings (or AVaaS becomes a thing on its own in Azure/AWS).
Implementation details
With App Service Plan and Docker your Function App is packed into a container and deployed on am Azure Function Host. A connection between Azure Function App and a container registry (Azure Container Registry in my case) facilitate that. Each App Service Plan instance is mapped to a single container instance. By default, there is no shared storage between multiple App Service Plan instances, however, it can be enabled by flipping WEBSITES_ENABLE_APP_SERVICE_STORAGE switch to true. This mounts the shared disk space into /home folder at the container instance. The path of /home/site/wwwroot is where the application code (NodeJS code) must be deployed for the application to run correctly. There is one place where the code is deployed to and as many Function Host processes serving that code as the nuber of App Service Plan instances.
In the solution, during the deployment, bundled AV definitions are copied into /home/clamav so that all ClamAV daemon instances can read it from a single source of truth rather than each one having to maintain a separate copy. The rationale behind that choice was to drive the refresh of the definition trough Cron-triggered function (basically calling freshclam command over shell invocation) rather than having runtime-level daemon or a process in a crontab refresh them because it gives me a full insight into the update operational status . ClamAV daemon, at each Function Host instance, checks up on the AV definitions and when it detects a change, it reloads them.
The solution exposes one extra endpoint, a DaemonHealth HTTP endpoint, which responds with the status of ClamAV daemon at a given instance and instructs App Service load balancer to remove unhealthy instance.
Reusability — HTTP Endpoint
Consumers are able to call the service passing on the base64 encoded binary payload to the HTTP trigger and receiving a response. I leave up to your imagination how detailed the responses might be — they can be tailored to one’s specification. Once up, the service can be called by multiple consumers, both synchronous and asynchronous ones.
Security
Since it is HTTP endpoint, it will obviously use the latest and greatest TLS for in-transit security. The payload is scanned in a stream-wise fashion without having to store the payload temporarily so encryption at rest is not really an issue.
Access security will be based on OOTB feature Azure Functions (and every App Service) that is OAuth 2.0. In my use case I use Azure AD authentication since on the project I use a lot of Azure AD App Registrations and Azure AD Service Principals so that fits in well.
Scalability
The solution is based Azure App Service Plan, so it can scale using different scaling policies — manual, scheduled. Recommending auto-scaling policy would be nothing more than a sales pitch. It is theoretically possible, however, it takes a while for a docker container to be initiated by underlying App Service infrastructure. In my case it is a couple of minutes so far from ideal.
Performance
The whole idea is based on ClamAV daemon running continuously and accepting the requests as they appear, therefore it is very performant in scanning files, even of a decent size — scanning an average file takes less than a second.
Observability
As Azure Functions integrate well with App Insights and Azure Monitor you have all the tools at hand to understand what happens with your service. Apart from typical request/response telemetry, certainly, you would like to know about:
- your AV definitions being updated
- your AV definition failed to update
Cost
The primary cost driver here is the tier of the App Service Plan. For non-production deployments B1 or B2 will be enough which would pencil it in around $14–26 per single instance of App Service Plan. For production deployments, S1/P1v2/P1v3 would suit you better for its auto-scaling policy (I recommend scheduled ones over system metric based one for practical reasons). It depends on the amount of RAM you would need, basically. Production would take you as high as $73–82–127 per App Service plan instance. Reservations is one of the tools to bring the price down, but also do test it our for your actually needs rather than ramping it up too heavily.
Reflections and gotchas
- since the code is deployed to a shared space it may be possible one instance copying it may override the other. It is important to ensure the folder the code is copied to is synced with the latest version of the code. I relied on rsync to ensure it (rsync has been installed extra on top of existing Docker layers)
- when planning the solution out I wanted to keep Azure Function Host processes running its bundled code together but it is not currently possible. There is a Github issue https://github.com/Azure/Azure-Functions/issues/1507 opened for that. That would make the rsync operation redundant.
- ClamAV definitions are part of the Docker image. When copied over to the shared storage once they will not override it anymore on subsequent deployments. From that moment on, the definitions will be updated only through periodic freshclam invocations over Cron-triggered Azure Function. This behaviour is subject to tweaking — it can be rsync’ed as the application code if need be.
- This piece can be easily reworked to integrate with Blob triggers. It will work perfectly fine, however, it is likely to be tied to the specific use case, then. Here we wanted to go down the route of infrastructural decoupling of AV scanning from any specific use case.
That is all folks. Hopefully I have managed to inspire you at least a bit on how to build AVaaS with Azure Functions using ClamAV. As you could read, there are a couple of tweaks one can still apply. It would not be fun if that article covered it all — it is probably not possible, anyway!