Deepfence Authors: Thomas Legris, Ramanan Ravikumar, Manan Vaghasiya, Shyam Krishnaswamy
ThreatMapper, the fastest growing open-source CNAPP platform, hunts for threats in production platforms, and ranks these threats based on their risk-of-exploit. It uncovers vulnerable software components, exposed secrets, malwares, and deviations from standard security and compliance configurations. ThreatMapper uses a combination of agent-based inspection, and agentless monitoring, for the widest possible coverage to detect threats.
Since the launch of the open source platform eighteen months ago, ThreatMapper has seen massive adoption across a wide variety of public, private and hybrid clouds, bare-metal servers, serverless environments like AWS Fargate, and even Raspberry PI devices. ThreatMapper adds runtime context such as network flows to the thousands of scan results to build ThreatGraph; a rich visualization of the most meaningful and threatening attack paths. This has potential to reduce the threats found by up to 97%, helping users prioritize the remediation of 3% of threats that are actually exploitable. Some of our users have already installed ThreatMapper on Kubernetes clusters across 2,500 Kubernetes nodes, around 20,000 pods and up to 50,000 containers, gaining critical security observability into their risk posture and ensuring the ability to respond to threats in runtime.
Since ThreatMapper provides unprecedented visibility across the entire infrastructure, we asked ourselves – how can we meet the demands of users who want to cover 100,000 Kubernetes nodes? Or 100,000 regular EC2 servers? Will it hold if we push the boundary to 200,000 nodes or servers? We went back to the drawing board to take a close look at our technology stack. This post explores those architectural changes and how they have helped Deepfence scale its open source CNAPP to cater to organizations of all sizes without spending an additional dollar on compute. No one should have to pay for fundamental building blocks of cloud security like cloud asset inventory, cloud misconfiguration checks, vulnerability management, sensitive secrets and malware scanning.
ThreatMapper consists of two components – Deepfence Sensors, and the Deepfence Console. The sensors are simple lightweight eBPF probes that collect the relevant metadata to be sent to the Deepfence Console, and deployed onto Kubernetes clusters, docker hosts, bare-metal servers; any environment where users deploy their application workloads. The Deepfence Console aggregates the metadata from the sensors, maps it back to the various threats like vulnerabilities, exposed secrets & malware, security misconfigurations, e.t.c., to build the ThreatGraph.
A simple representation of the architecture is as follows in Figure 1:
As we can see above, the Deepfence sensors collect metadata around processes, containers, network connections, and send it to the Deepfence console using a REST API. A persistent websocket is used to communicate any control information from the Deepfence Console to the sensors. This persistent connection is used to trigger any scans. Once a trigger is received by the sensor, it gathers the relevant data about the various packages installed, various programs that are currently running, and sends it back to the Deepfence console using the REST API. The scans are then performed on the Deepfence console. If any SIEM is configured in the system, Celery jobs are spawned to send the results of the scans to the SIEM tools in an asynchronous manner.
As we started to push the boundaries of scale, we observed:
We combined our learnings from the previous architecture, and our future considerations to build the new architecture as depicted below:
As we can see above, the first major design decision that we took was to move towards a Graph Database (DB) as our primary datastore. Choosing a Graph Database as the foundation for all our operations was a strategic decision rooted in the fundamental understanding of the nature of users' infrastructure. In the world of cybersecurity, applications, network, data and identities are not isolated entities, but an interconnected web of information. Users' infrastructure, comprising a multitude of compute or cloud services, is inherently a graph. Nodes represent services while edges signify various relationships - be it a compute service tied to a particular user, or communication between different applications, deployed as pods or containers. Embracing a Graph DB is akin to reflecting the organic, interconnected structure of the digital environment within which we operate.
Choosing Neo4j, an open-source Graph DB, further enriches our capabilities, as it embodies the ethos of collaboration, continuous enhancement, and transparency; crucial in the ever-evolving landscape of cybersecurity. Neo4j's robustness and adaptability equipped us to handle myriad use cases and capabilities, bolstering the future potential of our CNAPP solution. The graph-centric data model of Neo4j enables us to intuitively organize and retrieve data, thereby improving efficiency and scalability. With Neo4j, we can explore the limitless potential of interconnected data, and draw powerful insights that empower our users to fortify their security posture. This is a testament to our commitment to harness advanced technologies, to deliver a sophisticated, dynamic, and user-centric cybersecurity solution.
Putting it all together; the Deepfence sensors collect metadata around processes, containers, network connections, and send it to the Deepfence console using a REST API. The response to the REST API is now used to communicate any control information from the Deepfence Console to the sensor, i.e., to trigger any scans. Once a trigger is received by the sensor, it gathers the relevant data about the various packages installed, various programs that are currently running, and sends it back to the Deepfence console using the REST API. The scans are then performed on the Deepfence console. If any SIEM is configured in the system, worker jobs are spawned to send the results of the scans to the SIEM tools.
Since the focus of the new architecture was to build for scale:
When the Deepfence Console is deployed on a single EC2 server with 16 cores & 32GB of memory, while we were previously able to handle upto 7,000 servers, we are now able to scale to 100,000 Kubernetes nodes or even 100,000 EC2 servers.
For ease of navigation at scale, we also provide a tabular representation of the infrastructure.
The new CLI built as a part of the revamped architecture in action – vulnerability scans when 100,000 servers are being monitored for threats.
We now deploy the Deepfence Console on a three node Kubernetes cluster, where each node has 8 core CPU and 32GB of memory.
Leveraging the power of our solution, we have revolutionized the scale at which we monitor servers. While our single-node deployment capably manages 100,000 servers, we've taken it a step further. By merely augmenting compute resources, we've seamlessly scaled our console horizontally, thereby tripling our capacity. Today, we're effortlessly monitoring an astounding 300,000 servers, a significant leap from our previous capacity of 40,000 servers. This feat underscores the formidable scalability of our solution, ready to grow as your needs expand.
In conclusion, scaling a security tool like ThreatMapper to monitor hundreds of thousands of servers is no small feat. This exploration into our revamped architecture has demonstrated the capability of our technology to meet the needs of large-scale infrastructures. By transforming our backend and leveraging modern technologies like Golang, Neo4J, and Kafka, we've made ThreatMapper an even more powerful, efficient, and scalable CNAPP solution. In a few weeks, you will see a V2-tagged release in production, with an updated UI that reflects these enabled architecture changes, in an enterprise-grade launch; all within our open source platform ThreatMapper! Stay tuned.
Not only does this provide immense value for organizations managing vast amounts of servers, but it also empowers users with greater insights and control over their infrastructure security.
Remember, it's not just about spotting vulnerabilities; it's about understanding the risk context and focusing remediation efforts on those threats that could truly impact your infrastructure. With ThreatMapper, you'll navigate your cybersecurity landscape with unparalleled vision and confidence, regardless of your infrastructure's scale. And this new architecture also enables the community to implement various use cases easily, key among them that we have on our roadmap are:
Though our shift from Elasticsearch momentarily hinders free text search capabilities, it paves the way for an exciting integration - a self-hosted Security Specific Small Language Model (all in open-source)! This feature will not only function as a search interface but also enable a deeper correlation of the alerts that we detect in the infrastructure. We're turning temporary limitations into stepping stones for significant enhancements, staying true to our commitment to constant innovation and user-centric development. In the upcoming series of blog posts, we will also discuss the ability to monitor changes in security configurations of cloud accounts, without reaching the API limits of the cloud providers.
Join us on this journey to redefine cybersecurity at scale. Embrace the power of ThreatMapper to secure your infrastructure and to transform your cybersecurity posture from reactive to proactive. Make cybersecurity your strength, not your bottleneck.
As always, we welcome interested users to join our community Slack for any further deeper technical discussions. Additionally, the code is always available on Github for download, and to hack on it while we tag a formal V2 release over the next few weeks. We always welcome valuable contributions from the community. Finally, if you are interested in working on challenging scale problems involving graphs, and hacking into the exciting world of eBPF, let’s talk!