Moving to Open Source for Application Performance Management

Application Performance Management (APM) is the monitoring and management of the performance and availability of software applications. The interpretation of APM can vary for different people and businesses. A very basic and most important reason for monitoring your Infrastructure and Application is achieving 100% uptime for your customers and stakeholders. Multiple applications have been built over time to allow developers to achieve the same.

‍For reading more on application performance management, visit.

Different organizations use different tools as per their requirements. With multiple solutions available at hand, it is tough to pick one since each of them have their pros and cons. At Shadowfax, we have tried a few as well, and as our application traffic increased over time, we wanted to set up more detailed alerts, such as when the error count of our APIs is higher than a certain threshold or the average response times of our tasks.‍

A great start with New Relic

During our early days, we focused more on building features for our customers and for internal processes, and decided to start with New Relic APM Lite as our Monitoring Tool. It helped us to monitor our complete application performance, and a lot of issues were rectified to improve our overall response times. As a trial account, we were allowed to monitor our whole application i.e both web and non-web components.

Image for post

As our application grew and our trial period with New Relic got over, we started to miss a lot of insights. We had no way to keep track of our servers, and our production servers would go down without our knowledge. Production Issues were reported mostly by our on-ground team when their applications stopped working. Even tracking just the disk usage was hard and resulted in downtime multiple times.

Back to the roots, we set up parsing over our server logs and inspected them each time something bad happened. Some of the usual problems that happened but went unreported included the following:

Failure in Computational Resource. Increase in CPU Utilization, Memory Issue, Disk Usage increasing to 100%, etc.
RabbitMQ Queue unable to receive or delegate tasks
Some database queries take too much time, causing clogging in the connection pool, resulting in the application going down‍

Visualization and Debugging with ELK

As our infrastructure grew, we started using ELK (ElasticSearch, Logstash, Kibana) for debugging production issues. We moved to central RabbitMQ, centralized our Celery nodes, and created Dashboards to monitor Nginx logs, MQTT stats, and visualizations for team-related metrics. We were using New Relic APM Lite along with Nagios and ELK.

What changed:

Great Visualizations and Dashboards
Central Logging system to debug production issues
We stopped using Flower for monitoring celery workers

Image for post

Gaps that remained

X-Pack did not offer alerting and authentication for Basic or the open-source plan
Use of multiple monitoring tools (Nagios, ELK, Sentry, New Relic APM Lite, Flower) is hard to maintain
Gathering data and debugging production issues from multiple tools is tedious

With multiple monitoring tools to maintain, we wanted to upgrade our New Relic Subscription Plan and stop worrying. However, the Pricing Plan stopped us from doing it, and we decided to find an open-source solution.‍

Beyond the doors of open source

With a little trial and error strategy, we decided to use Graphite with StasD and collectd. With multiple collectd plugins already available and easy integration of statsd with Django, it was a very easy transition. We used collectd to gather server metrics with plugins like collectd-rabbitmq and redis-collectd-plugin. To visualize our time series data for application and analytics, we used Grafana, which has a better visualization component than Graphite. Image for post

We also added authentication to ELK and used a self-hosted version of Sentry. What we achieved:

One tool to monitor our complete infrastructure
A better context for our systems
Visualizing aggregated data over time helps our decisions in tweaking our stack
Enabling developers to add metrics for monitoring as per their needs
With ease of use, individuals across teams were able to create dashboards and an alerting mechanism as per their use case
Confidence in decisions on scaling infrastructure and troubleshooting
Customizing the different components used for monitoring as per our needs
Clear separation of metrics from our production, staging, and demo environments
Cost effectiveness
It was fun to set up our own thing

What’s next

There is always a lot to refine, and with time, we would move towards clustering graphite to handle more data. We plan to stop using ElasticSearch as a datasource, as alerting is still not available in the current version of Grafana.

How we moved to open source alternative for Application Performance Management?