Hosted Graphite - Graphite as a service, with StatsD and Grafana Dashboards Start page – collectd – The system statistics collection daemon mozilla/crontabber Monitoring at Spotify: The Story So Far | Labs This is the first in a two-part series about Monitoring at Spotify. In this, I’ll be discussing our history, the challenges we faced, and how they were approached. Operational monitoring at Spotify started its life as a combination of two systems. Zabbix and a homegrown RRD-backed graphing system named “sitemon”, which used Munin for collection. Zabbix was owned by our SRE team, while sitemon was run by our backend infrastructure team. In late 2013, we were starting to put more emphasis on self service and distributed operational responsibility. We tried to bandage up what we could: our Chief Architect hacked together an in-memory sitemon replacement that could hold roughly one month worth of metrics under the current load. Alerting as a service Alerting was the first problem we took a stab at. We considered developing Zabbix further. We found inspiration from attending Monitorama EU where we stumbled upon Riemann. We built a library on top of Riemann called Lyceum. Graphing Tags
Using monitoring and metrics to learn in development Monitoring at Spotify: Introducing Heroic | Labs This is the second part in a series about Monitoring at Spotify. In the previous post I discussed our history of operational monitoring. In this part I’ll be presenting Heroic, our scalable time series database which is now free software. Heroic is our in-house time series database. We built it to address the challenges we were facing with near real-time data collection and presentation at scale. We are aware Elasticsearch has a bad reputation for data safety, so we guard against total failures by having the ability to completely rebuild the index rapidly from our data pipeline or Cassandra. A key feature of Heroic is global federation. Every host in our infrastructure is running ffwd, which is an agent responsible for receiving and forwarding metrics. This setup allows us to rapidly experiment with our service topology. In the backend everything is stored exactly as it was provided to the agent. All parts of Heroic is now free software, feel free to grab the code on Github.
Munin Linux Performance Analysis in 60,000 Milliseconds You login to a Linux server with a performance issue: what do you check in the first minute? At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available. In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. uptime dmesg | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 top Some of these commands require the sysstat package installed. The following sections summarize these commands, with examples from a production system. 1. uptime $ uptime 23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02 3. vmstat 1
Ganglia Monitoring System Nagios - The Industry Standard in IT Infrastructure Monitoring [Sam&Max] Munin and email notifications Avoir un serveur c’est bien, en prendre soin c’est mieux. J’ai longtemps administré des serveurs sans trop me soucier de ce qui pouvait leur arriver et bien des fois tout a planté car je n’avais pas su anticiper la catastrophe. Un exemple courant, j’ai viré Apache pour installer Nginx et ce dernier log par defaut les accès http, après quelques semaines, le disque dur bien rempli, le serveur me claque dans les doigts, plus de place, tout merde, c’est le drame, pas moyen de rebooter à part en safe mode car disque plein, les hémorroïdes s’en mêlent, c’est foutu. J’ai commencé par MRTG, pas terrible à installer mais il fait son boulot. Installation de Munin: Munin se compose de deux programmes.Le maître (munin): qui va récupérer les infos et générer les graphs.Le noeud (munin-node): qui s’installe sur tous les serveurs à monitorer y compris le maître si besoin et qui va envoyer les infos au maître. Sur le serveur Maître pour une distribution Ubuntu: vi /usr/local/nginx/conf.d/munin.conf