Monitorama Conference - Baltimore 2019
Monitorama Conference - Baltimore 2019
- Dashboard Renaissance - Cory from Splunk
- Book: The Design of Everyday Things by Dan Norman
- What’s the goal/purpose of this dashboard?
- what action to take?
- most important things in your dashboard should be on the top left and least important on bottom right
- What to include
- RED and USE techniques
- prefer symptoms over causes
- Pre-attentive processing - position, angle/slope, size/length, volume, color/density
- What chart type
- Heatmaps show outliers
- Saturation is low accuracy
- Bar charts - better for comparison of a few values
- Scales, Units, Norms, Labels
- https://www.usability.gov/what-and-why/visual-design.html - accessibility
- Before, During, and After Chaos - Nora Jones @nora_js
- Different phases of Chaos Engineering
- Chaos Engineering
- O’Reilly e-book
- AWS re:Invent 2017 Nora Jones - Youtube video
- obscure hindsight bias
- Serial propensity effect -
- https://www.oreilly.com/library/view/velocity-conference-2017/9781491976265/video311370.html
- Logs, Metrics, endpoints - Bryan Liles - Tanzu Build, VMWare
- Logging best practices
- Best format: JSON. (Jsonnet?). slf4j for json format?
- log to stdout
- syslog - logs at scale
- add context to your log messages
- Metrics
- USE (, saturation, ),
- RED (rate, error, duration),
- Four Golden Signals (latency, traffic, error, saturation) - this may work better at scale
- Best practices
- OpenMetrics/Prometheus
- Keep the number of metrics to manageable size (may be 100s, not 1000s)
- Endpoints
- TCP, HTTP, Custom Reponse
- Best practices - look at the deck
- Traces
- Distributed tracing (OpenTracing, Jaeger)
- traces are composed of spans
- OpenTelemetry - capture metrics and distributed traces
- Jeff from Netflix
- Mantis - Open source streaming microservices monitoring solution
- answers new questions that you forgot to log
- low latency
- cost-effective
- Developing meaningful SLIs - Alex Hidalgo (Squarespace Engineering http://slidesgala.com)
- Service Level Indicator
- SLI (engineering) = User journey (product team) = KPI (business)
- M.E.L.T. Level Up - Ron from NewRelic
- MELT
- Metrics - Micrometer, Istio, Prometheus, OpenTelemetry, DropWizard
- Events - alerts and deployments are example of events
- Logs - json:api, fluentbit, cloudwatch,
- Traces - Zipkin, OpenTelemetry, Istio
- Platform - Zipkin, OpenTelemetry, Istio, Micrometer, DropWizard
- LightStep
- Tech paper - Dapper, distributed tracing paper
- Stackdriver monitoring from Google
- Designing Alerts to Direct Attention - Ryan Frantz
- https://www.pbs.org/newshour/science/3-brain-technologies-to-watch-in-2018
- Mental model of a system
- Fitness Function-Driven Development - Rosemary WAng ThoughtWorks
- Evolutionary Architecture
- Fitness function - borrowed from Genetrics algorithm
- Benefits
- KonMari method for former assumptions, tools and telemetry
- Highlights gaps in process, tooling and telemetry
- open discussions for tech deby
- develop mutual learning context
- https://github.com/joatmon08/2019-monitorama
- Presentation - Pete Cheslock @petecheslock
- http://lusislog.blogspot.com/2011/06/why-monitoring-sucks.html
- https://mattturck.com/data2019/
- Logging best practices
- Log in JSON format
- put everything in the json
- send it all to your Data Lake (georgehart.com)
- Presentation Material: https://pete.wtf/decks/MonitoramaBaltimore2019.pdf
- How to interview
- Diversity
- Preparation - get trained on interviewing. be comfortable
- Read the job description. Find the right candidate for the job. This isn’t about you
- Read the CV. Don’t Google - beware of legal issues
- Pair up
- Present yourself professionally - don’t take phone calls
- Colloborate, don’t confront
- Adopting a Product mindset for SRE and Observability teams
- You can’t spell “monitoring” without “monoid” - Kevin from NewRelic
- Easy is not the same as simple - Rich Hikey’s talk - e.g., Writing simple code is not easy
- What is Monoid? a single algebraic structure with a single
- Temporal and dimensional aggregation
- Observability Graph - Homin Lee from Datadog
- Corollary to Conway’s Law: Observability follows your org chart.
- Gore’s hypothesis (Goretex fabric)
- a.k.a. prequel to Dunbar’s number
- a.k.a. thing you heard about from The Tipping Point
- Graph with ontology is Knowledge graph
- Observing Observability - Philip O’Toole from Google Cloud
- Why observability is difficult
- BigPanda
- Catchpoint.com
- Circonus
- Datadog
- Elastic APM
- Honeycomb.io
- InfluxDB
- Jaeger
- Loggly
- NewRelic.com
- OpenCensus
- OpenMetrics
- OpenTracing
- OpenTSDB
- Prometheus
- SignalFx
- Site24x7.com
- StackDriver from Google
- Zipkin
Follow ups
- Mantis - perhaps to visualize the service mapping and reliability of services
- Define SLIs for services we depend on
- Fitness Function-Driven Development
- Evaluate
- Create a matrix of tools and their features (MELT, alerts, USP, SaaS, real-time, cost etc.)
- GUTS (Grand Unified Telemetry System) overview videos
- Learn about AIOps, Monoids
- Convert DAP logs to JSON format - Benefits - Implications
- https://slidesgala.com
- https://vimeo.com/monitorama
- https://speakerdeck.com/monitorama
- What is ScyllaDB?
- Statistics for Engineers - Heinrich Hartmann
- If CMDB is static, what’s the alternative?