Problem: Driving food production via IoT, automating some of the cooking devices, sales data, payment processing, etc
Edge computing - every restaurant runs a Kubernetes cluster
Why edge - because every restaurant should continue to operate even in absence of network failure or other disasters
2000 restaurants - 2000 clusters, 6000 nodes, 7 engineers including an SRE run the infra
Restaurant data center - NUC device = Intel quadcore processor, 8GB RAM
manageable remotely, automated device discovery and self-clustering, self-healing
Friers and other devices are controlled by this device
prometheus runs on the edge
Components
Highlander - custom tool for clustering leader election
Istio - alternate to NGinx
cluster initialization
used RKE(Rancher Kubernetes Engine)
other options tried: kops ( no bare metal), kubespray (slow, brittle), kubeadm (may be in future), RKE
Resetting cluster state:
need to be able to re-image remotely
solution: Overlay FS + HAMS (manages wiping clsuters and restoring to base)
Hooves up - self healing AWS SSM (Simple systems manager) registration - able to do remote commands and patch reporting/management - remotely logging via SSM and issue AWS commands to install Ansible
Fleet (custom)
custom package and deployment management tool
other tools: SQS, MQTT, Helm
supports variety of deploymnet models including canary, blue/green
how distributed tracing works in microservices world - OpenTracing
what is p75 and p90 in terms of latency
Distributed Tracing - holistic view of a single request - each request is assigned a token - tracing happens at transport level, not application level - each writes to a common store - don’t trace all requests, use sampling say one out of every 100 or 1000 requests
Scaling Push Messaging for Millions of Devices @Netflix
PUSH - persist until something happens
Zuul - uses WebSockets/SSE
used to push real-time update of movie suggestions - helped reduce 12% of load on Netflix servers
what is C10K challenge - 10K concurrent connections at a time to a server - creating a new thread for each socket connection doesn’t scale - Async I/O uses read/write callback and single thread
Netflix uses Netty non-blocking I/O
Zuul push server
clients connect to these servers which opens a websocket connection
connections are auto-closed periodically on the server to avoid clients connecting to old version of software/cluster after a server upgrade (avoiding thundering herd problem) - randomize each connection lifetime ( to avoid recurring thundering herd problem caused by any blips) - ask client to close connection within a time frame
how to optimize push servers - Goldilocks strategy - more smaller instances, not fewer large instances
how to auto-scale - based on open connections per server
Amazon ELBs cannot proxy websockets - by default they run as HTTP load balancer - run ELB as a TCP load balancer to solve the problem. Amazon now supports ALB (websocket-aware application load balancer) supports web sockets
Zuul push registry
keeps track of which client is connected to which push server
any data store with with following features can be used as push registry. Netflix uses Dynomite
low read latency
record expiry (when client disconnects)
sharding and replication
Message Processing
uses kafka
application uses push library to push the message to a queue
multiple - uses Mantis (similar to Flink) to streamline process the queues
Forced Evolution: Shopify’s Journey to Kubernetes
3 Common Pitfalls in Microservice Integration
Google for InfoWorld article on the same topic by the presenter
Challenges of Asynchronicity
synchronous blocking call exhausts threads in pool - use circuit breaker (Netflix Hystrix)
fail fast is important but not enough - (e.g., flight check-in failure due to bar code generator)
Workflow engines, state machines - AWS Step Functions, Uber Cadence, Netflix Conductor, jBPM, Camunda (Java based), Zeebe by Camunda, Activiti
Manigfold architecture options to run a workflow engine - blog article
A synchronous response is possible in the happy case, otherwise it is switched to async processing
Communication is complex
Client has to implement timeout, retry, compensation
Service Providers has to implement idempotency - (anti-pattern: Do not refresh the page)
It is hard to differentiate communication problems - unable to communicate to service or service received request but unable to respond back
Problems operating message bus - dead message, no context, inaccesible payload, hard to redeliver, …
Distributed Transactions
Paper: Life beyond distributed transactions: an apostate’s opinion - Pet Helland
Keynote - A brief, opinionated history of the API by Joshua Bloch
Efficient Fault Tolerant Java with Aeron Clustering
Fault Tolerance
Load balancer is another form of fault tolerance
Fault tolerance of state - partition, replication
Guaranteed messaging / queueing don’t store the previous state of a system
Exactly-Once vs. atleast-once
blocking ACK spiral
nondeterministic restarts
Contention and coherence
poison message and error queues
Better way: Contiguous log with snapshot & replay
processing the log, build a state
Aircrafts and spacecrafts process log, build state, marking the snapshot and store the state at that snapshot. In case of failure, it restarts from the last snapshot/checkpoint and starts from the stored state. Same concept as in database checkpoints
Clustered services
Replicated state machines
Raft consensus
Strong leader - elected member of the cluster, orders input, disseminates consensus
Application observability (like operations to devops)
debugging needs more context
Accelerate release lifecycle
Overops captures all swallowed exceptions, including the threadstack and data flowing through the code, also provides the author the code causing the error - tells you if the error is new or resurfaced
Overops gives app metrics like bugs introduced, swallowed exceptions, etc. release over release - helps towards code accountability
Debugging Microservices: How Google SREs Resolve Outages (Google)
SLO (Service level objective) - Google has only regional SLOs
Have You Tried Turning It Off and On Again? (David Blank-Edelman, Author of Seeking SRE, Founder of SRECon, @otterbook)
useless machine video (funny example of resiliency)
What is SRE (check pic)
SLO (Service level objective)
Characteristics of a resilient system
blameless postmortems - focus on improvement to the process
Nature of Work
Toiling without any value makes no sense
resiliency is a long game, it won’t happen overnight - you cannot burn out people
Interfaces
Error budget - developers and SRE meeting constantly to discuss the state of the code and system. There is no point in blaming each other about quality of code and instability of the system
Data
New technologies
Companies:
Doordash
Atomist
SynkSec (security company)
Shopify
Stripe - provides APIs that web developers can use to integrate payment processing into their websites and mobile applications.
Vaadin - Framework to convert Java to web programming language
DevOps
JFrog Xray - software violation rules are defined in SaaS - Xray can be installed locally and integrated with Jenkins - scans through the packages recursively including python, npm, docker, etc. looking for vulnerabilities
by Facebook - dialect of PHP - built for HHVM(HipHop VM) - both dynamically typed and statically typed (also called gradually typing system) - Hack seems to add some features to PHP to make it a more reasonable programming language