In IT since my first job helping out with computers in my high school in 1994
Past employers: CoreOS, Red Hat, Electronic Arts among many others
Currently a Senior Consultant for (we're hiring!)
Blood type: Caffeine-positive
Contact info:
Most of it isn't even really new -- we're just probing the state and outputs of the system. The only new things are:
Get the lay of the land...
kubectl get [-o wide] <nodes, pods, svc...>
kubectl describe [-o yaml] <node, pod, svc...>
Often you will spot the issue right here
Provides a summarized view of recently-seen events, e.g:
NAMESPACE LASTSEEN FIRSTSEEN COUNT
NAME KIND
SUBOBJECT TYPE REASON SOURCE
MESSAGE
default 6s 6d 39910
data-romping-buffoon-elasticsearch-data-0 PersistentVolumeClaim
Normal FailedBinding persistentvolume-controller
no persistent volumes available for this claim and no storage
class is set
get
Gets logs from a container in a pod:
<probe> INFO: 2017/11/14 21:55:43.738702 Control connection to
weave-scope-app.default.svc starting
<probe> INFO: 2017/11/14 21:55:45.789142 Publish loop for
weave-scope-app.default.svc starting
-c
Good old-fashioned SSH followed by interacting with the system logs or container runtime
"Didn't we just do that with kubectl
?" -- If you have really bad cluster issues or you're debugging an issue with a control-plane component, you might not be able to use kubectl
"Why?!"
<probe> WARN: 2017/12/08 07:39:07.762765 Error collecting weave status,
backing off 10s: Get http://127.0.0.1:6784/report: dial tcp 127.0.0.1:6784:
getsockopt: connection refused
<probe> WARN: 2017/12/08 07:39:07.767862 Cannot resolve 'scope.weave.local.':
dial tcp 172.17.0.1:53: getsockopt: connection refused
<probe> WARN: 2017/12/08 07:39:07.816447 Error collecting weave ps, backing
off 20s: exit status 1: "Link not found\n"
In my cluster, this is not a problem when I deploy Weave Scope because I don't have Weave networking deployed
(Note: no deep dive here, because the topic is vast...)
<probe> WARN: 2017/12/08 17:49:35.260659 Cannot resolve
'kubecon2018.default.svc': lookup kubecon2018.default.svc on
10.3.0.10:53: no such host
<probe> WARN: 2017/12/08 07:56:19.104684 Error Kubernetes reflector (pods),
backing off 40s: github.com/weaveworks/scope/probe/kubernetes/client.go:195:
Failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:default:
kubecon2017-weave-scope" cannot list pods at the cluster scope
<probe> WARN: 2017/12/08 07:56:19.106268 Error Kubernetes reflector (nodes),
backing off 40s: github.com/weaveworks/scope/probe/kubernetes/client.go:195:
Failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:
default:kubecon2017-weave-scope" cannot list nodes at the cluster scope
Usually that's the easy part: Kubernetes is declarative, so just redeclare things correctly:
kubectl apply -f ...
Don't forget to fix things before you clean up old pods/etc. Kubernetes does a lot of cleaning up for you -- don't make work for yourself
API docs: https://kubernetes.io/docs/api-reference/v1.8/ -- great for resource syntax
Other good info on the main Tasks page: https://kubernetes.io/docs/tasks -- see sidebar under "Monitor, Log and Debug" (especially Troubleshoot Clusters and Troubleshoot Applications)
A lot of your traditional knowledge is still relevant
Take the time to fully describe problems you encounter (rubber-duck debugging)
"Aren't there tools for this stuff?" -- yes, but what if deploying them fails? This is about giving you base knowledge to understand what underlies those tools
There are a lot of tools out there with advanced capabilities that will help you prevent, debug and fix problems -- find some awesome ones you love and tell us all about them!
Oteemo management (hi Sam!) for getting me here
Justin Garrison and Michelle Noorali for abstract help
Many CoreOS engineers and Red Hat trainers past and present for teaching me how to do this stuff