Jekyll2023-02-16T17:55:50+00:00https://dlbock.github.io/feed.xmlA scratchpad of sortsI like to break things. Then try to fix them.Dahlia BockValuable asynchronous work principles2023-02-15T00:00:00+00:002023-02-15T00:00:00+00:00https://dlbock.github.io/2023/02/15/valuable-async-work-principles<p>The Covid-19 pandemic showed the world that it was possible to be productive in a remote work setting. Some folks have taken things further by fully embracing asynchronous work, some even did this way before 2020.</p>
<p>A few things I’ll note before we move on:</p>
<ul>
<li>Working remotely does not necessarily mean working asynchronously.</li>
<li>Working asynchronously usually means you’re working remotely. I suspect it would be difficult to do so in a non-remote work setting.</li>
<li>I believe that there are various async work principles that you can adopt without fully going async by default like some companies have (e.g. Doist, GitLab). This is essentially the point I’m attempting to make in this post.</li>
<li>I’ll share some learning resources on async work that helped me at the end of this post.</li>
</ul>
<h2 id="what-is-async-work">What is async work?</h2>
<p><a href="https://remote.com/blog/elements-sustainable-remote-work-culture">Preston Wick</a> sums it up this way:</p>
<p><em>“Asynchronous work is a simple concept: Do as much as you can with what you have, document everything, transfer ownership of the project to the next person, then start working on something else.”</em></p>
<p>Another summary from <a href="https://lattice.com/library/what-is-asynchronous-work-heres-everything-you-need-to-know-to-implement-it-at-your-organization">Lattice</a>:</p>
<p><em>“Async work, collaboration, and communication simply means that employees work on their own time without the expectation of immediately responding to others.”</em></p>
<h2 id="what-are-these-principles-you-speak-of">What are these principles you speak of?</h2>
<p><strong>Note</strong>: This is not an exhaustive or perfect list. I’ve come to learn that these principles are & can be valuable in any work setting. I had to learn the hard way when I was at a company with a distributed workforce across 3 main timezone groups: North/Central/South America, EMEA (Europe, Middle East & Africa) and APAC (Asia Pacific). My team, which was based in the Americas, was essentially brand new and had to get up to speed quickly to start delivering on some initiatives. We were in many situations where there was no one else online that could help us. So we had to reverse engineer context, or perform an “archeological dig” for context and then build a library of knowledge as we went.</p>
<h3 id="assume-low-context-by-default">Assume low-context by default</h3>
<p>We have to remember that all of us move in and out of contexts constantly; new hires join a company, people go on vacation, people join a project midway, etc. We need to be considerate of the people in our audience who may have no prior knowledge of what we’re talking or writing about. This principle extends to all interactions you have with your colleagues, e.g. in documentation, meeting agendas, questions/messages on Slack, etc.</p>
<p>Some examples in reality:</p>
<ul>
<li>Business context of new projects/initiatives should be documented in written format. New folks who join a project/initiative midway can read about it after the fact and get up to speed. This democratizes context and hopefully prevents the cycle of people repeating themselves over and over in meetings, not to mention the high probability of knowledge leaks because certain people have context that others don’t.</li>
<li>When writing any content (e.g. incident playbook, how-tos, onboarding docs, project contexts), assume your audience has zero context and spell everything out.</li>
<li>Meetings should include information on why it’s being held and an agenda so the audience knows what to expect.</li>
</ul>
<h3 id="schedule-meetings-with-focus--intention">Schedule meetings with focus & intention</h3>
<p>Full async work culture discourages excessive meetings because it takes time away from focused work. There is definitely a time and place to meet with your colleagues and we should do so with intention.</p>
<p>Some examples in reality:</p>
<ul>
<li>All meetings must have an agenda.</li>
<li>Agendas should include, at minimum:
<ul>
<li>Information your audience needs to review beforehand to gain context and meaningfully participate in the meeting.</li>
<li>The reason(s) why this meeting is happening.</li>
<li>The goal(s) of this meeting.</li>
<li>Link to the recording of the meeting (after the fact). Recording a meeting is optional but highly encouraged.</li>
</ul>
</li>
<li>Avoid sending out next day early morning meeting requests/invites late in the previous day.</li>
<li>In situations when an ad-hoc discussion happens, we should create a written record of significant decisions/outcomes/next steps and the necessary context that led up to it.</li>
</ul>
<h3 id="curate-your-content">Curate your content</h3>
<p>It should not be a surprise that remote and async work settings rely heavily on written communication. However, how you write, what you write, and how you organize what you write will influence whether or not it will be usable after you write it. Curation of content is not an easy thing to do successfully, which is why you hear people say things like “Documentation becomes stale as soon as you write it”.</p>
<p>A few things on content that I think is important:</p>
<ul>
<li>Documentation is a subset of content in general. How you treat documentation should be how you treat the content you generate.</li>
<li>Clearly differentiate between point-in-time documents and “living” documents so that a reader can tell whether something <em>is</em> or <em>should be</em> outdated or not. Consider creating an archive for content that is no longer current but still valuable in providing context.</li>
<li>Keep root-level pages under control. If you use Confluence, the root-level pages are the ones you see on the left-hand navigation when you open a Confluence Space. Too many root-level pages will cause a lot of confusion, especially if they’re not categorized in a way to help with navigation.</li>
<li>Ensure there is a single source of truth. If content lives in multiple sources (e.g. Confluence, Google Docs, Notion, etc) and not linked back to a single source, this will cause <em>a lot</em> of friction and confusion and make things impossible to find.</li>
<li>Consider creating well-known spaces for types of content that you create new versions of on a frequent cadence, e.g. monthly All-Hands recordings, RFCs (Request For Comments), team discussion notes, demos, etc. Use templates to make the creation of new versions easier and keep the format consistent. Having well-known spaces for these types of content will hopefully make them easier for folks to find after the fact.</li>
<li>Whatever you do, <em>do not</em> use Slack as an information repository. It is a real-time communication tool, and “real-time” is the opposite of “asynchronous”.</li>
</ul>
<p>Essentially, you want your content to be easy to add to, easy to modify, easy to find, easy to navigate. Properly curated and organized content can be life-savers in certain urgent situations, level the playing ground for all team members by helping them gain the context they need and foster an environment where everyone is empowered and equipped to do their job.</p>
<h2 id="some-resources-on-async-work-that-helped-me">Some resources on async work that helped me</h2>
<p><a href="https://about.gitlab.com/company/culture/all-remote/guide/">GitLab’s All-Remote Guide</a> has been the main source of a lot of my learning, specifically:</p>
<ul>
<li><a href="https://about.gitlab.com/company/culture/all-remote/asynchronous/">How to work asynchronously</a></li>
<li><a href="https://about.gitlab.com/company/culture/all-remote/effective-communication/">Communicating effectively & responsibly through text</a></li>
<li><a href="https://about.gitlab.com/company/culture/all-remote/meetings/">How to optimize meetings in an all-remote environment</a></li>
</ul>
<p>GitLab also has a course on <a href="https://www.coursera.org/learn/remote-team-management">Remote Team Management on Coursera</a> that repackages a lot of the information from their All-Remote Guide and presents them in an easily consummable format.</p>
<p>The folks at Twist have <a href="https://async.twist.com/">a newsletter</a> solely dedicated to the topic of async collaboration that you can subscribe to.</p>Dahlia BockThe Covid-19 pandemic showed the world that it was possible to be productive in a remote work setting. Some folks have taken things further by fully embracing asynchronous work, some even did this way before 2020.Autoscaling Concourse workers with custom Prometheus metrics2021-12-06T00:00:00+00:002021-12-06T00:00:00+00:00https://dlbock.github.io/2021/12/06/autoscaling-concourse-workers-with-prometheus<p>If you’re operating a <a href="https://concourse-ci.org/">Concourse</a> cluster on Kubernetes, you may or may not need to implement the autoscaling of Concourse workers to automatically handle expanding and contracting workloads. The Concourse <a href="https://github.com/concourse/concourse-chart/blob/master/values.yaml">helm chart</a> supports using Kubernetes’ <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/">Horizontal Pod Autoscaler</a> to enable autoscaling based on observed CPU utilization or custom metrics.</p>
<p>By default, each Concourse worker allows only 250 containers to run concurrently. When a worker reaches its max of 250 concurrent running containers, it is no longer able to take on additional tasks and when all of your workers reach that point, then your Concourse cluster is basically unusable. You could <a href="https://www.engineerbetter.com/blog/increasing-concourse-container-limits/">increase that limit</a> but you should understand the implications before actually doing that. The other option would be to autoscale the Concourse workers based on the average number of concurrent running containers per worker.</p>
<p>Basically what we need is in the following order:</p>
<ul>
<li>Expose Prometheus metrics from Concourse</li>
<li>Install Prometheus to collect metrics</li>
<li>Install the Prometheus Adapter to act as a metric API server to make custom metrics available to Kubernetes</li>
<li>Enable autoscaling on the Concourse side, which will create a <code class="language-plaintext highlighter-rouge">HorizontalPodAutoscaler</code> resource and utilize the custom metric made available to Kubernetes to autoscale the Concourse worker pods</li>
</ul>
<p>There isn’t a lot of documentation out there specific to this use case (or at least I couldn’t find it), so hopefully this will be belpful to someone out there (or future me).</p>
<h2 id="enable-concourse-to-expose-prometheus-metrics">Enable Concourse to expose Prometheus metrics</h2>
<p>I mentioned in a <a href="https://dlbock.github.io/2021/10/15/operating-concourse-learnings.html">previous post</a> that Concourse can be enabled to emit metrics about itself. It supports a few types of metric emitters, but I chose Prometheus since we had the most experience with it.</p>
<h2 id="install-the-prometheus-server">Install the Prometheus server</h2>
<p>Once you’ve enabled your Concourse cluster to emit Prometheus metrics, if you already have a Prometheus server running in the same Kubernetes cluster, it’ll automatically find and collect those metrics. We installed Prometheus via its <a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus">helm chart</a> and the only component we needed here is the Prometheus <code class="language-plaintext highlighter-rouge">server</code> that will pull the metrics exposed by your services via the <code class="language-plaintext highlighter-rouge">/metrics</code> endpoints and store them in the time-series database for querying.</p>
<p>Here’s the <code class="language-plaintext highlighter-rouge">values.yaml</code> configuration we used for the Prometheus helm chart:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alertmanager:
enabled: false
kubeStateMetrics:
enabled: false
nodeExporter:
enabled: false
pushgateway:
enabled: false
server:
nodeSelector:
compute-load: prometheus
persistentVolume:
size: 20Gi
storageClass: "${server_storage_class_name}"
resources:
requests:
cpu: 2
memory: 3Gi
retention: "7d"
</code></pre></div></div>
<h2 id="install-the-prometheus-adapter">Install the Prometheus adapter</h2>
<p>In order to make custom metrics available to Kubernetes, we need them to be exposed via Kubernetes’ custom metrics API. This is enabled via “adapter” API servers like the <a href="https://github.com/kubernetes-sigs/prometheus-adapter">Prometheus Adapter</a>. We installed the Prometheus Adapter via its <a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter">helm chart</a> and here is the <code class="language-plaintext highlighter-rouge">values.yaml</code> configuration we used:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prometheus:
url: http://<name-of-prometheus-k8s-service>.<namespace>.svc.cluster.local
port: 80
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
rules:
default: false
custom:
- seriesQuery: 'concourse_workers_containers{worker=~"concourse-worker-.*", kubernetes_namespace="concourse"}'
resources:
overrides:
kubernetes_namespace: { resource: "namespace" }
worker: { resource: "pod" }
name:
matches: "^(.*)"
as: "${1}_avg"
metricsQuery: 'avg_over_time(<<.Series>>{<<.LabelMatchers>>,worker=~"concourse-worker-.*"}[5m])'
</code></pre></div></div>
<p>This is where things got confusing and fuzzy for me to understand what exactly Kubernetes is expecting with this custom metric and I found the following references to be the most helpful:</p>
<ul>
<li><a href="https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config-walkthrough.md">A very extensive walkthrough of the configuration of the Prometheus Adapter</a></li>
<li><a href="https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md">The Prometheus Adapter configuration reference</a> which is also very helpful</li>
<li><a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/">And of course the official Kubernetes docs regarding Horizontal Pod Autoscaling and how it works</a>. Personally the following paragraph was key to helping me understand what it was expecting in what resulted from the configured <code class="language-plaintext highlighter-rouge">metricsQuery</code>:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API
for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller
calculates the utilization value as a percentage of the equivalent resource request on the containers in each
Pod. If a target raw value is set, the raw metric values are used directly. The controller then takes the mean
of the utilization or the raw value (depending on the type of target specified) across all targeted Pods, and
produces a ratio used to scale the number of desired replicas.
For per-pod custom metrics, the controller functions similarly to per-pod resource metrics, except that it
works with raw values, not utilization values.
</code></pre></div></div>
<h2 id="enable-autoscaling-in-concourse">Enable autoscaling in Concourse</h2>
<p>Now you’re finally ready to configure Concourse to create the <code class="language-plaintext highlighter-rouge">HorizontalPodAutoscaler</code> resource using the resulting <code class="language-plaintext highlighter-rouge">concourse_workers_containers_avg</code> metric. Since we installed Concourse via its <a href="https://github.com/concourse/concourse-chart">helm chart</a>, we just had to edit the <code class="language-plaintext highlighter-rouge">values.yaml</code> to add the following section:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>concourse:
...
worker:
...
autoscaling:
maxReplicas: 30
minReplicas: 24
customMetrics:
- type: Pods
pods:
metric:
name: concourse_workers_containers_avg
target:
type: AverageValue
averageValue: 180
</code></pre></div></div>
<p>This basically creates a <code class="language-plaintext highlighter-rouge">HorizontalPodAutoscaler</code> (HPA) resource that would have a minimum of 24 Concourse worker pods, and if the average value of <code class="language-plaintext highlighter-rouge">concourse_workers_containers_avg</code> goes above 180 using the <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details">built-in scaling algorithm</a>, it will slowly scale up the number of pods to accommodate the load. Since the maximum replicas is configured at 30, there will never be more than 30 Concourse worker pods running with this HPA configuration.</p>
<h2 id="a-few-additional-things-to-note">A few additional things to note</h2>
<ul>
<li>Set resource (cpu/memory) requests on your Concourse workers so that each worker pod knows how much resources it can reasonably use. If you’re using managed Kubernetes services like Google Kubernetes Engine (GKE), you should also configure autoscaling on the node pool that runs Concourse so that it can scale the nodes as more worker pods are added/removed.</li>
<li>You should set a reasonable value for <code class="language-plaintext highlighter-rouge">terminationGracePeriodSeconds</code> when configuring the Concourse workers as this will tell Kubernetes how long to wait to allow the workers to drain current tasks and retire themselves before terminating it. If you have pipelines that run long running jobs, you might have to sequester them to separate worker groups and opt them out of auto-scaling as we don’t want Kubernetes to terminate worker pods midway through running tasks.</li>
</ul>Dahlia BockIf you’re operating a Concourse cluster on Kubernetes, you may or may not need to implement the autoscaling of Concourse workers to automatically handle expanding and contracting workloads. The Concourse helm chart supports using Kubernetes’ Horizontal Pod Autoscaler to enable autoscaling based on observed CPU utilization or custom metrics.Learnings from operating Concourse for the past year2021-10-15T00:00:00+00:002021-10-15T00:00:00+00:00https://dlbock.github.io/2021/10/15/operating-concourse-learnings<p>I have had to operate a <a href="https://concourse-ci.org/">Concourse</a> cluster on GKE for the last year or so. There have been so many things that I wish I knew back when I started that I have learned over the last few months. So, this is mostly a reminder to my future self in the event that the information is helpful, but maybe this will also help someone else who is going through a similar journey.</p>
<h2 id="use-a-managed-database">Use a managed database</h2>
<p>It’s possible that I’m missing many skills in this area, but having to manage a PostgreSQL database that’s installed via a helm chart dependency, running via Kubernetes Pod with the data in a Persistent Volume is a giant PITA. You have to hand-roll your own backup & restore strategy, you have to manually scale up the Persistent Volume size. It’s one less headache if you can use a managed database like Google CloudSQL from the get-go. Backups and sizing are automatic, and you can create & manage the resource via Terraform. I haven’t had to go through a PostgreSQL version update yet with CloudSQL, so that story remains to be told.</p>
<h2 id="enable-concourse-to-emit-metrics">Enable Concourse to emit metrics</h2>
<p>Concourse can be configured to <a href="https://concourse-ci.org/metrics.html">emit metrics</a> about its system health and the builds that it is running. This information is key to understanding how your Concourse cluster is doing, and where the actual & potential problems are. There are quite a few Metric Emitters available, Prometheus being one of them. If you’re installing Concourse on Kubernetes via its helm chart, you can see what those options are in the <a href="https://github.com/concourse/concourse-chart/blob/master/values.yaml">values.yaml</a> (search for <code class="language-plaintext highlighter-rouge">metrics:</code> and look at the available metric emitter sections below that).</p>
<p>Another thing to point out here is that if you want to implement autoscaling of the Concourse workers, you will need these metrics if you want to use anything other than the pod’s CPU & Memory usage to configure the Horizontal Pod Autoscaler.</p>
<h2 id="use-ssds-for-concourse-worker-persistent-volumes">Use SSDs for Concourse Worker Persistent Volumes</h2>
<p>If you are running Concourse in a Kubernetes cluster, and installing the Concourse workers as a StatefulSet instead of a Deployment, ensure that the Persistent Volumes are using a StorageClass with type <code class="language-plaintext highlighter-rouge">pd-ssd</code>. It’s configurable in the Concourse helm chart. For builds that are I/O heavy will run into issues very quickly if the volumes are using a <code class="language-plaintext highlighter-rouge">pd-standard</code> StorageClass.</p>
<h2 id="configure-the-appropriate-container-placement-strategy">Configure the appropriate Container Placement Strategy</h2>
<p>You will inevitably need to configure a <a href="https://concourse-ci.org/container-placement.html">container placement strategy</a> that works for you. The default set in the helm chart is <code class="language-plaintext highlighter-rouge">volume-locality</code>, which could mean that a handful of workers will be overloaded because that’s where the their inputs ended up. I recommend understanding all the available strategies and pick one (or more) that works for you.</p>
<p>We have been using the <code class="language-plaintext highlighter-rouge">limit-active-tasks</code> strategy for the past year, with the max allowed active build tasks per worker <code class="language-plaintext highlighter-rouge">limitActiveTasks</code> set to <code class="language-plaintext highlighter-rouge">5</code>. That number is probably too small, but we were trying to be conservative when we encountered the issue of some workers being overloaded with the default <code class="language-plaintext highlighter-rouge">volume-locality</code> strategy.</p>
<p>There is now an option to <a href="https://concourse-ci.org/container-placement.html#chaining-placement-strategies">chain more than one container placement strategies together</a>, which wasn’t available last year. We will probably be switching to a combination of <code class="language-plaintext highlighter-rouge">limit-active-tasks</code> + <code class="language-plaintext highlighter-rouge">volume-locality</code> in the very near future.</p>
<h2 id="set-an-acceptable-termination-grace-period">Set an acceptable termination grace period</h2>
<p>When <a href="https://github.com/concourse/concourse-chart#restarting-workers">workers are restarted/deleted</a> (either manually or by Kubernetes), we can configure the <code class="language-plaintext highlighter-rouge">terminationGracePeriodSeconds</code> value to provide an upper limit to how long Kubernetes will wait for Concourse to gracefully <a href="https://concourse-ci.org/internals.html#RETIRING-table">retire</a> the worker before forcefully terminating the container.</p>
<p>This is a tricky number to set as it depends on the builds that are running in your cluster and the average/max time a build takes to complete, and how comfortable you are with the possibility of workers being killed before its tasks are drained completely. This number is also important in the event that you implement worker autoscaling as you probably want Kubernetes to scale down workers only after they’ve been retired properly.</p>
<h2 id="make-one-optimization-at-a-time">Make one optimization at a time</h2>
<p>When things are not going as planned (slow builds, too much load on workers, builds waiting too long to be picked up by workers, etc), it is tempting to make whatever optimizations you think might help the situation. This could potentially result in the complete opposite happening, where too many optimizations cause things to deterioriate or you don’t know which one actually helped the situation. Make one change at a time, observe its effects for some time before moving onto the next. Hopefully you can also use the <a href="#enable-concourse-to-emit-metrics">Concourse metrics emitted</a> to help inform and guide your changes.</p>Dahlia BockI have had to operate a Concourse cluster on GKE for the last year or so. There have been so many things that I wish I knew back when I started that I have learned over the last few months. So, this is mostly a reminder to my future self in the event that the information is helpful, but maybe this will also help someone else who is going through a similar journey.Wrangling Kubernetes configuration (Part 2)2020-04-30T00:00:00+00:002020-04-30T00:00:00+00:00https://dlbock.github.io/2020/04/30/wrangling-kubernetes-configuration-part-2<p>Following up on my <a href="/2020/04/02/wrangling-kubernetes-configuration-part-1.html">first post</a>, where we looked at a simple example of using a Jsonnet template to determine what kind of icon path to use for a <code class="language-plaintext highlighter-rouge">Chart.yaml</code> file, I’d like to take a look at a less simple, but possibly still contrived, example of utilizing Jsonnet to dynamically reconstruct YAML.</p>
<p>Consider the following <code class="language-plaintext highlighter-rouge">my-application.yaml</code>:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Namespace</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">labels</span><span class="pi">:</span>
<span class="s">app.kubernetes.io/name</span><span class="pi">:</span> <span class="s">my-application</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">my-application-namespace</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ServiceAccount</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">labels</span><span class="pi">:</span>
<span class="s">app.kubernetes.io/name</span><span class="pi">:</span> <span class="s">my-application</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">my-application</span>
<span class="na">namespace</span><span class="pi">:</span> <span class="s">my-application-namespace</span>
</code></pre></div></div>
<p>I’d like to be able to identify which release of <code class="language-plaintext highlighter-rouge">my-application</code> is being run by adding an additional label to each resource type:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="na">labels</span><span class="pi">:</span>
<span class="s">app.kubernetes.io/name</span><span class="pi">:</span> <span class="s">my-application</span>
<span class="s">app.kubernetes.io/version</span><span class="pi">:</span> <span class="s">1.0.11</span>
</code></pre></div></div>
<p>Since the version number changes between releases, it is unrealistic for me to have to manually edit this file everytime it changes. So let’s take a look at how we can use a Jsonnet template to help us re-generate this YAML file and dynamically add the version label.</p>
<p><code class="language-plaintext highlighter-rouge">my-application.jsonnet</code></p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">local</span><span class="w"> </span><span class="err">resources</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.parseJson(std.extVar('resources'));</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.extVar('version');</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">addVersionToMetadataLabels(resource)</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">resource</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="err">metadata+:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">labels+:</span><span class="w">
</span><span class="err">super.labels</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"app.kubernetes.io/version"</span><span class="p">:</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="err">;</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">resourcesWithVersion</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.map(addVersionToMetadataLabels,</span><span class="w"> </span><span class="err">resources);</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="p">[</span><span class="s2">"namespace.v"</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="s2">".json"</span><span class="p">]</span><span class="err">:</span><span class="w">
</span><span class="err">std.filter(function(res)</span><span class="w"> </span><span class="err">res.kind</span><span class="w"> </span><span class="err">==</span><span class="w"> </span><span class="s2">"Namespace"</span><span class="p">,</span><span class="w"> </span><span class="err">resourcesWithVersion)</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w">
</span><span class="p">[</span><span class="s2">"serviceaccount.v"</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="s2">".json"</span><span class="p">]</span><span class="err">:</span><span class="w">
</span><span class="err">std.filter(function(res)</span><span class="w"> </span><span class="err">res.kind</span><span class="w"> </span><span class="err">==</span><span class="w"> </span><span class="s2">"ServiceAccount"</span><span class="p">,</span><span class="w"> </span><span class="err">resourcesWithVersion)</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Let’s break it down.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">local</span><span class="w"> </span><span class="err">resources</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.parseJson(std.extVar('resources'));</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.extVar('version');</span><span class="w">
</span></code></pre></div></div>
<p>This template accepts 2 external parameters <code class="language-plaintext highlighter-rouge">resources</code> and <code class="language-plaintext highlighter-rouge">version</code>. Here, <code class="language-plaintext highlighter-rouge">resources</code> is the contents of the <code class="language-plaintext highlighter-rouge">my-application.yaml</code> file, which has been converted to JSON format, and <code class="language-plaintext highlighter-rouge">version</code> is whatever version number you want to use.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">local</span><span class="w"> </span><span class="err">addVersionToMetadataLabels(resource)</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">resource</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="err">metadata+:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">labels+:</span><span class="w">
</span><span class="err">super.labels</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"app.kubernetes.io/version"</span><span class="p">:</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="err">;</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">resourcesWithVersion</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.map(addVersionToMetadataLabels,</span><span class="w"> </span><span class="err">resources);</span><span class="w">
</span></code></pre></div></div>
<p>Next, we have a function <code class="language-plaintext highlighter-rouge">addVersionToMetadataLabels</code> that takes a resource (in JSON format) and subsequently overrides <code class="language-plaintext highlighter-rouge">metadata.labels</code> to add an additional label with key <code class="language-plaintext highlighter-rouge">app.kubernetes.io/version</code> and the provided <code class="language-plaintext highlighter-rouge">version</code> value.</p>
<p>Then we use the Jsonnet standard library <code class="language-plaintext highlighter-rouge">map</code> function to apply that to all the JSON resources in <code class="language-plaintext highlighter-rouge">resources</code>.</p>
<p>Now that we have modified the original contents to include a new additional <code class="language-plaintext highlighter-rouge">app.kubernetes.io/version</code> label, we can now output the new contents and reconstruct our JSON.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="p">[</span><span class="s2">"namespace.v"</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="s2">".json"</span><span class="p">]</span><span class="err">:</span><span class="w">
</span><span class="err">std.filter(function(res)</span><span class="w"> </span><span class="err">res.kind</span><span class="w"> </span><span class="err">==</span><span class="w"> </span><span class="s2">"Namespace"</span><span class="p">,</span><span class="w"> </span><span class="err">resourcesWithVersion)</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w">
</span><span class="p">[</span><span class="s2">"serviceaccount.v"</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="err">version</span><span class="w"> </span><span class="err">+</span><span class="w"> </span><span class="s2">".json"</span><span class="p">]</span><span class="err">:</span><span class="w">
</span><span class="err">std.filter(function(res)</span><span class="w"> </span><span class="err">res.kind</span><span class="w"> </span><span class="err">==</span><span class="w"> </span><span class="s2">"ServiceAccount"</span><span class="p">,</span><span class="w"> </span><span class="err">resourcesWithVersion)</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This generates 2 separate JSON files, one for <code class="language-plaintext highlighter-rouge">Namespace</code> and the other for <code class="language-plaintext highlighter-rouge">ServiceAccount</code>, and then I have written some bash scripting to convert the JSON back to YAML and shove them back into a single YAML file. The reason I’ve done this is because Jsonnet outputs JSON in alphabetically order by key, and I needed to preserve the original order of the Kubernetes resources specified in <code class="language-plaintext highlighter-rouge">my-application.yaml</code>.</p>
<p>Hopefully these few examples illustrates some of the many ways Jsonnet can help manage YAML files and consequently your Kubernetes configuration in a slightly saner way.</p>
<p>I recently gave a talk on this subject at a Women Who Code Austin virtual meetup. I’ve published the code I’ve used and the slides on <a href="https://gitlab.com/dlbock/talks/-/tree/master/wrangling-k8s-config">this repo</a>. Feel free to copy and paste whatever makes sense.</p>
<p>If anyone else has other experiences with managing Kubernetes configuration using other tools and would like to share, please do in the comments section below. Would love to learn from you.</p>Dahlia BockFollowing up on my first post, where we looked at a simple example of using a Jsonnet template to determine what kind of icon path to use for a Chart.yaml file, I’d like to take a look at a less simple, but possibly still contrived, example of utilizing Jsonnet to dynamically reconstruct YAML.Wrangling Kubernetes configuration (Part 1)2020-04-02T00:00:00+00:002020-04-02T00:00:00+00:00https://dlbock.github.io/2020/04/02/wrangling-kubernetes-configuration-part-1<p>I’ve recently been working a lot with helm charts and Kubernetes configuration and one of the challenges has been managing the differences between all the installation methods and ensuring it is deployable on multiple Kubernetes platforms e.g. helm chart on <a href="https://hub.helm.sh/">Helm Hub</a>, helm chart on the <a href="https://rancher.com/docs/rancher/v2.x/en/catalog/built-in/">Rancher Library Catalog</a>, single YAML file format for both Kubernetes and OpenShift, a <a href="https://console.cloud.google.com/marketplace">Google Cloud Platform Marketplace</a> application just to name a few.</p>
<p>The underlying Kubernetes resources that need to be created are not <em>too</em> different from one platform to another, but there was enough difference for a fair amount of complexity not to mention the fact that they had to be published/pushed to various locations. There are <a href="https://blog.argoproj.io/the-state-of-kubernetes-configuration-management-d8b06c1205">many tools</a> out there for Kubernetes configuration management but they aren’t always a one-size-fits-all solution for your needs.</p>
<p>A few things things that we started doing in attempts to manage the increasing complexity:</p>
<ol>
<li>Create a single canonical source where the resources can be generated from</li>
<li>Use <a href="https://jsonnet.org/">Jsonnet</a> templates where necessary</li>
<li>Use <code class="language-plaintext highlighter-rouge">helm template</code> to generate configuration in single YAML file format</li>
</ol>
<p>I’ll illustrate the second point a bit more with a simple example.</p>
<p>Consider a snippet of the following <code class="language-plaintext highlighter-rouge">Chart.yaml</code> for <code class="language-plaintext highlighter-rouge">My Awesome Application</code>’s helm chart for Helm Hub.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">my-awesome-application</span>
<span class="na">version</span><span class="pi">:</span> <span class="s">1.0.26</span>
<span class="na">appVersion</span><span class="pi">:</span> <span class="m">1.1</span>
<span class="na">description</span><span class="pi">:</span> <span class="s">My Awesome Application</span>
<span class="na">home</span><span class="pi">:</span> <span class="s">https://www.example.com/</span>
<span class="na">icon</span><span class="pi">:</span> <span class="s">https://remote-site.com/my-icon.png</span>
<span class="nn">...</span>
</code></pre></div></div>
<p>Compare that with a snippet of this other <code class="language-plaintext highlighter-rouge">Chart.yaml</code> for <code class="language-plaintext highlighter-rouge">My Awesome Application</code>’s helm chart for the Rancher Library Catalog.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">my-awesome-application</span>
<span class="na">version</span><span class="pi">:</span> <span class="s">1.0.26</span>
<span class="na">appVersion</span><span class="pi">:</span> <span class="m">1.1</span>
<span class="na">description</span><span class="pi">:</span> <span class="s">My Awesome Application</span>
<span class="na">home</span><span class="pi">:</span> <span class="s">https://www.example.com/</span>
<span class="na">icon</span><span class="pi">:</span> <span class="s">file://../my-icon.png</span>
<span class="nn">...</span>
</code></pre></div></div>
<p>The difference is the <code class="language-plaintext highlighter-rouge">icon</code> path points to a local file for the Rancher version to provide support for air-gapped users, whereas the Helm Hub version points to a remote file. This simple difference could be handled in a few ways:</p>
<ol>
<li>Python (or bash, etc) script to swap out the icon path one of the <code class="language-plaintext highlighter-rouge">Chart.yaml</code> versions</li>
<li>Maintain 2 versions of the same file</li>
<li>Use a jsonnet template to manage the difference</li>
</ol>
<p>Option #1 got a little gnarly using <code class="language-plaintext highlighter-rouge">awk</code> as there were slight differences between Linux distributions of GNU <code class="language-plaintext highlighter-rouge">awk</code> (for running on the CI machine) and OS X <code class="language-plaintext highlighter-rouge">awk</code> (for local development). The <code class="language-plaintext highlighter-rouge">bash</code> gurus out there might know of a different tool/way to resolve this but I had to find an alternative as I wasn’t one.</p>
<p>Option #2 got really annoying over time as everytime someone made any changes, they had to remember to increase the Chart <code class="language-plaintext highlighter-rouge">version</code> in <em>two</em> separate files.</p>
<p>So we went with Option #3 using the following <code class="language-plaintext highlighter-rouge">Chart.jsonnet</code> template:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">local</span><span class="w"> </span><span class="err">type</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">std.extVar('type');</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">localIcon</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="s2">"file://../my-icon.png"</span><span class="err">;</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">remoteIcon</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="s2">"https://remote-site.com/my-icon.png"</span><span class="err">;</span><span class="w">
</span><span class="err">local</span><span class="w"> </span><span class="err">iconPath</span><span class="w"> </span><span class="err">=</span><span class="w"> </span><span class="err">if</span><span class="w"> </span><span class="err">type</span><span class="w"> </span><span class="err">==</span><span class="w"> </span><span class="err">'rancher'</span><span class="w"> </span><span class="err">then</span><span class="w"> </span><span class="err">localIcon</span><span class="w"> </span><span class="err">else</span><span class="w"> </span><span class="err">remoteIcon;</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"apiVersion"</span><span class="p">:</span><span class="w"> </span><span class="s2">"v1"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"my-awesome-application"</span><span class="p">,</span><span class="w">
</span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1.0.26"</span><span class="p">,</span><span class="w">
</span><span class="nl">"appVersion"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.1</span><span class="p">,</span><span class="w">
</span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"My Awesome Application"</span><span class="p">,</span><span class="w">
</span><span class="nl">"home"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.example.com/"</span><span class="p">,</span><span class="w">
</span><span class="nl">"icon"</span><span class="p">:</span><span class="w"> </span><span class="err">iconPath</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Then run <code class="language-plaintext highlighter-rouge">jsonnet</code> with the aforementioned template:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># helm</span>
jsonnet <span class="nt">--ext-str</span> <span class="nb">type</span><span class="o">=</span>helm <span class="nt">-o</span> <span class="nv">$TARGET_DIR</span>/Chart.json Chart.jsonnet
<span class="c"># rancher</span>
jsonnet <span class="nt">--ext-str</span> <span class="nb">type</span><span class="o">=</span>rancher <span class="nt">-o</span> <span class="nv">$TARGET_DIR</span>/Chart.json Chart.jsonnet
</code></pre></div></div>
<p>Note that <code class="language-plaintext highlighter-rouge">jsonnet</code> generates <code class="language-plaintext highlighter-rouge">json</code> files but you can use <code class="language-plaintext highlighter-rouge">python</code> to convert that to <code class="language-plaintext highlighter-rouge">yaml</code> fairly easily and then delete the generated <code class="language-plaintext highlighter-rouge">json</code> file if you no longer need it:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span>
<span class="kn">import</span> <span class="nn">sys</span><span class="p">,</span> <span class="n">yaml</span><span class="p">,</span> <span class="n">json</span>
<span class="n">yaml</span><span class="p">.</span><span class="n">safe_dump</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">stdin</span><span class="p">),</span> <span class="n">sys</span><span class="p">.</span><span class="n">stdout</span><span class="p">,</span> <span class="n">default_flow_style</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<p>What I like about using a <code class="language-plaintext highlighter-rouge">jsonnet</code> template in this scenario:</p>
<ul>
<li>It’s readable and the differences are encoded in the same location as the contents of the file itself so there isn’t some external script that modifies the file after the fact</li>
<li>It’s deterministic and not system dependent</li>
<li>And of course, no duplication necessary</li>
</ul>Dahlia BockI’ve recently been working a lot with helm charts and Kubernetes configuration and one of the challenges has been managing the differences between all the installation methods and ensuring it is deployable on multiple Kubernetes platforms e.g. helm chart on Helm Hub, helm chart on the Rancher Library Catalog, single YAML file format for both Kubernetes and OpenShift, a Google Cloud Platform Marketplace application just to name a few.Beyond Just Feature Development2020-02-13T00:00:00+00:002020-02-13T00:00:00+00:00https://dlbock.github.io/2020/02/13/beyond-just-feature-development<p>Software engineering isn’t just about pushing out features, IMHO. There is so much that needs to happen between the beginning of a feature and when it is actually available to be used by customers. I find that some engineers tend to glaze over these things with the excuse that it is not their job to ensure the software they write gets into their customers hands and works as intended. I disagree. Here are some of the questions I <del>worry about</del> ask myself when writing software.</p>
<h3 id="testing">Testing</h3>
<ul>
<li>How do I create automatic test harnesses for these features?</li>
<li>How do I ensure that this will work in production? Is there a pre-production environment that’s a carbon copy of production? Is there a way that I can simulate the load/traffic of production?</li>
</ul>
<h3 id="deploying-changes">Deploying changes</h3>
<ul>
<li>How do I get these changes to pre-production? And then to production?</li>
<li>How quickly can I get bug fixes into production, if needed?</li>
<li>Is there an automatic deployment pipeline? Is it a manual process? Is it a black box controlled by someone else?</li>
</ul>
<h3 id="releasing-features">Releasing features</h3>
<ul>
<li>How do I enable these features for our customers? Or a subset of our customers? Is it something I can do myself, or do I have to go searching for the right person for it?</li>
<li>Do we even have a feature toggle system in place? Is it somewhere that I can find easily or do I have go on a digging expedition</li>
<li>How are these feature toggles managed across environments?</li>
</ul>
<h3 id="observability-of-features">Observability of features</h3>
<ul>
<li>How do I know how these features are performing in production?</li>
<li>How do I know if they are even being used in production?</li>
<li>How do I get notified when things go wrong in production?</li>
</ul>
<h3 id="communication">Communication</h3>
<ul>
<li>Is the feature I’m working on mostly self-contained within my team? Do I need to reach out to another team to collaborate or can I work independently?</li>
<li>Is there anyone else that we need to communicate with about this feature?</li>
<li>Do I need to write up documentation about this feature?</li>
</ul>
<p>I’ll update this list over time as I learn about other things that I should <del>worry</del> care about.</p>Dahlia BockSoftware engineering isn’t just about pushing out features, IMHO. There is so much that needs to happen between the beginning of a feature and when it is actually available to be used by customers. I find that some engineers tend to glaze over these things with the excuse that it is not their job to ensure the software they write gets into their customers hands and works as intended. I disagree. Here are some of the questions I worry about ask myself when writing software.Pull Requests And Why They Don’t Bring Out The Best In Us2018-09-28T00:00:00+00:002018-09-28T00:00:00+00:00https://dlbock.github.io/2018/09/28/pull-requests-dont-bring-out-the-best-in-us<p>The pull request (PR) system has existed on GitHub (and other similar products) from day one, and has been revamped a few times to allow for more options of collaboration around (all types of) code. As evidenced by <a href="https://blog.github.com/2010-08-31-pull-requests-2-0/">this GitHub blog post</a> from 2010.</p>
<p>In the recent years however, I’ve realized that the pull request model causes more problems and creates more inefficiencies than they solve.</p>
<h4 id="intent-is-hard-to-capture-and-communicate"><strong>Intent is hard to capture and communicate</strong></h4>
<p>When authoring a pull request, it is important to provide some context as to why you’re making the changes that you’re making, the thought processes behind the code that was written and what were the tradeoffs. This helps a reviewer understand the what, why and how behind the code change so that they can take that into account when providing feedback. Capturing all that context is hard and time-consuming. And even if you were able to capture all of that in the description of your pull request, it’s not a guarantee that your reviewers will have the same background and experience to completely understand the intent behind what you wrote.</p>
<p>The same goes when providing feedback on a pull request. It isn’t easy to communicate nuances via the written word and we have to work extra hard to get our point across effectively.</p>
<h4 id="pull-requests-discourage-emergent-design"><strong>Pull requests discourage emergent design</strong></h4>
<p>Emergent design is the notion that engineers focus on delivering small pieces of working functionality and allow for the design to emerge as the code evolves, as opposed to having long running feature branches and anticipating design in advance.</p>
<p>This way of working allows for more frequent commits to master/trunk, and changes the focus from “I need to deliver this feature” to “How can I deliver business value in small chunks?”.</p>
<p>The end result is usually just enough design and code to enable the feature that is being delivered and less opportunity for over-engineering.</p>
<h4 id="pull-requests-end-up-being-the-conversation-conduit-as-opposed-to-just-conversation-starter"><strong>Pull requests end up being the conversation conduit as opposed to just conversation starter</strong></h4>
<p>When a team is in the same time-zone and co-located, there is no reason to use pull requests as the tool to facilitate conversation as opposed to just having the conversation in real life. In addition to the difficulty of effectively communicating intent, all sorts of other communication challenges arise when we hide behind the veneer of a pull request: bike shedding, nitpicking, etc.</p>
<hr />
<p>It is a constant reminder for me that the pull request system sometimes does not allow me to work in the most efficient way possible.
I prefer sitting and and having a conversation with my colleague about code in real life, and allow for the back-and-forth and nuances to be communicated that way. In an ideal world, we would be pairing and designing a solution together, committing to master/trunk as frequently as possible while keeping the tests passing and allowing the design to emerge from each iteration.</p>Dahlia BockThe pull request (PR) system has existed on GitHub (and other similar products) from day one, and has been revamped a few times to allow for more options of collaboration around (all types of) code. As evidenced by this GitHub blog post from 2010.Attending Lead Developer Austin 20182018-03-09T00:00:00+00:002018-03-09T00:00:00+00:00https://dlbock.github.io/2018/03/09/austin-lead-developer-conf<p>What makes a good technical leader? How does one stay competent technically while being an effective leader? How do we build more effective teams in the age where people are constantly pulled in multiple directions but still expected to focus and deliver?</p>
<p>I had the privilege of attending The Lead Developer Conference held for the first time in Austin this year. There were a lot of good talks but here is a snippet of the ones that stuck with me.</p>
<ul>
<li>
<p><strong><a href="https://www.slideshare.net/thekua/levelling-up-the-way-of-the-lead-developer">Pat Kua: Levelling Up: The Way of the Lead Developer</a></strong></p>
<ul>
<li>As engineers, our default mode is the ‘maker’ mode. But, as a leader, we don’t necessarily need to always be the one solving the problem, instead, enable the team to do so themselves.</li>
<li>There rarely is only one right answer to a problem, especially when it involves human beings. Consider the context of the situation and the tradeoffs when making decisions.</li>
<li>What you say matters as much as how you say it. Each of us come with different interpretations of reality and our own personal biases, so what you say could mean different things to different people.</li>
</ul>
</li>
<li>
<p><strong>Julia Grace: Building Engineering Teams Under Pressure</strong></p>
<ul>
<li><a href="https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html">When Google set out on its quest to build the perfect team</a>, the most important thing they found after studying many many teams was that the good ones generally shared 2 behaviors: Members spoke in roughly the same proportion (equality in distribution of conversational turn-taking) and they were skilled at intuiting how others felt based on nonverbal cues (high “average social sensitivity”). <strong>Within psychology, researchers sometimes colloquially refer to traits like ‘‘conversational turn-taking’’ and ‘‘average social sensitivity’’ as aspects of what’s known as psychological safety</strong></li>
<li>Toxic conflict arises when different people are solving different problems that they think are the same problem.</li>
<li>Get as much clarity and information as possible about the problems that you’re solving and NOT solving.</li>
<li>
<p>Grace further illustrates the combination psychological safety and clarity:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Psychological
safety
^
|
Party | High performing
| team
|
----------------------------> Clarity
|
Chaos | Dictatorship
|
|
</code></pre></div> </div>
</li>
</ul>
</li>
<li>
<p><strong>Heidi Waterhouse: The Death of Data: Retention, Rot, and Risk</strong></p>
<ul>
<li>Be mindful of the data that you store. Store it in such a way that enables you to separate the wheat from the chaff.</li>
<li>It might be cheap to store data, but is it cheap to maintain, protect and convert to new formats in the future?</li>
<li>Old data is not neutral, it affects current data.</li>
<li>“Collect carefully. Ingest mindfully. Delete boldly.”</li>
</ul>
</li>
<li>
<p><strong>Nickolas Means: Who Destroyed Three Mile Island?</strong></p>
<ul>
<li>When something bad happens (e.g. we experience an outage of our services, we miss the deadline on a very important project, someone miscalculates a formula that causes us to collect the wrong amount of money from our customers, etc), it’s always too easy to find someone to blame and make them pay for their mistakes. In reality, if we believe that we have hired the right people with the right values to get the job done, the root cause is never that simple.</li>
<li>Nickolas talks about Hindsight Bias, also known as the knew-it-all-along effect, is the inclination, after an event has occurred, given the knowledge of everything that took place during that event, to see it as having been predictable all along. He also talks about Outcome Bias, which is when we evaluate the quality of a decision, when the outcome of that decision is already known.</li>
<li>It’s very important when something bad happens that we focus less on the human(s) that made the mistake, but rather the systemic flaws that allowed the human to make the mistake. Seek forward accountability, and always find the second story that’s hidden under what’s obvious.</li>
</ul>
</li>
</ul>Dahlia BockMy thoughts and learnings from attending the conferenceService outages and good practices around handling them2017-11-12T00:00:00+00:002017-11-12T00:00:00+00:00https://dlbock.github.io/2017/11/12/service-outages<p>As humans, we are inherently flawed, and therefore the software we write is also inherently flawed, no matter how careful we are, or how many tests we write.
It is prudent what we admit that so that we can plan for when things go wrong.</p>
<p>One of the many things I appreciated very much in my time at SoundCloud was their rigorous process around “What to do when there’s an outage?” and their dedication to continuously improve it so that the company as a whole was better equipped to handle these unwanted situations when they happened. Here are some of my learnings:</p>
<ul>
<li>
<p><strong>Plan for things to go wrong</strong></p>
<p><em>Hope for the best and prepare for the worst</em>. This might seem obvious to you (or not), but the most important thing to do to prepare for when things go wrong, is to do just that: prepare. This could take the form of:</p>
<ul>
<li>Runbooks: Document your systems and what to do when they go down. Store the runbooks in a place that is easy for your team to access and not in the same place as the rest of your services.</li>
<li>Training: Run regular sessions to make sure your team is well versed on what to do. This will also ensure that your runbooks are up-to-date.</li>
</ul>
</li>
<li>
<p><strong>Communicate when things go wrong</strong></p>
<p>One of the first things that you should do when you find out that something is wrong is notify people. Depending on the size of the problem, you might just be notifying your team, or the whole company and subsequently your public users.</p>
<p>Use any and all of the following means:</p>
<ul>
<li>If you have a public facing service, notify your users, be it through Twitter, a static status page or banner on your site.</li>
<li>Send out severity emails to your entire company, and most importantly your executives, sales and customer support folks.</li>
</ul>
</li>
<li>
<p><strong>Use a centralized channel to troubleshoot the issue</strong></p>
<p>If you have a distributed team (which is more often than not a common occurrence these days), agree upon a means by which you can troubleshoot the issue. In the case of SoundCloud, everyone knew to go one specific Slack channel when an incident occurred. Since there was also a pretty large engineering team, the first thing we did was to establish who would take ‘point’ and ‘comms’. The ‘point’ person would be the one directing the troubleshooting, even if they weren’t the ones actually doing the work. The ‘comms’ person would be the one taking care of communicating the outage to the rest of the company, and to keep them up to date as things change.</p>
</li>
<li>
<p><strong>Find out how and what went wrong and how to prevent it from happening in the future</strong></p>
<ul>
<li>Set up a time to talk through what happened as a team. One of things that SoundCloud did which I appreciated was to open up those postmortem meetings to any engineer who was interested in them. Engineers were actually encouraged to attend postmortem meetings once in a while just to be informed on what happened.</li>
<li>If you have different teams managing different parts of your system, ask that team to prepare the postmortem report.</li>
<li>As part of that report, also include a section for the work that needs to be done in order to prevent this specific occurrence in the future.</li>
<li>And of course, schedule that work as soon as you can.</li>
</ul>
</li>
</ul>
<p>Outages happen. We need to be ready when it does.</p>Dahlia BockMy learnings on what to do when there's a service outageBack up and blogging!2016-12-14T00:00:00+00:002016-12-14T00:00:00+00:00https://dlbock.github.io/2016/12/14/Hello-World<figure>
<a href="https://www.flickr.com/photos/dahliabock/7654250812/in/dateposted-public/" title="NYC skyline from Rockefeller Center"><img src="https://farm9.staticflickr.com/8166/7654250812_caba0b00fd_k.jpg" width="2048" height="1536" alt="NYC skyline from Rockefeller Center" /></a>
<figcaption><a href="https://www.flickr.com/photos/dahliabock/7654250812/in/dateposted-public/" title="NYC skyline from Rockefeller Center">NYC skyline from Rockefeller Center</a>.</figcaption>
</figure>
<p>After a multiple year hiatus, I’ve resurrected my blog and hope to be blogging again.</p>
<p>Be back soon!</p>Dahlia BockAfter a multiple year hiatus, I've resurrected my blog and hope to be blogging again.