31 Episodes

Ship It Weekly

Ship It Weekly is a short, practical recap of what actually matters in DevOps, SRE, cloud infrastructure, and platform engineering. Each episode, your host Brian Teller walks through the latest outages, releases, tools, and incident writeups, then translates them into “here’s what this means for your systems” instead of just reading headlines. Expect a couple of main stories with context, a quick hit of tools or releases worth bookmarking, and the occasional segment on on-call, burnout, or team culture.

Hosted by

Brian Teller

Latest
Episodes

All Releases
GitHub Actions Hardening, Airbnb Config Rollouts, Cloudflare Rust Restarts, ECS Managed Daemons, and Terraform Access Controls
Ep. 30

This episode of Ship It Weekly is about the quiet platform work that keeps things safe before they break. Brian covers GitHub Actions hardening in Kubernetes-related repos, Airbnb’s safer config rollouts, Cloudflare’s zero-downtime Rust restarts, Amazon ECS Managed Daemons, and HCP Terraform access controls with IP allow lists and temporary AWS permission delegation.

Links

GitHub Actions security roadmap

https://github.blog/news-insights/product-news/whats-coming-to-our-github-actions-2026-security-roadmap/

Airbnb config rollouts

https://medium.com/airbnb-engineering/safeguarding-dynamic-configuration-changes-at-scale-5aca5222ed68

Cloudflare graceful restarts for Rust

https://blog.cloudflare.com/ecdysis-rust-graceful-restarts/

Amazon ECS Managed Daemons

https://aws.amazon.com/about-aws/whats-new/2026/04/amazon-ecs-managed-daemons/

HCP Terraform IP allow lists

https://www.hashicorp.com/blog/hcp-terraform-adds-ip-allow-list-for-terraform-resources

HCP Terraform AWS permission delegation

https://www.hashicorp.com/blog/aws-permission-delegation-now-generally-available-in-hcp-terraform

GitHub secret scanning updates

https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/

GitHub secret scanning for AI coding agents

https://github.blog/changelog/2026-03-31-secret-scanning-extends-to-ai-coding-agents-via-the-github-mcp-server/

Codespaces GA with data residency

https://github.blog/changelog/2026-04-01-codespaces-is-now-generally-available-for-github-enterprise-with-data-residency

Kubernetes v1.36 sneak peek

https://kubernetes.io/blog/2026/03/30/kubernetes-v1-36-sneak-peek/

GKE Inference Gateway

https://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway

More episodes and show notes

https://shipitweekly.fm

On Call Briefs

https://oncallbrief.com

View Details
Hackerbot-Claw Grows, Xygeni Tag Poisoning, GitHub Search HA, Windows SID Failures, and AI Skills Supply Chain
Ep. 29

This episode of Ship It Weekly is about the places where convenience quietly turns into trust.

Brian revisits the Trivy story by zooming out to the bigger hackerbot-claw GitHub Actions campaign, then gets into the Xygeni tag-poisoning compromise, GitHub’s search high availability rebuild for GitHub Enterprise Server, Windows Server 2025 surfacing duplicate SID problems in cloned images, and the agent-skills ecosystem replaying package supply chain history. Plus: a quick lightning round on GitHub pausing self-hosted runner minimum-version enforcement and March secret scanning updates.

Links

OpenSSF advisory on active GitHub Actions exploitation https://seclists.org/oss-sec/2026/q1/246

Xygeni action compromise via tag poisoning https://www.stepsecurity.io/blog/xygeni-action-compromised-c2-reverse-shell-backdoor-injected-via-tag-poisoning

GitHub Enterprise Server search high availability rebuild https://github.blog/engineering/architecture-optimization/how-we-rebuilt-the-search-architecture-for-high-availability-in-github-enterprise-server/

Microsoft on duplicate SIDs and nongeneralized Windows Server 2025 images https://learn.microsoft.com/en-us/troubleshoot/exchange/administration/exchange-server-issues-on-incorrect-windows-server-image

Socket on supply chain security for skills.sh https://socket.dev/blog/socket-brings-supply-chain-security-to-skills

Snyk ToxicSkills research https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/

GitHub self-hosted runner minimum version enforcement paused https://github.blog/changelog/2026-03-13-self-hosted-runner-minimum-version-enforcement-paused/

GitHub secret scanning pattern updates, March 2026 https://github.blog/changelog/2026-03-10-secret-scanning-pattern-updates-march-2026/

More episodes and show notes at https://shipitweekly.fm

On Call Briefs at https://oncallbrief.com

View Details
Ship It Conversations: Ang Chen on Project Vera, AI Cloud Emulation, and Safer Infrastructure Testing
Ep. 28

This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.

In this Ship It: Conversations episode, I talk with Ang Chen from the University of Michigan about Project Vera, a cloud emulator built to help teams test infrastructure changes more safely before they touch real cloud.

We talk about why testing against real cloud APIs is slow, expensive, and risky, how Vera works under tools like Terraform and CloudFormation, what “high fidelity” actually means, and where a tool like this could fit in local dev and CI/CD.

The bigger theme is one I think matters a lot: if AI is going to play a real role in cloud operations, it probably needs a sandbox first, not direct access to production.

Note

This interview was recorded on February 13, 2026. Since then, Vera’s public project materials have expanded the framing a bit further around multi-cloud support and safe environments for agent learning, so keep that in mind while listening.

Highlights

• Why real cloud testing still creates cost, delay, and risk

• How Vera emulates cloud behavior at the API layer

• Where this could help with Terraform, CloudFormation, and CI/CD workflows

• Why “useful enough to catch real mistakes” may matter more than perfect emulation

• The limits, tradeoffs, and fidelity questions that still need to be solved

• Why safe training grounds may matter before AI agents touch real infrastructure

Ang’s links

• LinkedIn: https://www.linkedin.com/in/ang-chen-8b877a17/

• University of Michigan profile: https://eecs.engin.umich.edu/people/chen-ang/

• Publications: https://web.eecs.umich.edu/~chenang/pubs.html

Project Vera

• Project site: https://project-vera.github.io/

• GitHub: https://github.com/project-vera/vera

• The quest for AI Agents as DevOps: https://project-vera.github.io/blogs/cloudagent/cloudagent/

• No More Manual Mocks: https://project-vera.github.io/blogs/cloudemu/cloudemu/

Stuff mentioned

• A Case for Learned Cloud Emulators: https://dl.acm.org/doi/10.1145/3718958.3754799

• Cloud Infrastructure Management in the Age of AI Agents: https://dl.acm.org/doi/abs/10.1145/3759441.3759443

• LocalStack: https://www.localstack.cloud/

Our links

More episodes + show notes + links: https://shipitweekly.fm

On Call Brief: https://oncallbrief.com

View Details
McKinsey AI Flaw, Kafka Goes Diskless, Google Buys Wiz, AWS Copilot Ends, and AI Gateway on Kubernetes
Ep. 27

This week on Ship It Weekly, Brian looks at what happens when new interfaces create old responsibilities.

McKinsey patched a vulnerability in its internal AI tool Lilli, Kafka contributors are pushing a diskless-topics model that rethinks durability and replication in cloud environments, and Google officially closed Wiz acquisition in one of the biggest cloud-security moves. Plus: AWS is sunsetting Copilot CLI, Kubernetes launches an AI Gateway Working Group.

Links

McKinsey statement on Lilli

https://www.mckinsey.com/about-us/media/statement-on-strengthening-safeguards-within-the-lilli-tool

Kafka diskless topics proposal

https://cwiki.apache.org/confluence/display/KAFKA/The%2BPath%2BForward%2Bfor%2BSaving%2BCross-AZ%2BReplication%2BCosts%2BKIPs

Google completes acquisition of Wiz

https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/wiz-acquisition/

AWS Copilot CLI end-of-support

https://aws.amazon.com/blogs/containers/announcing-the-end-of-support-for-the-aws-copilot-cli/

Kubernetes AI Gateway Working Group

https://kubernetes.io/blog/2026/03/09/announcing-ai-gateway-wg/

Amazon Bedrock observability for first-token latency and quota consumption

https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-bedrock-observability-ttft-quota/

Cloudflare JSON responses and RFC 9457 support for 1xxx errors

https://developers.cloudflare.com/changelog/post/2026-03-11-json-rfc9457-responses-for-1xxx-errors/

Amazon S3 source-region information in server access logs

https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-s3-source-region-information/

AWS Config adds 30 new resource types

https://aws.amazon.com/about-aws/whats-new/2026/03/aws-config-new-resource-types/

Amazon Bedrock AgentCore Runtime stateful MCP server features

https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-bedrock-agentcore-runtime-stateful-mcp/

More episodes and show notes at

https://shipitweekly.fm

On Call Briefs at

https://oncallbrief.com

View Details
Meta Buys Moltbook, Block AI Layoffs Get Messier, Atlassian Cuts Jobs, and GitHub Explains the Outages
Ep. 26

This week on Ship It Weekly, Brian covers five “AI meets reality” stories that every DevOps, SRE, security, and platform team can learn from.

Block’s AI layoff story is getting messier as follow-up reporting pushes back on the original framing, Meta bought Moltbook and brought more attention to the trust and security problems already showing up around AI-agent platforms, and Atlassian cut about 10% of its workforce while saying AI is changing the skills and roles it needs. Plus: GitHub gives one of the more honest outage breakdowns we’ve seen lately, Anthropic and Mozilla show a more grounded AI use case with Claude finding real Firefox bugs, and there’s a quick lightning round on Bedrock AgentCore policy, Dependabot for pre-commit hooks, and Cloudflare’s latest threat report.

Links

Block layoffs follow-up

https://www.theguardian.com/technology/2026/mar/08/block-ai-layoffs-jack-dorsey

Meta acquires Moltbook

https://www.theguardian.com/technology/2026/mar/10/meta-acquires-moltbook-ai-agent-social-network

Wiz on Moltbook exposure

https://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keys

Atlassian team update

https://www.atlassian.com/blog/announcements/atlassian-team-update-march-2026

GitHub availability issues write-up

https://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/

Anthropic + Mozilla Firefox security

https://www.anthropic.com/news/mozilla-firefox-security

Anthropic labor market report

https://www.anthropic.com/research/labor-market-impacts

AWS Bedrock AgentCore Policy GA

https://aws.amazon.com/about-aws/whats-new/2026/03/policy-amazon-bedrock-agentcore-generally-available/

GitHub Dependabot support for pre-commit hooks

https://github.blog/changelog/2026-03-10-dependabot-now-supports-pre-commit-hooks/

Cloudflare 2026 Threat Report

https://blog.cloudflare.com/2026-threat-report/

More episodes and show notes at

https://shipitweekly.fm

On Call Briefs at:

https://oncallbrief.com

View Details
Ship It Conversations: Yvonne Young on Linux Foundations, Mentorship, and Getting Job Ready in Cloud
Ep. 25

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Yvonne Young, a cloud and Linux mentor active in the CloudWhistler community. We talk about the real path into cloud and DevOps, why Linux still matters as a foundation, what “job ready” actually means, and why focus, consistency, and business thinking matter more than chasing every new tool.

Highlights

  • Linux fundamentals still matter because so much of cloud and infra work sits on top of Linux
  • What “job ready” really means: prepare for both technical and behavioral interviews, know the basics, and show how you learn when you don’t know something
  • Why so many juniors stall out by trying to learn everything instead of picking a direction
  • Why daily reps beat cramming: short, consistent practice keeps skills fresh better than marathon study sessions
  • How Yvonne thinks about certifications, including why hands-on certs like RHCSA stand out
  • Hands-on practice ideas: break things on purpose, troubleshoot, fix services, inspect ports, and use the help files
  • Why tools matter less than the business problem they solve
  • Using Vault as an example of solving real issues like secret sprawl, rotation, and centralized access
  • How to think about cloud learning: pick one provider, learn the concepts, and map your path to the kinds of companies you want to work for
  • Why mentorship and community matter, especially for juniors trying not to waste time or head in the wrong direction
  • What seniors can do better: better onboarding, real availability, and giving juniors an actual lifeline when they get stuck

Yvonne’s links

Stuff mentioned

More episodes + details: https://shipitweekly.fm

View Details
AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security
Ep. 24

This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.

We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.

Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.

Links

AWS Bahrain/UAE (Reuters) https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/

ArgoCD to Flux https://hai.wxs.ro/migrations/argocd-to-flux/

GitHub Actions exploitation https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitation

Trivy incident https://github.com/aquasecurity/trivy/discussions/10265

RoguePilot https://thehackernews.com/2026/02/roguepilot-flaw-in-github-codespaces.html

Block layoffs (WSJ) https://www.wsj.com/business/jack-dorseys-block-to-lay-off-4-000-employees-in-ai-remake-28f0d869

Claude Code Security https://www.anthropic.com/news/claude-code-security

DeepSeek (Reuters) https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/

Agentic boundaries https://vercel.com/blog/security-boundaries-in-agentic-architectures

CISA KEV https://www.cisa.gov/news-events/alerts/2026/03/03/cisa-adds-two-known-exploited-vulnerabilities-catalog

mcp-atlassian CVE https://arcticwolf.com/resources/blog-uk/cve-2026-27825-critical-unauthenticated-rce-and-ssrf-in-mcp-atlassian/

Claude Cowork tasks https://support.claude.com/en/articles/13854387-schedule-recurring-tasks-in-cowork

More: https://shipitweekly.fm

View Details
Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)
Ep. 23

This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.

We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.

Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.

Links

AWS Bahrain/UAE (Reuters) https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/

ArgoCD to Flux https://hai.wxs.ro/migrations/argocd-to-flux/

GitHub Actions exploitation https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitation

Trivy incident https://github.com/aquasecurity/trivy/discussions/10265

RoguePilot https://thehackernews.com/2026/02/roguepilot-flaw-in-github-codespaces.html

Block layoffs (WSJ) https://www.wsj.com/business/jack-dorseys-block-to-lay-off-4-000-employees-in-ai-remake-28f0d869

Claude Code Security https://www.anthropic.com/news/claude-code-security

DeepSeek (Reuters) https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/

Agentic boundaries https://vercel.com/blog/security-boundaries-in-agentic-architectures

CISA KEV https://www.cisa.gov/news-events/alerts/2026/03/03/cisa-adds-two-known-exploited-vulnerabilities-catalog

mcp-atlassian CVE https://arcticwolf.com/resources/blog-uk/cve-2026-27825-critical-unauthenticated-rce-and-ssrf-in-mcp-atlassian/

Claude Cowork tasks https://support.claude.com/en/articles/13854387-schedule-recurring-tasks-in-cowork

More: https://shipitweekly.fm

View Details
Ship It Conversations: Mike Lady on Day Two Readiness + Guardrails in the AI Era
Ep. 22

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Mike Lady (Senior DevOps Engineer, distributed systems) from Enterprise Vibe Code on YouTube. We talk day two readiness, guardrails/quality gates, and why shipping safely matters even more now that AI can generate code fast.

Highlights

  • Day 0 vs Day 1 vs Day 2 (launching vs operating and evolving safely)
  • What teams look like without guardrails (“hope is not a strategy”)
  • Why guardrails speed you up long-term (less firefighting, more predictable delivery)
  • Day-two audit checklist: source control/branches/PRs, branch protection, CI quality gates, secrets/config, staging→prod flow
  • AI agents: they’ll “lie, cheat, and steal” to satisfy the goal unless you gate them
  • Multi-model reviews (Claude/Gemini/Codex) as different perspectives
  • AI in prod: start read-only (logs/traces), then earn trust slowly

Mike’s links

Stuff mentioned

More episodes + details: https://shipitweekly.fm

View Details
Ship It Weekly – DevOps and SRE News for Engineers Who Run Production
Special

Ship It Weekly is a DevOps and SRE news podcast for engineers who run real systems.

Every week I break down what actually matters in cloud, Kubernetes, CI/CD, infrastructure as code, and production reliability. No hype. No vendor spin. Just practical analysis from someone who’s been on call and shipped systems at scale.

This isn’t a tutorial show. It’s a signal filter.

I cover major industry shifts, security incidents, cloud provider changes, and tooling updates, then explain what they mean for platform teams and engineers operating in production.

If you work in DevOps, SRE, platform engineering, or cloud infrastructure and want context instead of clickbait, you’re in the right place.

New episodes weekly.

You can also find detailed write-ups at: https://shipitweekly.fm

And curated production-focused briefs at: https://oncallbrief.com

Subscribe, and let’s ship.

View Details
GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep
Ep. 21

This week on Ship It Weekly, Brian hits five stories where the “defaults” are shifting under ops teams.

GitHub is bringing Agentic Workflows into Actions, Gentoo is migrating off GitHub to Codeberg, Argo CD upgrades are forcing Server-Side Apply in some paths, AWS Config quietly expanded coverage again, and EC2 nested virtualization is now possible on virtual instances.

Links

YouTube episodes https://www.youtube.com/watch?v=tuuLlo2rbI0&list=PLYLi5KINFnO7dVMbhsJQTKRFXfSSwPmuL&pp=sAgC

OnCallBrief https://oncallbrief.com

Teller’s Tech Substack https://tellerstech.substack.com/

GitHub Agentic Workflows (preview) https://github.blog/changelog/2026-02-13-github-agentic-workflows-are-now-in-technical-preview/

Gentoo moves to Codeberg https://www.theregister.com/2026/02/17/gentoo_moves_to_codeberg_amid/

Argo CD upgrade guide: 3.2 -> 3.3 (SSA) https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/3.2-3.3/

AWS Config: 30 new resource types https://aws.amazon.com/about-aws/whats-new/2026/02/aws-config-new-resource-types

EC2 nested virtualization (virtual instances) https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec2-nested-virtualization-on-virtual/

GitHub status page update https://github.blog/changelog/2026-02-13-updated-status-experience/

GitHub Actions: early Feb updates https://github.blog/changelog/2026-02-05-github-actions-early-february-2026-updates/

Runner min version enforcement extended https://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/

Open Build Service postmortem https://openbuildservice.org/2026/02/02/post-mortem/

Human story: AI SRE vs incident management https://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/

More episodes and show info on https://shipitweekly.fm

View Details
Special: OpenClaw Security Timeline and Fallout: CVE-2026-25253 One-Click Token Leak, Malicious ClawHub Skills, Exposed Agent Control Panels, and Why Local AI Agents Are a New DevOps/SRE Control Plane (OpenAI Hires Founder)
Ep. 20

In this Ship It Weekly special, Brian breaks down the OpenClaw situation and why it’s bigger than “another CVE.”

OpenClaw is a preview of what platform teams are about to deal with: autonomous agents running locally, wired into real tools, real APIs, and real credentials. When the trust model breaks, it’s not just data exposure. It’s an operator compromise.

We walk through the recent timeline: mass internet exposure of OpenClaw control panels, CVE-2026-25253 (a one-click token leak that can turn your browser into the bridge to your local gateway), a skills marketplace that quickly became a malware delivery channel, and the Moltbook incident showing how “agent content” becomes a new supply chain problem. We close with the signal that agents are going mainstream: OpenAI hiring the OpenClaw creator.

Chapters

  • 1. What OpenClaw Actually Is
  • 2. The Situation in One Line
  • 3. Localhost Is Not a Boundary (The CVE Lesson)
  • 4. Exposed Control Panels (How “Local” Went Public)
  • 5. The Marketplace Problem (Skills Are Supply Chain)
  • 6. The Ecosystem Spills (Agent Platforms Leaking Real Data)
  • 7. Minimum Viable Safety for Local Agents
  • 8. The Plot Twist (OpenAI Hires the Creator)

Links from this episode

Censys exposure research https://censys.com/blog/openclaw-in-the-wild-mapping-the-public-exposure-of-a-viral-ai-assistant

GitHub advisory (CVE-2026-25253) https://github.com/advisories/GHSA-g8p2-7wf7-98mq

NVD entry https://nvd.nist.gov/vuln/detail/CVE-2026-25253

Koi Security: ClawHavoc / malicious skills https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting

Moltbook leak coverage (Reuters) https://www.reuters.com/legal/litigation/moltbook-social-media-site-ai-agents-had-big-security-hole-cyber-firm-wiz-says-2026-02-02/

OpenClaw security docs https://docs.openclaw.ai/gateway/security

OpenAI hire coverage (FT) https://www.ft.com/content/45b172e6-df8c-41a7-bba9-3e21e361d3aa

More information and past episodes on https://shipitweekly.fm

View Details
When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep
Ep. 19

This week on Ship It Weekly, Brian hits four stories where the guardrails become the incident.

GitHub had “Too Many Requests” caused by legacy abuse protections that outlived their moment. Takeaway: controls need owners, visibility, and a retirement plan.

Kubernetes has a nasty edge case where nodes/proxy GET can turn into command execution via WebSocket behavior. If you’ve ever handed out “telemetry” RBAC broadly, go audit it.

HashiCorp shared how HCP Vault handled a real AWS regional disruption: control plane wobbled, Dedicated data planes kept serving. Control plane vs data plane separation paying off.

AWS expanded its PCI DSS compliance package with more services and the Asia Pacific (Taipei) region. Scope changes don’t break prod today, but they turn into evidence churn later if you don’t standardize proof.

Human story: “reasonable assurance” turning into busywork.

Links

GitHub: When protections outlive their purpose (legacy defenses + lifecycle)

https://github.blog/engineering/infrastructure/when-protections-outlive-their-purpose-a-lesson-on-managing-defense-systems-at-scale/

Kubernetes nodes/proxy GET → RCE (analysis)

https://grahamhelton.com/blog/nodes-proxy-rce

OpenFaaS guidance / mitigation notes

https://www.openfaas.com/blog/kubernetes-node-proxy-rce/

HCP Vault resilience during real AWS regional outages

https://www.hashicorp.com/blog/how-resilient-is-hcp-vault-during-real-aws-regional-outages

AWS: Fall 2025 PCI DSS compliance package update

https://aws.amazon.com/blogs/security/fall-2025-pci-dss-compliance-package-available-now/

GitHub Actions: self-hosted runner minimum version enforcement extended

https://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/

Headlamp in 2025: Project Highlights (SIG UI)

https://kubernetes.io/blog/2026/01/22/headlamp-in-2025-project-highlights/

AWS Network Firewall Active Threat Defense (MadPot)

https://aws.amazon.com/blogs/security/real-time-malware-defense-leveraging-aws-network-firewall-active-threat-defense/

Reasonable assurance turning into busywork (r/sre)

https://www.reddit.com/r/sre/comments/1qvwbgf/at_what_point_does_reasonable_assurance_turn_into/

More episodes + details: https://shipitweekly.fm

View Details
Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP
Ep. 18

This week on Ship It Weekly, Brian hits four “control plane + trust boundary” stories where the glue layer becomes the incident.

Azure had a platform incident that impacted VM management operations across multiple regions. Your app can be up, but ops is degraded.

GitHub is pushing Agent HQ (Claude + Codex in the repo/CI flow), and Actions added a case() function so workflow logic is less brittle.

MCP is becoming platform plumbing: Miro launched an MCP server and Kong launched an MCP Registry.

Links

Azure status incident (VM service management issues) https://azure.status.microsoft/en-us/status/history/?trackingId=FNJ8-VQZ

GitHub Agent HQ: Claude + Codex https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/

GitHub Actions update (case() function) https://github.blog/changelog/2026-01-29-github-actions-smarter-editing-clearer-debugging-and-a-new-case-function/

Claude Opus 4.6 https://www.anthropic.com/news/claude-opus-4-6

How Google SREs use Gemini CLI https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outages

Miro MCP server announcement https://www.businesswire.com/news/home/20260202411670/en/Miro-Launches-MCP-Server-to-Connect-Visual-Collaboration-With-AI-Coding-Tools

Kong MCP Registry announcement https://konghq.com/company/press-room/press-release/kong-introduces-mcp-registry

GitHub Actions hosted runners incident thread https://github.com/orgs/community/discussions/186184

DockerDash / Ask Gordon research https://noma.security/blog/dockerdash-two-attack-paths-one-ai-supply-chain-crisis/

Terraform 1.15 alpha https://github.com/hashicorp/terraform/releases/tag/v1.15.0-alpha20260204

Wiz Moltbook write-up https://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keys

Chainguard “EmeritOSS” https://www.chainguard.dev/unchained/introducing-chainguard-emeritoss

More episodes + details: https://shipitweekly.fm

View Details
CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE
Ep. 17

This week on Ship It Weekly, Brian looks at four “glue failures” that can turn into real outages and real security risk.

We start with CodeBreach: AWS disclosed a CodeBuild webhook filter misconfig in a small set of AWS-managed repos. The takeaway is simple: CI trigger logic is part of your security boundary now.

Next is the Bazel TLS cert expiry incident. Cert failures are a binary cliff, and “auto renew” is only one link in the chain.

Third is Helm chart reliability. Prequel reviewed 105 charts and found a lot of demo-friendly defaults that don’t hold up under real load, rollouts, or node drains.

Fourth is n8n. Two new high-severity flaws disclosed by JFrog. “Authenticated” still matters because workflow authoring is basically code execution, and these tools sit next to your secrets.

Lightning round: Fence, HashiCorp agent-skills, marimo, and a cautionary agent-loop story.

Links

AWS CodeBreach bulletin https://aws.amazon.com/security/security-bulletins/2026-002-AWS/

Wiz research https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild

Bazel postmortem https://blog.bazel.build/2026/01/16/ssl-cert-expiry.html

Helm report https://www.prequel.dev/blog-post/the-real-state-of-helm-chart-reliability-2025-hidden-risks-in-100-open-source-charts

n8n coverage https://thehackernews.com/2026/01/two-high-severity-n8n-flaws-allow.html

Fence https://github.com/Use-Tusk/fence

agent-skills https://github.com/hashicorp/agent-skills

marimo https://marimo.io/

Agent loop story https://www.theregister.com/2026/01/27/ralph_wiggum_claude_loops/

Related n8n episodes:

https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/

https://www.tellerstech.com/ship-it-weekly/n8n-auth-rce-cve-2026-21877-github-artifact-permissions-and-aws-devops-agent-lessons/

More episodes + details: https://shipitweekly.fm

View Details
Ship It Conversations: AI Automation for SMBs: What to Automate (And What Not To) (with Austin Reed)
Ep. 16

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Austin Reed from horizon.dev about AI and automation for small and mid-sized businesses, and what actually works once you leave the demo world.

We get into the most common automation wins he sees (sales and customer service), why a lot of projects fail due to communication and unclear specs more than the tech, and the trap of thinking “AI makes it cheap.” Austin shares how they push teams toward quick wins first, then iterate with prototypes so you don’t spend $10k automating a thing that never even happens.

We also talk guardrails: when “human-in-the-loop” makes sense, what he avoids automating (finance-heavy logic, HIPAA/medical, government), and why the goal is usually leverage, not replacing people. On the dev side, we nerd out a bit on the tooling they’re using day to day: GPT and Claude, Cursor, PR review help, CI/CD workflows, and why knowing how to architect and validate output matters way more than people think.

If you’re a DevOps/SRE type helping the business “do AI,” or you’re just tired of automation hype that ignores real constraints like credentials, scope creep, and operational risk, this one is very much about the practical middle ground.

Links from the episode:

Austin on LinkedIn: https://www.linkedin.com/in/automationsexpert/

horizon.dev: horizon.dev

YouTube: https://www.youtube.com/@horizonsoftwaredev

Skool: https://www.skool.com/automation-masters

If you found this useful, share it with the person on your team who keeps saying “we should automate that” but hasn’t dealt with the messy parts yet.

More information on our website: https://shipitweekly.fm

View Details
curl Shuts Down Bug Bounties Due to AI Slop, AWS RDS Blue/Green Cuts Switchover Downtime to ~5 Seconds, and Amazon ECR Adds Cross-Repository Layer Sharing
Ep. 15

This week on Ship It Weekly, Brian looks at three different versions of the same problem: systems are getting faster, but human attention is still the bottleneck.

We start with curl shutting down their bug bounty program after getting flooded with low-quality “AI slop” reports. It’s not a “security vs maintainers” story, it’s an incentives and signal-to-noise story. When the cost to generate reports goes to zero, you basically DoS the people doing triage.

Next, AWS improved RDS Blue/Green Deployments to cut writer switchover downtime to typically ~5 seconds or less (single-region). That’s a big deal, but “fast switchover” doesn’t automatically mean “safe upgrade.” Your connection pooling, retries, and app behavior still decide whether it’s a blip or a cascade.

Third, Amazon ECR added cross-repository layer sharing. Sounds small, but if you’ve got a lot of repos and you’re constantly rebuilding/pushing the same base layers, this can reduce storage duplication and speed up pushes in real fleets.

Lightning round covers a practical Kubernetes clientcmd write-up, a solid “robust Helm charts” post, a traceroute-on-steroids style tool, and Docker Kanvas as another signal that vendors are trying to make “local-to-cloud” workflows feel less painful.

We wrap with Honeycomb’s interim report on their extended EU outage, and the part that always hits hardest in long incidents: managing engineer energy and coordination over multiple days is a first-class reliability concern.

Links from this episode

curl bug bounties shutdown https://github.com/curl/curl/pull/20312

RDS Blue/Green faster switchover https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-rds-blue-green-deployments-reduces-downtime/

ECR cross-repo layer sharing https://aws.amazon.com/about-aws/whats-new/2026/01/amazon-ecr-cross-repository-layer-sharing/

Kubernetes clientcmd apiserver access https://kubernetes.io/blog/2026/01/19/clientcmd-apiserver-access/

Building robust Helm charts https://www.willmunn.xyz/devops/helm/kubernetes/2026/01/17/building-robust-helm-charts.html

ttl tool https://github.com/lance0/ttl

Docker Kanvas (InfoQ) https://www.infoq.com/news/2026/01/docker-kanvas-cloud-deployment/

Honeycomb EU interim report https://status.honeycomb.io/incidents/pjzh0mtqw3vt

SRE Weekly issue #504 https://sreweekly.com/sre-weekly-issue-504/

More episodes + details: https://shipitweekly.fm

View Details
n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent Lessons
Ep. 14

This week on Ship It Weekly, the theme is simple: the automation layer has become a control plane, and that changes how you should think about risk.

We start with n8n’s latest critical vulnerability, CVE-2026-21877. This one is different from the unauth “Ni8mare” issue we covered in Episode 12. It’s authenticated RCE, which means the real question isn’t only “is it internet exposed,” it’s who can log in, who can create or modify workflows, and what those workflows can reach. Takeaway: treat workflow automation tools like CI systems. They run code, they hold credentials, and they can pivot into real infrastructure.

Next is GitHub’s new fine-grained permission for artifact metadata. Small change, big least-privilege implications for Actions workflows. It’s also a good forcing function to clean up permission sprawl across repos.

Third is AWS’s DevOps Agent story, and the best part is that it’s not hype. It’s a real look at what it takes to operationalize agents: evaluation, observability into tool calls/decisions, and control loops with brakes and approvals. Prototype is cheap. Reliability is the work.

Lightning round: GitHub secret scanning changes that can quietly impact governance, a punchy Claude Code “guardrails aren’t guaranteed” reminder, Block’s Goose as another example of agent workflows getting productized, and OpenCode as an “agent runner” pattern worth watching if you’re experimenting locally.

Links

n8n CVE-2026-21877 (authenticated RCE) https://thehackernews.com/2026/01/n8n-warns-of-cvss-100-rce-vulnerability.html?m=1

Episode 12 (n8n “Ni8mare” / CVE-2026-21858) https://www.tellerstech.com/ship-it-weekly/n8n-critical-cve-cve-2026-21858-aws-gpu-capacity-blocks-price-hike-netflix-temporal/

GitHub: fine-grained permission for artifact metadata (GA) https://github.blog/changelog/2026-01-13-new-fine-grained-permission-for-artifact-metadata-is-now-generally-available/

GitHub secret scanning: extended metadata auto-enabled (Feb 18) https://github.blog/changelog/2026-01-15-secret-scanning-extended-metadata-to-be-automatically-enabled-for-certain-repositories/

Claude Code issue thread (Bedrock guardrails gap) https://github.com/anthropics/claude-code/issues/17118

Block Goose (tutorial + sessions/context) https://block.github.io/goose/docs/tutorials/rpi https://block.github.io/goose/docs/guides/sessions/smart-context-management

OpenCode https://opencode.ai

More episodes + details: https://shipitweekly.fm

View Details
Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)
Ep. 13

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Gracious James Eluvathingal about TARS, his “human-in-the-loop” fixer bot wired into CI/CD.

We get into why he built it in the first place, how he stitches together n8n, GitHub, SSH, and guardrailed commands, and what it actually looks like when an AI agent helps with incident response without being allowed to nuke prod. We also dig into rollback phases, where humans stay in the loop, and why validating every LLM output before acting on it is the single most important guardrail.

If you’re curious about AI agents in pipelines but hate the idea of a fully autonomous “ops bot,” this one is very much about the middle ground: segmenting workflows, limiting blast radius, and using agents to reduce toil instead of replace engineers.

Gracious also walks through where he’d like to take TARS next (Terraform, infra-level decisions, more tools) and gives some solid advice for teams who want to experiment with agents in CI/CD without starting with “let’s give it root and see what happens.”

Links from the episode:

Gracious on LinkedIn: https://www.linkedin.com/in/gracious-james-eluvathingal

TARS overview post: https://www.linkedin.com/posts/gracious-james-eluvathingal_aiagents-devops-automation-activity-7391064503892987904-psQ4

If you found this useful, share it with the person on your team who’s poking at AI automation and worrying about guardrails.

More information on our website: https://shipitweekly.fm

View Details
n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix Temporal
Ep. 12

This week on Ship It Weekly, Brian’s theme is basically: the “automation layer” is not a side tool anymore. It’s part of your perimeter, part of your reliability story, and sometimes part of your budget problem too.

We start with the n8n security issue. A lot of teams use n8n as glue for ops workflows, which means it tends to collect credentials and touch real systems. When something like this drops, the right move is to treat it like production-adjacent infra: patch fast, restrict exposure, and assume anything stored in the tool is high value.

Next is AWS quietly raising prices on EC2 Capacity Blocks for ML. Even if you’re not a GPU-heavy shop, it’s a useful signal: scarce compute behaves like a market. If you do rely on scheduled GPU capacity, it’s time to revisit forecasts and make sure your FinOps tripwires catch rate changes before the end-of-month surprise.

Third is Netflix’s write-up on using Temporal for reliable cloud operations. The best takeaway is not “go adopt Temporal tomorrow.” It’s the pattern: long-running operational workflows should be resumable, observable, and safe to retry. If your critical ops are still bash scripts and brittle pipelines, you’re one transient failure away from a very dumb day.

In the lightning round: Kubernetes Dashboard getting archived and the “ops dependencies die” reality check, Docker pushing hardened images as a safer baseline and Pipedash.

Links

SRE Weekly issue 504 (source roundup) https://sreweekly.com/sre-weekly-issue-504/

n8n CVE (NVD) https://nvd.nist.gov/vuln/detail/CVE-2026-21858

n8n community advisory https://community.n8n.io/t/security-advisory-security-vulnerability-in-n8n-versions-1-65-1-120-4/247305

AWS price increase coverage (The Register) https://www.theregister.com/2026/01/05/aws_price_increase/

Netflix: Temporal powering reliable cloud operations https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953

Kubernetes SIG-UI thread (Dashboard archiving) https://groups.google.com/g/kubernetes-sig-ui/c/vpYIRDMysek/m/wd2iedUKDwAJ

Kubernetes Dashboard repo (archived) https://github.com/kubernetes/dashboard

Pipedash https://github.com/hcavarsan/pipedash

Docker Hardened Images https://www.docker.com/blog/docker-hardened-images-for-every-developer/

More episodes and more details on this episode can be found on our website: https://shipitweekly.fm

View Details
Ship It Conversations: Backstage vs Internal IDPs, and Why DevEx Muscle Matters (with Danny Teller)
Ep. 11

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

I sat down with Danny Teller, a DevOps Architect and Tech Lead Manager at Tipalti, to talk about internal developer platforms and the reality behind “just set up a developer portal.” We get into Backstage versus internal IDPs, why adoption is the real battle, and why platform/DevEx maturity matters more than whatever tool you pick.

What we covered

Backstage vs internal IDPs Backstage is a solid starting point for a developer portal, but it doesn’t magically create standards, ownership, or platform maturity. We talk about when Backstage fits, and when teams end up building internal tooling anyway.

DevEx muscle (the make-or-break) Danny’s take: the portal UI is the easy part. The hard part is the ongoing work that makes it useful: paved roads, sane defaults, support, and keeping the catalog/data accurate so engineers trust it.

Where teams get burned Common failure mode: teams ship a portal first, then realize they don’t have the resourcing, ownership, or workflows behind it. Adoption fades fast if the portal doesn’t remove real friction.

A build vs buy gut check We walk through practical signals that push you toward open source Backstage, a managed Backstage offering, or a commercial portal. We also hit the maintenance trap: if you build too much, you’ve created a second product.

Links and resources

Danny Teller's Linkedin: https://www.linkedin.com/in/danny-teller/

matlas — one CLI for Atlas and MongoDB: https://github.com/teabranch/matlas-cli

Backstage: https://backstage.io/

Roadie (managed Backstage): https://roadie.io/

Port: https://www.port.io/

Cortex: https://www.cortex.io/

OpsLevel: https://www.opslevel.com/

Atlassian Compass: https://www.atlassian.com/software/compass

Humanitec Platform Orchestrator: https://humanitec.com/products/platform-orchestrator

Northflank: https://northflank.com/

If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a platform/devex friend and leave a quick rating or review. It helps more than it should.

Visit our website at https://www.shipitweekly.fm

View Details
Fail Small, IaC Control Planes, and Automated RCA
Ep. 10

This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.

We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.

Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.

Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.

In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.

We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.

Links from this episode

SRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/

Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/

Meta DrP: https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/

GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/

Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/

AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/

DriftHound https://drifthound.io/

Superset https://superset.sh/

More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm

View Details
Ship It Conversations: From Full-Stack to Cloud/DevOps, One Project at a Time (with Eric Paatey)
Ep. 9

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

I sat down with Eric Paatey, a Cloud & DevOps Engineer who’s been transitioning from full-stack web development into cloud/devops, and building real skills through hands-on projects instead of just collecting tools and buzzwords.

We talk about what that transition actually feels like, what’s helped most, and why you don’t need a rack of servers to learn DevOps.

What we covered Eric’s path into DevOps How he moved from building web apps to caring about pipelines, infra, scalability, reliability, and automation. The “oh… code is only part of the job” moment that pushes a lot of people toward DevOps.

The WHY behind DevOps Eric’s take: DevOps is mainly about breaking down silos and improving communication between dev, ops, security, and the business. We also hit the idea from The DevOps Handbook: small batches win. The bigger the release, the harder it is to recover when something breaks.

Leveling up without drowning in tools DevOps has an endless tool list, so we talked about how to stay current without burning out. Eric’s recommendation: stay connected to the industry. Meet people, join user groups, go to events, and don’t silo yourself.

The homelab mindset (and why simple is fine) Eric shared his “homelab on the go” setup and why the hardware isn’t the point. It’s about using a safe environment to build habits: automation, debugging, systems thinking, monitoring, breaking things, recovering, and improving the design.

A practical first project for aspiring DevOps engineers We talked through a starter project you can actually show in interviews: Dockerize a simple app, deploy it behind an ALB, and learn basic networking/security along the way. You don’t need to understand everything on day one, but you do need to build things and learn what breaks.

Agentic AI and guardrails We also touched on AI agents and MCPs, what they could mean for ops teams, and why you should not give agents full access to anything. Least privilege and policy guardrails matter, because “non-deterministic” and “prod permissions” is a scary combo.

Links and resources Eric Paatey on LinkedIn: https://www.linkedin.com/in/eric-paatey-72a87799/

Eric’s website/portfolio: https://ericpaatey.com/

If you enjoyed this episode Ship It Weekly is still the weekly news recap, and I’m dropping these guest convos in between. Follow/subscribe so you catch both, and if this was useful, share it with a coworker or your on-call buddy and leave a quick rating or review. It helps more than it should.

Visit our website at https://www.shipitweekly.fm

View Details
Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access
Ep. 8

This week on Ship It Weekly, Brian looks at real platform engineering in the wild.

We start with Cloudflare’s write-up on building an internal maintenance scheduler on Workers. It’s not marketing fluff. It’s “we hit memory limits, changed the model, and stopped pulling giant datasets into the runtime.”

Next up: AWS databases are now available inside the Vercel Marketplace. This is a quiet shift with loud consequences. Devs can click-button real AWS databases from the same place they deploy apps, and platform teams still own the guardrails: account sprawl, billing/tagging, audit trails, region choices, and networking posture.

Third story: TEAM (Temporary Elevated Access Management) for IAM Identity Center. Time-bound elevation with approvals, automatic expiry, and auditing. We cover how this fits alongside break-glass and why auto-expiry is the difference between least-privilege and privilege creep.

Lightning round: GitHub Actions workflow page performance improvements, Lambda Managed Instances (slightly cursed but interesting), a quick atmos tooling blip, and k8sdiagram.fun for explaining k8s to humans.

We close with Marc Brooker’s “What Now? Handling Errors in Large Systems” and the takeaway: error handling isn’t a local code decision, it’s architecture. Crashing vs retrying vs continuing only makes sense when you understand correlation and blast radius.

shipitweekly.fm has links + the contact email. Want to be a guest? Reach out. And if you’re enjoying the show, follow/subscribe and leave a quick rating or review. It helps a ton.

Links from this episode

Cloudflare https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/ AWS on Vercel https://aws.amazon.com/about-aws/whats-new/2025/12/aws-databases-are-available-on-the-vercel/ https://vercel.com/changelog/aws-databases-now-available-on-the-vercel-marketplace TEAM https://aws-samples.github.io/iam-identity-center-team/ https://github.com/aws-samples/iam-identity-center-team GitHub Actions https://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/ Lambda Managed Instances https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html Atmos https://github.com/cloudposse/atmos/issues k8sdiagram.fun https://k8sdiagram.fun/ Marc Brooker https://brooker.co.za/blog/2025/11/20/what-now.html

View Details
Ship It Conversations: The WHY Behind DevOps, Upskilling, and Agentic AI (with Maz Islam)
Ep. 7

This is a Ship It Weekly conversation episode. The weekly news recaps are still weekly. These interviews drop in between when I find someone worth talking to and the convo feels useful.

In this episode I’m joined by Mazharul “Maz” Islam (DevOps with Maz). Maz is a UK-based DevOps Engineer who shares practical, real-world DevOps content on YouTube and LinkedIn. We talk about the stuff that actually matters when you’re building systems, running infrastructure, owning reliability, and living in on-call.

We hit three big things: the importance of understanding the WHY behind DevOps (not just the tools), how to upskill and keep up with the industry without burning out, and what the agentic AI era might look like for DevOps, SRE, and platform engineering teams. We also touch on MCPs and AI agents, and what “leveling up” looks like for companies that want to move faster without breaking everything.

If you’re into DevOps culture, SRE practices, platform engineering, CI/CD, infrastructure automation, and how teams should think about reliability and security as things keep changing, this one should land.

Guest Mazharul Islam (DevOps with Maz) UK-based DevOps Engineer. Posts a lot of hands-on content around cloud, DevOps fundamentals, and leveling up as an engineer.

Links (Maz) YouTube: https://m.youtube.com/@devopswithmaz LinkedIn: https://www.linkedin.com/in/mazharul419

Topics we covered WHY behind DevOps, and why “tools” is the smallest part of it DevOps fundamentals vs tool-chasing Upskilling strategies for DevOps Engineers and SREs How to keep learning cloud and automation without drowning What strong teams measure and what “good” actually looks like (delivery, reliability, feedback loops) Agentic AI, AI agents in operations, and the next era of DevOps MCPs, automation guardrails, and safe ways to scale change How companies can “level up” their engineering org without turning it into chaos

We also discussed the previous episode of Ship It Weekly - GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI

https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/

Book Maz recommended The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Paperback, Oct 6, 2016) Gene Kim, Jez Humble, Patrick Debois, John Willis

About Ship It Weekly (format) Ship It Weekly is for people running infrastructure and owning reliability. Most episodes are quick weekly news recaps for DevOps, SRE, and platform engineering. In between those weekly drops, I’ll publish interview episodes like this one.

Subscribe / help the show If you want the weekly DevOps news recaps plus these interviews, hit follow or subscribe in your podcast app. And if you’re feeling generous, leave a rating or review and share this episode with a coworker (especially your on-call buddy). That stuff genuinely helps the show get discovered.

View Details
GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI
Ep. 6

This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.

We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.

Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier capped at 500 managed resources. If you’re running real infrastructure, this is a good moment to audit what you’re actually managing and decide whether you’re cleaning up, paying, or planning a migration.

Then we talk PromptPwnd: why stuffing untrusted PR/issue text into AI agent prompts (inside CI) can turn into a supply chain/security problem. The short version: treat AI inputs like hostile user input, keep tokens/permissions minimal, and don’t let agents “run with scissors.”

We also cover the Home Depot report about long-lived access exposure as a reminder that secrets hygiene, blast radius, and detection still matter more than the shiny tools.

In the lightning round: CDKTF is sunset/archived, Bitbucket is cleaning up free unused workspaces, and SourceHut is proposing pricing changes. We wrap with a human note on “platform whiplash” and why a simple watchlist beats carrying all this stuff in your head.

Links from this episode

GitHub Actions pricing + pause https://runs-on.com/blog/github-self-hosted-runner-fee-2026/ https://x.com/github/status/2001372894882918548 https://www.githubstatus.com/incidents/x696x0g4t85l

HashiCorp / Terraform Cloud free plan changes https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice https://www.reddit.com/r/Terraform/s/slYm77wzYr

PromptPwnd / AI agents in CI https://www.aikido.dev/blog/promptpwnd-github-actions-ai-agents

Home Depot access exposure report https://techcrunch.com/2025/12/12/home-depot-exposed-access-to-internal-systems-for-a-year-says-researcher/

Bitbucket cleanup https://community.atlassian.com/forums/Bitbucket-articles/Bitbucket-cleanup-of-free-unused-workspaces-what-you-need-to/ba-p/3144063

SourceHut pricing proposal https://sourcehut.org/blog/2025-12-01-proposed-pricing-changes/

View Details
IBM Buys Confluent, React2Shell, and Netflix on Aurora
Ep. 5

In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.

First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.

Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being exploited in the wild. Even if you never touch React, if you run platforms or Kubernetes for teams using Next.js or RSC, you’re on the hook for patching windows, WAF rules, and blast-radius thinking.

We also look at Netflix’s write-up on consolidating relational databases onto Aurora PostgreSQL, with big performance gains and cost savings. It’s a good excuse to step back and ask whether your own Postgres fleet still makes sense at the scale you’re at now.

In the lightning round, we hit OpenTofu 1.11’s new language features, practical Terraform “tips from the trenches,” Ghostty becoming a non-profit project, and two spec-driven dev tools (Spec Kit and OpenSpec) that show what sane AI-assisted development might look like.

For the human side, we close with “Your Brain on Incidents” and what high-stress outages actually do to people, plus a few concrete ideas for making on-call less brutal.

If you’re on a platform team, own SLOs, or you’re the person people ping when “something is wrong with prod,” this one should give you a mix of immediate to-dos and longer-term questions for your roadmap.

Links:

IBM + Confluent https://www.confluent.io/blog/ibm-to-acquire-confluent/ https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-ai

React2Shell (CVE-2025-55182) https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components

Netflix on Aurora PostgreSQL https://aws.amazon.com/blogs/database/netflix-consolidates-relational-database-infrastructure-on-amazon-aurora-achieving-up-to-75-improved-performance/

Tools & tips https://opentofu.org/blog/opentofu-1-11-0/ https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html https://mitchellh.com/writing/ghostty-non-profit https://github.com/github/spec-kit https://github.com/Fission-AI/OpenSpec

Human side https://uptimelabs.io/your-brain-on-incidents/

View Details
AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging
Ep. 4

In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.

We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a “what matters so far” pass, not a full recap.

Outside AWS, we get into Google’s 130,000-node GKE cluster and what actually applies if you’re running normal-sized clusters, plus the “It’s time to kill staging” argument and what responsible testing in production looks like with feature flags, progressive delivery, and solid observability.

In the lightning round, we hit Zachary Loeber’s Terraform MCP server and terraform-ingest (letting AI tools speak your real Terraform modules), Runs-On’s EC2 instance rankings so you stop picking instance types by vibes, and Airbnb’s adaptive traffic management for their key-value store. We close with Nolan Lawson’s “The fate of small open source” and what it means when your platform quietly depends on one-maintainer libraries.

Links from this episode:

AWS highlights:

https://aws.amazon.com/about-aws/whats-new/2025/11/aws-nat-gateway-regional-availability

https://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-preview

https://aws.amazon.com/about-aws/whats-new/2025/11/announcing-amazon-ecs-express-mode

https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/

Other topics:

https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-cluster

https://thenewstack.io/its-time-to-kill-staging-the-case-for-testing-in-production/

https://blog.zacharyloeber.com/article/terraform-custom-module-mcp-server/

https://go.runs-on.com/instances/ranking

https://medium.com/airbnb-engineering/from-static-rate-limiting-to-adaptive-traffic-management-in-airbnbs-key-value-store-29362764e5c2

https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/

View Details
Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails
Ep. 3

In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.

We start with Kubernetes’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.

Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across clusters and AZs instead of guessing from node metrics.

On the GitHub side, Actions OIDC tokens now include a check_run_id for tighter access control, and Copilot adds instructions files and custom agents so you can encode platform and security expectations directly into reviews and workflows.

In the lightning round, we touch on Terrascan being archived, Microsoft’s write-up of a 15.72 Tbps Aisuru DDoS attack against Azure, and AWS flat-rate CloudFront plans that bundle CDN and security into more predictable pricing.

We close with Lorin Hochstein’s “Two thought experiments” and what it looks like to write incident reports as if an AI (and your future teammates) will rely on them to debug the next outage.

If run Kubernetes in prod this one should give you a few concrete ideas for your roadmap.

Links from episode

https://kubernetes.io/blog/2025/11/25/configuration-good-practices/

https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-eks-provisioned-control-plane/

https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/

https://github.blog/changelog/2025-11-13-github-actions-oidc-token-claims-now-include-check_run_id/

https://github.blog/ai-and-ml/unlocking-the-full-power-of-copilot-code-review-master-your-instructions-files/

https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/create-custom-agents

Lightning Round

https://github.com/tenable/terrascan

https://www.bleepingcomputer.com/news/microsoft/microsoft-aisuru-botnet-used-500-000-ips-in-15-tbps-azure-ddos-attack/

https://aws.amazon.com/about-aws/whats-new/2025/11/aws-flat-rate-pricing-plans/

https://sreweekly.com/sre-weekly-issue-498/ (Lorin's Article)

View Details
Kubernetes Shake-ups, Platform Reality, and AI-Native SRE
Ep. 2

In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.

First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.

Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.

Third, we talk about AI as a first-class workload on Kubernetes. CNCF’s new Certified Kubernetes AI Conformance Program aims to standardize how AI runs on K8s, and recent writing on SRE in the age of AI looks at what reliability means when systems learn and drift.

In the lightning round, we hit good reads on database migrations, Postgres upgrades, and a distributed priority queue on Kafka. We wrap with the human side of incidents: fixation during incident response and using incidents as landmarks for the tradeoffs you’ve been making over time.

If you’re on a platform team, responsible for SLOs, or the person people ping when “Kubernetes is weird,” this one should give you concrete questions to take back to your roadmap and runbooks.

Links from this episode

https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/

https://www.haproxy.com/blog/ingress-nginx-is-retiring

https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/

https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/

https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/

Lightning round

https://www.cncf.io/blog/2025/11/18/top-5-hard-earned-lessons-from-the-experts-on-managing-kubernetes/

https://www.tines.com/blog/zero-downtime-database-migrations-lessons-from-moving-a-live-production

https://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/

https://klaviyo.tech/building-a-distributed-priority-queue-in-kafka-1b2d8063649e

https://sreweekly.com/sre-weekly-issue-497/

https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html

View Details
Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages
Ep. 1

In this special kickoff episode of Ship It Weekly, Brian walks through three major outages from the last few weeks and what they actually mean for DevOps, SRE, and platform teams.

Instead of just reading status pages, we look at how each incident exposes assumptions in our own architectures and runbooks:

Topics in this episode:

• Cloudflare’s global outage and what happens when your CDN/WAF becomes a single point of failure

• The AWS us-east-1 incident and why “multi-AZ in one region” isn’t a full disaster recovery strategy

• GitHub’s Git operations / Codespaces outage and how fragile our CI/CD and GitOps flows can be

• Practical questions to ask about your own setup: CDN bypass, cross-region readiness, backups for Git and CI

This episode is more of a themed “special” to kick things off.

Going forward, most episodes will follow a lighter news format: a couple of main stories from the week in DevOps/SRE/platform engineering, a quick tools and releases segment, and one culture/on-call or burnout topic. Specials like this will pop up when there’s a big incident or theme worth unpacking.

If you’re the person people DM when production is acting weird, or you’re responsible for the platform everyone ships on, this one’s for you.

Links from this episode

Cloudflare outage – November 18, 2025

https://blog.cloudflare.com/18-november-2025-outage/

https://www.thousandeyes.com/blog/cloudflare-outage-analysis-november-18-2025

AWS us-east-1 outage – October 2025

https://aws.amazon.com/message/101925/

https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025

GitHub outage – November 18, 2025

https://us.githubstatus.com/incidents/f3f7sg2d1m20

https://currently.att.yahoo.com/att/github-down-now-not-just-211700617.html

View Details
Prodcast Studio

Effortless monetization for podcasters using AI-driven metadata.