Banking

Modernizing Kubernetes Image Promoter: A Behind-the-Scenes Technical Transformation

2026-03-17 00:00
661 views

kpromo, the Kubernetes image promoter, handles all container image distribution to registry.k8s.io by copying images from staging to production environments, applying cosign signatures, and managing replication across registries.

Every container image pulled from registry.k8s.io passes through kpromo, the Kubernetes image promoter. This tool copies images from staging to production registries, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. When this tool fails, Kubernetes releases stop. Recently, we rewrote its core from scratch, removed 20% of the codebase, and dramatically improved performance—all without disrupting production. That was exactly the goal.

A bit of history

The image promoter began in late 2018 as an internal Google project by Linus Arver. Its purpose was straightforward: replace the manual, Googler-controlled process of copying container images into k8s.gcr.io with a community-owned, GitOps-based workflow. The new approach let contributors push to a staging registry, open a PR with a YAML manifest, get it reviewed and merged, then let automation handle the rest. KEP-1734 formalized the proposal.

In early 2019, the code moved to kubernetes-sigs/k8s-container-image-promoter and expanded rapidly. Over the following years, Stephen Augustus consolidated multiple tools (cip, gh2gcs, krel promote-images, promobot-files) into a single CLI called kpromo. The repository was renamed to promo-tools. Adolfo Garcia Veytia (Puerco) added cosign signing and SBOM support. Tyler Ferrara built vulnerability scanning. Carlos Panato maintained the project in a healthy and releasable state. Across 42 contributors, roughly 3,500 commits, and more than 60 releases, the tool matured.

It worked. But by 2025 the codebase showed the strain of seven years of incremental additions from multiple SIGs and subprojects. The README acknowledged it plainly: duplicated code, multiple techniques for the same task, and numerous TODOs.

The problems we needed to solve

Production promotion jobs for Kubernetes core images routinely took over 30 minutes and often failed with rate limit errors. The core promotion logic had become a monolith that was difficult to extend and challenging to test, making new features like provenance or vulnerability scanning painful to implement.

On the SIG Release roadmap, two items had been pending: "Rewrite artifact promoter" and "Make artifact validation more robust". These had been discussed at SIG Release meetings and KubeCons, and the open research spikes on project board #171 captured eight questions that needed resolution before work could proceed.

One issue to answer them all

In February 2026, we opened issue #1701 ("Rewrite artifact promoter pipeline") and addressed all eight spikes in a single tracking issue. The rewrite was deliberately phased so each step could be reviewed, merged, and validated independently. Here's what we did:

Phase 1: Rate Limiting (#1702). Rewrote rate limiting to properly throttle all registry operations with adaptive backoff.

Phase 2: Interfaces (#1704). Placed registry and auth operations behind clean interfaces for independent testing and swapping.

Phase 3: Pipeline Engine (#1705). Built a pipeline engine that runs promotion as a sequence of distinct phases instead of one monolithic function.

Phase 4: Provenance (#1706). Added SLSA provenance verification for staging images.

Phase 5: Scanner and SBOMs (#1709). Added vulnerability scanning and SBOM support. Switched the default to the new pipeline engine. At this point we released v4.2.0 and let it run in production before continuing.

Phase 6: Split Signing from Replication (#1713). Separated image signing from signature replication into distinct pipeline phases, eliminating the rate limit contention that caused most production failures.

Phase 7: Remove Legacy Pipeline (#1712). Deleted the old code path entirely.

Phase 8: Remove Legacy Dependencies (#1716). Deleted the audit subsystem, deprecated tools, and e2e test infrastructure.

Phase 9: Delete the Monolith (#1718). Removed the old monolithic core and its supporting packages. Thousands of lines deleted across phases 7 through 9.

Each phase shipped independently. v4.3.0 followed the next day with the legacy code fully removed.

With the new architecture in place, a series of follow-up improvements landed: parallelized registry reads (#1736), retry logic for all network operations (#1742), per-request timeouts to prevent pipeline hangs (#1763), HTTP connection reuse (#1759), local registry integration tests (#1746), removal of deprecated credential file support (#1758), a rework of attestation handling to use cosign's OCI APIs and removal of deprecated SBOM support (#1764), and a dedicated promotion record predicate type registered with the in-toto attestation framework (#1767). These improvements would have been significantly harder to implement without the clean separation the rewrite provided. v4.4.0 shipped all of these improvements and enabled provenance generation and verification by default.

The new pipeline

The promotion pipeline now has seven clearly separated phases:

graph LR Setup --> Plan --> Provenance --> Validate --> Promote --> Sign --> Attest
Phase What it does
Setup Validate options, prewarm TUF cache.
Plan Parse manifests, read registries, compute which images need promotion.
Provenance Verify SLSA attestations on staging images.
Validate Check cosign signatures, exit here for dry runs.
Promote Copy images server-side, preserving digests.
Sign Sign promoted images with keyless cosign.
Attest Generate promotion provenance attestations using a dedicated in-toto predicate type.

Phases run sequentially, giving each one exclusive access to the full rate limit budget. No more contention. Signature replication to mirror registries is no longer part of this pipeline and runs as a dedicated periodic Prow job instead.

Making it fast

With the architecture in place, we focused on performance.

Parallel registry reads (#1736): The plan phase reads 1,350 registries. We parallelized this and the plan phase dropped from roughly 20 minutes to about 2 minutes.

Two-phase tag listing (#1761): Instead of checking all 46,000 image groups across more than 20 mirrors, we first check only the source repositories. Roughly 57% of images have no signatures at all because they were promoted before signing was enabled. We skip those entirely, cutting API calls roughly in half.

Source check before replication (#1727): Before iterating all mirrors for a given image, we check if the signature exists on the primary registry first. In steady state where most signatures are already replicated, this reduced the work from roughly 17 hours to about 15 minutes.

Per-request timeouts (#1763): We observed intermittent hangs where a stalled connection blocked the pipeline for over 9 hours. Every network operation now has its own timeout and transient failures are retried automatically.

Connection reuse (#1759): We started reusing HTTP connections and auth state across operations, eliminating redundant token negotiations. This closed a long-standing request from 2023.

By the numbers

Here's what the rewrite looks like in aggregate.

  • More than 40 pull requests merged across three releases (v4.2.0, v4.3.0, v4.4.0)
  • More than 10,000 lines of code added and over 16,000 removed, resulting in a net reduction of approximately 5,000 lines—a 20% smaller codebase
  • Significant performance improvements across all operations
  • Enhanced reliability through retry logic, per-request timeouts, and adaptive rate limiting
  • 19 longstanding issues resolved

Despite shrinking by one-fifth, the codebase gained substantial new capabilities: provenance attestations, a pipeline engine, vulnerability scanning integration, parallelized operations, retry mechanisms, integration tests against local registries, and a standalone signature replication mode.

No user-facing changes

Maintaining backward compatibility was non-negotiable. The kpromo cip command still accepts the same flags and reads identical YAML manifests. The post-k8sio-image-promo Prow job ran without interruption throughout the transition. Promotion manifests in kubernetes/k8s.io required no modifications. Users didn't need to adjust workflows or configuration files.

Two regressions surfaced early in production. The first (#1731) triggered a registry key mismatch that flagged every image as "lost," halting all promotions. The second (#1733) set the default thread count to zero, blocking all goroutines. Both issues were resolved within hours. The phased rollout strategy—v4.2.0 introducing the new engine, v4.3.0 removing legacy code—provided a clear rollback path that ultimately went unused.

What comes next

Signature replication across mirror registries remains the most resource-intensive phase of the promotion cycle. Issue #1762 proposes eliminating it entirely by having archeio (the registry.k8s.io redirect service) route signature tag requests to a single canonical upstream rather than per-region backends. An alternative approach would move signing closer to the registry infrastructure itself. Both options require further discussion with SIG Release and infrastructure teams, but either would eliminate thousands of API calls per promotion cycle and further simplify the codebase.

Thank you

This project represents a seven-year community effort. Thanks to Linus, Stephen, Adolfo, Carlos, Ben, Marko, Lauri, Tyler, Arnaud, and the many others who contributed code, reviews, and planning. The SIG Release and Release Engineering communities provided essential context, discussion, and patience for a rewrite of infrastructure that every Kubernetes release depends on.

To get involved, join #release-management on Kubernetes Slack or visit the repository.