Kubernetes v1.35: Introducing Workload Aware Scheduling

Scheduling large workloads presents far greater complexity than scheduling individual Pods. Rather than evaluating each Pod independently, the scheduler must often consider the entire group holistically. Machine learning batch jobs exemplify this challenge: workers typically need strategic placement—such as co-location on the same rack—to maximize efficiency. Complicating matters further, the Pods comprising these workloads are frequently identical from a scheduling standpoint, which fundamentally alters the optimal scheduling approach.

While numerous custom schedulers have emerged to handle workload scheduling efficiently, the prevalence and criticality of this use case—particularly in the AI era with its expanding requirements—demands native support. The time has come to elevate workloads to first-class status within kube-scheduler.

Workload aware scheduling

Kubernetes 1.35 delivers the initial wave of workload aware scheduling improvements. This release represents one milestone in a broader, multi-SIG initiative spanning several versions, designed to progressively enhance workload scheduling and management capabilities. The ultimate objective: seamless workload scheduling and management in Kubernetes, encompassing preemption, autoscaling, and beyond.

Version 1.35 introduces the Workload API, enabling you to specify both the desired topology and scheduling requirements of your workload. It includes an initial gang scheduling implementation that directs kube-scheduler to schedule gang Pods in an all-or-nothing manner. Additionally, the release optimizes scheduling for identical Pods—which typically constitute a gang—through opportunistic batching, substantially accelerating the process.

Workload API

The new Workload API resource belongs to the scheduling.k8s.io/v1alpha1 API group. This resource provides a structured, machine-readable specification of scheduling requirements for multi-Pod applications. While user-facing workloads like Jobs define what to execute, the Workload resource dictates how a Pod group should be scheduled and how its placement should be managed throughout its lifecycle.

A Workload lets you define a Pod group and apply a scheduling policy. Here's a gang scheduling configuration example: you define a podGroup named workers and apply the gang policy with a minCount of 4.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
 name: training-job-workload
 namespace: some-ns
spec:
 podGroups:
 - name: workers
 policy:
 gang:
 # The gang is schedulable only if 4 pods can run at once
 minCount: 4

When creating your Pods, link them to this Workload using the new workloadRef field:

apiVersion: v1
kind: Pod
metadata:
 name: worker-0
 namespace: some-ns
spec:
 workloadRef:
 name: training-job-workload
 podGroup: workers
 ...

How gang scheduling works

The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources while unable to execute—resulting in resource waste and potential deadlocks.

When you create Pods belonging to a gang-scheduled pod group, the scheduler's GangScheduling plugin manages each pod group (or replica key) lifecycle independently:

Upon Pod creation (whether manual or controller-driven), the scheduler blocks scheduling until:
- The referenced Workload object exists.
- The referenced pod group exists within that Workload.
- The number of pending Pods in that group reaches your minCount.
Once sufficient Pods arrive, the scheduler attempts placement. Rather than binding them to nodes immediately, Pods wait at a Permit gate.
The scheduler verifies whether it has found valid assignments for the entire group (at least minCount Pods).
- If capacity exists for the group, the gate opens and all Pods bind to nodes.
- If only a subset successfully schedules within the timeout (currently 5 minutes), the scheduler rejects all Pods in the group. They return to the queue, releasing reserved resources for other workloads.

While this represents the initial implementation, the Kubernetes project is committed to enhancing and expanding gang scheduling in future releases. Planned improvements include single-cycle scheduling for entire gangs, workload-level preemption, and additional capabilities aligned with the north star vision.

Opportunistic batching

Beyond explicit gang scheduling, v1.35 introduces opportunistic batching, a Beta feature that reduces scheduling latency for identical Pods.

Unlike gang scheduling, this feature requires neither the Workload API nor explicit opt-in. It operates opportunistically within the scheduler by identifying Pods with identical scheduling requirements (container images, resource requests, affinities, and so on). When processing a Pod, the scheduler can reuse feasibility calculations for subsequent identical Pods in the queue, dramatically accelerating throughput.

Most users benefit from this optimization automatically, provided their Pods satisfy the necessary criteria.

Restrictions

Opportunistic batching functions under specific conditions. All fields that kube-scheduler uses to determine placement must be identical across Pods. Additionally, certain features disable the batching mechanism for affected Pods to maintain correctness.

You may need to review your kube-scheduler configuration to verify it isn't inadvertently disabling batching for your workloads.

See the docs for detailed information about restrictions.

The north star vision

The project's ambition for workload aware scheduling extends well beyond these initial capabilities. These new APIs and scheduling enhancements mark only the beginning. Near-term objectives include:

Introducing a workload scheduling phase
Enhanced support for multi-node DRA and topology-aware scheduling
Workload-level preemption
Tighter integration between scheduling and autoscaling
Better interaction with external workload schedulers
Managing workload placement throughout their entire lifecycle
Multi-workload scheduling simulations

These focus areas may shift in priority and implementation order as development continues.

Getting started

To experiment with workload-aware scheduling improvements:

Workload API: Enable the GenericWorkload feature gate on both kube-apiserver and kube-scheduler, and verify that the scheduling.k8s.io/v1alpha1 API group is enabled.
Gang scheduling: Enable the GangScheduling feature gate on kube-scheduler (the Workload API must be enabled first).
Opportunistic batching: This Beta feature is enabled by default in v1.35. You can disable it using the OpportunisticBatching feature gate on kube-scheduler if necessary.