Scheduling large workloads presents far greater complexity than scheduling individual Pods. Rather than evaluating each Pod independently, the scheduler must often consider the entire group holistically. Machine learning batch jobs exemplify this challenge: workers typically need strategic placement—such as co-location on the same rack—to maximize efficiency. Complicating matters further, the Pods comprising these workloads are frequently identical from a scheduling standpoint, which fundamentally alters the optimal scheduling approach.
While numerous custom schedulers have emerged to handle workload scheduling efficiently, the prevalence and criticality of this use case—particularly in the AI era with its expanding requirements—demands native support. The time has come to elevate workloads to first-class status within kube-scheduler.
Workload aware scheduling
Kubernetes 1.35 delivers the initial wave of workload aware scheduling improvements. This release represents one milestone in a broader, multi-SIG initiative spanning several versions, designed to progressively enhance workload scheduling and management capabilities. The ultimate objective: seamless workload scheduling and management in Kubernetes, encompassing preemption, autoscaling, and beyond.
Version 1.35 introduces the Workload API, enabling you to specify both the desired topology and scheduling requirements of your workload. It includes an initial gang scheduling implementation that directs kube-scheduler to schedule gang Pods in an all-or-nothing manner. Additionally, the release optimizes scheduling for identical Pods—which typically constitute a gang—through opportunistic batching, substantially accelerating the process.
Workload API
The new Workload API resource belongs to the scheduling.k8s.io/v1alpha1 API group. This resource provides a structured, machine-readable specification of scheduling requirements for multi-Pod applications. While user-facing workloads like Jobs define what to execute, the Workload resource dictates how a Pod group should be scheduled and how its placement should be managed throughout its lifecycle.
A Workload lets you define a Pod group and apply a scheduling policy. Here's a gang scheduling configuration example: you define a podGroup named workers and apply the gang policy with a minCount of 4.
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
When creating your Pods, link them to this Workload using the new workloadRef field:
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
...
How gang scheduling works
The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources while unable to execute—resulting in resource waste and potential deadlocks.
When you create Pods belonging to a gang-scheduled pod group, the scheduler's GangScheduling plugin manages each pod group (or replica key) lifecycle independently:
-
Upon Pod creation (whether manual or controller-driven), the scheduler blocks scheduling until:
- The referenced Workload object exists.
- The referenced pod group exists within that Workload.
- The number of pending Pods in that group reaches your
minCount.
-
Once sufficient Pods arrive, the scheduler attempts placement. Rather than binding them to nodes immediately, Pods wait at a
Permitgate. -
The scheduler verifies whether it has found valid assignments for the entire group (at least
minCountPods).- If capacity exists for the group, the gate opens and all Pods bind to nodes.
- If only a subset successfully schedules within the timeout (currently 5 minutes), the scheduler rejects all Pods in the group. They return to the queue, releasing reserved resources for other workloads.
While this represents the initial implementation, the Kubernetes project is committed to enhancing and expanding gang scheduling in future releases. Planned improvements include single-cycle scheduling for entire gangs, workload-level preemption, and additional capabilities aligned with the north star vision.
Opportunistic batching
Beyond explicit gang scheduling, v1.35 introduces opportunistic batching, a Beta feature that reduces scheduling latency for identical Pods.
Unlike gang scheduling, this feature requires neither the Workload API nor explicit opt-in. It operates opportunistically within the scheduler by identifying Pods with identical scheduling requirements (container images, resource requests, affinities, and so on). When processing a Pod, the scheduler can reuse feasibility calculations for subsequent identical Pods in the queue, dramatically accelerating throughput.
Most users benefit from this optimization automatically, provided their Pods satisfy the necessary criteria.
Restrictions
Opportunistic batching functions under specific conditions. All fields that kube-scheduler uses to determine placement must be identical across Pods. Additionally, certain features disable the batching mechanism for affected Pods to maintain correctness.
You may need to review your kube-scheduler configuration to verify it isn't inadvertently disabling batching for your workloads.
See the docs for detailed information about restrictions.
The north star vision
The project's ambition for workload aware scheduling extends well beyond these initial capabilities. These new APIs and scheduling enhancements mark only the beginning. Near-term objectives include:
- Introducing a workload scheduling phase
- Enhanced support for multi-node DRA and topology-aware scheduling
- Workload-level preemption
- Tighter integration between scheduling and autoscaling
- Better interaction with external workload schedulers
- Managing workload placement throughout their entire lifecycle
- Multi-workload scheduling simulations
These focus areas may shift in priority and implementation order as development continues.
Getting started
To experiment with workload-aware scheduling improvements:
- Workload API: Enable the
GenericWorkloadfeature gate on bothkube-apiserverandkube-scheduler, and verify that thescheduling.k8s.io/v1alpha1API group is enabled. - Gang scheduling: Enable the
GangSchedulingfeature gate onkube-scheduler(the Workload API must be enabled first). - Opportunistic batching: This Beta feature is enabled by default in v1.35. You can disable it using the
OpportunisticBatchingfeature gate onkube-schedulerif necessary.
Test these features in non-production clusters and share your feedback to help refine Kubernetes scheduling. You can contribute by:
- Joining the conversation on Slack (#sig-scheduling)
- Commenting on the workload-aware scheduling tracking issue
- Opening a new issue in the Kubernetes repository
Learn more
- Review the KEPs for Workload API and gang scheduling and Opportunistic batching
- Follow the Workload-aware scheduling issue for the latest updates