Because Pods represent processes running on nodes in the cluster, it is important to
allow those processes to gracefully terminate when they are no longer needed (rather
than being abruptly stopped with a KILL signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When you request deletion of a Pod, the cluster records and tracks the intended grace period before the Pod is allowed to be forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful shutdown.
Typically, with this graceful termination of the pod, kubelet makes requests to the container runtime
to attempt to stop the containers in the pod by first sending a TERM (aka. SIGTERM) signal,
with a grace period timeout, to the main process in each container.
The requests to stop the containers are processed by the container runtime asynchronously.
There is no guarantee to the order of processing for these requests.
Many container runtimes respect the STOPSIGNAL value defined in the container image and,
if different, send the container image configured STOPSIGNAL instead of TERM.
Once the grace period has expired, the KILL signal is sent to any remaining
processes, and the Pod is then deleted from the
API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to terminate, the
cluster retries from the start including the full original grace period.
The stop signal used to kill the container can be defined in the container image with the STOPSIGNAL instruction.
If no stop signal is defined in the image, the default signal of the container runtime
(SIGTERM for both containerd and CRI-O) would be used to kill the container.
Kubernetes v1.33 [alpha](disabled by default)If the ContainerStopSignals feature gate is enabled, you can configure a custom stop signal
for your containers from the container Lifecycle. We require the Pod's spec.os.name field
to be present as a requirement for defining stop signals in the container lifecycle.
The list of signals that are valid depends on the OS the Pod is scheduled to.
For Pods scheduled to Windows nodes, we only support SIGTERM and SIGKILL as valid signals.
Here is an example Pod spec defining a custom stop signal:
spec:
os:
name: linux
containers:
- name: my-container
image: container-image:latest
lifecycle:
stopSignal: SIGUSR1
If a stop signal is defined in the lifecycle, this will override the signal defined in the container image. If no stop signal is defined in the container spec, the container would fall back to the default behavior.
Pod termination flow, illustrated with an example:
You use the kubectl tool to manually delete a specific Pod, with the default grace period
(30 seconds).
The Pod in the API server is updated with the time beyond which the Pod is considered "dead"
along with the grace period.
If you use kubectl describe to check the Pod you're deleting, that Pod shows up as "Terminating".
On the node where the Pod is running: as soon as the kubelet sees that a Pod has been marked
as terminating (a graceful shutdown duration has been set), the kubelet begins the local Pod
shutdown process.
If one of the Pod's containers has defined a preStop
hook and the terminationGracePeriodSeconds
in the Pod spec is not set to 0, the kubelet runs that hook inside of the container.
The default terminationGracePeriodSeconds setting is 30 seconds.
If the preStop hook is still running after the grace period expires, the kubelet requests
a small, one-off grace period extension of 2 seconds.
preStop hook needs longer to complete than the default grace period allows,
you must modify terminationGracePeriodSeconds to suit this.The kubelet triggers the container runtime to send a TERM signal to process 1 inside each container.
There is special ordering if the Pod has any
sidecar containers defined.
Otherwise, the containers in the Pod receive the TERM signal at different times and in
an arbitrary order. If the order of shutdowns matters, consider using a preStop hook
to synchronize (or switch to using sidecar containers).
At the same time as the kubelet is starting graceful shutdown of the Pod, the control plane evaluates whether to remove that shutting-down Pod from EndpointSlice objects, where those objects represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica.
Pods that shut down slowly should not continue to serve regular traffic and should start terminating and finish processing open connections. Some applications need to go beyond finishing open connections and need more graceful termination, for example, session draining and completion.
Any endpoints that represent the terminating Pods are not immediately removed from
EndpointSlices, and a status indicating terminating state
is exposed from the EndpointSlice API.
Terminating endpoints always have their ready status as false (for backward compatibility
with versions before 1.26), so load balancers will not use it for regular traffic.
If traffic draining on terminating Pod is needed, the actual readiness can be checked as a
condition serving. You can find more details on how to implement connections draining in the
tutorial Pods And Endpoints Termination Flow
The kubelet ensures the Pod is shut down and terminated
SIGKILL to any processes still running in any container in the Pod.
The kubelet also cleans up a hidden pause container if that container runtime uses one.Failed or Succeeded depending on
the end state of its containers).By default, all deletes are graceful within 30 seconds. The kubectl delete command supports
the --grace-period=<seconds> option which allows you to override the default and specify your
own value.
Setting the grace period to 0 forcibly and immediately deletes the Pod from the API
server. If the Pod was still running on a node, that forcible deletion triggers the kubelet to
begin immediate cleanup.
Using kubectl, You must specify an additional flag --force along with --grace-period=0
in order to perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately so a new Pod can be created with the same name. On the node, Pods that are set to terminate immediately will still be given a small grace period before being force killed.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation for deleting Pods from a StatefulSet.
If your Pod includes one or more
sidecar containers
(init containers with an Always restart policy), the kubelet will delay sending
the TERM signal to these sidecar containers until the last main container has fully terminated.
The sidecar containers will be terminated in the reverse order they are defined in the Pod spec.
This ensures that sidecar containers continue serving the other containers in the Pod until they
are no longer needed.
This means that slow termination of a main container will also delay the termination of the sidecar containers. If the grace period expires before the termination process is complete, the Pod may enter forced termination. In this case, all remaining containers in the Pod will be terminated simultaneously with a short grace period.
Similarly, if the Pod has a preStop hook that exceeds the termination grace period, emergency termination may occur.
In general, if you have used preStop hooks to control the termination order without sidecar containers, you can now
remove them and allow the kubelet to manage sidecar termination automatically.
For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.
The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up
terminated Pods (with a phase of Succeeded or Failed), when the number of Pods exceeds the
configured threshold (determined by terminated-pod-gc-threshold in the kube-controller-manager).
This avoids a resource leak as Pods are created and terminated over time.
Additionally, PodGC cleans up any Pods which satisfy any of the following conditions:
node.kubernetes.io/out-of-service.Along with cleaning up the Pods, PodGC will also mark them as failed if they are in a non-terminal phase. Also, PodGC adds a Pod disruption condition when cleaning up an orphan Pod. See Pod disruption conditions for more details.
preStop hook used during termination.