Scheduling and Killing Processes

Sometimes processes crash, or use too many resources, or just need to be gracefully shut down.  In this post, I'll cover the following:

  • nice
  • renice
  • kill
  • pkill

Let's talk about killing first. There are actually two different commands with "kill" in them, and they do different things.  There are also different ways of killing processes, and some of them aren't even very murderey!

The first thing we should cover is how you can even tell if a process is causing problems - it's not always as easy as something becoming unresponsive.  It could be that a process is using too much memory, or too many CPU cycles - but how do you determine that?  If you're even vaguely familiar with Windows, you'll know about the Task Manager.  Linux has something similar, called top.  Just type "top" in a shell and you'll see something like this:

There's a lot to unpack here, but we're only going to focus on two areas right now - the first line with the text load average: 0.28, 1.05, 1.20, and the %CPU column.  

The load average line shows a running average of system load over 1 minute, 5 minutes and 15 minutes.  These numbers will change, and by watching them for a moment, you'll be able to tell if your system's load is increasing or decreasing.  It's important to note that this information is for all cores COMBINED, so to determine the real load on your system, you need to divide the number you're interested in (say, the 15 minute figure of 1.20) by the number of cores in your system.  My system has 6 cores, so I would divide 1.20 by 6, which is .20 - that means that over the last 15 minutes, my system has been running at 20% load.  Which isn't bad at all!  

Now let's assume that your system has suddenly slowed down.  You open top and see the following:

Whoa!  We've got a load averages of 9.20, 4.28 and 2.18 - that's 154% over 1 minute, 71% over 5 minutes, and 36% over 15 minutes.  So we can tell just from looking at these numbers that something started hammering the system within the last 5 minutes or so. And if we look below, we can see that a process called ghb is using 482.4% of the CPU! That's actually Handbrake, which was busy converting a video, so that's kind of to be expected, since video conversion is a resource-intensive process.

But what if I want to do more than just sit back and let Handbrake bog my system down until it's finished?  That's where renice comes in!

Every process in Linux has a "nice" value.  The nice value determines the priority of the process, and the priority determines how much of the system's resources the process is allowed to use.  In top, the nice value is in the "NI" column, and we can see that Handbrake has a nice value of 0, which is essentially neutral, and which was also assigned automatically by the system.  Nice values go from -20, which is the LEAST nice, to 19, which is the MOST nice.  

Think of the nice value in terms of people (processes) standing in line to get into a concert.  Let's assume three different scenarios:

  1. Scott is standing at about the halfway point of the line.  Scott has a nice value of 0 - he's not at the front of the line, but he's not at the back, either.
  2. Crispin starts out right behind Scott, but he keeps letting people get in line in front of him, which eventually puts him at the back of the line.  Crispin has a nice value of 19 - he's the nicest it's possible to be.
  3. Steve arrives right before they're about ready to let people in and he cuts the line right at the front.  Steve has a nice value of -20 - not nice at all!

Using the magic of renice, we can move these processes/people around.  Before we do that, however, it's important to know that regular users have fewer permissions than the root user when it comes to adjusting priorities, so a regular user can only make a process MORE nice, not LESS.  Only the root user (or someone who has permission to sudo) can give a process more resources/make it less nice.

So let's assume that Steve is Handbrake/ghb, and type renice -n 10 456911, where 456911 is the Handbrake's process id (PID - see the left column in top above), and -n 10 tells the system what the new nice value should be.  Since Handbrake started at a nice value of 0 and was using 482% of available system resources, this new nice value of 10 will cause it to use far fewer resources.  

However, what if I want to make more resources available to Scott (Joplin), now that Scott (Handbrake) isn't using so much?  I would need to use sudo, because giving permission to use more resources requires root privileges.  So I would type something like sudo renice -n -10 273996(Joplin's PID - see top above).  This would let Joplin run smoother, as it would have access to more CPU and memory than it did before, and significantly more than Steve/Handbrake.  That's what you get for cutting, Steve!

We're not done with Steve yet, though!  We still need to talk about kill and pkill.  Now let's assume that Steve doesn't like that he had to go almost to the back of the line, and he starts faking a seizure in an attempt to make people feel sorry for him so he can get back to the front of the line.  In Linux, this means that Handbrake starts to hang.  Oh no!  Here's where it gets interesting, though, because we've got a License to (p)Kill (I'm so sorry).

There are technically 31 different kill signals we could send, but I'm not going to cover them all here.  It's enough to know that each kill signal has a number, and the number is what you use to tell the system exactly how you want to kill a process.  By default, if you don't use a number at all, the system will send signal 15 (SIGTERM), which will basically tell a process to shut itself down gracefully if it can, much like short-pressing the power button on your computer or laptop - when you do that, stuff is going to try to shut down in a way that won't cause issues like data loss.  HOWEVER, if the process doesn't shut down, and we need it to just die now, immediately, don't try to save anything, just die, we would use signal 9 (SIGKILL), which causes immediate death.  To do this, you simply type kill -9 ghb and Handbrake/Steve will die as soon as you hit enter.
NOTE: You can't kill processes owned by others without root privileges.

Alternatively, if we just want to PAUSE Handbrake while we get some other stuff done, we could use kill signal 19 (SIGSTOP), which will effectively suspend a process, and when we're ready for it to use more resources, we could use signal 18 (SIGCONT), which will unpause/continue the process.  So sometimes, a process might only be MOSTLY dead, in a state of suspension, waiting for the true love's kiss of that sweet, sweet signal 18.

These signals aren't just for you, the system uses them as well.  For example, if your browser crashes due to an invalid memory reference (SIGSEGV), that's one of the kill signals that will automatically generate a core dump, so you'd be able to check the logs and hopefully determine why the process crashed.

Sometimes you'll have processes running that have the same command name.  Browsers do this a lot.  If you take a look at the top example above you'll see multiple instances of "brave" in the command column.  If I wanted to kill all processes that share the same command name, I would use the killall command: killall brave and the system would send a kill -15 signal to those processes.  If that doesn't work, you could send signal 9 like this: killall -9 brave.

killall leads us nicely to the last command I wanted to cover: pkill.  Like killall, pkill can be used to kill multiple processes, but it also includes advanced features like being able to kill processes based on the owning user, the owning group, the child processes of a parent process, or processes running on a specific terminal.

So if we wanted to terminate all of Steve's processes, and ONLY Steve's, we could do this: pkill -U steve. Or, if we know that Steve is logged in to the same server you're on, and is working in a shell session in tty6, we (as root) could terminate his shell by using this pkill command: pkill -t tty6.  This will send signal 15 to his shell and all processes in it, and he'd be logged out.

That's what you get for cutting in line, Steve!