Saturday, February 10, 2007

Scheduling on SMP Hardware

Ever wondered how the Linux scheduler schedules a job on another processor, after the idle task has halted the CPU?

On x86 systems, the answer lies in InterProcessor Interrupts (IPI). An IPI is a type of custom interrupt that can be sent from one processor to another, using an Advanced Programmable Interrupt Controller (APIC).

One of the customers of IPIs happens to be the OS task scheduler, where a processor can be woken up after a task is added to it's per-processor run queue.

Let me add some context.

On a uni-processor machine, the idle task is scheduled when the OS has absolutely nothing to do. The main purpose of this task is to shut down the CPU by first making sure that interrupts are enabled, and then sending the hlt instruction. This happens in the default_idle() function within arch/i386/process.c, via a call to safe_halt().

100 void default_idle(void)
101 {
102 local_irq_enable();
104 if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
105 clear_thread_flag(TIF_POLLING_NRFLAG);
106 smp_mb__after_clear_bit();
107 while (!need_resched()) {
108 local_irq_disable();
109 if (!need_resched())
110 safe_halt();
111 else
112 local_irq_enable();
113 }
114 set_thread_flag(TIF_POLLING_NRFLAG);
115 } else {
116 while (!need_resched())
117 cpu_relax();
118 }
119 }

The safe_halt() function, is not really a function, but a preprocessor macro that calls the sti and hlt instructions, and is defined in include/asm-i386/system.h.

#define safe_halt()             __asm__ __volatile__("sti; hlt": : :"memory")

Of course, since interrupts are enabled, the CPU is immediately woken up when an interrupt is received. This could be from any IO device, the timer, or even another processor. On an idle machine, the timer interrupt vector gates into the scheduler, which may examine it's per-processor run-queues, and try to find a task to execute. If nothing is found, the idle task gets rescheduled.

This works well on uniprocessor systems, but on an SMP system, the timer interrupt only wakes up the primary CPU. So what happens when a new task is forked, and the scheduler decides that it needs to run on another CPU?

The answer lies within the scheduler code (kernel/sched.c), in the resched_task() function.

820 #ifdef CONFIG_SMP
821 static void resched_task(task_t *p)
822 {
823 int cpu;
825 assert_spin_locked(&task_rq(p)->lock);
827 if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
828 return;
830 set_tsk_thread_flag(p, TIF_NEED_RESCHED);
832 cpu = task_cpu(p);
833 if (cpu == smp_processor_id())
834 return;
836 /* NEED_RESCHED must be visible before we test POLLING_NRFLAG */
837 smp_mb();
838 if (!test_tsk_thread_flag(p, TIF_POLLING_NRFLAG))
839 smp_send_reschedule(cpu);
840 }
841 #else
842 static inline void resched_task(task_t *p)
843 {
844 assert_spin_locked(&task_rq(p)->lock);
845 set_tsk_need_resched(p);
846 }
847 #endif

As you can see there are two versions of this function, one for uniprocessor systems, and the other for SMP systems. On a uniprocessor, the task is rescheduled by setting the TIF_NEED_RESCHED flag, via a call to set_tsk_need_resched().

On a multiprocessor system, the same flag is set, and a check is made to see if the task should be running on a different processor from the current one. If not, it just returns, and lets the scheduler take care of rescheduling the task. Otherwise, it wakes up the target processor by calling the smp_send_reschedule() function.

The smp_send_reschedule() function (defined in arch/i386/kernel/smp.c) simply sends an IPI to the specified processor, by calling send_IPI_mask().

472 /*
473 * This function sends a 'reschedule' IPI to another CPU.
474 * it goes straight through and wastes no time serializing
475 * anything. Worst case is that we lose a reschedule ...
476 */
477 void smp_send_reschedule(int cpu)
478 {
479 WARN_ON(cpu_is_offline(cpu));
480 send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
481 }

The send_IPI_mask() function uses the APIC to fire an interrupt off to the target processor. It calls send_IPI_mask_bitmask(), which in-turn programs the APIC using the functions in include/asm-i386/apic.h.

160 void send_IPI_mask_bitmask(cpumask_t cpumask, int vector)
161 {
162 unsigned long mask = cpus_addr(cpumask)[0];
163 unsigned long cfg;
164 unsigned long flags;
166 local_irq_save(flags);
167 WARN_ON(mask & ~cpus_addr(cpu_online_map)[0]);
168 /*
169 * Wait for idle.
170 */
171 apic_wait_icr_idle();
173 /*
174 * prepare target chip field
175 */
176 cfg = __prepare_ICR2(mask);
177 apic_write_around(APIC_ICR2, cfg);
179 /*
180 * program the ICR
181 */
182 cfg = __prepare_ICR(0, vector);
184 /*
185 * Send the IPI. The write to APIC_ICR fires this off.
186 */
187 apic_write_around(APIC_ICR, cfg);
189 local_irq_restore(flags);
190 }

The specific IPI-gate is defined as RESCHEDULE_VECTOR, and is installed by the smp_intr_init() function in arch/i386/kernel/smpboot.c.

1454 void __init smp_intr_init(void)
1455 {
1456 /*
1457 * IRQ0 must be given a fixed assignment and initialized,
1458 * because it's used before the IO-APIC is set up.
1459 */
1460 set_intr_gate(FIRST_DEVICE_VECTOR, interrupt[0]);
1462 /*
1463 * The reschedule interrupt is a CPU-to-CPU reschedule-helper
1464 * IPI, driven by wakeup.
1465 */
1466 set_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
1468 /* IPI for invalidation */
1469 set_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
1471 /* IPI for generic function call */
1472 set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
1473 }

These gates are installed at boot time, and are responsible for handling the various interrupts on multiprocessor systems. An APIC is required on all SMP systems for various reasons, and scheduling is one of them.

There is one small detail that I left out. You may have wondered about the TIF_POLLING_NRFLAG scattered about in the code. If you provide "idle=poll" as a kernel parameter at boot time, the idle function points to poll_idle(), instead of default_idle().

The poll_idle() function does not halt the processor, but spins around checking to see if a task is ready to run. As you may have guessed, although performance may improve, it's not very power-friendly. You certainly don't want this on your laptop.

124 /*
125 * On SMP it's slightly faster (but much more power-consuming!)
126 * to poll the ->work.need_resched flag instead of waiting for the
127 * cross-CPU IPI to arrive. Use this option with caution.
128 */
129 static void poll_idle (void)
130 {
131 local_irq_enable();
133 asm volatile(
134 "2:"
135 "testl %0, %1;"
136 "rep; nop;"
137 "je 2b;"
138 : : "i"(_TIF_NEED_RESCHED), "m" (current_thread_info()->flags));
139 }

If you go back and read the resched_task() function, you will notice that the IPI is only sent if TIF_POLLING_NRFLAG is not set. Since the CPU is already running, it is not necessary to interrupt it.

A final word about IPIs. Waking up a halted processor is not all that it is used for. It is also used in the memory management code, to maintain cache coherency, in certian synchronization code, and many other places in the scheduler.

If you found this interesting and want to learn more, listed below are a few places you can go digging around. Happy hacking!

Note: All file paths are relative to the base of the Linux kernel source tree. The code snippets in this article are from version 2.6.20 of the kernel.

Intel APIC Architecture


  1. Thanks for explaining, I read your blog to find out new steps to troubleshoot *NIX

  2. as always.. another great article.. thanks!

  3. superb... i have alwayes wondered about this.

    your blog is fantastic. its nice to see some good technical posts for a change.

  4. Cool. You have an excellent blog. You should update it more often.

  5. thanks for the information in such detail