Scheduling on SMP Hardware

Ever wondered how the Linux scheduler schedules a job on another processor, after the idle task has halted the CPU?

On x86 systems, the answer lies in InterProcessor Interrupts (IPI). An IPI is a type of custom interrupt that can be sent from one processor to another, using an Advanced Programmable Interrupt Controller (APIC).

One of the customers of IPIs happens to be the OS task scheduler, where a processor can be woken up after a task is added to it's per-processor run queue.

Let me add some context.

On a uni-processor machine, the idle task is scheduled when the OS has absolutely nothing to do. The main purpose of this task is to shut down the CPU by first making sure that interrupts are enabled, and then sending the hlt instruction. This happens in the default_idle() function within arch/i386/process.c, via a call to safe_halt().


100 void default_idle(void)
101 {
102         local_irq_enable();
103
104         if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
105                 clear_thread_flag(TIF_POLLING_NRFLAG);
106                 smp_mb__after_clear_bit();
107                 while (!need_resched()) {
108                         local_irq_disable();
109                         if (!need_resched())
110                                 safe_halt();
111                         else
112                                 local_irq_enable();
113                 }
114                 set_thread_flag(TIF_POLLING_NRFLAG);
115         } else {
116                 while (!need_resched())
117                         cpu_relax();
118         }
119 }

The safe_halt() function, is not really a function, but a preprocessor macro that calls the sti and hlt instructions, and is defined in include/asm-i386/system.h.

#define safe_halt()             __asm__ __volatile__("sti; hlt": : :"memory")

Of course, since interrupts are enabled, the CPU is immediately woken up when an interrupt is received. This could be from any IO device, the timer, or even another processor. On an idle machine, the timer interrupt vector gates into the scheduler, which may examine it's per-processor run-queues, and try to find a task to execute. If nothing is found, the idle task gets rescheduled.

This works well on uniprocessor systems, but on an SMP system, the timer interrupt only wakes up the primary CPU. So what happens when a new task is forked, and the scheduler decides that it needs to run on another CPU?

The answer lies within the scheduler code (kernel/sched.c), in the resched_task() function.


820 #ifdef CONFIG_SMP
821 static void resched_task(task_t *p)
822 {
823         int cpu;
824
825         assert_spin_locked(&task_rq(p)->lock);
826
827         if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
828                 return;
829
830         set_tsk_thread_flag(p, TIF_NEED_RESCHED);
831
832         cpu = task_cpu(p);
833         if (cpu == smp_processor_id())
834                 return;
835
836         /* NEED_RESCHED must be visible before we test POLLING_NRFLAG */
837         smp_mb();
838         if (!test_tsk_thread_flag(p, TIF_POLLING_NRFLAG))
839                 smp_send_reschedule(cpu);
840 }
841 #else
842 static inline void resched_task(task_t *p)
843 {
844         assert_spin_locked(&task_rq(p)->lock);
845         set_tsk_need_resched(p);
846 }
847 #endif

As you can see there are two versions of this function, one for uniprocessor systems, and the other for SMP systems. On a uniprocessor, the task is rescheduled by setting the TIF_NEED_RESCHED flag, via a call to set_tsk_need_resched().

On a multiprocessor system, the same flag is set, and a check is made to see if the task should be running on a different processor from the current one. If not, it just returns, and lets the scheduler take care of rescheduling the task. Otherwise, it wakes up the target processor by calling the smp_send_reschedule() function.

The smp_send_reschedule() function (defined in arch/i386/kernel/smp.c) simply sends an IPI to the specified processor, by calling send_IPI_mask().


472 /*
473  * This function sends a 'reschedule' IPI to another CPU.
474  * it goes straight through and wastes no time serializing
475  * anything. Worst case is that we lose a reschedule ...
476  */
477 void smp_send_reschedule(int cpu)
478 {
479         WARN_ON(cpu_is_offline(cpu));
480         send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
481 }

The send_IPI_mask() function uses the APIC to fire an interrupt off to the target processor. It calls send_IPI_mask_bitmask(), which in-turn programs the APIC using the functions in include/asm-i386/apic.h.


160 void send_IPI_mask_bitmask(cpumask_t cpumask, int vector)
161 {
162         unsigned long mask = cpus_addr(cpumask)[0];
163         unsigned long cfg;
164         unsigned long flags;
165
166         local_irq_save(flags);
167         WARN_ON(mask & ~cpus_addr(cpu_online_map)[0]);
168         /*
169          * Wait for idle.
170          */
171         apic_wait_icr_idle();
172                
173         /*
174          * prepare target chip field
175          */
176         cfg = __prepare_ICR2(mask);
177         apic_write_around(APIC_ICR2, cfg);
178                
179         /*
180          * program the ICR
181          */
182         cfg = __prepare_ICR(0, vector);
183                        
184         /*
185          * Send the IPI. The write to APIC_ICR fires this off.
186          */
187         apic_write_around(APIC_ICR, cfg);
188
189         local_irq_restore(flags);
190 }

The specific IPI-gate is defined as RESCHEDULE_VECTOR, and is installed by the smp_intr_init() function in arch/i386/kernel/smpboot.c.


1454 void __init smp_intr_init(void)
1455 {
1456         /*
1457          * IRQ0 must be given a fixed assignment and initialized,
1458          * because it's used before the IO-APIC is set up.
1459          */
1460         set_intr_gate(FIRST_DEVICE_VECTOR, interrupt[0]);
1461
1462         /*
1463          * The reschedule interrupt is a CPU-to-CPU reschedule-helper
1464          * IPI, driven by wakeup.
1465          */
1466         set_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
1467
1468         /* IPI for invalidation */
1469         set_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
1470
1471         /* IPI for generic function call */
1472         set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
1473 }

These gates are installed at boot time, and are responsible for handling the various interrupts on multiprocessor systems. An APIC is required on all SMP systems for various reasons, and scheduling is one of them.

There is one small detail that I left out. You may have wondered about the TIF_POLLING_NRFLAG scattered about in the code. If you provide "idle=poll" as a kernel parameter at boot time, the idle function points to poll_idle(), instead of default_idle().

The poll_idle() function does not halt the processor, but spins around checking to see if a task is ready to run. As you may have guessed, although performance may improve, it's not very power-friendly. You certainly don't want this on your laptop.


124 /*
125  * On SMP it's slightly faster (but much more power-consuming!)
126  * to poll the ->work.need_resched flag instead of waiting for the
127  * cross-CPU IPI to arrive. Use this option with caution.
128  */
129 static void poll_idle (void)
130 {
131         local_irq_enable();
132
133         asm volatile(
134                 "2:"
135                 "testl %0, %1;"
136                 "rep; nop;"
137                 "je 2b;"
138                 : : "i"(_TIF_NEED_RESCHED), "m" (current_thread_info()->flags));
139 }

If you go back and read the resched_task() function, you will notice that the IPI is only sent if TIF_POLLING_NRFLAG is not set. Since the CPU is already running, it is not necessary to interrupt it.

A final word about IPIs. Waking up a halted processor is not all that it is used for. It is also used in the memory management code, to maintain cache coherency, in certian synchronization code, and many other places in the scheduler.

If you found this interesting and want to learn more, listed below are a few places you can go digging around. Happy hacking!

Note: All file paths are relative to the base of the Linux kernel source tree. The code snippets in this article are from version 2.6.20 of the kernel.

kernel/sched.c
kernel/cpu.c
arch/i386/kernel/smp.c
arch/i386/kernel/smpboot.c
arch/i386/kernel/process.c
arch/i386/kernel/head.S
include/asm-i386/apic.h
Intel APIC Architecture

5 comments:

Anonymous6:33 AM, February 14, 2007
Thanks for explaining, I read your blog to find out new steps to troubleshoot *NIX
Anonymous11:16 AM, February 15, 2007
as always.. another great article.. thanks!
Anonymous9:19 PM, February 19, 2007
superb... i have alwayes wondered about this.

your blog is fantastic. its nice to see some good technical posts for a change.
Anonymous9:07 AM, March 25, 2007
Cool. You have an excellent blog. You should update it more often.
Anonymous3:12 AM, February 15, 2008
thanks for the information in such detail

0xFE - 11111110b - 0376

Saturday, February 10, 2007

Scheduling on SMP Hardware

5 comments: