There are 14 queues and one thread state for each queue. The queues are
shown below with an example of what the queue looks like. The states are
listed later. After the list of states, the event flag is described and
examples of what is done to put a thread on a queue and take it off.

Thread scheduling is basically just moving threads among these queues. 
When a thread has to block to wait for something, it is put on the
queue for that type of event. When the thread is not waiting for an event,
but is just waiting to run, it is on the pthread_current_prio_queue. From
this queue, it will become a running thread in the order of it's priority.
Otherwise, it will be waiting around on one of the other queues. A thread
will run until it blocks for something, or until the timer goes off and
reschedules it in the priority queue to be run again.

QUEUES  and what they look like
-------------------------------
Threads are put on the queues in the order thread1, thread2, thread3, etc.
The examples show what the queues look like after the last thread is
put on the queue. The queue heads look like this:

struct pthread_queue {
    struct pthread  *q_next;
    struct pthread  *q_last;
    void            *q_data;
};

    Queue functions are:
    --------------------
	pthread_queue_init(queue); 
		- queue = {NULL, NULL, NULL}
	pthread_queue_enq(queue, thread);
		- puts thread on queue
	pthread_queue_get();
		- returns pointer to next thread on queue, thread
			stays on the queue
	pthread_queue_deq(queue);
		- returns pointer to the next thread on the queue
			and removes the thread from the queue
	pthread_queue_remove(queue, thread);
		- removes the specified thread from the queue

    in the pthread_t structure, there are 5 queue related fields:
    -------------------------------------------------
	struct pthread_queue join_queue 
		- queue of threads that are waiting to join with this one
		- the queue looks like this:
			thread.join_queue->q_next = thread1; 
			thread.join_queue->q_last = thread2; 
			thread1->queue = &thread.join_queue; 
			thread1->next = thread2; 
			thread2->queue = &thread.join_queue; 
			thread2->next = NULL;

	struct pthread_queue *queue 
		- pointer to the head of the queue that the thread is on
		- used for dead, alloc, join, r_queue, w_queue, c_queue, m_queue
		- queue head is described below
	struct pthread *next 
		- next thread in queue
		- used for dead, alloc, join, prio, r_queue, w_queue, c_queue, m_queue

	struct pthread *pll 
		- used for the pthread linked list, has all threads in any state
	struct pthread *sll 
		- used for the linked list of sleeping threads

    in globals.c:
    -----------------------------------

	struct pthread *pthread_link_list - head of list of all threads
		- looks like this:
			pthread_link_list = thread3; 
			thread3->pll = thread2; 
			thread2->pll = thread1; 
			thread1->pll = NULL;

	struct pthread_queue pthread_dead_queue
		- queue of dead threads
	struct pthread_queue pthread_alloc_queue
		- queue of available, allocated threads
		- looks like this:
			queue->q_next = thread1; 
			queue->q_last = thread2; 
			thread1->queue = queue; 
			thread1->next = thread2; 
			thread2->queue = queue; 
			thread2->next = NULL;

	struct pthread_prio_queue pthread_current_prio_queue
		- current priority queue
		- thread is in state PS_RUNNING
		- looks like this:
			pp is priority level > p

			prio_queue->level[pp].first = thread1;
			prio_queue->level[pp].last = thread2;
			prio_queue->next = thread1;
			thread1->next = thread2;
			thread2->next = thread3;
			thread3->next = NULL;
			prio_queue->level[p].first = thread3;
			prio_queue->level[p].last  =  thread3;

    in sleep.c:
    ------------------
	struct pthread *pthread_sleep
		- list of sleeping threads
		- looks like this:
			pthread_sleep = thread1; 
			thread1->sll = thread2; 
			thread2->sll = thread3; 
			thread3->sll = NULL;

    in sig.c:
    ----------------
	pthread_sigwait	linked list of threads waiting for a signal
		- looks like this:
			pthread_sigwait = thread2; 
			thread2->next = thread1; 
			thread1->next = NULL;

    in fd.c:
    --------
    each fd_table entry has:
	r_queue		list of threads waiting for a read lock on this fd
	w_queue		list of threads waiting for a write lock on this fd

    in fd_kern.c:
    -------------
	fd_wait_read		list of threads waiting for kernel read
	fd_wait_write		list of threads waiting for kernel write
	fd_wait_select		list of threads waiting for kernel select

    in wait.c:
    ----------
	wait_queue
		- is used for each of these queues:
			fd.r_queue fd.w_queue fd_wait_read fd_wait_write 
			fd_wait_select wait_queue
		- looks like this:
			queue->q_next = thread1;	
			queue->q_last = thread2;	
			thread1->queue = queue;
			thread1->next  =  thread2;
			thread2->queue = queue;
			thread2->next = NULL;


    each conditional has
    --------------------
	cond.c_queue		
		- list of threads waiting on a conditional
		- looks like this:
			cond.c_queue->q_next = thread1;
			cond.c_queue->q_last = thread2;
			thread1->queue = cond.c_queue;
			thread1->next  =  thread2;
			thread2->queue = cond.c_queue;
			thread2->next = NULL;

    each mutex has
    --------------
	mutex.m_queue
		- list of threads waiting on a mutex
		- looks like this:
			mutex.m_queue->q_next = thread1;
			mutex.m_queue->q_last = thread2;
			thread1->queue = mutex.m_queue;
			thread1->next  =  thread2;
			thread2->queue = mutex.m_queue;
			thread2->next = NULL;

THREAD STATES (from state.def) and the corresponding queue:
-----------------------------------------------------------
Queue				State		Description
					Functions used in

pthread_current_prio_queue	PS_RUNNING	running
mutex.m_queue			PS_MUTEX_WAIT	waiting for a mutex
					pthread_mutex_lock()
cond.c_queue			PS_COND_WAIT	waiting on a condition variable
					pthread_cond_wait()
					pthread_cond_timedwait()
r_queue				PS_FDLR_WAIT	waiting on a fd read lock
					fd_lock()
w_queue				PS_FDLW_WAIT	waiting on a fd write lock
					fd_lock()
fd_wait_read			PS_FDR_WAIT	waiting on a kernel fd to have data to read
					read() readv() recv() 
					recvfrom() recvmsg()
fd_wait_write			PS_FDW_WAIT	waiting on a kernel fd to write data
					write() writev() 
					send() sendto() sendmsg()
fd_wait_select			PS_SELECT_WAIT	waiting for several fds in a select
					select()
pthread_sleep			PS_SLEEP_WAIT	waiting on a sleep
					sleep() usleep() nanosleep()
wait_queue			PS_WAIT_WAIT	waiting for a child to die
					wait() waitpid() [wait3() wait4()]
pthread_sigwait			PS_SIGWAIT	waiting on a set of signals
					sigwait()
join_queue			PS_JOIN		waiting for a thread to die
					pthread_join()
pthread_dead_queue		PS_DEAD		waiting for a thread to join with it or detach it
pthread_alloc_queue		PS_UNALLOCED	available to use for a new thread

EVENT FLAGS
-----------

Event flags are used to monitor the event state of a thread. A thread is
either waiting for an event, or the event is done. The flags are mutually
exclusive - they cannot both be set at the same time. Macros are defined
to manipulate and test the flags.

    flags (in pthread.h):
    ---------------------
	PF_WAIT_EVENT	0x01
	PF_DONE_EVENT	0x02

    macros (in pthread.h):
    ----------------------
	SET_PF_DONE_EVENT(thread)
	This wants to set the flag to indicate an event is done.
	If the flag is already set to PF_DONE_EVENT, then the flag
	cannot be set now. If the flag is set to PF_WAIT_EVENT, then
	it's OK to set the flag to PF_DONE_EVENT. If the flag is
	not set to PF_DONE_EVENT or PF_WAIT_EVENT, then PANIC

		if the PF_DONE_EVENT flag is not set
		then 
			if the PF_WAIT_EVENT flag is set
			then set flags to PF_DONE_EVENT
				return OK
			else PANIC
		else return NOTOK

	SET_PF_WAIT_EVENT(thread)
	This sets the flag to indicate it's waiting for an event.
	Set the flag only if it is clear, otherwise PANIC

		if flags is set to PF_WAIT_EVENT or PF_DONE_EVENT
		then PANIC
		else set flags to PF_WAIT_EVENT
		
	CLEAR_PF_DONE_EVENT(pthread)
	Clear the flag if it is set to PF_DONE_EVENT.

		if flags is not set to PF_DONE_EVENT
		then PANIC
		else set flags to 0



PUTTING A THREAD ON A QUEUE:
----------------------------
This is a general sequence of steps to put a thread on a queue. There
are variations, but this covers what is basically needed.

pthread_sched_prevent(); - locks kernel
SET_PF_WAIT_EVENT(pthread_run);
pthread_queue_enq(queue, pthread_run);
pthread_resched_resume(state); - sets state, processes signals, changes context,
				processes signals, unlocks kernel
if event has occurred successfully CLEAR_PF_DONE_EVENT(pthread_run);
else pthread_resched_resume(state) - reschedule the thread to keep waiting


REMOVING A THREAD FROM A QUEUE:
-------------------------------
This is the general sequence of steps to take a thread off of a queue. There
are variations, but this covers what is basically needed.


pthread = pthread_queue_deq(queue); - take next thread off of queue
pthread_sched_prevent();
if SET_PF_DONE_EVENT(pthread)	- check if the event it was waiting for is done
	pthread_sched_other_resume(pthread);	- schedule thread
else pthread_sched_resume();	- continue with current thread


DESCRIPTION OF FUNCTIONS
------------------------
This section describes very tersely what some of the key functions
do. This is what helped me to understand what is going on. It may or
may not help you much. 


pthread.c
	pthread_create
		create thread structure (or take it from alloc queue)
		initialize it
		pthread_sched_prevent
		pthread_sched_other_resume(new_thread)

signal.c
	pthread_sched_other_resume(thread)
		set state to PS_RUNNING
		put on prio_queue
		if thread has higher priority than pthread_run
			and the kernel is locked
			sig_handler(SIGVTALRM) 
				- this eventually causes context switch
		pthread_sched_resume() 
				- process any signals and continue running 
				the current thread

	sig_handler(sig)
		(kernel lock must  =  1)
		sig_handler_switch(sig)
		process any other signals

	sig_handler_switch(sig)
		VTALRM: sigvtalrm();
		
	sigvtalrm()
		if (sig_count) set sig_count = 0, unblock all signals.
		context_switch()
		context_switch_done()

	pthread_sched_resume()
		loop until kernel is not locked
			if there are any signals to process,
				keep the kernel locked
				call the signal handler
			if pthread_run has a signal to process
				keep the kernel locked
				call pthread_sig_process
			
	pthread_resched_resume(state)
		changes state of pthread_run to state
		sig_handler(SIGVTALRM)
		pthread_sched_resume()
			

signal.c
	context_switch
		if state is PS_RUNNING put current thread on prio_queue
		save state info
		poll all fds
	A:	get next thread off of prio_queue
		restore that thread's state
		return

		if no threads on prio_queue, see if there are
		any threads in the linked list
			process any interrupts for the thread
			if no signals to process, wait
			then process the signal and go back to A:
		exit if there are no threads alive

	context_switch_done()
		clear SIGVTALRM
		set_thread_timer()

	set_thread_timer()
		setitimer(pthread_run)
		based on scheduling policy

	fd_lock
		lock fd's mutex with timedwait
		return error if it times out
		if the fd has an owner and it is not pthread_run
		then	stop scheduling
			put pthread_run on the fd's r or w queue
			SET_PF_WAIT_EVENT on pthread_run
			unlock mutex
			if timeout is 0
			then	set pthread_run->data
				resched_resume(PS_FDLR_WAIT)
				lock mutex
			else	
				stuff

			CLEAR_PF_DONE_EVENT on pthread_run
		else	increase lockcount or set owner to pthread_run
		unlock fd's mutex
	
	fd_unlock
		lock fd's mutex
		if pthread_run owns fd for reading/writing and is not waiting
			with another lock
		then get the next thread from the read queue,
			stop scheduling
			set the owner for the lock to the thread
			if SET_PF_DONE_EVENT on thread is successful
				schedule the thread
			else	resume the current thread
		else decrease lockcount
		unlock fd's mutex
		
QUESTIONS
---------
	i think a thread can sometimes be on two queues, like the sleep
	queue and another one. 


CANCELLATION
------------
MySQL has cancellation implemented from the cancellation code contributed
by Larry Streepy.

A thread calls a function which is a cancellation point. It sets a flag 
that says it is at a cancellation point. The scheduler sees this flag 
and looks to see if the thread has been cancelled. If so, it will cancel 
the thread. Otherwise, the thread will be put on a wait queue and block.  
If the thread gets cancelled while it is blocked, when the thread wakes 
up the scheduler will see that it is at a cancellation point and that 
the thread has been cancelled, so the cancel is processed. This results 
in the thread getting cancelled sooner - it happens when the thread 
wakes up instead of waiting until the next cancellation point. It also 
will get cancelled if the thread wakes up spuriously.

    functions in MySQL:
	pthread_testcancel()
		return if disabled
		return if already trying to cancel
		if cancelled, SET_PF_RUNNING_TO_CANCEL 
		pthread_exit()
	called by pthread_cancel, pthread_setcancelstate, pthread_setcanceltype

	at cancellation points:
		SET_PF_AT_CANCEL_POINT
		pthread_resched_resume(xxx_WAIT)
		CLEAR_PF_AT_CANCEL_POINT

	pthread_resched_resume()
		if not trying to cancel
			and thread is cancellable 
		(cancel is enabled, thread has been cancelled,
		cancel type is asynchronous or it's at a cancellation point).
			and thread is not dead or unalloced
			SET_PF_RUNNING_TO_CANCEL 
			pthread_sched_resume()
				if not trying to cancel 
				and thread is cancellable
					pthread_cancel_internal()
			pthread_cancel_internal()

	pthread_cancel_internal()
		lock kernel 
		if thread state is not RUNNING
			remove thread from the queue it's on
			set state to RUNNING
		SET_PF_RUNNING_TO_CANCEL
		fd_unlock_for_cancel()
		pthread_sched_resume() - lock the kernel and process signals
		pthread_exit()




NOTES:
------
	The following functions are missing their cancellation point:
	fcntl(), open(), close(), fsync(),  msync(), sleep(), and tcdrain()


THREAD SCHEDULING:
------------------

There are individual lists of threads for each priority. Threads in a
higher priority list are scheduled before threads in a lower priority
list. Within each list, where all threads have the same priority, the
threads are scheduled according to the policy.

SCHED_FIFO is a first-in first-out order. When a thread is preempted, it
is put back at the first of the list and continues running unless there
is a thread with a higher priority waiting. When a thread is blocked, it
is put at the end of the list. When a thread's scheduling policy or
priority is changed, it is put at the end of the new priority's list.

SCHED_RR is like SCHED_FIFO, except that each thread has a timer set so
that it won't get rescheduled indefinitely if there are other threads at
the same priority waiting. When a thread has been rescheduled and run
for PTHREAD_ITIMER_INTERVAL, the timer will expire and the thread will
go to the end of the list.

SCHED_IO is something in between the other two. It will set the timer
like RR only when there are no signals pending. Otherwise, it does not
set the timer. 

The spec says that there must be a scheduling type called SCHED_OTHER.
We added this so that is the same as SCHED_IO.

If you have comments or suggestions, email me at sdybiec@humanfactor.com