|
Octavian Andrei Dragoi, |
Jean Elizabeth Preston, |
This report gives a tour of the concrete architecture of the Apache web server (release 1.3.4). The goal is to provide support for anyone who wants to modify a subsystem, or add extra functionality.Keywords:
The main components of the concrete architecture of the Apache server are the core and the modules.
This paper covers the details of Apache core architecture, the essential data structures with their uses, and gives an extended insight into the concrete architecture of a module. The concurrency approach employed in Apache is also detailed.
In general, anyone who wants to add extra functionality only has to write a new module. This usualy means providing one or more handlers (functions) for one of the phases of processing an HTTP request. In fact, even an important part of the Apache core has the "look and feel" of a module, although it is not a proper one (it shares information with other core sub-components).
The way a module's handlers are called is transparent to the module, and all communication with a module is done through pointers to functions. Because of this, fact extractors cannot capture the interaction between core and modules.
To extract the concrete architecture we have used a variety of sources: fact extractors (Portable Book Shelf), papers on Apache, read me files, and to a large extent, analyzing relevant parts of the source code.
Apache, concrete architecture, design, web serverAvailable online at:
http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a2/caa.html
The Apache server divides the handling of requests into separate phases. These request phases are:
In Apache, each phase is handled by a module or set of modules. Each module is looked at in succession, to see if it has a handler for the phase. This results in a flow of control and data that is similar to a pipeline.
![]() |
|
|
The above figure illustrates the conceptual movement of the data structure request_rec and the flow of control with the broken arrows. The process starts and ends in the core, where request_rec is created and where the cleanup is done after the request has been handled.
Actually, control moves from the core to each phase and then back to the core, as is shown by the solid arrow lines. As well, request_rec is first created by the core, then passed to each phase and back to the core in turn.
ap, regex) are essentially libraries of
utility functions, used by both the core and the modules. The third one
(os) is the one that ensures the independence of the operating
system from the Apache core and the standard modules.
![]() |
|
|
include/ directory, although in
fact they define only the functions implemented by the Apache core
(main/ subdirectory).
It is worth noting that all Apache functions, utility functions, wrappers and
re-implementations are prefixed by ap_. This is a rule introduced
in release 1.3 of Apache, in order to avoid the possiblity of conflicting names
(without the ap_) in the 1.2 release. However there are header
files that perform mappings between the functions, so modules written for the
1.2 release can be adapted easily.
main is the apache core, which implements the basic
functions of a web server.
modules is a component containing different modules
that are shipped with the Apache distribution. This includes a set of standard
modules that extend and complement the Apache core.
os component encapsulates the functions strictly
dependent on operating system and platform. In the source code tree the
os/ directory contains directories for each specific platform
supported. Platform is used here in the sense of software platform that offers
a common programming environment. For example unix/ contains
specific files for unix platforms. However, mainframes that work with
proprietary OS that use EBCDIC character encoding instead of ASCII, different
directories are used bs2000, tpf. OS/2 from IBM and Windows NT
are also supported (os2, win32). The "link" between the other
platform independent components and the platform dependent os
component is the file os.h. It is included in all Apache source
code. It declares the specific functions for that system. Such functions are,
for instance, UNIX functions not available on a Windows environment. The
implementation of such functions goes in os.c for regular
functions and in os-inline.c if can be in-lined. functions and
code strictly related to a platform should also go in the directory for that
system. For example, the Windows code for storing configuration files into the
registry (a repository of configuration information used by all applications
in this operating system) is also part of the os component for
Windows (win32/)
regex is a separate component, used as a library of
general functions dealing with regular expression manipulation (e.g. splitting
a string in tokens in an awk fashion). It is called from the module:
main (alloc.c, http_core.c, http_request.c, util.c, util_uri.c)
ap is a component that defines function wrappers for
functions with no unique behavior across platforms (e.g. strncpy which has
different behaviors w.r.t. the trailing '\0' in the copied string),
re-implementation for unstable library functions, and new utilities functions
(e.g. formating functions for numbers, for Internet addresses,
etc.).support is a separate component containing shell
scripts and source code for helper programs for the Apache server
administrator, such rotating log files to save space, manipulating password
files, generating statics starting from the log files. So files in this
modules are not part of the actual Apache server,
helpers contains shell scripts used as helpers by the
compile time configuration routine for Apache. main/ directory. The designers of Apache wanted as much
functionality of the web server as possible to be implemented as separate
modules and therefore there are many interactions between the sub-components of
the Apache core. The idea is that someone extending Apache should not have to
modify anything in the core. The only sub-component that might need changing in
order to extend the server is the one that implements the HTTP protocol (which
is part of the core). Although (on the good side) the HTTP protocol is a
separate sub-component of the core, there is no well define API.
![]() |
|
|
http_main.c)http_main.c file contains
code that starts up the server (i.e. the actual main()), the main
server loop, code for managing children and code for managing timeouts.
The main server loop is the one that waits for a TCP/IP connection
request, accepts it (i.e. establishes an TCP/IP connection), allocates a
resource pool, reads the HTTP request from /IP stream, and calls the appropriate
function in the http_request.c file to handle the HTTP request.
After the request has been processed, it frees the resource pool and,
eventually, closes the TCP/IP connection.
Figure 3 shows the interaction of this file with the other sub-components of the Apache core. This sub-component also controls the number of active child processes, through a special shared data structure (score board) that holds information on the status of each children. More on this data structure and the way Apache manages concurrency is presented in the data structures section and the section on concurrency.
http_core.c)http_core.h file implements the most basic functions of processing
an HTTP request. In a comment from a source file http_core.c it is
described as being "just 'barely' functional enough to serve documents,
though not terribly well".
This file could almost have been mod_core.c . In fact it defines
a module
structure as any module. http_core.c defines the command table
for all the standard configuration commands. It also implements handlers for
self-initialization and for some of the phases of the HTTP request cycle:
One reason why http_core.c is not a separate module is related
to the legacy configuration commands (from NCSA web server) that Apache must
implement. These commands, and in particular Options, are more
powerful than the typical Apache configuration commands in the sense that they
affect more than one module. In order to implement this kind of behavior
http_core.c must have access to some Apache core data structures
that are not accessible to ordinary modules.
http_request.c)http_code.c).
This includes parsing configuration files and invoking the appropriate commands
in the command table advertised by the modules.
Another major function of http_config.c is walking through the
link list of modules and invoking the appropriate handlers for a phase of the
HTTP request handling cycle, when it is asked by the http_request.c
sub-component. The rationale behind having this function here and not in
http_request.c is that, in a way, http_config.c is the
owner of the data structure that holds information on current modules and does
all the book keeping related to this structure. It should be noted that
http_config.c does not decide when a phase is invoked, it just does
the implicit invocation on the appropriate handlers.
http_request.c)http_request.c.
An interesting feature of the http_request.h is the possibility
of handling sub requests, which can be viewed as a sort of recursion in the flow
of handling an HTTP request (e.g. while handling one phase a module can issue a
sub request to convert an additional URI to a file name). More on sub-requests
can be found in the section on data structures.
http_protocol.http, http_rfc1413.c)The sub-component is called when the connection is established and closed, and at the appropriate times during the processing cycle of the HTTP request (for example when the document to be delivered must be written to the client).
buff.c, alloc.c) and utilities sub-component
(util*.c)alloc.h and buff.h.
buff.c offers functions for buffered I/O and buffered character
conversion, which replace similar library functions which have semantics that
vary from platform to platform.
alloc.c implements the management of resource pools. A resource
pool is a big memory pool which is used to allocate memory needed to process the
current request. File descriptors are also allocated on the resource pool. The
advantage of having a unique resource pool for each HTTP request is that memory
and file descriptors can be freed at once, when the processing ends. This not
only frees the programmer of a module to explicitly free each allocated
resource, but also prevents resource leakage.
The Apache core component also contains a number of utility functions that are not of general interest outside the component.
http_log.c)http_chost.c)
http_request.c component of the Apache core.
module
structure. Such a structure is defined by each module and is accessible by the
core. The structure holds pointers to all handlers exported by the module.
First a module must be initialized (when it is loaded). As the core has no information on the internal structure of the module, the module must export a handler for this purpose.
The idea behind Apache is to be as versatile as possible, therefore each
module can define its own commands to be used by users in configuration files.
However, the core is the one that actually reads from the configuration files,
so there must be a way to match a command with a module, or more precisely with
the handler exported by the module to execute that command. This matching is
done through the command table memorized in the module structure.
The purpose of a module, as have been said, is to add new functionality to
the way the web server is servicing HTTP request. However the flow of executing
a request is controlled by the core (http_request.c). A module is
not required to export handlers for all the distinct phases of the HTTP request
cycle. Figure
4 shows a module that implements handlers for all of the phases but in
fact most of the modules will implement one or a few of them.
A special case of handler is the one for the phase that actually delivers the
object to the client. Those handlers are called content handlers or
response handlers. They are special because a module often implements
several content handlers, one for each type of object the module knows how to
deliver (e.g. a module might know how to deliver disk directories, but might
also know how to deliver lists of people formated in the same way as
directories). In order to discover which content handler must be called for the
current request the core will use the content handler table in the module's
modules structure. It consists of pairs of content type(i.e.
MIME type) and a pointer to the handler. Of course a module can export the same
content handler for more than one MIME type.
An important element to be taken into account when writing a module is that although handlers are invoked implicitly (or might not be invoked at all if handlers of other modules, for the same phase , inform the core that they have completed that phase) different handlers can communicate between data structures private to the module (usually static ones).
![]() |
|
|
The most simple module is one that does not provide (i.e. does not need) any initialization or configuration handlers, does not define any custom configuration commands and implements only one handler (usually the content handler). A complex module (like the one in Figure 4) will implement all or most of the possible handlers.
modules directory. The standard modules grouped in
standard subdirectory are essential for the funtioning of Apache.
An extension of Apache (in the form of a module) enables it to work as a web
proxy. Additionally, the source contains a demo module that provides any single
handler a module can implement. It is well commented and is located in
example/ subdirectory.
When the configuration scripts are run a special .c file
modules.c is automatically generated in the root of the source
code. modules.c defines special arrays of pointers to module
structures, called ap_prelinked_modules[], for the modules that are
linked with the core, and module *ap_preloaded_modules[] for those
that will be preloaded.
It is interesting that even mod_core, which is the module
structure defined by the http_code.c sub-component is listed in the
array generated in modules.c. The following list gives an overview of some of
the standard modules. The mod_core (the pseudo module define in the apache core
(subcomponent http_core.c) is also included.
mod_userdir: translate the user home directories into
actual paths mod_rewrite Apache 1.2 and up
mod_rewrite: rewrites URLs based on regular expressions, it
has additional handlers for fix-ups and for determining the mime type
mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm : User
authentication using text files, anonymous in FTP-style, using Berkeley DB
files, using DBM files.
mod_access: host based access control. mod_mime: determines document types using file extensions.
mod_mime_magic: determines document types using "magic
numbers" (e.g. all gif files start with a certain code) mod_alias: replace aliases by the actual path
mod_env: fix-up the environment (based on information in
configuration files)
mod_speling: automatically correct minor typos in URLs
mod_actions: file type/method-based script execution
mod_asis: send the file as it is
mod_autoindex: send an automatic generated representation
of a directory listing
mod_cgi: invokes CGI scripts and returns the result
mod_include: handles server side includes (documents parse
by server which includes certain additional data before handing the document
to the client)
mod_dir: basic directory handling.
mod_imap: handles image-map file mod_log_*: various types of logging modules
A summary of what each standard modules has to offer (i.e. what phases
handles is given in the following. Note that a module can define handlers for
more than one phase. Again mod_core has been included.
| No | Phase | Modules |
| 2. | filename translation | mod_alias.c mod_userdir.c mod_core |
| 3. | check_user_id | mod_auth.c mod_auth_anon.c mod_auth_db.c mod_auth_dbm.c mod_digest.c |
| 4. | check auth | mod_auth.c mod_auth_anon.c mod_auth_db.c mod_auth_dbm.c mod_digest.c |
| 5. | check access | mod_access |
| 6. | type_checker | mod_mime.c mod_mime_magic.c mod_negotiation.c mod_core |
| 7. | fixups | mod_alias.c mod_cern_meta.cmod_env.c mod_expires.c mod_headers.c mod_negotiation.c mod_speling.c mod_usertrack.c mod_core |
| 8 | content handlers | mod_actions.c mod_asis.c mod_autoindex.c mod_cgi.c mod_dir.c mod_imap.c mod_include.c mod_info.c mod_negotiation.c mod_status.c mod_core |
| 9 | logger | mod_log_agent.c mod_log_config.c mod_log_referer.c |
mod_core is indeed minimal. Also one can observed
that a module tends to provide handlers for related phases (e.g authorization
and authentication or MIME type check and content handlers), although this type
of behavior is not a requirement.
mod_proxy.h mod_proxy.c is in a sense the main files
of this module since they define the module structure. The handlers
implemented by this module are:
proxy_cache.c manages the cache implemented by the
proxy module. The cache is an example of a private data structure that survive
between implicit invocation to the handlers of the proxy module.
proxy_connect.c implements the code that connect this
server to an web server. The proxy acts as a client for the web server and as
a server for the HTTP client.
proxy_ftp.c, proxy_http.c implement utility routines
specific for HTTP and FTP protocols.
proxy_util.c implements various routines that main
deal with matching symbolic host name, host Internet address, etc. Each request that the server receives is actually handled by a copy of the httdp program. Rather than creating a new copy when it is needed, and killing it when a request is finished, Apache maintains at least 5 and at most 10 inactive children at any given time. The parent process runs a periodic check on a structure called the scoreboard, which keeps track of all existing server processes and their status. If the scoreboard lists less than the minimum number of idle servers, then the parent will spawn more. If the scoreboard lists more than the maximum number of idle servers, which is by default 10, then the parent will proceed to kill off the extra children.
When it receives a request, the parent process passes it along to the next idle child on the scoreboard. Then the parent goes back to listening for the next request.
![]() |
|
|
There is also a default limit of 256 on the total number of servers that can exist at one time. The authors of Apache provided this upper bound in order to keep the machine that the software is running on from being swamped by servers and crashing. The default was picked to keep the scoreboard file small enough so that it can be scanned by the processes without causing overhead concerns.
Since the number of requests that can be processed at any one time is limited by the number of processes that can exist, there is a queue provided for waiting requests. The maximum number of pending requests that can sit on the queue is 511.
Apache uses the persistent connection to allow multiple requests from a client to be handled by one connection, rather than opening and closing a connection for each request. The default maximum number of requests allowed over one connection is 100. The connection is closed by a timeout.
opening line - some data structures that are central to the functioning of the Apache server
Once a request has been read in, http_request.c is the code which handles the main line of request processing. It finds the right per-directory configuration, building it if necessary. It then calls all the module dispatch functions in the right order.
![]() |
|
|
When the module handlers are called, the only argument passed to them is request_rec. The pieces of this structure which are public to the modules allow them to learn what the request is and how it should be handled. Most of the handlers complete their part of the request cycle by changing some fields in the request rec. But the response handlers must actually return something to the client. Sometimes these handlers need to direct a server to return some other file instead of the one that the client originally requested. This is a redirected request.
Some handlers can farm out part of their job to another process in the form of a sub-request.
Request_rec can be a linked list if the request is redirected by a handler. The structure can contain pointers to the request_rec the request is redirected to and the request_rec it is redirected from. Or, if it is a sub-request, it can contain a pointer to the original request_rec.
The response handler may find that the request to be served is better handled as another type of request. If the request is to an imagemap, a type map, or a CGI script, then the actual resource the user requested is in some other URI than the one originally used. In this case, the module's handler generates a new request and passes it to another process.
The handler invokes ap_internal_redirect, which initiates a new
request_rec. The chain of redirects is placed in a list of request_recs which is
linked by pointers. The results of the final response handler is passed back up
the chain to the one that caught the original request, and is then sent back to
the client.
The sub_request mechanism allows a response handler to look up files and URIs
without actually sending a response. This is done using the functions
ap_sub_req_lookup_file or ap_sub_req_lookup_uri. These
construct a new request_rec which is processed up to the point of the response.
Here is a partial list of the fields that are contained in a request_rec.
Rather than keeping track of which files are opened and where allocated memory is, and then explicitly tracking it all done to deallocate it, Apache uses the idea of resource pools. A resource pool is a data structure which keeps track of all allocations of finite resources that are associated with a request. When the request cycle is finished, all the resources held in the pool are released at one time.
This provides the advantages of garbage collection without the extensive code, and small amounts of space can be allocated without adding large amounts of record keeping.
One disadvantage of this method is that resources that are not being actively used cannot be released until the pool is cleared. This can create problems, especially with memory. So the modules can establish private resource pools that they can clear or destroy as they want.
The core's command table is held in http_core.c. - example - (Thau, pg 3)XS
Each module may have its own command table, which allows it to handle commands read from configuration files. The entries for each command listed in the table are:
The scoreboard structure is used to keep track of the child processes. The information is kept brief, basically just the status value and the pid, the process id number. The creators of Apache have plans to add a separate set of longer score structures that will give the number of requests serviced, and data on the current or most recent request.
Each time a parent process spawns a child, a record is created for the child in scoreboard. When a child is killed, its record is removed from scoreboard. The status value of a process is written to scoreboard by the process itself. The parent process uses the status value of each child to determine if new children need to be created, or if there are too many idle processes.The status values defined in scoreboard.h are the following:
SERVER_DEAD 0 SERVER_STARTING 1 /* Server Starting up */ SERVER_READY 2 /* Waiting for connection (or accept() lock) */ SERVER_BUSY_READ 3 /* Reading a client request */ SERVER_BUSY_WRITE 4 /* Processing a client request */ SERVER_BUSY_KEEPALIVE 5 /* Waiting for more requests via keepalive */ SERVER_BUSY_LOG 6 /* Logging the request */ SERVER_BUSY_DNS 7 /* Looking up a hostname */ SERVER_GRACEFUL 8 /* server is gracefully finishing request */ SERVER_NUM_STATUS 9 /* number of status settings */
The specific contents of a module are determined by the type of function the module performs. Figure 4 shows a generalized picture of a module.
One of the main difficulties resides in the characteristics of the Apache
source code, which defines a large number of macros, not only for data
structures but also for procedures, their parameters and their return functions.
This mislead PBS in many cases, to show as suppliers/users the .h
files, when the actual suppliers/users were, in fact, the .c files
(through macros).
As an example Figure
7 shows the result for the Apache core structure. The content of the
sub-components is hidden in order to increase clarity. The arrows that point
down (at the bottom of the picture), go to the utilities component described
earlier (ap, regex, os).
The following table shows what has been grouped under each sub-system (file.* means both file.c and file.h, prefix*.h means all files with that prefix).
| Sub-component | Files |
| http_main | http_main.*, httpd.h |
| protocol | http_protocol.*, rfc1413.* |
| http_request | http_request.* |
| http_core | http_core.c, http_core.h |
| http_config | http_config.*, http_config.global.h , ap_*.h |
| resources | buff.*, alloc.* |
| util | util.*.* |
| http_log | http_log.* |
| http_vhosts | http_vhost.* |
Another major difficulty in using PBS, or any other automatic fact extractor
is that there is no way to extract relations between the Apache core and the
modules. All the calls and references are done through pointers to functions or
data structures. These kinds of interactions are difficult, if not impossible,
to extract at compile time. Even most of the interactions between the
http_core.c and the rest of the sub-components of the Apache core
are done through the same mechanism (see the section on
http_core.c).
![]() |
|
|
It should be noted that the suppliers and users of http_config
have not been drawn. That is because nearly everybody references this module
since many .h files have been included in it. The above figure is more a
validation of the concrete architecture depicted in Figure
3, rather than a source for it.
This report has offered a tour of the concrete architecture of the Apache web server (release 1.3.4). The modular architecture of the server seems to offer great opportunity for extending the code. Designers of Apache strove to move as much of the functionality as possible into the modules. Therefore, modules must implement a well defined API.
Communication between the core and the external modules is done through the modules handler functions. The module handlers are invoked to perform certain phases of processing a request. Handlers receive a reference to the request_rec, which contains the information about the request and the resources the handlers need.
We did not observe the same independence and well defined API between the
components of the core. Of course there are some clear utility components that
offer services to the other components of the core and to the modules, but the
important parts of the core are tightly linked together. One example of this
inter-dependence is the http_request.c which controls the flow of
processing a request, the http_config.c component which performs
the actual invocation of handlers, and the http_protocol.c which
communicates with the HTTP client. This linking of the core components makes it
somewhat more difficult to change the behavior of Apache by modifying the core.
Fortunately, modules can do the same jobs as well, if not better, and they are
usually easier to write.
Since the method of calls to its handlers is transparent to the module and all communication with a module is done through pointers to functions, we have found that fact extractors do not capture the interaction between core and modules.