|
Information about the AdaIC search engine
The AdaIC search engine provides a way to search many Ada-related web sites in a
single search. Since only Ada-related sites are included, you won't get piles of
unrelated pages, and you won't have to limit your search so much that you can't
find the information you need.
Tips for using the AdaIC search engine
Matching on the AdaIC search engine
About the AdaIC Site Search
About the AdaIC search engine
Search Ada sites on the Web
- Usually, you will want to put your search terms into the All of the words box,
as this returns the most relevant pages.
- If you want to look for an exact phrase, put it in double quotes (").
"access discriminants" will only return pages containing that phrase, while
access discriminants in the All of the words box will return pages
containing both words.
- If your search term includes non-alphanumeric characters, it must be quoted; regular
searches ignore most non-alphanumeric characters.
- Be sure to include the archive sites (by checking them) if older information would
be relevant. (All of the information on the archive sites is older than 1998).
- Restrict the sites searched only when necessary. The categories of sites are not
exact, and many sites would fit into multiple categories, but appear in only one
category.
- The AdaIC search engine searches for words. However, very common words (the,
in, is, and so on) are not indexed. If your search includes only these words,
it may fail or take a very long time to complete.
- A word to the AdaIC search engine includes letters, numbers (including embedded dots ('.'), and
embedded ' and _ characters. The latter make it possible to search for possessives, contractions,
and Ada identifiers without false matches. Other punctuation characters are ignored unless
they are given in quoted text.
- Searches on the AdaIC search engine include related words (plurals, possessives, etc.).
Thus, it isn't necessary to include all forms of a word. A search for discriminant
will also find discriminants and discriminant's. If you don't want related
words included, quote the word or phrase.
- The case and white space does not matter to matching.
- Quoted text ("like this"), also known as a phrase, is matched exactly
with the following exceptions:
- White space is ignored, except that there must be some white space between
words ("some thing" does not match "something", but does match "some thing");
- Case is ignored ("some thing" matches "Some THING").
The Ada Site Search is based on search indexes created for all of the relevant
Ada-related sites that we know about (these are the sites listed in Links).
Redundant sites have been eliminated, as well as sites that use character sets
very different from Latin-1. (If you know of a relevant site not included
in Links, please send us
the URL of the site so we can include it in the future.)
As of this writing, about 58,000 pages are included in all of the indexes. Text and
HTML pages are indexed; other file types are not indexed. Generally we trust the
site's web server to tell our indexer the type of a page (as do many web browsers).
That occasionally means we'll misidentify a page so some HTML markup will appear.
The Robots.Txt file is the standard way for webmasters to tell search engines
which pages not to index. A few sites ask that most of the contents of the site
not be indexed. The Ada Site Search obeys these directions, thus some sites have
little or no material included.
With exception of a few major sites, sites have been categorized into Vendor sites,
Organization sites, Source Code Library sites, and Other sites. Many sites could fit
into multiple categories. However, each site appears only in one category. Thus,
the categories should be used only as a broad guide. For instance, source code appears
on many vendor and organization sites, as well as source code library sites. So we
recommend searching all of the sites unless too many results are returned.
Since many vendors sell products for many programming languages, vendor sites are
pruned to pages mentioning Ada or known Ada-specific products. Other types of sites
are not pruned unless they contain substantial non-Ada material. For large vendors,
pruning eliminates a large amount of irrelevant material, but also might lose some
valuable material. Our experiments show that more than 90% of the relevant prose
pages contain a form of the word Ada; most relevant pages that don't contain
the word Ada are Ada source code.
Pages matching the criteria given are primarily scored based on the number of matches
for each word. Bonus scores are given to pages which are relatively new. Bonus scores
also are given based on the site from which the page comes. Sites which are very trusted
(such as AdaIC and AdaPower) are given the largest bonuses, while archival sites
are given the smallest bonuses. ARA member sites are given larger bonuses than
other vendors (thus giving ARA members better positioning in search results), but
results from all vendors are returned.
Once a set of matching pages is determined, redundant pages are removed from the
set. This is done at lookup (rather than from the individual indexes) because many
separate sites may have the same pages posted. For instance, many sites have the
Ada Reference Manual and the Ada 95 Rational posted. By removing redundant copies
of these pages, more relevant results can be shown. Redundant page removal is done
by comparing word counts of the pages. It's possible, but very unlikely, for two
significantly different pages to have the same word counts; so there is a very small
chance of a non-redundant page being removed by this check.
The result page links are encoded versions of the real link; clicking on it will
take you immediately to the correct page. This allows us to record which pages
you found most relevant for particular search words. In the future, we'll use
that information to improve the scoring of our result pages. If the indexes have
been updated since the search was performed, these links may not work properly.
Thus saving results pages is not recommended. However, the actual URL is
always given in the results; you can paste it directly into your browser if
necessary.
The AdaIC search engine was created between December 2002 and March 2003 by Tom Moran
and Randy Brukardt for use on the AdaIC web server. The programs are all written
in Ada 95, and were primarily created out of existing programs and components. The
indexer was based on the link checker web
crawler Finder. Most of the new
code was devoted to the Words-Files index, the page storage and abstracting, and
page scoring.
The lookup program is called directly from the Ada Server, and streams the output
directly to that server's HTTP file transfer stream. Both Finder and Ada Server are
built on top of Claw Sockets (and thus are Windows programs), although neither does
much Windows-specific. When Claw Sockets is ported to other operating systems, both
programs should port easily.
Ada Server is a medium performance, reliable web server. It is written in Ada, and
only interfaces to the operating system (there is no foreign language code). Because
of this, programming errors generally cause exceptions, which are logged and the
task in question is reset. Only the request with the problem fails; all others
continue normally. Reliability and security are enhanced further by the exclusion
of foreign code of unknown reliability. Since the server is created as a single program,
attacks which try to trick the server into running another program must fail (it never
runs another program, so it cannot be fooled into running the wrong program). Of
course, the server could be compromised by an attack on
the underlying operating system or another server running on the computer. However,
the primary aim of computer security is to make the system secure enough so that
potential attackers look for easier systems to attack -- essentially, you just
have to be more secure than your neighbor.
|