|
|
Corporate intranets can contain an almost
unimaginable amount of information. Departments,
divisions, and individuals create a wide variety
of Web pages, both for internal and external
consumption. Human resource information,
personnel handbooks, procedures manuals, and
newsletters are all posted internally.
Databases-both those hosted directly on the
intranet and on "legacy" databases on non TCP/IP
systems-are available. Add that to all the
information that can be gotten via the Internet
using the World Wide Web, and you have a serious
case of information overload.
There are several ways to help
intranet users find the information they need.
One way is to create subject directories of
intranet data that present a highly structured
way to find information. They let you browse
through information by categories and
subcategories, such as marketing, personnel,
sales, research and development, budget,
competitors, and so on. In a Web browser, you
click on a category, and you are then presented
with a series of subcategories, such as East
Coast Sales, South Sales, Midwest Sales, and
West Sales. Depending on the size of the subject
directory, there may be several such layers of
subcategories. At some point, when you get to
the subcategory you're interested in, you'll be
presented with a list of relevant documents. To
get those documents, you click on links to them.
On the Internet, Yahoo is the most well-known,
largest, and most popular subject directory.
Another popular way of finding
information-and in the long run for intranets,
probably more useful-is to use search engines,
also called search tools. Search engines operate
differently from subject directories. They are
essentially massive databases that index all the
information found on the intranet-and can
include information found on the Internet as
well. Search engines don't present information
in a hierarchical fashion. Instead, you search
through them as you would a database, by typing
in keywords that describe the information you
want.
Intranet search engines are
usually built out of three components: An
agent, spider, or crawler
that crawls across the intranet gathering
information; a database, which contains
all the information the spiders gather; and a
search tool, which people use as an
interface to search through the database. The
technology is similar to Internet search engines
such as Alta Vista.
Intranet search tools differ
somewhat from their Internet equivalents. The
database of information they search can be built
not just by agents and spiders searching
Web-based pages. Agents can be written that can
go into existing corporate databases, extract
data from them, and put them into the database
of searchable information. And people on an
intranet can fill out forms and submit their
information into the database as well.
Additionally, since they are built for a
specific corporation and its data, the
information they gather and the way they are
searched can be customized.
Searching and cataloging
tools, sometimes called search engines, can be
used to help people find the information they
need. Intranet search tools, such as agents,
spiders, crawlers, and robots, are used to
gather information about the documents available
on an intranet. These search tools are programs
that search Web pages, extract the hypertext
links on those pages, and automatically index
the information they find to build a database.
Each search engine has its own set of rules
guiding how documents are gathered. Some follow
every link on every page that they find, and
then in turn examine every link on each of those
new home pages, and so on. Some ignore links
that lead to graphics files, sound files, and
animation files; some ignore links to certain
resources such as WAIS databases; and some are
instructed to look primarily for the most
popular home pages.
- Agents are the "smartest"
of the tools. They can do more than just
search out records: They can per-form
transactions on your behalf, eventually such
as finding and ordering the lowest-fare
airline ticket for your vacation. Right now
they can search sites for particular
recordings and return a list of five sites,
sorted by the lowest price first. Agents can
cope with the context of the content. Agents
can find and index other kinds of intranet
resources, not just Web pages. They can also
be programmed to extract records from legacy
data-bases. Whatever information the agents
index, they send back to the search engine's
database.
- General searchers are
commonly known as spiders. Spiders report the
content found. They index the information they
find and extract summary information. They
look at headers and at some of the links and
send an index of the information to the search
engine's database. There is some overlap
between the tools-spiders can be robots, for
example.
- Crawlers look at headers
and report first layer links only. Crawlers
can be spiders.
- Robots can be programmed to
go to various link depths, compile the index,
and even test the links. Because of their
nature, they can get stuck in loops, and they
take consider-able Web resources going through
the system. There are methods available to
prevent robots from searching your site.
- Agents extract and index
different kinds of information. Some, for
example, index every single word in each
document, while others index only the most
important 100 words in each; some index the
size of the document and number of words in
it; some index the title, headings and
subheadings, and so on. The kind of index
built will determine what kind of searching
can be done with the search engine, and how
the information will be displayed.
- Agents can also go out to
the Internet and find information there to put
in the search engine's database. Intranet
administrators can decide which sites or kinds
of sites the agents should visit and index-for
example, competitors to the corporation or
news sources. The information is indexed and
sent to the search engine's database in the
same way as is information found on the
intranet.
- Individuals can put
information into the index by filling out a
form about the data they want put in. That
data is then put into the database.
- When someone wants to find
information available on the intranet, they
visit a Web page and fill out a form detailing
the information they're looking for. Keywords,
dates, and other criteria can be used. The
criteria in the search form must match the
criteria used by the agents for indexing the
information they found while crawling the
intranet.
- The database is searched,
based on the information specified in the
fill-out form, and a list of matching
documents is prepared by the database. The
data-base then applies a ranking algorithm to
determine the order in which the list of
documents will be displayed. Ideally, the
documents most relevant to a user's query will
be placed highest on the list. Different
search engines use different ranking
algorithms. The database then tags the ranked
list of documents with HTML and returns it to
the individual requesting it. Different search
engines also choose different ways of
displaying the ranked list of documents-some
just provide URLs; some show the URL as well
as the first several sentences of the
document; and some show the title of the
document as well as the URL.
- When you click on a link to
one of the documents you're interested in,
that document is retrieved from where it
resides. The document itself is not in the
database or on the search engine site.
|
|