Skip to main content

The formation of the WWW

More
12 years 2 months ago #37694 by sose
this is a very interesting read I yanked out of a writeup

SPIDER WEBS, BOW TIES, SCALE-FREE
NETWORKS, AND THE DEEP WEB

The World Wide Web conjures up
images of a giant spider web where
everything is connected to everything
else in a random pattern, and you can go
from one edge of the web to another by just
following the right links. Theoretically, that’s
what makes the Web different from a typical
index system—you can follow hyperlinks from one
page to another. In the “small world” theory of
the Web, every Web page is thought to be separated
from any other Web page by an average of
about 19 clicks. In 1968, sociologist Stanley
Milgram invented small-world theory for social
networks by noting that every human was
separated from any other human by only six
degrees of separation. On the Web, the small
world theory was supported by early research on
a small sampling of Web sites. But recent
research conducted jointly by scientists at IBM,
Compaq, and AltaVista found something entirely
different. These scientists used AltaVista’s Web
crawler “Scooter” to identify 200 million Web
pages and follow 1.5 billion links on these pages.
The researchers discovered that the Web was
not like a spider web at all, but rather like a bow
tie (see figure below). The bow-tie Web had a
“strongly connected component” (SCC)
composed of about 56 million Web pages. On the
right side of the bow tie was a set of 44 million
OUT pages that you could get to from the center,
but could not return to the center from. OUT
pages tended to be corporate intranet and other

Web site pages that are designed to trap you at
the site when you land. On the left side of the bow
tie was a set of 44 million IN pages from which
you could get to the center, but that you could not
travel to from the center. These were recently
created “newbie” pages that had not yet been
linked to by many center pages. In addition, 43
million pages were classified as “tendrils,” pages
that did not link to the center and could not be
linked to from the center. However, the tendril
pages were sometimes linked to IN and/or OUT
pages. Occasionally, tendrils linked to one another
without passing through the center (these are
called “tubes”). Finally, there were 16 million
pages totally disconnected from everything.
Further evidence for the non-random and
structured nature of the Web is provided in
research performed by Albert-Lazlo Barabasi
at the University of Notre Dame. Barabasi’s
team found that far from being a random, exponentially
exploding network of 8 billion Web
pages, activity on the Web was actually highly
concentrated in “very connected super nodes”
that provided the connectivity to less wellconnected
nodes. Barabasi dubbed this type of
network a “scale-free” network and found
parallels in the growth of cancers, disease
transmission, and computer viruses. As its turns
out, scale-free networks are highly vulnerable
to destruction. Destroy their super nodes and
transmission of messages breaks down rapidly.
On the upside, if you are a marketer trying to
“spread the message” about your products,
place your products on one of the super
nodes and watch the news spread. Or build
super nodes like Kazaa did (see the case study
at the end of the chapter) and attract a huge
audience.
Thus, the picture of the Web that emerges
from this research is quite different from earlier
reports. The notion that most pairs of Web pages
are separated by a handful of links, almost
always under 20, and that the number of
connections would grow exponentially with the
size of the Web, is not supported. In fact, there
is a 75% chance that there is no path from one
randomly chosen page to another. With this
knowledge, it now becomes clear why the most
advanced Web search engines only index about 6
million Web sites, when the overall population of
Internet hosts is over 300 million. Most Web
sites cannot be found by search engines because
their pages are not well-connected or linked to
the central core of the Web. Another important
finding is the identification of a “deep Web”
composed of over 600 billion Web pages that
are not indexed at all. The pages are not easily
accessible to Web crawlers that most search
engine companies use. Instead, these pages are
either proprietary (not available to crawlers and
non-subscribers, such as the pages of the Wall
Street Journal) or are not easily available from
home pages. In the last few years, new search
engines (such as the medical search engine
Mamma.com) and older ones such as Yahoo!
have been revised to enable them to search the
deep Web. Because e-commerce revenues in part
depend on customers being able to find a Web
site using search engines, Web site managers
need to take steps to ensure their Web pages are
part of the connected central core, or super
nodes of the Web. One way to do this is to make
sure the site has as many links as possible to and
from other relevant sites, especially to other
sites within the SCC.




SOURCES: “Deep Web Research,” by Marcus P. Zillman, Llrx.com, July 2005; “Momma.com Conquers Deep Web,” Mammamediasolutions.com, June 20,
2005; “Yahoo Mines the ‘Deep Web,’” by Tim Gray, Internetnews.com, June 17, 2005; Linked: The New Science of Networks by Albert-Lazlo Barabasi. Cambridge,
MA: Perseus Publishing (2002); “The Bowtie Theory Explains Link Popularity,” by John Heard, Searchengineposition.com, June 1, 2000; “Graph Structure in the
Web,” by A. Broder, R. Kumar, F. Maghoul, P. Raghaven, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Proceedings of the 9th International World Wide Web
Conference, Amsterdam, The Netherlands, pages 309–320. Elsevier Science, May 2000.
Time to create page: 0.142 seconds