WEB BROWSERS
Most browsers have a point and click interface-the browser displays information on the computer’s screen and permit a user to navigate using the mouse. The information displayed includes both text and graphics. Furthermore, some of the information on the display is highlighted to indicate that an item is selectable. When the users places the cursor over a selectable item and clicks a mouse button, the browser displays new information that corresponds to the selected item.
Technically, the Web is a distributed hypermedia/hypertext system that supports interactive access. Here, information is stored as a set of documents. Besides the basic information, a document can contain pointers to other documents in the set. Each pointer is associated with a selectable item that allows a user to select the item and follow the pointer to a related document. Hypertext document contain only textual information, while hypermedia documents can contain additional representations of information, including digitized photographic images or graphics. There can be a non-distributed hypermedia system, in which information resides within a single computer. A non-distributed hypermedia system can guarantee that all links are valid and consistent.
In contrast, in a distributed hypermedia system, the Web distributes documents across a large set of computers. A system administrator can choose to add, remove, change, or rename a document on a computer without notifying other sites. Consequently, links among Web documents are not always consistent. For example, suppose document D1 on computer C1 contains a link to document D2 on computer C2. If the administrator responsible for computer C2 chooses to remove document D2, the link on C1 becomes invalid.
Most browsers have numerous buttons and features to make it easier to navigate the web. Many have a button for going back to the previous page, a button for going forward to the next page (only operative after the user has gone back from it), and a button for going straight to the user’s own home page. Most browsers have a button or menu item to set a bookmark on a given page and another one to display the list of bookmarks, making it possible to revisit any of them with a single mouse click. Pages can also be saved to disk or printed. Numerous options are generally available for controlling the screen layout and setting various user preferences.
In addition to having ordinary text (not underlined) and hypertext (underlined), Web pages can also contain icons, line drawing, maps, and photographs. Each of these can (optionally) be linked to another page. Clicking on one of these elements causes the browser to fetch the linked page and display it, the same as clicking on text. With images such as photos and maps, which page is fetched next may depend on what part of the image was clicked on.
Not all pages are viewable in the conventional way. For example, some pages consist of audio tracks, video clips, or both. When hypertext pages are mixed with other media, the result is called hypermedia. Some browsers can display all kinds of hypermedia, but others cannot. Instead they check a configuration file to see how to handle the received data. Normally, the configuration file gives the name of a program, called an external viewer, or a helper application, to be run with the incoming page as input. If no viewer is configured, the browser usually asks the user to choose one. If no viewer exists, the user can tell the browser to save the incoming page to a disk file, or to discard it. Helper applications for producing speech are making it possible for even blind users to access the Web. Other helper applications contain interpreters for special web languages, making it possible to download and run programs from Web pages. This mechanism makes it possible to extend the functionality of web itself.
Many web pages contain large images, which take a long to load. For example, fetching an uncompressed 640 X 480 (VGA) image with 24 bits per pixel (922 KB) takes about 4 minutes over a 28.8- kbps modem line. Some browsers deal with the slow loading of images by first fetching and displaying the text, then getting the images. This strategy gives the user something to read while the images are coming in and also allows the user to kill the load if the page is not sufficiently interesting to warrant waiting. An alternative strategy is to provide an option to disable the automatic fetching and display of images.
Some page writers attempt to placate potentially bored users by displaying images in a special way. First the image quickly appears in a coarse resolution. Then the details are gradually filled in. For the user, seeing the whole image after a few seconds, albeit at low resolution, is often preferable to seeing it built up slowly from the top, scan line by scan line.
Some web pages contain forms that request the user to enter information. Typical applications of these forms are searching a database for a user-supplied item, ordering a product, or participating in a public opinion survey. Other web pages contain maps that allow users to click on them to zoom in or get information about some geographical area. Handling forms and active (clickable) maps requires more sophisticated processing than just fetching a known page.
Some browsers use the local disk to cache pages that they have fetched. Before a page is fetched, a check is made to see if it is in the local cache. If so, it is only necessary to check if the page if still up to date. If so, the page need not be loaded again. As a result, clicking on the BACK button to see the previous page is normally very fast.
To host a web browser, a machine must be directly on the Internet, or at least have a SLIP or PPP connection to a router or other machine that is directly on the Internet. This requirement exists because the way a browser fetches a page is to establish a TCP connection to the machine where the page is, and then send a message over the connection asking for the page. If it cannot establish a TCP connection to an arbitrary machine on the Internet, a browser will not work.
The Server Side
Figure (a) : The parts of the Web model.
A URL has three parts:
- The name of the protocol(http)
- The name of the machine where the page is located (www.w3.org)
- The name of the file containing the page (hypertext/WWW/TheProject.html).
The steps that occur between the user’s click and the page being displayed are as follows:
- The browser determines the URL (by seeing what was selected).
- The browser asks DNS for the IP address of www.w3.org.
- DNS replies with 18.23.0.23
- The browser makes a TCP connection to port 80 on 18.23.0.23.
- It then sends a GET/hypertext/WWW/TheProject.html command.
- The www.w3.org server sends the file TheProject.html.
- The TCP connection is released.
- The browser displays all the text in TheProject.html.
- The browser fetches and displays all images in TheProject.Html.
Many browsers display which step they are currently executing in a status line at the bottom of the screen. In this way, when the performance is poor, the user can see if it is due to DNS not responding, the server not responding, or simply network congestion during page transmission.
For each in-line image (icon, drawing, photo etc.) on a page, the browser establishes a new TCP connection to the relevant server to fetch the image. If a page contains many icons, all on the same server, establishing, using, and releasing a new connection for each one is not wildly efficient, but it keeps the implementation simple.
Because HTTP is an ASCII protocol like SMTP, it is quite easy for a person at a terminal (as opposed to a browser) to directly talk to Web servers. All that is needed is a TCP connection to port 80 on the server.
No all servers speak HTTP. In particular, many older servers use the FTP, Gopher, or other protocols. Since a great deal of useful information is available on FTP and gopher servers, one of the design goals of the Web was to make this information available to Web users. One solution is to have the browser use these protocols when speaking to an FTP or Gopher server. Some of them, in fact, use this solution, but making browsers understand every possible protocol makes them unnecessarily large.
Instead, a different solution is often used: proxy servers.
A proxy server is a kind of gateway that speaks HTTP to the browser but FTP, Gopher, or some other protocol to the server. It accepts HTTP requests and translates them into, say, FTP requests, so the browser does not have to understand any protocol except HTTP. The proxy server can be a program running on the same machine as the browser, but it can also be on a free-standing machine somewhere in the network serving many browsers. Figure 4 shows the different between a browser that can speak FTP and one that uses a proxy.
Often users can configure their browsers with proxies for protocols that the browsers do not speak. In this way, the range of information sources to which the browser has access is increased.
In addition to acting as a go-between for unknown protocols, proxy servers have a number of other important functions, such as caching. A caching proxy server collects and keeps all the pages that pass through it. When a user asks for a page, the proxy server checks to see if it has the page. If so, it can check to see if the page is still current. In the event that the page is still current, it is passed to the user. Otherwise, a new copy is fetched.
Finally, an organization can put a proxy server inside its firewall to allow users to access the Web, but without giving them full Internet access. In this configuration, users can talk to the proxy server, but it is the proxy server that contacts remote sites and fetches pages on behalf of its clients. This mechanism can be used, for example, by high schools, to block access to web sites the principal feels are inappropriate for tender young minds.
CREATING AND LOCATING INFORMATION ON THE WEB
The browser has a menu bar on top, where the user can quit, get help on using the program, and change certain display characteristics (font size, background color, etc.). Some local configuration may be required under one of the menu options. The browser may be purchased separately or may be provided by the Internet access provider.
A scroll bar allows the user to scroll through the document, forward or backward. Because there is no limit to how wide or small a hypertext/hypermedia document can be, scroll bars are often needed when the document is larger than the viewing window.
Usually, the first document on the screen is a home page. This is a special document that is intended to be viewed first. It contains an introduction of the information displayed and/or a master menu of the documents contained within this collected set of topics. Home pages are generally associated with a particular site, person, or named collection. Other interrelated documents are hyperlinked to other web pages.
Typically, clicking on the word (or link) with a mouse will cause another document to appear on the screen, which may hold more images and/or hyperlinks to other places. Some browsers represent text that is linked to other things by underlining or by using special colors. Images, also known as inline images, can be displayed within a page.
Users often create their own personal documents with collections of their favorite links or biographical information and make them publicly available. Usually called home pages (they are a virtual “home” for the user), they may also be called personal pages or hyplans (hypermedia plans).
In the display screen, there is also a set of navigation buttons. A user might go to many different screens by selecting multiple hyperlinks; these buttons provide a method for retracing the user’s steps and reviewing the documents that were previously explored.
The Back button brings the user to previously viewed documents. The forward button will bring the user to the page most recently viewed prior to taking the backward steps.
An open button allows the user to connect to other documents and networked resources by specifying the address of the desired document or resource. The user might be able to connect to a document stored locally on the machine currently being used or to one stored in another country. Such a document is normally transferred over the Internet in its entirety. Most browsers have a cache setting to allow faster access to these documents once they have been visited.
The print button allows the user to print out the document that is on the screen. The user may be given the choice of printing the document with images and formatting as seen on the screen or as a text-only document.
Typically, a person who is in charge of administrating a World Wide Web site is listed at the bottom of a home page. Any problems with the hyperlinks, images, documents, or questions about the site can be mailed to this Webmaster’s address.
Writing a web page in HTML
Web pages are written in a language called HTML (Hyper Text Markup Language). HTML allows users to produce web pages that include text, graphics, and pointers to other web pages.
The WWW distributes information and supports links to resources via Web pages. These documents can incorporate formatted text, color graphics, digitized sound, and digital video clips. Hypertext Markup language is the language used to make these pages become whatever the user intends them to be. HTML is used to display text, graphics, sounds, movies etc., over the internet on the WWW. WWW is an information system that links data from many different Internet services under one set of protocols. Web clients, called browsers or viewers, interpret Hypertext Markup Language documents delivered from Web servers. The WWW is a distributed, multimedia, hypertext system. It is distributed since information on the web can be located on any computer system connected to the Internet around the world. It is multimedia because the information it holds can be in the form of text, graphics, sound or even video. Hypertext means that the information is available using hypertext technique, which involves selecting highlighted phrases or images that, once selected, retrieve information related to the selected highlighted subject. The information being retrieved can be information located anywhere in the world. The normal way to provide information on the WWW is by writing documents in HTML.
HTML is designed to specify the logical organization of a document, with hypertext extensions. It achieves that goal by the use of instructions known as tags. HTML documents are in plain (ASCII) text format that contains embedded HTML tags. Document can be created in any text editor, including editors in a graphical environment (WYSIWYG). There are also many other tools including editors, designed specifically to assist in creating HTML documents. HTML defines the structural elements in a document (such as headers, citations, and addresses), layout information (bold and italics), and the use of inline graphics together with the ability to provide hypertext links. To view an HTML document, the user needs a browser. The browser interprets the instructional tags and presents the HTML document.
A proper Web page consist of a head and a body enclosed by and tags (formatting commands). The head is bracketed by the and tags and the body is bracketed by the and tags. The command inside the tags are called directives. Most HTML tags have this format, that is, to mark the beginning of something and to mark its end. Tags can be in either lowercase or uppercase. HTML parsers ignore extra spaces and carriage returns since they have to reformat the text to make it fit the current display area.
Tables can also be created in HTML whose entries could be clicked on to active hyperlinks. An HTML table consists of one or more rows, each consisting of one or more cells. Cells can contain a wide range of material, including text, figures, and even other tables. Cells can be merged, so, for example, a heading can span multiple columns.
HTML also provides forms for two-way traffic. Forms contain boxes or buttons that allow users to fill in information or make choices and then send the information back to the page’s owner.
CGI (Common Gateway Interface) is a standard for handling form’s data. Suppose that someone has a database (e.g. an index of web pages by keyword and topic) and wants to make it available to web users. The CGI way to make the database available is to write a script (a program) that interfaces (i.e. gateway) between the database and the web. This script is given a URL, by convention in the directory cgi-bin. HTTP servers know that when they have to invoke a method on a page located in cgi-bin, they are to interpret the file name as being an executable script or program and start it up. CGI scripts can also produce output and do many other things as well as accepting input from forms.
HTML makes it possible to describe how static web pages should appear, including tables and pictures. With the cgi-bin hack, it is also possible to have a limited amount of two-way interaction (forms, etc). However, rapid interaction with web pages written in HTML is not possible. To make it possible to have highly interactive web pages, Java language and interpreter is used. The main idea of using Java for interactive web pages is that a web page can point to a small Java program called an applet. When the browser reaches it, the applet is downloaded to the client machine and executed there in a secure way. Thus applets allow web pages to become interactive. For e.g., a game playing program (chess, tic-tac-toe etc.) written in Java can be downloaded along with its web page. Complex forms (e.g. spreadsheets) can be displayed, the users filling in items and seeing calculations made instantly. Applets also make it possible to add animation and sound to web pages.
URLs – UNIFORM RESOURCE LOCATORS
Web pages may contain pointers to other web pages. When the web was first created, it was immediately apparent that having one page point to another web page required mechanisms for naming and locating pages. In particular, there were three questions that had to be answered before a selected page could be displayed:
- What is the page called?
- Where is the page located?
- How can the page be accessed?
If every page were somehow assigned a unique name, there would not be any ambiguity in identifying pages. Nevertheless, the problem would not be solved.
The solution chosen identifies pages in a way that solves all three problems at once. Each page is assigned a URL (Uniform Resource Locator) that effectively serves as the page’s worldwide name.
URLs have three parts:
- The protocol (also called a scheme).
- The DNS name of the machine on which the page is located.
- A local name uniquely indicating the specific page (usually just a file name on the machine where it resides).
For example, a URL can be
http:/www.cs.ku.in/welcome.html
This URL consists of three parts:
- The protocol (http)
- The DNS name of the host (www.cs.ku.in)
- The file name (welcome.html)
with certain punctuation separating the pieces.
To make a piece of text clickable, the page writer must provide two items of information: the clickable text to be displayed and the URL of the page to go if the text is selected. When the text is selected, the browser looks up the host name using DNS. Now armed with the host’s IP address, the browser then establishes a TCP connection to the host. Over that connection, it sends the file name using the specified protocol and the page comes.
This URL scheme is open-ended in the sense that it is straightforward to have protocols other than HTTP. In fact, URLs for various other common protocols have been defined, and many browsers understand them. Slightly simplified forms of the more common ones are listed below:
Name |
Used for |
http |
Hypertext (HTML) |
ftp |
FTP |
file |
Local file |
news |
News group |
news |
News article |
gopher |
Gopher |
mailto |
Sending email |
telnet |
Remote login |
The http protocol is the web’s native language, the one spoken by HTTP servers.The ftp protocol is used to access files by FTP, the Internet’s file transfer protocol. Numerous FTP servers all over the world allow people anywhere on the Internet to log in and download whatever files have been placed on the FTP server. The Web does not change this; it just makes obtaining files by FTP easier.
It is possible to access a local file as a Web page, either by using the file protocol, or more simply, by just naming it. This approach is similar to using FTP but does not require having a server.
The news protocol allows a web user to call up a news article as though it were a Web page. This means that a Web browser is simultaneously a newsreader. Many browsers have buttons or menu items to make reading USENET news even easier than using standard news readers.
Two formats are supported for the news protocol. The first format specifies a newsgroup and can be used to get a list of articles from a preconfigured news site. The second one requires the identifier of a specific news article to be given. The browser then fetches the given article from its preconfigured news site using the NNTP protocol.
The gopher protocol is used by the Gopher system. It is an information retrieval scheme, conceptually similar to the Web itself, but supporting only text and no images. When a user logs into a Gopher server, he is presented with a menu of files and directories, any of which can be linked to another Gopher menu anywhere in the world.
The last two protocols do not really have the flavor of fetching Web pages, and are not supported by all browsers. The mailto protocol allows users to send email from a Web browser. The way to do this is to click on the OPEN button and specify a URL consisting of mailto: followed by the recipient’s email address.
The telnet protocol is used to establish an on-line connection to a remote machine. It is used in the same way as the telnet program.
In short, the URLs have been designed to not only allow users to navigate the Web, but to deal with FTP, news, Gopher, email, and telnet as well, making all the specialized user interface programs for those other services unnecessary, and thus integrating nearly all Internet access into a single program, the Web browser.