|
![](/web/20060215213716im_/http://strategis.ic.gc.ca/epic/home.nsf/images/spacer.gif/$FILE/spacer.gif) |
![Spectrum Management and Telecommunications Spectrum Management and Telecommunications](/web/20060215213716im_/http://strategis.ic.gc.ca/epic/home.nsf/images/sf_banner_e.gif/$FILE/sf_banner_e.gif)
Content Filtering Technologies and Internet Service Providers
Format PDF 207, KB
Access to Documents
Michael Shepherd and Carolyn Watters Footnote 1
Web Information Filtering Lab
Faculty of Computer Science
Dalhousie University
6050 University Avenue
Halifax, Nova Scotia, Canada B3H 1W5
Executive Summary
March 22, 2000
The project reports on the mechanisms that Internet Service Providers
(ISPs) have the option to provide and that users can
choose to utilize in order to filter the content delivered to
users over the Internet and to allow authorized access to that
content. The report is purely descriptive of the filtering mechanisms
available and does not provide policy or legal advice or
recommendations.It was commissioned by Industry Canada to help promote
the development, awareness and use of tools and technologies that
enable Internet users to make choices about the content that they
access on the Internet.
Classification of Mechanisms
The mechanisms are classified into two tiers, the application level mechanisms and
the underlying core technologies.The core technologies are classified
as follows:
- Site labels
Labeling refers to schemes to assign content related labels to
URL's and/or specific Web pages. The URL (Universal Resource
Locator) describes the location of a specific Web page.Individual
rating protocols exist, in general, separate from products or
applications using these ratings. In general, these labels can be
stored as part of the Web page or separately from the Web page in a
database. Labels may be the result of self-rating, third-party
authority rating, or community rating by interested users.
Footnote 2
- Lists ofappropriate or inappropriate sites ("white" and
"black" lists)
The most frequently used filtering
technology is the use of lists of acceptable and/or unacceptable
URL’s. "White" lists are used to define a domain of
"safe" Web sites within which users can browse. These
typically require people to search and select sites that are approved
by the provider of the list. "Black" lists are lists
compiled of URL’s from which requests will not be serviced.The
lists are compiled as a service by individuals or by communities of
raters.
- Automated text analysis
Another way to
analyze a Web site is to use software that scans the text of a site to
determine the relevance or suitability of pages. Users or groups of
users have profiles of interests (positive and/or negative),
consisting of keywords and phrases, that are used in this
determination. Almost all content based filtering uses some variation
of keyword matching, where keywords from a profile of interest are
compared against the keywords occurring in the content of the specific
Web page.Text analysis is also used to screen search terms from search
queries.
- Authorization
Encryption, password protection, and credit
card validation techniques are used to authenticate that a user has the
authorization to access given services or data.
- Activity tracing
Internet usage can be traced by using the server log files and other
data logs. These files store details of all Web accesses and can be
used to analyze Web-related activities.
The filtering applications that are built on these underlying
technologies have been classified as follows:
- Special purpose browsers for children
Browser applications can be written that are targeted to child users.
Such applications can provide easier search strategies and friendlier
graphics, remove advertisements, and provide filtering and search-safe
domains in a way that makes it transparent to the user.
- Child-friendly search engines and portals
The idea behind both special purpose child-friendly search engines
and portals is to use a third party gateway to Web content.
Child-friendly portals are Web access sites that try to provide a
domain of safe sites for the user to explore. As long as the user
comes in through the portal, they view a pre-selected domain set of
the Web.
- Proxy applications
Proxy software is software that is at the ISP and acts as the
intermediary between the client or browser and the Internet.
Application software can be added to server proxy modules that permits
the execution of text analysis and URL list comparisons for each
Internet browser request and response.
- Activity monitors
Rather than restrict or control access to Web sites proactively,
these applications monitor and log Internet activity for parental
review.
- Restricted access applications
Applications residing on the host site can be written that restrict
access to services or data on that site to authorized users. These
applications may encrypt the data so that only authorized users can
decrypt and view the data.
- Non-HTTP applications.
In addition to Web page access, applications can be written using
these core technologies to filter content of email and to control
access to ftp sites, telnet hosts, discussion and chat groups, and
newsgroups.
Potential of the Core Technologies
All of these core technologies have a role to play in filtering
mechanisms for the Web. None of these core technologies provides a
long term solution on its own. Systems will need to combine the
technologies in innovative ways to provide effective solutions. In
particular, we note that:
- Site labeling systems are the most flexible and perhaps hold the most
promise for the future. Labels may be assigned by the content provider,
third party rating services, communities of users, and/or individual
users. It is the responsibility of the ISP and/or of the client to use
the labels that have been assigned.
- URL lists are the most effective in controlling domains of access.This
method is particularly good for creating child-friendly sites.
However, the use of lists does not provide the flexibility of labels
and the Web is growing so quickly it is very hard to keep lists
up-to-date.
- Accurate automated analysis of the content of Web sites is problematic
at best, due to the vagaries of natural language, the difficulties of
cross-language filtering, and the difficulties of determining the
content of graphics and images. Although using this technology to
assist in the labeling of sitesbased on text categorization
techniques does hold some promise.
- Access authorization can be effective as a reverse filter, restricting
who can have access to a given site.
- Activity tracing can only monitor what has been done, it cannot
actually filter out any material.
Although Web crawlers are not considered a core technology, they can
act as rating agents that rate sites proactively based on content
analysis algorithms. Although this
has not proven very accurate to-date, there is potential for
improvement in this area.
It must be emphasized that none of the above technologies are 100%
effective and that the content of the Web is, by its very nature,
volatile.In order to be (more) effective, these technologies have to
be used in combinations and in layers, both at the ISP and client.As
these applications and core technologies are used in combination and
as most can be applied at either the ISP or the client or both, there
is no clear recommendation as to the best method or where it is best
applied.
The areas that hold the greatest potential and where future efforts
should be focused include:
The development of architectures that combine various mechanisms to
work in collaboration or as layers within the architecture.For
example, labels managed by a Label Bureau, with lists managed by the
ISP, and with content analysis at the client.
Further research into Web crawlers and agents for the proactive
categorization of Web sites.
The development of architectures supporting collaborative filtering
for communities of users.
While virtually all of the techniques reviewed in this report can be
implemented by an ISP, it is important that the use of filtering is
transparent to the user. The user should be informed when filtering is
in place or has occurred. To be effective, it is important that:
The user knows when content has been filtered and why;
The user knows the criteria for filtering, i.e., what is on the list
or in the filter; and
Given the dynamic nature of the Web, a concerted and continuing effort
into the development, evaluation, and maintenance of filtering and
access control mechanisms will be required at all levels, including;
government, community, ISP, and individual.
Footnotes
1.With Margo Boyd, Research Assistant
2. Balkin, J., B.Noveck and K.Roosevelt. 1999. Filtering
the Internet: A best practices model. In Protecting our children
on the Internet: towards a new culture of
responsibility.Bertelsmann Foundation Pub.
|