Frequently Asked Questions - IMRC - Resource Discovery Sub-group: GoC Search Engine |
![](/web/20071120024706im_/http://www.tbs-sct.gc.ca/cioscripts/images/icon-im-gi.gif) |
![,](/web/20071120024706im_/http://www.tbs-sct.gc.ca/cioscripts/images/line450x1.gif) |
Basic Facts
Crawling of Sites
Metadata
Ranking
Troubleshooting
Q: What search product is currently being used?
The Canada site and Publiservice are presently using AltaVista 3.0 search software.
Q: How many and what file types are supported by AltaVista?
AltaVista 3.0 presently supports over 250 different file types. A complete list is
available at
http://www.fastsearch.com.
Presently the Government of Canada index supports 15 file types along with the various
versions of each of those types. The 15 file types include:
File Type |
Extension |
Adobe Acrobat document |
.pdf |
Hypertext document |
.html, .shtml, .htm, .xhtml, etc. |
Active Server Page |
.asp |
Java Server Page |
.jsp |
Text document |
.doc |
Lotus Word Pro |
.lwp |
Ami Professional document |
.sam |
Text document |
.txt |
WordPerfect |
.wp (all versions) |
Excel |
.xl (all versions) |
Lotus 123 |
.wk*, 123 |
PowerPoint |
.ppt |
CorelDraw |
.cd* (all versions) |
ColdFusion |
.cfm |
Q. When does the AltaVista License expire?
The Government of Canada owns a perpetual license to use the product. The license does
not expire. The support contract is in place until September 2004.
Q: How are Web sites added to the index?
The search engine works by creating an index (repository) of all the information it
collects. It collects this information using a Web collector. Web collectors are also
called robots, spiders or crawlers. The Web collector follows all links on a Web page
based on a set of rules defined by the administrator.
The rules state:
- how many sub directories it will follow,
- whether it will only crawl sites served by a specific domain,
- the Web page where it will start building the index.
Once the crawler is put to work, it organizes all the data from each page it crawls
into the search index. Dynamic sites are currently being crawled. However, due to
complications throughout the indexing of these sites, the index might not include all
possible pages.
Q: How many collectors does the Government of Canada index presently use?
The Government of Canada index is search-recherche.gc.ca. This index presently has four
collectors populating it. Two are dedicated to daily schedules which include indexing the
Canada site for "What's New" and "Canadians Gateway".
Q: What sites are included in the index?
Sites that are in the gc.ca domain are automatically discovered by the crawler as long
as other sites link to them. Due to the time needed to recrawl the index, this may take up
to two weeks. Sites with .ca, .com, .org and .net domains may be collected if they are a
Government of Canada initiative, but they must be explicitly added to the crawlers'
lists of URLs.
The main sites for the Provinces and Territories are also indexed.
Q: How many documents are presently indexed by the Government of Canada index?
As of Feb. 6, 2002 there were over 3.5 million documents indexed.
Q: Does GENet have a separate index?
GENet does have a separate index called search-recherche.publiservice.gc.ca. As of Feb.
6, 2002 there were over 250 thousand documents indexed. It employs a single collector that
continuously refreshes the index.
Q: How long does it take to refresh the indexes completely?
As a rule of thumb we say it can take up to two weeks for a new site to get crawled,
however it could take longer.
Q: What Metadata does AltaVista accept and how does it use it?
The Government of Canada search engine is currently configured to recognize nine Dublin
Core metatags and four additional metatags. A configuration file tells AltaVista what
metatags to collect. It also includes a map that matches each metatag to an internal
variable. The configuration file can have up to 300 metatags defined. The table below
shows the current tags mapped in the Government of Canada search engine configuration
file.
Metatag |
Internal Tag Name |
dc.title |
dctitle |
dc.creator |
dccreator |
dc.subject |
dcsubject |
dc.date.created |
dcdatecreator |
dc.date.modified |
dcdatemodified |
dc.language |
dclanguage |
dc.contributor |
dccontributor |
dc.source |
dcsource |
dc.description |
dcdescription |
Description |
description |
review_date |
review_date |
Keywords |
keywords |
keywords |
keywords |
Q: What happens then a metatag is
repeated in the document?
AltaVista collects the content from the first metatag and appends the content for any
subsequent repeating tags. Therefore, the index contains the contents of both metatags,
and will return the expected results.
Q: Why does AltaVista rank its results?
Search engines use ranking systems to ensure the return of the most accurate results
for a given query. Pages that are found to be most relevant to the query are ranked higher
and placed at the top of the results. AltaVista is capable of displaying the percentage of
relevancy for each result returned. This feature, however, is not enabled at this time.
Q: How does AltaVista rank its results?
There are two types of ranking to consider, ranking based on criteria and ranking based
on relevancy.
Relevance Ranking: is based on the number of words in the search query that the
document contains and the weight value. The weight value assigned is based on the number
of occurences of a word in the entire index. Words which appear less often in the index
are assigned a higher weight. This is based on the assumption that words that occur less
frequently in the index will have more relevance to a user's interest. AltaVista assigns a
rank for relevancy by default.
AltaVista ranks pages containing the given search term(s) in the Title field to have
100% relevancy. These pages are placed at the top of the list of results. Therefore, the
Title of a page is extremely important for search purposes. If the search engine does not
find the search term(s) in the title of the document, it ranks pages based on the search
term(s) found within the body.
Criteria ranking: assigns a numeric factor to one or more variables based on
stated criteria. Rankings range from zero to an identified maximum. For example, 0 to 10.
Advanced search capabilities defined within the Canada Site template include date range
and ranking using specific search term(s).
Q: How does AltaVista create the description?
When AltaVista displays a list of results, the description for a page is shown. If the
description metatag does not exist for a document, AltaVista creates one by using the
first 100 characters of the document as the description.
When searching for pages with a specific description, AltaVista will not find pages
that do not have a description metatag.
Q: Why is my site not showing on a search result?
There are a number of possibilities as to why a Web site does not show up in a search
result:
- The site may not yet have been crawled. If the index does not contain the site it
will not display in the search results.
- Verify that the individual documents are being returned. Search for a specific
document name to ensure that it is in the index.
- Do not use language based searches. Assume documents will contain different phrases
in different languages.
- When searching for a specific URL, if the same domain is used for both languages the
search engine will not return results for the pages under the missing domain. For
example, if the domain tpsgc.gc.ca was used for both French and English, performing a
search using url:source.pwgsc.gc.ca/english/
will produce zero results.
- If the same subdirectory is mistakenly used for both languages the search engine will
not return the correct results. For example, if you search for url:http://source.pwgsc.gc.ca/english/main_e.htm,
and it has been indexed as http://source.pwgsc.gc.ca/french/main_e.htm, no results will
be returned.
Q: How do I confirm that my site is being crawled?
To confirm that your site is being crawled, query the index for your specific URL using
the "URL:" search feature. If your site appears in the search results, then your
site has been crawled.
|