Treasury Board of Canada, Secretariat - Government of Canada
Skip all menus Skip first menu
,  Français  Contact Us  Help  Search  Canada Site
     What's New  About Us  Policies  Documents  TBS Site
   Calendar  Links  FAQs  Presentations  Home
,
Chief Information Officer Branch
Information Management Resource Centre
Framework for the Management of Information
Metadata
Working Groups
Governance and Accountability
Policies / Standards
IM Topics
Resources

Find Information:
by Subject [ A to Z ] by Sub-site
Versions:  
Print Version Print Version
RTF Version RTF Version
Related Subjects:
Information Management
Metadata
Methodology
Resource Discovery
Feedback on Website
,
,
Frequently Asked Questions - IMRC - Resource Discovery Sub-group: GoC Search Engine
,

Overview Documents Links GOL MWG

Basic Facts
Crawling of Sites
Metadata
Ranking
Troubleshooting

Basic Facts:

Q: What search product is currently being used?
The Canada site and Publiservice are presently using AltaVista 3.0 search software.

Q: How many and what file types are supported by AltaVista?
AltaVista 3.0 presently supports over 250 different file types. A complete list is available at
http://www.fastsearch.com.

Presently the Government of Canada index supports 15 file types along with the various versions of each of those types. The 15 file types include:

File Type Extension
Adobe Acrobat document .pdf
Hypertext document .html, .shtml, .htm, .xhtml, etc.
Active Server Page .asp
Java Server Page .jsp
Text document .doc
Lotus Word Pro .lwp
Ami Professional document .sam
Text document .txt
WordPerfect .wp (all versions)
Excel .xl (all versions)
Lotus 123 .wk*, 123
PowerPoint .ppt
CorelDraw .cd* (all versions)
ColdFusion .cfm

Q. When does the AltaVista License expire?
The Government of Canada owns a perpetual license to use the product. The license does not expire. The support contract is in place until September 2004.

Top of Page Crawling of Sites:

Q: How are Web sites added to the index?
The search engine works by creating an index (repository) of all the information it collects. It collects this information using a Web collector. Web collectors are also called robots, spiders or crawlers. The Web collector follows all links on a Web page based on a set of rules defined by the administrator.

The rules state:

  • how many sub directories it will follow,
  • whether it will only crawl sites served by a specific domain,
  • the Web page where it will start building the index.

Once the crawler is put to work, it organizes all the data from each page it crawls into the search index. Dynamic sites are currently being crawled. However, due to complications throughout the indexing of these sites, the index might not include all possible pages.

Q: How many collectors does the Government of Canada index presently use?
The Government of Canada index is search-recherche.gc.ca. This index presently has four collectors populating it. Two are dedicated to daily schedules which include indexing the Canada site for "What's New" and "Canadians Gateway".

Q: What sites are included in the index?
Sites that are in the gc.ca domain are automatically discovered by the crawler as long as other sites link to them. Due to the time needed to recrawl the index, this may take up to two weeks. Sites with .ca, .com, .org and .net domains may be collected if they are a Government of Canada initiative, but they must be explicitly added to the crawlers' lists of URLs.

The main sites for the Provinces and Territories are also indexed.

Q: How many documents are presently indexed by the Government of Canada index?
As of Feb. 6, 2002 there were over 3.5 million documents indexed.

Q: Does GENet have a separate index?
GENet does have a separate index called search-recherche.publiservice.gc.ca. As of Feb. 6, 2002 there were over 250 thousand documents indexed. It employs a single collector that continuously refreshes the index.

Q: How long does it take to refresh the indexes completely?
As a rule of thumb we say it can take up to two weeks for a new site to get crawled, however it could take longer.

Top of Page Metadata:

Q: What Metadata does AltaVista accept and how does it use it?
The Government of Canada search engine is currently configured to recognize nine Dublin Core metatags and four additional metatags. A configuration file tells AltaVista what metatags to collect. It also includes a map that matches each metatag to an internal variable. The configuration file can have up to 300 metatags defined. The table below shows the current tags mapped in the Government of Canada search engine configuration file.

Metatag Internal Tag Name
dc.title dctitle
dc.creator dccreator
dc.subject dcsubject
dc.date.created dcdatecreator
dc.date.modified dcdatemodified
dc.language dclanguage
dc.contributor dccontributor
dc.source dcsource
dc.description dcdescription
Description description
review_date review_date
Keywords keywords
keywords keywords

Q: What happens then a metatag is repeated in the document?
AltaVista collects the content from the first metatag and appends the content for any subsequent repeating tags. Therefore, the index contains the contents of both metatags, and will return the expected results.

Top of Page Ranking:

Q: Why does AltaVista rank its results?
Search engines use ranking systems to ensure the return of the most accurate results for a given query. Pages that are found to be most relevant to the query are ranked higher and placed at the top of the results. AltaVista is capable of displaying the percentage of relevancy for each result returned. This feature, however, is not enabled at this time.

Q: How does AltaVista rank its results?
There are two types of ranking to consider, ranking based on criteria and ranking based on relevancy.

Relevance Ranking: is based on the number of words in the search query that the document contains and the weight value. The weight value assigned is based on the number of occurences of a word in the entire index. Words which appear less often in the index are assigned a higher weight. This is based on the assumption that words that occur less frequently in the index will have more relevance to a user's interest. AltaVista assigns a rank for relevancy by default.

AltaVista ranks pages containing the given search term(s) in the Title field to have 100% relevancy. These pages are placed at the top of the list of results. Therefore, the Title of a page is extremely important for search purposes. If the search engine does not find the search term(s) in the title of the document, it ranks pages based on the search term(s) found within the body.

Criteria ranking: assigns a numeric factor to one or more variables based on stated criteria. Rankings range from zero to an identified maximum. For example, 0 to 10. Advanced search capabilities defined within the Canada Site template include date range and ranking using specific search term(s).

Q: How does AltaVista create the description?
When AltaVista displays a list of results, the description for a page is shown. If the description metatag does not exist for a document, AltaVista creates one by using the first 100 characters of the document as the description.

When searching for pages with a specific description, AltaVista will not find pages that do not have a description metatag.

Top of Page Troubleshooting:

Q: Why is my site not showing on a search result?
There are a number of possibilities as to why a Web site does not show up in a search result:

  • The site may not yet have been crawled. If the index does not contain the site it will not display in the search results.
  • Verify that the individual documents are being returned. Search for a specific document name to ensure that it is in the index.  
  • Do not use language based searches. Assume documents will contain different phrases in different languages.
  • When searching for a specific URL, if the same domain is used for both languages the search engine will not return results for the pages under the missing domain. For example, if the domain tpsgc.gc.ca was used for both French and English, performing a search using url:source.pwgsc.gc.ca/english/ will produce zero results.
  • If the same subdirectory is mistakenly used for both languages the search engine will not return the correct results. For example, if you search for url:http://source.pwgsc.gc.ca/english/main_e.htm, and it has been indexed as http://source.pwgsc.gc.ca/french/main_e.htm, no results will be returned.

Q: How do I confirm that my site is being crawled?
To confirm that your site is being crawled, query the index for your specific URL using the "URL:" search feature. If your site appears in the search results, then your site has been crawled.


  ,
 Return to
Top of Page
Important Notices