![]() |
|
|
|
Section 1: General Information Section 2: Site Indexing and Search Invocation
Section 1: General Information 1.1 What is an indexed site? When the search engine spider begins the crawl, it looks at each of the start pages as configured by the administrator. When it looks at a page, the spider adds it to the index and then adds all of the links to a list of sites to crawl. It will pick a new page from that list, read it, add it to the index and add any additional links found to the list. It will continue this process until it cannot find any more pages. Before it adds a page to the index, it will check the administrator settings and then the Robots Exclusion Protocol to see if it is allowed to add it. After it adds it (or decides not to) it will see which links it is allowed to follow. The links to other pages may be disallowed by the administrator settings or by the Robots Exclusion Protocol. NOTE: The search engine will not follow links from your pages to URLS that are not in the .psu.edu domain. For example, if your Web site at http://www.aaa.psu.edu/ includes a link to http://www.some.org/, the site www.some.org will not be indexed by the search engine. Some exceptions may be granted by sending a request to search@psu.edu. 1.2 What search engine does Penn State use? Return to topics Section 2: Site Indexing 2.1 What is the procedure for getting a site indexed by the Penn State search engine? For a site to be indexed by the search engine, pages must:
If your Web site does not meet the criteria listed above, please contact search@psu.edu to see if your site may be added to the search engine. If your Web site does meet the criteria and your pages are not found in the search engine, contact search@psu.edu.
2.2 How long does it take the search engine to recognize changes to indexed documents? After a site that changes rarely has been updated, Penn State Search Engine administrators can schedule the site to be recrawled sooner. Penn State Search Engine administrators may also designate select sites as "changes frequently," meaning crawl at least once a day, or "changes rarely," meaning crawl no more than once every 90 days. Penn State Webmasters and Web content editors may submit a request to search@psu.edu to have their respective sites listed as "frequently" or "infrequently" changing sites or to have a rarely crawled site recrawled sooner. 2.3 The server for which I am responsible is indexed by the search engine. I have materials on this server that I do not want the search engine to find. Can the search engine disallow certain sites/pages on a particular server from being indexed? For example, you are responsible for the sites that reside on the server www.someserver.psu.edu, which is indexed by the search engine. You just created a site for a particular department; the URL is http://www.someserver.psu.edu/yoursite/. However, you have a directory, "budget" that you do not want the search engine to index. We can set up a filter to disallow the indexing of http://www.someserver.psu.edu/yoursite/budget/. The site www.someserver.psu.edu/yoursite/ will be indexed, but all files associated with the directory "budget" will not be indexed. When submitting a request to have your site indexed, make sure to mention any disallow filters. You also may use the robots.txt file to restrict access to parts of your Web site. Please see the Web Robots Pages for instructions on how to do this. 2.4 How do I prevent the search engine from indexing Web pages? To prevent the Penn State Search Engine from indexing your Web site or your entire Web server, your Web server administrator should use the Robots Exclusion Protocol. The User-Agent string used by the Penn State search engine is PennStateSpider. To prevent the Penn State search engine from indexing or following links on an individual Web page you can use the following Robots Meta tag between the <HEAD> and </HEAD> tags of an HTML page. <meta name="robots" content="noindex, nofollow"> 2.5 How do I invoke the search engine from my Web page? To invoke the search engine from a Web page, you can add a search field to your site. This is done by adding an HTML form to your Web page. You also may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only). To add a search field like the one below to your site, add the following to your HTML file: |