![]() |
|
|
The Penn State Search Engine uses the Google Search Appliance. This Web page describes:
The Search Engine crawls Penn State Web servers continuously. The search engine begins every crawl with the Penn State home page <http://www.psu.edu/>. A crawl continues by searching home page links to new Web sites, which will lead to more links and associated Web sites creating a search index from which information will be extracted to provide search results. The Penn State Search Engine crawls continuously to incrementally update portions of the index a site at a time, rather than recreating the entire index at once. This crawling process leads to faster and more efficient crawls to keep search results more up to date. For a site to be indexed by the search engine, pages must:
If your Web site does not meet the criteria listed above, please contact search@psu.edu to see if your site may be added to the search engine. If your Web site does meet the criteria and your pages are not found in the search engine, contact search@psu.edu. How to Prevent the Search Engine from Indexing Web Pages To prevent the Penn State Search Engine from indexing your Web site or your entire Web server, your Web server administrator(s) should use the Robots Exclusion Protocol. The User-Agent string used by the Penn State Search Engine is PennStateSpider. To prevent the Penn State Search Engine from indexing or following links on an individual Web page you can use the following Robots Meta tag between the <HEAD> and </HEAD> tags of an HTML page. <meta name="robots" content="noindex, nofollow"> You can use Googleon and Googleoff tags to specify which portions of a Web page are to be excluded from matching against search queries and the search results snippets; also, these tags when surrounding text in a link will disassociate this information from the link. Read how to use the Googleon/Googleoff tags using a Penn State Access Account. Not all Web servers are indexed by the search engine. Here is a list of the following Web servers that are not indexed by the search engine:
NOTE: By default, all URLs with a question mark are excluded from the index. Webmasters and Web content providers may request specific sites to be included, such as (for example) all sites that begin with http://somesite.psu.edu/default.asp? and any others that should be included. Requests may be directed to search@psu.edu. If your Web server is in the excluded list, but should in fact be indexed, please contact search@psu.edu. Document Types Indexed by the Search Engine The Penn State Search Engine indexes many file formats such as HTML, plain text, Microsoft's Word, Excel, Powerpoint, Adobe PDF, and Postscript. The search engine does not index graphics, video or audio file formats such as GIF, JPEG, MPEG, imagemaps and Flash. Please see the complete file type listing for details. The maximum document size indexed by the search engine is 2.5 MB for HTML files and 30MB for non-HTML files. The search engine indexes the first 2.5MB of HTML files and discards the rest. For non-HTML files, it discards any file larger than 30MB. If the file is less than 30MB it indexes the first 2.5MB of the file. How to Invoke the Search Engine from a Web Page To invoke the search engine from a Web page, you can add a search field to your site. This is done by adding an HTML form to your Web page. You also may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only). To add a search field like the one below to your site, add the following to your HTML file: |