Penn State Mark Search Engine documentation Information Technology Services
Documentation Home | User Help | File Types | FAQ | Info. for Web Content Providers | Index Helper | Custom Style Manager | Crawler Info | End-of-Life Scheduled, 2010

Penn State Search Engine: Information for Webmasters/Web content providers

Overview

The Penn State Search Engine uses the Google Search Appliance. This Web page describes:

  • how Web pages are now indexed at Penn State.
  • how to invoke the Google Search Appliance from your Web pages.
  • some features available from Google.

Return to top


Indexing Policy

The Search Engine crawls Penn State Web servers continuously. The search engine begins every crawl with the Penn State home page <http://www.psu.edu/>. A crawl continues by searching home page links to new Web sites, which will lead to more links and associated Web sites creating a search index from which information will be extracted to provide search results. The Penn State Search Engine crawls continuously to incrementally update portions of the index a site at a time, rather than recreating the entire index at once. This crawling process leads to faster and more efficient crawls to keep search results more up to date.

For a site to be indexed by the search engine, pages must:

If your Web site does not meet the criteria listed above, please contact search@psu.edu to see if your site may be added to the search engine. If your Web site does meet the criteria and your pages are not found in the search engine, contact search@psu.edu.

Return to top


How to Prevent the Search Engine from Indexing Web Pages

To prevent the Penn State Search Engine from indexing your Web site or your entire Web server, your Web server administrator(s) should use the Robots Exclusion Protocol. The User-Agent string used by the Penn State Search Engine is PennStateSpider.

To prevent the Penn State Search Engine from indexing or following links on an individual Web page you can use the following Robots Meta tag between the <HEAD> and </HEAD> tags of an HTML page.

<meta name="robots" content="noindex, nofollow">

You can use Googleon and Googleoff tags to specify which portions of a Web page are to be excluded from matching against search queries and the search results snippets; also, these tags when surrounding text in a link will disassociate this information from the link. Read how to use the Googleon/Googleoff tags using a Penn State Access Account.

Return to top


Excluded Web Servers

Not all Web servers are indexed by the search engine. Here is a list of the following Web servers that are not indexed by the search engine:

NOTE: By default, all URLs with a question mark are excluded from the index. Webmasters and Web content providers may request specific sites to be included, such as (for example) all sites that begin with http://somesite.psu.edu/default.asp? and any others that should be included. Requests may be directed to search@psu.edu.

If your Web server is in the excluded list, but should in fact be indexed, please contact search@psu.edu.

Return to top


Document Types Indexed by the Search Engine

The Penn State Search Engine indexes many file formats such as HTML, plain text, Microsoft's Word, Excel, Powerpoint, Adobe PDF, and Postscript. The search engine does not index graphics, video or audio file formats such as GIF, JPEG, MPEG, imagemaps and Flash. Please see the complete file type listing for details.

Return to top


Maximum Document Size

The maximum document size indexed by the search engine is 2.5 MB for HTML files and 30MB for non-HTML files. The search engine indexes the first 2.5MB of HTML files and discards the rest. For non-HTML files, it discards any file larger than 30MB. If the file is less than 30MB it indexes the first 2.5MB of the file.

Return to top


How to Invoke the Search Engine from a Web Page

To invoke the search engine from a Web page, you can add a search field to your site. This is done by adding an HTML form to your Web page. You also may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only). To add a search field like the one below to your site, add the following to your HTML file:

<form method="GET" action="http://search-results.aset.psu.edu/search">
< input type="text" name="q" size="40" maxlength="2048" value="">
< input type="submit" name="btnG" value="Search">
< input type="hidden" name="client" value="PennState">
< input type="hidden" name="proxystylesheet" value="PennState">
< input type="hidden" name="output" value="xml_no_dtd">
< input type="hidden" name="site" value="PennState">
< /form>

The size and maxlength can be whatever you specify. The value, 2048, is the maximum length that Google supports.

Return to top


How to Restrict Search Results for a Specific Web Site

Only collections that the "site" parameter are used for restricting searches. The following line would be used in the form to restrict a search to a collection, such as "PennState":

<input type="hidden" name="site" value="collection_name">
e.g.
<input type="hidden" name="site" value="ITS">

NOTE: Specifying a restricted collection in the site parameter will replace the default for "PennState". Do not use more than one site parameter.

For additional search parameters, please send an e-mail inquiry to search@psu.edu.

Return to top


How to Customize the Search Result format

You can create a custom search style via the FrontEnds feature of the search engine and the Custom Search Style Manager tool.

You also have the option to request Extensible Markup Language (XML)-formatted data from the search engine, which may be used in dynamic applications.

Return to top


How to Request XML formatted Search Results

You can request XML formatted search results when you invoke the search engine without the proxystylesheet parameter. You also may choose to receive the Document Type Definition (DTD) if you set the output parameter to "xml" rather than "xml_no_dtd". For example, the following HTML:

<form method="GET" action="http://search-results.aset.psu.edu/search">
<input type="text" name="q" size="40" maxlength="2048" value="">
<input type="submit" name="btnG" value="Search">
<input type="hidden" name="client" value="PennState">
<input type="hidden" name="output" value="xml">
<input type="hidden" name="site" value="PennState">
</form>

will produce the following form (note XML tags may be hidden by your browser):

Using a Penn State Access Account, you may reference the Google Search Appliance XML vocabulary.

Return to top


Synonyms

Synonyms may be created to suggest alternate queries to perform. For example, the synonym "President Spanier" could be created for "Graham Spanier" suggesting that users may make alternate queries for one topic. Synonyms are user-friendly ways of helping users find the information for which they are searching. To request creation of synonyms, please contact search@psu.edu.

Return to top


Features of 5.x

Return to top


The Pennsylvania State University ©2009. All rights reserved.
Alternative Media - Nondiscrimination Statement
This site maintained by Academic Services and Emerging Technologies, a unit of Information Technology Services.

Comments and suggestions may be directed to The Penn State Search Engine Support Team.

Last revised: Tuesday, March 24, 2009.