Penn State Mark Search Engine documentation Information Technology Services
Documentation Home | User Help | File Types | FAQ | Info. for Web Content Providers | Index Helper | Custom Style Manager | Crawler Info

Search Engine FAQ

Section 1: General Information


Section 2: Site Indexing and Search Invocation


Section 3: Common Problems


Section 4: Policies


Section 1: General Information

1.1 What is an indexed site?
An indexed site is a Web site that is added to the search engine index. The search engine finds sites to index by reading Web pages and following the links to other pages; this process is known as crawling or spidering. The search engine administrator defines which pages to read first, which pages are allowed to be indexed and which links are allowed to be followed. An example of an indexed site is http://www.psu.edu/, which is the first page the Penn State search engine reads.

When the search engine spider begins the crawl, it looks at each of the start pages as configured by the administrator. When it looks at a page, the spider adds it to the index and then adds all of the links to a list of sites to crawl. It will pick a new page from that list, read it, add it to the index and add any additional links found to the list. It will continue this process until it cannot find any more pages.

Before it adds a page to the index, it will check the administrator settings and then the Robots Exclusion Protocol to see if it is allowed to add it. After it adds it (or decides not to) it will see which links it is allowed to follow. The links to other pages may be disallowed by the administrator settings or by the Robots Exclusion Protocol.

NOTE: The search engine will not follow links from your pages to URLS that are not in the .psu.edu domain. For example, if your Web site at http://www.aaa.psu.edu/ includes a link to http://www.some.org/, the site www.some.org will not be indexed by the search engine. Some exceptions may be granted by sending a request to search@psu.edu.

1.2 What search engine does Penn State use?
The Penn State Search Engine uses the Google Search Appliance model GB-5005. Our license agreement with Google currently allows for up to 2,000,000 documents. Only those Web sites that Penn State owns, operates, and/or that have been developed on behalf of may be indexed with our search engine. Note that per the license agreement, Penn State is permitted to include the Google trademark on our main search page.


Return to topics

Section 2: Site Indexing

2.1 What is the procedure for getting a site indexed by the Penn State search engine?
The Penn State Search Engine crawls continuously to incrementally update portions of the index a site at a time, rather than recreating the entire index at once. This crawling process leads to faster and more efficient crawls to keep search results up to date.

For a site to be indexed by the search engine, pages must:

If your Web site does not meet the criteria listed above, please contact search@psu.edu to see if your site may be added to the search engine. If your Web site does meet the criteria and your pages are not found in the search engine, contact search@psu.edu.

Special note for people with pages on the Personal server www.personal.psu.edu:
Please do not send e-mail to search@psu.edu requesting that we index your personal pages. As of June 1999, faculty, staff and student personal Web pages on the personal server (www.personal.psu.edu) are not indexed by the Penn State Search Engine.

2.2 How long does it take the search engine to recognize changes to indexed documents?
The Penn State Search Engine crawls continuously to accommodate Web pages that change more frequently than those that rarely change. Notably, the search engine will crawl a given site anywhere from every 15 minutes to 90 days, depending on how often a Web page changes. During the search engine's initial crawl of a given page after the upgrade, another crawl will be scheduled for 3 to 20 days later for pages with high to low PageRank, respectively. Each time a page is recrawled, the search engine will adjust the recrawl interval automatically until it is twice as often as changes are detected (as resources allow). A Penn State Access Account is required for login to see more details regarding the crawling specifications.

After a site that changes rarely has been updated, Penn State Search Engine administrators can schedule the site to be recrawled sooner. Penn State Search Engine administrators may also designate select sites as "changes frequently," meaning crawl at least once a day, or "changes rarely," meaning crawl no more than once every 90 days. Penn State Webmasters and Web content editors may submit a request to search@psu.edu to have their respective sites listed as "frequently" or "infrequently" changing sites or to have a rarely crawled site recrawled sooner.

2.3 The server for which I am responsible is indexed by the search engine. I have materials on this server that I do not want the search engine to find. Can the search engine disallow certain sites/pages on a particular server from being indexed?
Yes. For every indexed server, filters can be established to allow and disallow the indexing of certain sites on a particular server.

For example, you are responsible for the sites that reside on the server www.someserver.psu.edu, which is indexed by the search engine. You just created a site for a particular department; the URL is http://www.someserver.psu.edu/yoursite/. However, you have a directory, "budget" that you do not want the search engine to index. We can set up a filter to disallow the indexing of http://www.someserver.psu.edu/yoursite/budget/. The site www.someserver.psu.edu/yoursite/ will be indexed, but all files associated with the directory "budget" will not be indexed. When submitting a request to have your site indexed, make sure to mention any disallow filters.

You also may use the robots.txt file to restrict access to parts of your Web site. Please see the Web Robots Pages for instructions on how to do this.

2.4 How do I prevent the search engine from indexing Web pages?

To prevent the Penn State Search Engine from indexing your Web site or your entire Web server, your Web server administrator should use the Robots Exclusion Protocol. The User-Agent string used by the Penn State search engine is PennStateSpider.

To prevent the Penn State search engine from indexing or following links on an individual Web page you can use the following Robots Meta tag between the <HEAD> and </HEAD> tags of an HTML page.

<meta name="robots" content="noindex, nofollow">

2.5 How do I invoke the search engine from my Web page?

To invoke the search engine from a Web page, you can add a search field to your site. This is done by adding an HTML form to your Web page. You also may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only). To add a search field like the one below to your site, add the following to your HTML file:

<form method="GET" action="http://search-results.aset.psu.edu/search">
< input type="text" name="q" size="40" maxlength="2048" value="">
< input type="submit" name="btnG" value="Search">
< input type="hidden" name="client" value="PennState">
< input type="hidden" name="proxystylesheet" value="PennState">
< input type="hidden" name="output" value="xml_no_dtd">
< input type="hidden" name="site" value="PennState">
< /form>

Also, you may restrict a search to just a subset of the pages indexed by the search engine (e.g. your Web site only).

How to Restrict Search Results for a Specific Web Site

To restrict the search box to only return results from an entire Web domain, a single Web site or a subset of a Web site add the as_sitesearch parameter in your search form:

<input type="hidden" name="as_sitesearch" value="your_url_here"">
e.g.
<input type="hidden" name="as_sitesearch" value="its.psu.edu">

In the example above, results will be returned from all hosts in the its.psu.edu domain such as aset.its.psu.edu, tlt.its.psu.edu, etc. You may also restrict search results to a specific folder, for example http://www.psu.edu/dept/its/ with the following input tag:

<input type="hidden" name="as_sitesearch" value="www.psu.edu/dept/its">

NOTE: If a trailing slash '/' is used in the URL value, then the search will be restricted to only that folder. In the example above, which does not use a trailing slash, results will be returned for the /dept/its folder and all subfolders under it.

You cannot specify more than one distinct Web site with the as_sitesearch parameter. If you need to restrict a search to a set of site patterns, a "collection" should be created in the search engine configuration. Contact search@psu.edu to request the creation of a collection for the set of sites for which you need to create a search form. Directions on how to invoke a search from your site using the collection will be given by the Search Engine Team after the collection is created.

2.6 What are META tags? How can I use META tags?
META tags serve a variety of different functions, depending on how you use them in your document. You can redirect or reload the page after a specified amount of time and you can use META tags to provide visitors with information about your Web pages/site. In particular, META tags can provide keywords, controlling HOW your page is indexed by the search engine.

Your META tag information should be added between the <HEAD> and </HEAD> tags.

First, you might want to describe your document, so that the search engine displays the description META tag along with the title of your document in the results. The description META tag looks like this:
<META NAME="description" CONTENT="your description">

Keywords help the search engine to categorize your site. Choose keywords that best describe your content. Choose carefully as priority is given to the first few keywords found. The keyword META tag looks like this:
<META NAME="keywords" CONTENT="keyword1, keyword2, keyword3">

The refresh META tag is a way to direct visitors of your Web site to another site after a specified amount of time. This tag is especially helpful if your Web site has moved to a new location. The refresh tag looks like this:
<meta http-equiv="refresh" CONTENT="10; url=http://www.yournewwebsite.psu.edu">
The number in the CONTENT section of the tag reflects the number of seconds visitors will be automatically directed to the new site.

Currently, the refresh META tag is not prevented; however, it is recommended that you send the URL to search@psu.edu with a request to index the new site and remove the old site.

In addition to META tags, the title of your document should reflect the content of your document. The title of your document should be placed between the <TITLE> and </TITLE> tags.

The following is a list of META tag resources:

2.7 How can I customize the format of the search results to fit the style of my Web site?
A custom search style can be obtained via the FrontEnds feature and the Custom Search Style Manager tool. You also have the option to request Extensible Markup Language (XML)-formatted data from the search engine, which may be used in dynamic applications.

2.8 How do I request for the search results in XML format?
You can request XML formatted search results when you invoke the search engine without the proxystylesheet parameter. You also may choose to receive the Document Type Definition (DTD) if you set the output parameter to "xml" rather than "xml_no_dtd". For example, the following HTML:

<form method="GET" action="http://search-results.aset.psu.edu/search">
<input type="text" name="q" size="40" maxlength="2048" value="">
<input type="submit" name="btnG" value="Search">
<input type="hidden" name="client" value="PennState">
<input type="hidden" name="output" value="xml">
<input type="hidden" name="site" value="PennState">
</form>

will produce the following form (note XML tags may be hidden by your browser):

Using a Penn State Access Account, you may reference the Google Search Appliance XML vocabulary.


Return to topics

Section 3: Common Problems

3.1 I just created a new site that is indexed by the search engine; however, files from the old site are still being indexed. What can I do?
The old site will need to be removed. If the old site was hosted via Penn State departmental Web space, then the site's administrator/supervisor will need to request removal via the ITS Accounts Services Office. After the Accounts Office verifies the request, the site can be removed. If the old site was hosted on an independent server (for example, www.aaa.psu.edu), then the site administrator for the independent server will need to be contacted. It then takes about a maximum of seven days after the site has been removed for the search engine to purge the files from the index. If there are only a few URLs in the index, then we can delete the URLs manually. If there are many URLs, then you will need to wait until the server purges the URLs from the index.

3.2 My site exists on a server that is indexed by the search engine. Why doesn't my site show in the index when I search on my URL or keywords?
Most likely, the indexed server on which your site resides does not provide links to your site from any of its pages/sites. Contact the server administrator or Webmaster for the server to request that a link be added to your site. If a link can be added, then it might take up to seven days for the search engine to recognize the change and find your page. If for some reason you can not be listed, contact search@psu.edu again and your site can be added as a separate URL for the search engine to index.

3.3 I can't find a particular site. Why is this?
To be indexed by the Penn State search engine your pages must:

If your Web site does not meet the criteria listed above please contact search@psu.edu to have your site added to the search engine. If your Web site does meet the criteria and your pages are not found in the search engine, contact search@psu.edu.

For additional information, please see the "Excluded Web servers" section of the document "Info. for Webmasters/Web Content Providers".


Return to topics

Section 4: Policies

4.1 Can a site from outside the psu.edu domain be indexed?
Web sites that are owned and operated by Penn State or have been developed on behalf of Penn State may be indexed with our search engine.

4.2 Can server administrators/Webmasters register Penn State pages with the various search engines, such as Yahoo, Excite, Lycos, etc?
Yes, Penn State pages/sites can be registered with other search engines.

4.3 Are there any regulations or requirements concerning which search engine Penn State departments, colleges or academic units can use?
No, each Penn State college, department or academic unit can use the search engine of their choice. Departments, colleges and units are welcome to index sites with our search engine and then query our search engine to find documents on the departmental, college, or academic unit server (see Question 2.5, "How do I invoke the search engine from my Web page?).


Return to topics


The Pennsylvania State University ©2007. All rights reserved.
Alternative Media - Nondiscrimination Statement
This site maintained by Academic Services and Emerging Technologies, a unit of Information Technology Services.

Comments and suggestions may be directed to The Penn State Search Engine Support Team.

Last revised: Monday, October 8, 2007.