| |
 |  | 
|
This document provides a brief introduction to the
effective use of the CCIR-NYC site search engine. Basic features are described below.
Advanced features are described further down the page.
|
|
|
| |
| Basic Features
|
| SIMPLE SEARCHES: WORDS AND PHRASES |
The simplest kind of search expression is a word,
or a phrase. A phrase is a sequence of one or more words. Only those
documents that contain the exact word or phrase provided will be selected
by the query.
Examples:
- mortality
- human
- population
- cause of death
- population distribution
Note that, unlike most popular search engines, phrases are not treated
as a list of words, but are instead interpreted as a fragment of a
sentence. The exact phrase must occur in a document if the search
engine is to find it.
|
| BOOLEAN SEARCHES |
Many users will already be familiar with the concept
of using Boolean expressions in queries. You can use Boolean operators
to combine results from simpler queries in powerful ways. There are
three Boolean operators: AND, OR, and NOT.
When you write:
- word1 AND word2
you are saying that you want to locate all documents that contain
both word1 and word2. Documents that contain neither word1 nor word2
will not be selected.
For example, the search expression:
death AND cause
will select only those documents that contain both the word "death"
and the word "cause"
When you write:
- word1 OR word2
you are saying that you want to locate all documents that contain
either word1 or word2, or both. Documents that contain neither word1
nor word2 will not be selected.
For example, the search expression:
death OR mortality
will select only those documents that contain either the word "death"
or the word "mortality".
When you write:
- word1 NOT word2
you are saying that you want to locate all documents that contain
word1but not word2. Documents that don't contain word1 will be ignored.
Documents containing both word1 and word2 will be ignored.
For example, the search expression:
death NOT cause
will select only those documents that contain the word "death"
but not the word "cause"
You can parenthesize Boolean expressions and combine them into ever
more complicated queries.
For example:
(disease AND cause of death) NOT
natural causes
would select only documents containing both the word "disease"
and the phrase "cause of death", but not documents containing
the phrase "natural causes".
|
| |
| Advanced Features |
| TELLING THE SEARCH ENGINE
TO KEEP IT SIMPLE |
As you saw in the previous section, the search
engine has a small number of special words, such as AND, OR,
NOT, and WITH that modify the meaning of the search
expression. Expressions such as:
- death and taxes
will be interpreted as a search for documents containing both the
words "death" and "taxes", but not the intended
phrase "death and taxes".
You can tell the search engine to ignore the special meaning
of words in the search expression by writing portions of the search
expression in the form {phrase}
For example:
death {and} taxes
to be {or not} to be
{to be or not to be}water
{within} house(death {and} taxes AND health) NOT social security
The last example above matches only those documents that contain the
phrase "death and taxes" as well as the word "health" but excludes
documents that contain the phrase "social security".
|
| WORDS THAT ARE NEAR EACH OTHER
IN A DOCUMENT |
| When words or phrases occur in a document near
one another, there's a good chance that they might be more related
to a single topic of interest. A search that involves checking the
nearness of words and phrases is called a "proximity search".
The CIESIN search engine allows for nearness of words to be described
in two ways.
word1 NEAR word2
and
NEAR( (word1, word2, ...), n)
where n is some number.
In the first form above, the expression will select only those
documents containing word1 and word2, and only when word1 and word2
occur within 100 words of each other somewhere in the document.
For example:
death NEAR cause
The second form is more complicated. The expression will select only
those documents containing all of the words word1, word2, ..., and
only when all of the words occur in a group no longer than n words
in length. In other words, there must be some excerpt that can be
taken out of the document, consisting of no more than n words, and
that excerpt must contain all the search terms.
Consider the following example:
near((red tide,cause,sewage),50)
First, only documents that contain the phrase "red tide", and
the words "cause" and "sewage" will be considered.
Consider the following scenario:
Document 1: ... red tide...(30 words)...sewage...(30
words)...cause...
Document 2: ...cause...(20 words)...red tide...(30 words)...sewage...
With the search expression above, only the second document would match
the query, because the total distance from the first word to the last
word in Document 1 is 60 words, while in Document 2 the distance is
50 words.
The search engine also understands sentences and paragraphs. You can
use the WITHIN operator to indicate two or more words occur in the
same sentence or paragraph as follows:
(death AND cause) WITHIN SENTENCE
(death AND unnatural) WITHIN PARAGRAPH
|
| USING PATTERNS TO SEARCH |
|
Words can be misspelled, can occur in difference tenses, can be
pluralized, and can have other forms that make it more difficult
to find matches using exact matching of words and phrases. To address
these problems, the search engine supports numerous pattern matching
tools to allow for more flexible searching. Here we will discuss
only a few of them: wildcards, word
stemming, soundex, and Wildcard (%)
A wildcard, %, matches any number of characters.
It is used when it is desirable to specify only a portion of a word
when searching. Examples are as follows:
| polluti% |
matches words beginning with the "polluti",
such as pollution and polluting. |
| pol%ing |
matches words beginning with 'pol' and
ending with 'ing' such as polling, polluting, and politicking. |
| %lution% |
matches words containing the sequence of
letters "lution", such as pollution, solution, and
resolutions. |
Word Stem ($)
The stem pattern finds words with the same stem form.
This is useful for finding "GOING" and "WENT"
from "GO", for instance. Examples:
| $go |
matches words having the same stem as
"go", including going, gone, and went. |
| $pollution |
matches word having word stem as pollution,
e.g. polluting, pollute, pollutant |
Fuzzy (?)
The fuzzy pattern finds words with similar
form. This is useful for finding mis-typed or mis-OCR'd words. The
fuzzy operator is ?. Example:
| ?cat |
expands to cat cats calc case |
Soundex
(!)
Soundex query finds words which sound
similar. Example:
|
| A SIMPLE TOOL THAT DOES A
LOT |
| ABOUT ()
About applies word stem, wildcards, and other patterns to find
variations on the words and phrases given in the query. It uses
a variety of strategies to find the most information that might
be relevant to your search expression.
about(temperature)
about(global climate change in the southern hemisphere)
|
| THE AMAZINGLY COMPLICATED FINAL EXAMPLE |
In order to illustrate the flexibility that you
have in defining search criteria, we offer the following very complicated
but potentially useful example:
about(causes of disease that result
in unnatural death)
AND ($cause near water)
AND ( (pollut% AND infect%) WITHIN SENTENCE )
|
| GCRIO's search engine, developed by CIESIN and built on the InterMedia Text
Cartridge from Oracle Corporation, supports most of the InterMedia query language. |
|
 |
 |
| |
|
|
|
|
|
|
For more information about CIESIN and our activities contact CIESIN User Services
Telephone: 1 (845) 365-8988 - FAX:
1 (845) 365-8922
CIESIN is
a center within the Earth
Institute at Columbia University Copyright© 2004-2005. The Trustees of Columbia University in the City of New York.
|
 |
|
| |
|
|
|