Advancing discovery
Promoting learning
Broadening participation
Infrastructure
Dissemination
Society at large
Advancing discovery. There are three levels at which Wordcorr
advances discovery: experienced, student, and curious guest.
Experienced:
Experienced scholars with huge amounts of data profit from the fact
that
- no data or hypotheses leak out
of Wordcorr.
- all the data are available all
the time.
- observations expressing part of
the investigator's analysis can be attached to any relevant
unit.
- a Residue section holds
everything for which the scholar has not yet found a place in
the analysis, including things that do not fit the analysis
because they are due to language contact or internal analogies.
As in all science, it's the things
that don't quite fit the big picture that are the cracks through
which new insights make their way to the inside of one's mental
box.
Furthermore, the fact that
scholars no longer have to spend most of their time on data
management details, to the detriment of thinking analytically,
means that novel ideas have a better chance of surfacing.
And the ability to set up separate
views to follow out the implications of several incompatible
hypotheses at the same time should lead to more thorough
documentation of the reasons for preferring or rejecting
alternative analyses.
Student:
Students of comparative linguistics have been known to fall asleep
after a certain number of pages of data that began to look all the
same. Experience with Wordcorr's interactive approach makes it look
like it holds the user's attention with the intensity of the more
cerebral types of video game.
This means that the student using
a prepared data set is more motivated to retrace the original
scholar's path of discovery, and not just to read about the
conclusions the scholar reached and the controversies along the
way. In fact, the student just might see something the established
scholar missed.
INTERESTED: People with no linguistic background but
plenty of curiosity are welcome to try Wordcorr to browse the same demonstration data sets that the developers
of Wordcorr used to test the program. With the help of this Web
site and the Wordcorr Help facility they can learn how to try their
hand at making comparisons.
Some of them will not only be
motivated to play with real language data; they will discover
things about language in general they hadn't thought of before.
Personal discovery of that kind could lead some to become
linguists, and might soften the prejudice some people were brought
up with against languages other than their own.
Promoting learning. Graduate students in master's or doctor's
programs find Wordcorr useful for storing and archiving their field
data and combining them with other relevant data already available
from archives or publications. Then as they tabulate what they have
collected and form their own hypotheses about the patterns of
language divergence that underlie each correspondence set they find
in their tabulations, they go through what for graduate school is
the prime learning experience: doing a workmanlike job that
actually advances knowledge.
Classroom discussions at both
graduate and undergraduate levels should be interesting, since the
students are likely to notice things the professor has never dealt
with. The "interested" category may turn into an avenue
of self-directed learning.
Broadening participation. Trends in the kinds of papers accepted for
meetings such as the Linguistic Society of America suggest that
fewer people than before are active in comparative linguistic
research.
One reason may be that when
students of linguistics are exposed to the comparative method, they
find it exciting -- until it hits them how much picky work is
involved and how easy it is to overlook something, at which point
semantics begins to look like a better career choice. Knowing that
there is a tool that diminishes the picky work and makes it hard to
overlook anything could lead to an increase in participation.
The ability to form research teams
by exchanging files over the Internet can come to involve
collaboration among many institutions, domestic and foreign. And
the Interested status may draw in students and members of the
public, including native speakers of some of the languages who are
otherwise underrepresented in linguistic scholarship because of
geographic or social isolation from mainstream academic
institutions, or lack of funding, but who nevertheless have much to
contribute.
Infrastructure.
Linguist List lets any linguist search any archive that follows the
norms put forth by the Open Language Archives Community (OLAC), including the
Linguist List's own Electronic Metastructures for Endangered
Languages Data (EMELD). In addition,
Linguist List as eventual host of this Web site provides a
practical point of contact for exchanging Wordcorr files among the
Wordcorr community over the Internet. In this way research teams
can form and discuss each other's analyses. Professors of
linguistics and their students can interchange data and analyses.
The Wordcorr design began with an
impossible alternative: to create and maintain an infrastructure
capable of managing the data and analytical work necessary to
complete a thorough classification of all the world's languages in
a single large data base.
Were it to operate on that scale,
it would contain a data component of around 200 gigabytes,
containing say 10,000 speech varieties with an average of 1,000
entries per variety (based on 6,900 living languages in the 15th
edition of the Ethnologue,
revisiting poorly documented dialects of known languages, and
finding varieties that linguists are still not aware of; Kurebito 2001 is an example of
an actual 1,000-entry data list), with each variety containing a
datum of on average 10 segments of Unicode characters in UTF-8
encoding for each entry.
Such world scale tabulation could
well involve 1,000 investigators, with each investigator looking at
100 varieties and some varieties being looked at by more than one
investigator, through 10 different views with annotations of 10
bytes for each datum. That would make the results component a
little over 10 gigabytes, assuming 50 protosegments per view, 100
correspondence sets per protosegment, and 100 Unicode characters in
UTF-8 encoding in each correspondence set, plus 20 4-byte citations
per set. With a 50% overhead for the management component tables
and behind-the-scenes linking tables, this would put the worldwide
database for comparative linguistics at around 45 gigabytes, a
modest size as serious databases go.
But such a world size database
would require costly maintenance over decades. One could guess that
in five years the data component might grow to 5,000 varieties
averaging 500 entries per variety, requiring 50 megabytes. By that
time tabulation activity might reach 300 investigators working on
an average of 50 varieties each, in 3 views, giving 450 megabytes.
Results would still be around 50 protosegments per view and 100
correspondence sets per protosegment, but only 50 segments per
average correspondence set because of the 50-variety scope, and
fewer available citations per set, giving another 450 megabytes.
With overhead, the actual database in five years would be around
1.5 gigabytes, which would fit the 2002 model laptop computer this
page is being edited on with room to spare.
Educational use of Wordcorr to
teach comparative phonology might swell this number to 2 gigabytes;
but it is unlikely to strain the resources, because educational
users are likely to stick to small collections with relatively few
varieties. (Agard's excellent pedagogical presentation of the
Romance language family (1984),
for example, has 475 entries for 8 varieties; it would be ideal as
a data set for educational use with Wordcorr.)
The important decision for
Wordcorr turned out to be designing it so that one installation
could in principle handle a huge amount of data, but committing
resources only for what could be developed in the two years of
funding. Having multiple copies of Wordcorr data collections spread
independently around the world is a better way to go than putting
them in just a few archival databases, because dispersal can be
done free over the Internet, and is an effective kind of insurance
against catastrophes. That way, we can keep operating without an
enormous database that requires a staff of its own.
So the main focus of the Wordcorr
Project quickly came to be the standalone application, with its own
local database inside the computer of an individual investigator.
If that person is out in the field collecting and analyzing primary
data, it doesn't matter if there is easy Internet access or not.
The person may have already been working alone for years and may
not be in a position to join in team research. But whenever there
is opportunity to connect with colleagues over the Internet,
everything is ready to go.
One person's data component is not
likely to be larger than 100 varieties, and many investigators
collect only about 300 entries per variety, giving about a megabyte
total. The results of tabulation for just one investigator, not
hundreds, for 100 varieties and perhaps 3 views, give another
megabyte. The results part is comparable to the others, averaging
50 protosegments per view, 100 sets per protosegment, 100 segments
per set with citations, giving around 2 megabytes. The grand total
for a substantial amount of data is under 5 megabytes. (Behind all
that, the Wordcorr program and its database take up 13 megabytes,
and the Java Runtime Environment that Wordcorr draws on is 68
megabytes.)
Once linkages between individual
investigators start to form, file exchanges over the Internet among
colleagues (even scholars who work in remote locations can stumble
across an Internet cafe every now and then) can turn into
productive research networks.
Dissemination.
We had proposed to circulate a printed report in the usual fashion,
to maybe a few dozen interested scholars. But we realized that with
around 400 downloads already out on Release 2.0 of Wordcorr, this
Web site (which as of November 2005 attracts over 550 different
visitors each month) is much more effective as a means of
dissemination.
Society at large. There is always public interest in knowing about how
languages have developed and diverged.
- Archaeology
- Genetics
- Comparative linguistics
are our main sources of knowledge
about the paths taken by peoples whose history has never been
written, and about what may have gone on in times before any
history was written anywhere.
At the other end of the scale of
language relationships, knowing about the ways in which closely
related speech varieties can diverge from each other meshes with
Agard's typology of sound changes that result in language
differentiation by inhibiting intelligibility (Agard 1984, pp. 41-47; Grimes 1995a, pp. 4-8; Milliken 1988). Intelligibility and
lack of intelligibility among speech varieties are also of interest
to educators in multilingual or multidialectal areas.
For example, when Grimes was on
Saipan Island in the Northern Marianas consulting with a project on
Carolinian languages, he met a high school principal from Pohnpei
in the Federated States of Micronesia. The educator was concerned
with providing school texts for students on island chains. In many
parts of the Pacific, people on island chains speak related
languages that are unlike enough that their speakers do not
understand each other readily unless they have learned the other
varieties as second languages. (German and Dutch, or Spanish and
Catalán, are European examples of the same phenomenon.) When the
educator saw the output of the STAMP program (Weber et al. 1990) based on
preliminary comparative tables for Carolinian that Grimes had
worked out by traditional paper and pencil means, then used for
switching a folk tale from one variety to another, he saw it as a
possible solution for his textbook problem.
Making it easy for people at large
to try their hand at language comparison using Wordcorr could have
a modest societal impact in two different directions. First, it may
well attract more people into linguistics. Second, it may encourage
people to discover for themselves the patterning and beauty of
languages that they had previously thought of as primitive or
deficient.
Look at references for this background
material, or at the CSH Collections
to see how Wordcorr has already helped in data preservation.
|
How it
works: Wordcorr gives you control over five main
functions, called Data, Views, Annotate, Tabulate, and Refine.
Data
includes inputting, editing, importing and exporting files, and
inspection of the data in a collection. Data are stored in text
form (audio samples are useful too, but that's for the future). You
have easy access to the entire IPA phonetic alphabet.
For each individual in a research
team or linguistics class, having common data is the starting point
for defining multiple views of the data,
and annotating the same data
differently for each view. Views allows you to define multiple
views of the common data in order to try out different approaches
to analysis. Views can differ in coverage and ordering of speech
varieties. By setting up different views, you can follow out the
implications even of conflicting hypotheses.
Annotate
lets you tell Wordcorr your judgments about which forms in an entry
might be treated as cognates, and how the segments in them are to
be lined up for comparison. You may modify the annotations as you
go. If they change, Wordcorr helps you unroll the original
tabulation and step through it again on the new arrangement.
Tabulate
takes the data and annotations for the entries and groups in a
particular view and from them generates the correspondence sets
that are the primary pieces of evidence in comparative analysis.
You specify a phonological environment and a tentative protosegment
for each correspondence set, and Wordcorr organizes the sets
accordingly. You can look at the
complete register of correspondence sets (including residual sets),
and at the annotated data they are derived from, at any stage of
the tabulation.
Refine
allows you to change how the results are arranged, by
correspondence sets in clusters representing a particular
protosegment and environment. You can move sets from the place
where they were registered on initial tabulation to a more
appropriate place. The cluster concept allows you to work with
correspondence sets for which data are missing, which may be
indeterminate as to where they fit the analysis best. Clusters also
help in filtering out sets that represent borrowings or internal
analogies, by moving them into Residue. Display
Evidence constructs a presentation suitable as an appendix to a
comparative monograph, containing a listing as complete as the
investigator wants of detailed evidence for each conclusion the
linguist has reached, starting with the most convincing evidence.
|