I think that the extent to which 'learned research papers' are unavailable on-line varies
considerably. In collecting material for my PhD, I am under the impression that more than 50%
of the current generation of young professors in the fields of computational linguistics, NLP
and IR have put their doctorates and the majority of their publications on the web. This is not I
believe for self-publicity but to encourage the interaction of the academic community. I would
expect other disciplines to be doing the same thing.
A simple way to recover a sub-language is to use the citation indexes, find the top 100 items
cited in the last ten years, and aim to recover 50% of these over the net. This gives you a
corpus which is statistically proven to be the most influential (i.e. like getting top ratings
on TV) and thus linguistically both representative of and influential on the linguistic community
of the sub-language.
What is the flaw in my method?
Christopher Brewster
University of Patras & University of Birmingham