Print

Print


*****  To join INSNA, visit http://www.insna.org  *****

If you have access a Unix shell, this script will do the job on any plain text file. There might be a theoretical file size limit, but I've run dozens of gigabytes through a similar script with no problems.

Credit for this script should go to Doug McIlroy, reportedly the originator of this solution. Cite: Bentley, J., Knuth, D. and McIlroy, D. (June 1986) "Programming Pearls" Communications of the Association of Computing Machinery, Volume 29, Number 6. Association of Computing Machinery.

#!/bin/sh
#
# wtally [files...] - word tally
#
# written by [log in to unmask] as a COMP2041 example
#
# list all the words found in standard input with a 
# count of how many times they occurred
# V2 [log in to unmask]  use traditional tr -cs to separate words, 

keep apostrophes

set -x 

cat "$@" |
    tr 'A-Z' 'a-z' |
    tr -cs "'a-z" '\n' |
    grep -v '^$' |
    sort |
    uniq -c |
    sort -nr

# Explanation of pipeline:
#   Since tr doesn't read from files, precede with cat
#   first tr is case mapper
#   second tr isolates words (alpha, hyphen and apostrophe)
#   grep removes the one empty line that could eventuate (how)
#   first sort prepares the data for counting
#   uniq -c does the counting
#   second sort reorders by decreasing frequency
#
# Note the use of double quotes to pass a literal apostrophe through to tr

* Jay McKinnon [[log in to unmask]]
  Graduate Student
  School of Communication
  Simon Fraser University


----- Original Message -----
From: [log in to unmask]
To: [log in to unmask]
Sent: Saturday, March 26, 2011 4:29:39 PM
Subject: [SOCNET] words, words,...

*****  To join INSNA, visit http://www.insna.org  *****

Dear Colleagues

Since some issues of 'text analysis' occasionally appear on this SNA list,
let me ask your help in answering the following question:

Does there exist a computer program which

- takes as input a Word document or (even better for me) a pdf file
- returns as output the list of words which occur in the document with
  their frequencies of appearance

?

Best regards

Tad Sozanski

_____________________________________________________________________
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.insna.org). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.

_____________________________________________________________________
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.insna.org). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.