Print

Print


***** To join INSNA, visit http://www.insna.org *****

Dear colleagues,

 

I extended and updated both ti.exe and fulltext.exe. Ti.exe generates a semantic co-word map of a set of lines no longer than 1000 characters each and fulltext.exe uses a set of documents. The extensions are as follows:

 

First both programs write now the word-document matrix also as matrix.txt (comma separated variables). This file can be read into SPSS in the case of more than 255 variables. The file labels.sps can be used for labeling the variables to a maximum of 1023 variables. (SPSS does not read more than 255 variables from a .dbf or .xls file.) Both programs have a limit of 1023 words; the number of records (documents) is not limited other than by the disksize. (A manual for applications to content analysis can be found here.)

 

After generating the word-document matrix and the Pajek input files (cosine.dat and coocc.dat; cf. Leydesdorff & Welbers, in press), the program prompts with the question of whether one wishes additionally to run the same routines with observed/expected values. This generates obsexp.dbf (analogous to matrix.dbf), obsexp.txt (analogous to matrix.txt), and coocc_oe.dat and cos_oe.dat, analogous to the input files for Pajek (see the websites), but now containing or operating on the observed/expected values instead of the observed ones. Note that answering “y” (yes) extends the processing time of the original routine; therefore, the default is “n”. The SPSS syntax file labels.sps is the same because the variable are not changed.

 

Similarly, one can use (with the same variable labels) the file TfIdf.dbf which contains the “term frequency-inverse document frequency” values as used in library and information science. The expected values are separately stored in the file expected.dbf. The file obs_exp.dbf contains the signed (!) differences between observed and expected values at the cell level of the matrix. (These are the non-standardized residuals of the chi-square.)

 

The corresponding Pajek files can be generated by replacing the matrix values in cos_oe.dat with, for example, the cosine matrix of the values in TfIdf.dbf. (Cosine values can be generated in SPSS under Analyze > Correlate > Distances.) One can also replace the matrix values with the non-normalized values. This should work without problems. Note that the number of cases can be different because rows with no values other than zero are in some cases removed in order to prevent divisions by zero in the computation. However, the number and order of the variables remains the same.

 

After processing, the file words.dbf contains for all words the following summations over the (column) vectors for each word:

  1. A variable named “Chi_Sq” which provides Chi-square contributions for each of the variables; these are defined for wordi as i2 = (Observedij – Expectedij)2 / Expectedin. In other words, the sum of the contributions over the column for the variable in each row (Mogoutov et al., 2008);
  2. A variable named “Obs_Exp” which provides the sum of |Observed – Expected| for the word as a variable summed over the column;
  3. A variable named “ObsExp” which provides the quotient of Obs/Exp for the word as variable summed over the column;
  4. A variable named “TfIdf” (that is, Term Frequency * Inverse Document Frequency) defined as follows: Tf-Idf = FREQik * [log2 (n / DOCFREQk)]. This function assigns a high degree of importance to terms occurring in only a few documents in the collection (Salton & McGill, 1983, p. 63);
  5. The word frequency within the set.

 

These programs were made under DOS, using 16-bits. Increasingly, 64-bits machines are no longer able to use these programs (unless one downloads the Virtual PC and runs in XP mode). I uploaded versions which are recompiled under Windows-32. These versions are much larger ( > 7 Mbyte) and more error-prone. I just recompile (using a different compiler) without systematically debugging, but I’ll react on feedback about error messages. One can disregard error messages which do not stop the program. The 64-bits versions can be found here: ti.exe and fulltext.exe.

 

Error messages are provided when working from the command prompt. (These programs still look like DOS, but they are fully Windows-32.) One can run the old versions on older machines or using the virtual PC in XP-mode on 64-bits machines. Please, consider these programs as legacy software; you are most welcome to use them for scholarly purposes, but at your own risk!

 

Best wishes,

Loet

 

References:

Loet Leydesdorff & Kasper Welbers (2011), The semantic mapping of words and co-words in contexts, Journal of Informetrics (in press); preprint version available at http://arxiv.org/abs/1011.5209.  

 

Esther Vlieger & Loet Leydesdorff (2009). “How to analyze frames using semantic maps of a collection of messages? Pajek Manual.” Amsterdam: University of Amsterdam.

 

** apologies for cross-postings


Loet Leydesdorff

Professor, University of Amsterdam
Amsterdam School of Communications Research (ASCoR)
Kloveniersburgwal 48, 1012 CX Amsterdam.
Tel. +31-20-525 6598; fax: +31-842239111

[log in to unmask] ; http://www.leydesdorff.net/
Visiting Professor, ISTIC, Beijing; Honorary Fellow, SPRU, University of Sussex

 

_____________________________________________________________________ SOCNET is a service of INSNA, the professional association for social network researchers (http://www.insna.org). To unsubscribe, send an email message to [log in to unmask] containing the line UNSUBSCRIBE SOCNET in the body of the message.