Rubryx a blend of experience and knowledge

Download Rubryx 2.2 - new version of our text classification program



Main page

Articles & Presentations

Download Rubryx

Rubryx help

Contacs

About us

Mirror

Rubryx

Short Manual

Introduction

Rubryx is a program of pattern classification of web sites. It allows classifying a large bulk of specialized textual information and generating web-catalogs, electronic libraries, reference systems on account of expert information and full-text analysis.

System requirements:

  1. Windows 95/98/Me/NT/2000/XP

  2. Pentium 100MHz (Pentium III and higher is recommended)

How to work with the program:

  1. Make a list of classes.

    Main window. With the help of buttons "Add" and "Delete", make a list of classes.

  2. Select a few patterns of documents for each class.

    Main window. Choose a class and double-click to enter it. A dialog window "Selection of class patterns" will appear. With the help of "Add" button, make a list of a few documents (4-6) fully representing the corresponding class. Press OK. The program will automatically generate the vocabulary depending on the selected patterns. The process can take a few minutes.

    Do the same for each class.

  3. Define the catalog of sites.

  4. Choose the index.

    Index ranges from 1 to 100. The index is defined empirically. For its initial value, consult the Statistics button.

  5. Press "Start".

Practical hints

The aim of the program is to classify the documents most efficiently. For a successful solution of the task, an accurate selection of the class and threshold value of index K is required. The classes should be selected so that their intersection is minimized and the most bulk of documents is covered. Index K should be chosen so that odd documents are not included into the class (K value is too small) and suitable documents are not sorted away (K value is too big). A number of preliminary classifications may be required.

For preliminary classifications, make approx 1 per cent sample of the general bulk of documents. For example, for 100 thousand web sites to classify, 1000 sites is enough for preliminary experiments. On the one hand, 1000 sites is a representative sample, on the other hand, classification of such a sample on up-to-date computers will take a few moments.

During classification a part of documents can be excluded from all classes. These documents should be carefully studied. It is possible that new classes should be added to the list. Part of the residual documents may not suit for the generated catalog. Including of a large amount of the same documents into different classes means that the subject matter of the catalog has been poorly divided into classes.

Having obtained good results in sample classification, the whole bulk of documents can be classified. Consequently, you get a number of web sites of qualitative information corresponding to the number of classes.

How to create a new dictionary

It is necessary to create a special dictionary to tune the program on new domain. The dictionary is placed in three text files.

WordList.txt -the dictionary of one-word terms

WordLst2.txt -the dictionary of two-word terms

WordLst3.txt -the dictionary of three-word terms

You may use ordinary text editor Notepad to create these files or any editor that saves files in plain text format in ANSI. Samples of dictionary files for domain "Computational Linguiastics" are included in delivery.

Information

Now Rubryx is a freeware product. You can use and distribute it without any limitations!

You can support this project.

All questions and comments may be sent to:



Rubryx Community

KSU

FCCL

MSLU

MISA

WJAELA

NLP Registry of DFKI

Elsnet

Prof. Kenji Kita

NeuroXL - Neural Network Software Add-Ins for Excel

Copyright 2001-2012. All rights reserved