|
Page last modified: Friday 19th August
This section aims to give instructions for using all of the software's different areas and functions. For information on downloading the software see the availability section. If you have a question that is not answered in this user guide then see the FAQ section.
This user guide represents VARD 2.4.2.
Running the tool
VARD 2 is built with the Java programming language and at least version 6 of the Java Runtime Environment (JRE) is required to use the tool. This is free and can be downloaded from the java website if not already present on your computer.
The tool is delivered in a compressed ".zip" file, the files from which can be extracted to run the program, most operating systems now have built in functionality to do this, or any unzipping program, such as WinRAR or WinZip, can be used. To run VARD 2 use "run.bat" in Windows, "run.command" in Mac OSX, or "run.sh" in Linux ("chmod a+x run.sh" first). Alternatively, open a command prompt, locate the VARD directory and type the following command:
java -Xms256M -Xmx512M -jar gui.jar
256 and 512 indicate the memory allocated to the Java Virtual Machine. The default values should be sufficient for most purposes, however, if your system only has 512MB of memory, these should be reduced to 128 and 256. It is not recommended to lower these values any further than 128/256. The memory can be raised if you are dealing with particularly large text files (100,000 tokens plus) and your system has sufficient memory. The higher figure should be no more than half of your system memory, the lower value should usually be half of the higher value.
Upon running the program you are presented with three options: Interactive, Batch or Training modes. Select the option required for your current session. You can also select the setup folder to use for the current session, this can be selected using the browse button or by dragging and dropping the folder into the setup folder text box. If the folder selected does not exist or any of the setup files are not present, they will be created once a mode has been selected.
It is not recommended to run more than one instance of the VARD 2 tool as this will require a large amount of system memory and may cause instability especially when loading and saving the tool's stored data. If you would like to change modes then close the current session and start the program again.

VARD 2 can process a variety of text formats; including plain text, rich text format (".rtf") and text containing sgml and xml tags. The tool cannot be used with Microsoft Word documents (".doc") and Adobe Portable Document Format (".pdf") files. There is no restriction on file extensions, VARD will attempt to process any file opened by the user. If your files are not successfully processed with the tool feedback would be appreciated, so that the possibility of dealing with these files in future versions can be explored.
Previously processed files saved with xml tags can also be opened with the tool and processing continued from when the file was saved.
A new feature for VARD 2.3 allows the user to define how a word should be detected, i.e. what characters are valid in a word. This is set in the options file.
Ignoring text
By default VARD 2 will ignore anything between < and > (apart from its own xml tags), [ and ], or { and }. This means that any text within these brackets will not be processed by VARD 2. You can stop the tool from ignoring text between any bracket set or add additional structures to ignore (such as text between <header>...</header>) by editing text_to_ignore.txt.
Encoding
VARD 2 will make a best attempt at detecting the encoding of a text, read the text accordingly and subsequently save the file with the detected encoding. However, this is not 100% reliable and it is often safer to enforce an encoding if it is known; this can be set in the options file.
Any characters encoded in structures like &(#x)1234; will be converted into Java Unicode characters, this means they will be processed like normal characters and, providing the font selected can display the character, will be displayed normally in the interactive version. Upon saving the document all entities converted to Java Unicode characters will be converted back into their original state.
Standard XML entities will be converted into the correct character by default (and reverted back upon saving):
- & = &
- ' = '
- > = >
- < = <
- " = "
This behaviour can be edited and additional entity mappings created through entities.txt. Note that entity replacement occurs after ignoring text, hence text between < and > will not be ignored by default.

Interactive version
The interactive version of the tool allows for the manual processing of a single text at a time. The user interface is similar to that found in modern word processing applications, with a main window displaying the text and spelling variants highlighted in yellow. One can right click on any word to be presented with different options depending on the type of word (variant, non variants or normalised).

Inserting text
Text can either be inserted by opening a file or by pasting text into the application. As described above text files can be in a variety of formats, this also applies to pasted text. Use 'Open' in the 'File' menu or , a standard file opening dialog will be displayed, simply select the file you wish to process. Another option is to simply drag and drop a file from your system into the main text area, the file will then be loaded as if selected using 'Open'. Before loading the new file, you will be prompted to save all current data as if when exiting.
Pasting text can be achieved by using 'Paste' in the 'Edit' menu or , the text will always be appended to the end of the document with a line-break inserted above the pasted text. You can also add text in a similar way by simply dragging and dropping selected text from another application into the main text area. To clear the current document use the 'New' command in the 'File' menu or .
Note: an empty line will always be inserted at the start of the document to ease issues with updating the very first position in the document, this is not saved in VARD's output.
Once text has been inserted it will be automatically analysed by the software and the words will be placed into the different categories.

Main text area
Once text is available to the software it is displayed in the main text area. The text is not manually editable as any changes made to words need to be tracked by the software. Words belonging to the category selected in the sidebar are highlighted within the text by a yellow background. Right-clicking any word in the document will bring up a pop-up menu which differs depending on the word's category; the options are detailed in the category descriptions below.

Word Categories
Words are placed (either manually or automatically) into 1 of 3 categories: variants, non variant forms and normalised words. Right-clicking on any word in the text (or in the types list) will bring up a popup menu with different options depending on the word's category, these are detailed below. One option which is common to all categories is Find word in list, this changes the currently selected word category to that of the clicked-on word and selects the word in the types list resulting in all instances of the word being highlighted with a green background (the current instance is highlighted in red) and allowing the user to cycle through and process each instance of the word in step complete.

Variants
VARD 2 currently adds to the variant category when word forms are not found in its modern lexicon, words from a text can also be added to the category manually by the user. The variant category contains all words which the system deems necessary to replace (either manually or through automatic processing). Options offered when right-clicking on a word from the variant category consist of:
Replacement suggestions: A ranked list of suggested replacements is given, the ranking is based on the confidence score. Each suggested replacement can be applied to the single instance (Normalise instance) or all instances of the word form (Normalise All). If the replacement word is not in the modern lexicon it can be added for future reference - this option is given as a dialog after the replacement has been made.
Normalise to...: If the correct replacement is not offered in the suggested list a user can provide their own replacement with this option. The user is presented with a textbox to input the replacement and is given the option to normalise all instances. As with normalising from the suggestions list, if the replacement word is not in the modern lexicon it can be added for future reference - this option is given as a dialog after the replacement has been made.
Mark as not variant: The variant can be 'ignored' by changing its category to not variant, this option can be invoked upon the clicked-on instance (Mark instance as Not Variant) or all instances of the word (Mark all as Not Variant). Once this option has been invoked the user is given the option to add the word to the modern lexicon so in future sessions it will not be marked as a variant.

Normalised
The normalised category contains all words which have been normalised by the user or through automatic processing. The original word form is stored and displayed with the replacement in the types list.
The only option offered other than finding the word in the list is to revert the replacement to the original variant. The user can revert just the clicked-on instance or all instances.

Not Variants
The not variants category contains all words which are in the modern lexicon.
The user may mark any word in this category as a variant, either the clicked-on instance only (Mark instance as Variant) or all instances (Mark all as Variant).

The sidebar to the right of the main text area contains extra information and options. The word category can be changed with the radio buttons at the top of the sidebar, for each category the number of tokens is shown. The other components in the sidebar are as follows:
Types List
The types list contains each word in the currently selected category (the number of words in the current category is displayed above the list), with the number of tokens for each word type displayed in brackets. Selecting a word in the list highlights all instances of the word in the main text area in green and begins the step complete component to cycle through instances of the word. Right-clicking on a word in the list gives similar options to those when right-clicking on a word in the main text area - these options are detailed in the word category descriptions.
The types list can be ordered alphabetically or by the frequency of each type in the text. Simply select 'A-Z' or 'Frequency (Desc.)' to switch between these options.
The current word list can be copied to the clipboard using Copy Current List, the list is copied in tab-delimited format so is easily pasted into programs such as Microsoft Excel.
Step Complete
The step complete component allows the user to cycle through and process instances in the text or instances of the currently selected word (text instances and type instances), the same options are available as when right-clicking a word and dependant on the word's category. Once an instance is dealt with the next available instance is selected and relevant options displayed. The currently selected instance is highlighted red in the text. The instances can also be cycled through using the first ( ), previous ( ), next ( ) and last ( ) buttons.
You can choose to use the step complete feature to cycle through 'Type Instances' or 'Text Instances'. Type instances are the instances of the currently selected type in the Types List. Text instances are the tokens found in the text sequentially. If 'Type Instances' is selected and all instances of the currently selected type have been dealt with, the next type will be automatically selected for processing. If the Types List is currently sorted alphabetically, this will be the next type in the list. If the Types List is sorted by frequency, the most frequent type in the list will be selected.
Auto Normalise
There is also an option to automatically process variants if the top replacement has a confidence score equal to or higher than the selected threshold. This is very similar to the batch version of the tool, except the output is not produced immediately, instead replacements made can be viewed in the types list and further manual processing can take place. The threshold can be lowered and automatic processing invoked again, replacing any variants which now have a top replacement with a high enough confidence score. The automatic processor will always use the current confidence scores and f-score weight.

Confidence scores and methods
Each replacement offered for normalisation of a given variant is given a confidence score which is based on predicted precisions and recalls for each method used to find it. For each method all previous replacements made are taken into account as well as the current replacement being offered, all these scores are combined to produce an f-score (based on the current f-score weight). The current f-score, precision and recall for each method are displayed at the bottom of the text area in the form "F-Score (Precision|Recall)". The scores for each individual replacement are displayed for each method in the variant popup menu in the same format. This is based on the score given for the replacement and how many other replacements the method has suggested. The methods used to find and rank variant replacements are as follows:
KV: Known Variants
This method returns any replacement mappings found in the variants list for the given variant string. In most cases only one replacement will be offered, but the variants list may contain different options for a given variant string. A count is incremented or decremented each time a variant replacement is selected or reversed respectively. These counts are used to produce a precision and recall score for each offered replacement, if there is no ambiguity (i.e. only one offered replacement for the current variant string) these scores will be 100%, if there is more than one option the most picked replacement will have the higher recall score, the precision score will depend on how often the other replacements have been picked. The variants, replacements and counts are stored in the variants list file.
LR: Letter Replacement
A list of letter replacement rules (stored in the rules list file) is used to find potential variant replacements, rules are applied to the variant string and any resulting strings which are found to be in the modern lexicon or in the variants list are offered as potential replacements. The recall score is reduced if more than one rule or a mapping in the variants list is required to create the replacement string, the precision score is based on how many other replacements are found for the current variant string. The rules list can be edited in the rule list manager.
PM: Phonetic Matching
A modified version of the Soundex phonetic matching algorithm is used to find matching replacements in the modern lexicon or variants list. The recall score is reduced if a mapping from the variants list is required. The precision score is based on how many other options the algorithm produces from the current variant string.
ED: Edit Distance
A normalised Levenshtein Distance is calculated for all replacements offered by other methods. The individual replacements' scores are based on the replacement in question's distance score (recall) and the score given to all other offered replacements (precision).

F-Score Weight
To calculate overall confidence scores for methods and replacements, an f-score is calculated by combining the precision and recall scores. Usually, equal weight is given to precision and recall (F-Score weight 1), but a user may give priority to either precision or recall by altering this weight using the slider in the toolbar. By moving the slider either towards precision or recall will bias all F-Scores in VARD accordingly. Weights under 1 will bias towards precision, weights over 1 will bias towards recall. A score of 1/2 equates to precision being weighted twice as much as recall, a score of 2 equates to recall being weighted twice as much as precision, a third or 3 equates to 3 times as much, and so on. The difference this weighting makes is instantly shown in the f-scores presented for each method at the bottom of the text area. Adjusting this weight may result in the ranking of replacements offered for variants changing as well as some replacements having a score above the current automatic replacement threshold.

Rule list manager
An important part of VARD 2 is a list of letter replacement rules which are used to find potential variant replacements. The current rules contain 3 parameters: original string, replacement string and position, e.g. replace ys with ies at the end of the word. Insertions can be achieved by leaving the original string blank and deletions can be achieved by leaving the replacement string blank. 58 rules are present by default in the software, however, this list can be added to or have rules removed with the 'Rule list manager' ( ) located in the 'Advanced' menu.
Upon opening the rule list manager the user is presented with a current list of rules. Removing rules simply involves selecting the rules and clicking 'Remove selected rule(s)'. A rule can be added to the list by inputting the original, replacement and position parameters and clicking 'Add'. Leaving 'Original' blank will result in an insertion rule and leaving 'Replacement' blank will result in a deletion rule. Be careful to consider the context of a rule, surrounding letters may be provided by including them both in the original string and replacement string; for instance, a good rule to deal with cou'd, wou'd and shou'd (could, would and should) would be replace ou'd with ould at the end of a word. If a rule is added or removed mistakenly use 'Revert rule list'.
Once adding and removing rules is complete click 'Finished' to return to the main interface, any new rules will now be applied when searching for variant replacements, and any removed rules will no longer be used. The rule list will only be saved for future sessions if the list is written to file.

Join Function
Due to the main text area not being editable, a function has been added to allow the joining of two or more word instances in the text. To do this the user should select (by clicking and dragging the mouse over) the words they wish to join in the main text area and select 'Join' from the edit menu or click .
Once the join function has been invoked all selected words will be concatenated with any characters (including ignored text, hyphens and white space) in between words removed. The whole of a word need not be selected for it to be included and words can be joined over line-breaks. The resulting new word instance will be evaluated by the system as a variant or modern form. The user can then manipulate the word like any other.
Any word instance which has already been replaced may not be joined with other words; the word instance must first be reverted to a variant and then joined.

Undo / Redo
All edits made to words within the text can be undone and redone using the 'Undo' and 'Redo' in the 'Edit' menu or with and . Edits made to the known variants list and the dictionary are also undoable and redoable. The undo and redo functionality works in much the same way as in word processing software such as Microsoft Word. Edits can only be undone (or redone) one edit at a time.

Various formatting options are available for the text in the main text area. Two dropdown menus are available in the toolbar to change the selected text's font face and size. The toolbar and Style menu also has functions for making the selected text Bold ( ), Italic ( ) and Underlined ( ). Formatting functions will always be invoked on the currently selected text, to select all of the text in the current document 'Select All' can be used from the 'Edit' menu or in the toolbar.
Formatting will only be saved if the file is saved as Rich Text Format (.rtf).

Saving document
The current document can be saved with or without xml tags to represent replacements and remaining variants. To save with XML tags use 'Save With XML Tags' ( ) in the 'File' menu, to save without xml tags use 'Save Without XML Tags' ( ). A ".rtf" extension can be used to save the current formatting, this is not possible when saving with xml tags.
'Save' in the 'File' menu or in the toolbar can be used to save to the last file the document was saved to. If the document has not previously been saved the user will be able to choose whether or not to save with xml tags.
The currently selected text can be copied to the clipboard to be pasted into other programs with 'Copy' in the 'Edit' menu or in the toolbar; no formatting will be included however.

Saving other data
Other data can be written to files for use in future sessions from the 'File' menu, these are detailed below. Options are only available if a change has occurred in the particular data. For more information on the actual files written see the Setup and logs section.
'Save Dictionary' ( ): This option saves any changes to the modern lexicon, i.e. any words removed or added. This is the list checked against when judging whether a word is a variant and also used when searching for potential variant replacements.
'Save Variant List' ( ): Saves any additions to the known variants list. A variant to replacement mapping will be added when a replacement is made which does not already appear in the saved list.
'Save Rule List' ( ): Saves to file any changes made using the Rule list manager.
'Save Confidence Weights' ( ): Saves the current replacement searching method weights.
'Save All' ( ): Gives the option to save all of the above as well as the document, the same prompt is displayed as that shown when exiting the program.

Exiting
Exiting the program can be done through the 'File' menu and 'Exit' ( ) or by simply closing the window.
Upon exiting the user will be prompted to save any changes made, as shown in the screenshot to the right. Note: options will only be given to save data which has changed.

Batch version
The batch processing interface allows a user to exploit VARD 2's automatic processing capability over any number of text files, even an entire corpus or set of corpora. The current setup will be used; including the current confidence weights, dictionary, variants list and rule list.
Once files are chosen, a threshold and F-Score weight set and an output type and folder selected, processing can be started using 'Process listed file(s)'. The current progress of the batch processing will be shown in the progress bar.


Choosing files
All files to be processed are shown in a list in the 'Input' section. Files can be added to this list with 'Add file(s)' or an entire folder(s)'s content with 'Add folder(s)' (the entire directory tree rooted at the selected folder can also be added by changing the 'search_subfolders' field to 'true' in options.txt). A standard opening dialog is given to the user where text files can be added. You can also add files or folders to be processed by dragging and dropping them from your system into the list.
Files can be removed from the list by selecting the files and clicking 'Remove selected file(s)', or the entire list can be cleared with 'Clear list'.
Any texts which can be processed by the interactive version can be processed by the batch processing tool. Any changes made to texts with VARD 2 and saved with xml tags will be loaded when processing in the batch version.

Replacement Threshold
VARD 2 will generally find numerous potential replacements for a given variant, each of these suggestions is given a confidence score. When automatic processing takes place the suggested replacement with the highest confidence score is used to replace each variant; however, if the system's replacement methods 'struggle' with a particular variant the highest confidence score may be relatively low - in these cases a threshold is required which is the minimum confidence score needed for a replacement to take place, if the threshold is not met by the top replacement the word is left as a variant.
The user can select a threshold to use when processing the current list of texts; the value must be between 0% and 100%. It is recommended to use the interactive mode to train the tool on your texts/corpus and get a feel for a suitable threshold or to test a threshold on a sample first to gauge the recall and precision of replacements made. A higher threshold will increase precision but reduce recall.

Selecting Output
The text can be outputted with or without changes tagged (or both). If the original text was ".rtf", the formatting will be preserved and saved as ".rtf", this only applies to the untagged output. Both formats can be outputted simultaneously by selecting both check boxes.
An output folder must also be chosen, use the 'Browse' button to select a folder on your system. All output files and the stats file are placed in this folder. Since Version 2.1.2, if the files are from different folders the folder structure will be retained.

Stats File
Each time the batch processor is run ('Process listed file(s)') a stats file is created in tab-delimited plain text. A row is present for each file processed with columns indicating the file name, total number of words and tokens, variant words and tokens remaining, normalised variant words and tokens and non variant words and tokens.
A Microsoft Excel file is available which serves as a rough template for further stats from this file. The yellow columns should be replaced with the columns from the stats file described above; be careful to make sure that totals are calculated in the bottom section from the entire list of files, especially if new rows are added. Stats available include totals, minimums, maximums and average of the given columns as well as percentages of how many variants were normalised.

Training Mode

A training mode is available from version 2.3 which allows the user to train VARD with previously normalised (VARDed) texts. Each text is processed as if it was being manually processed in the interactive mode, with weights and the variants list being updated accordingly. This means that only the XML VARD output is needed to train VARD for batch processing.
To use the training mode, simply select files in the same way as selecting files to process in the batch mode and click Process listed file(s). VARD will process each file in turn and automatically save relevant lists. The only difference between using the interactive mode and training mode will be the editing of the dictionary by the user (although variant replacements will be added if indicated in the options file). This is something which will be addressed in future versions of VARD.

Command Line version
As of version 2.2 a command line interface has been added for VARD 2. This may be useful for running VARD 2 over a network or running a series of tools through a script. The interface will process texts in exactly the same way as the batch version. The following command will run this interface:
java -Xms256M -Xmx512M -jar clui.jar "<setup directory>" <threshold int> <f-score weight int or fraction> "<input directory>" "<output directory>"
As when running the tool normally the memory values can be changed to suit your needs, but 256/512 should be adequate for most applications.
Anything in brackets (<...>) should be replaced with your own values:
<setup directory>: This is the setup folder VARD will use, this is equivalent to that chosen during the user interface selection screen. If the setup folder does not exist it will be created and default settings used.
<threshold int>: Indicates the replacement threshold. This should be an integer value between 0 and 100.
<f-score weight int or fraction>: Indicates the f-score weight used when calculating replacement scores. 1 indicates an equal balance between precision and recall, 1/2 indicates precision weight is twice as much as recall and 2 indicates recall weight is twice as much as precision. 1 is generally ok for most purposes, the threshold has a bigger effect on VARD's performance. For precision bias a score less than one should be given in the form of a fraction, the numerator and denominator should be whole numbers between 1 and 9 (inclusive). For recall bias a score greater than one should be given, this should be a whole number between 1 and 9 (inclusive).
<input directory>: This is the directory from which texts will be read in from. Ensure that the directory is surrounded by quotes (" "). All files not hidden are (attempted to be) read by VARD 2, regardless of extension or type. By default, only files in the immediate directory are selected, the entire directory tree rooted at the selected folder can be processed by changing the 'search_subfolders' field to 'true' in options.txt.
<output directory>: This is the directory where processed texts will be placed. Two folders are created, one with tagged texts and one without, plus a stats file. If search_subfolders is set to true the original directory structure will be maintained in the two folders. Again, ensure the directory is surrounded by quotes (" ").

Output
The format of output from the interactive, batch and command line versions are exactly the same and are explained below. When choosing the format of output to use, there are two things to consider: whether the original variants need to be retained (if so, use xml) and whether formatting information needs to be retained (if so, use Rich Text Format).

With XML Tags
VARD 2 XML output retains information about edits made to words in the text. Four pieces of information are marked in-line with the text, words marked as variants by the user (which were originally marked as not variants and have not yet been replaced), words marked as not variants by the user (which were previously marked as variants), normalised words and joins made. Any original xml tags will be retained, although any formatting will not be saved. The following tags may be added:
-
<join original="old string">new string</join>: Any joint words will be surrounded with these tags, the original words with surrounding punctuation and white space are given in the original attribute, new lines are displayed as [n]. The join tags may surround a variant or replaced tag, but join will never be surrounded by variant or replaced tags.
-
<variant>word</variant>: This tag surrounds any word which the user has marked as a variant manually. Upon reading this tag, VARD 2 will mark the word as a variant regardless of whether the word is in the dictionary (and replace if possible in the batch version).
-
<notvariant>word</notvariant>: This tag surrounds any word which the user has marked as a modern form manually. Upon reading this tag, VARD 2 will mark the word as a modern form regardless of whether the word is in the dictionary, and hence not normalise it automatically.
-
<normalised orig="original form" auto="true|false">normalised form</replaced>: A normalised variant will be surrounded by these tags. The original form is retained along with whether the replacement was made automatically or not - if the replacement was automatic this will not count in training.

Rich Text Format (RTF)
Saving in Rich Text Format (RTF) will retain any original formatting in the document and save any formatting completed by the user in the interactive version of the tool. The saved file could be opened in word processing software and the formatting will be present. No reference to the original variant is included for replaced words.
VARD will automatically save as RTF when a ".rtf" extension is used.

Plain Text
If saving without xml tags is chosen and the extension is not ".rtf" the text will be saved with replacements occurring instead of variants where appropriate. References to the original variant will be lost along with any formatting.

Setup
Various other data is saved by the tool in order for future sessions to benefit from training by a user. Any of the files described below can be reset to the original version by simply deleting the file, upon the next running of VARD 2 the file will be re-initialised with default values. A file can be saved for future use by making a copy of the file and renaming it, to use the data again simply paste over the existing file with the previously copied file (making sure to save the existing file, if necessary). The files can also be manually edited, although this is not recommended unless you are confident in what you are doing as a badly structured file could cause the software to crash.
As of version 2.4, you can have multiple setup folders and choose which setup to use when starting VARD.
All files are tab-delimited with one entry pair line (except for the variants list). Whilst lines may appear in alphabetic order, this is not a necessity. If editing the files manually it is important to ensure the files are saved with UTF-8 encoding. Previous versions of VARD used UTF-16 which will make them unusable in VARD 2.3 and later, other software can be used to convert these files to UTF-8. Please get in touch if this is proving problematic.

Dictionary
The dictionary (or modern lexicon) used by VARD 2 to classify words in a document and to find replacements for variants is stored in words.txt. The file is quite simple in that it only contains a maximum of 3 columns:
word string [tab] frequency [tab] (# if word is user added)
The frequency is either taken from the British National Corpus or based on the folder in SCOWL. At present new words are always given a frequency of 10.
If this file is deleted, next time VARD 2 runs it will re-initialise words.txt. This is done by adding words with a range of 50 or more in the British National Corpus (hidden from the user) and all words contained within the files from the scowl folder. The scowl folder contains a subset of the word lists offered by SCOWL (Spell Checker Oriented Word Lists), unwanted word lists can be deleted and new ones added, see the SCOWL webpage for extra word lists. The files simply contain lists of words, each on a new line. The file's ending indicates the frequency of the words, e.g. .10 indicates the top 10% most frequent words.
If creating your own words list (e.g. for another language) it may not be possible to include frequency information, the frequency score is only used when ranking candidate replacements if two scores are equal. Due to the more complicated manner in which confidence scores are calculated since version 2.3 it is unlikely that two scores will be equal (particularly highly ranked scores) and so the same frequency (10 say) can be given to every word in the list.

Variants
The variant list containing variant -> replacement mappings is saved in variants.txt. Entries are grouped for variant strings, with a list of possible replacements for each. The file is formatted as follows:
variant
* [tab] replacement 1 [tab] count [tab] (# if mapping is user added)
* [tab] replacement 2 [tab] count [tab] (# if mapping is user added)
... etc.
All counts start at 1, if a count is reduced to 0 (by a replacement being reverted) it will be removed from the variants list (and recorded in the log). The format of the variants list has changed significantly from VARD 2.2, but the original format can be read by VARD 2.3 (as long as it is in UTF-8 form) and will be transformed to the new format upon saving. However, due to the removal of several erroneous entries in the default variants list, it is recommended to delete variants.txt so a fresh file is created. The training mode can be used to reproduce any changes made. Note: variant replacements are not necessarily in the dictionary.

Rules
rules.txt contains a list of all letter replacement rules used by the system. Rules can be added with the Rule list manager or manually appended to this file. There are four fields for each entry:
original [tab] replacement [tab] position [tab] (# if rule is user added)
The original field may be blank if the rule is an insertion; likewise, the replacement field may be blank if the rule is a deletion. The position field should be one of Start, Second, Middle, Penultimate, End, Anywhere.

Confidence weights
The current confidence weights for each replacement method are stored in weights.txt, these are used to calculate confidence scores for each suggested variant replacement. It is not recommended to edit this file, but each line represents:
Method code [tab] true positives [tab] false positives [tab] false negatives
The values are used to create precision and recall scores which are combined to give F-Scores. Due to the overhaul in how VARD is trained, previous versions of weights.txt cannot be used with VARD 2.3. The interactive and training modes can be used to produce relevant values for your corpus.

Text To Ignore
text_to_ignore.txt contains a list of regular expressions which are used to detect text which should not be processed by VARD 2. This file can be edited to add user-defined regular expressions representing text which should be ignored, although a little experience with regular expressions may be required.
For example, if your texts contained headers between <head> tags, the header could be ignored by adding the following regular expression:
<head>[^<>]*</head>
Note that certain characters need "breaking" with \, such as {, }, [, ]. For more information about java regular expressions this page is quite useful.

Entities
entities.txt contains a list of regular expressions representing entities along with their desired replacement. The regular expression will be searched for in the text and replaced with the string given in the second column.
For example, one may wish to detect all long-s's - ſ and replace them with a modern s. This can be done by making two additions to entities.txt:
&[#]?[x]?017[fF]; [tab] s would replace any html/xml unicode entities.
\u017F [tab] s would replace any characters which java encodes automatically.

Options
options.txt contains various options which can be set by the user to influence how VARD 2 operates. The fields available are:
search_subfolders which can be set to true or false (default false). Setting this field to true will result in VARD 2 searching for files to add in the entire directory structure rooted at the input folder selected in the batch, training or command line version of the tool.
interactive_train_from_xml which again can be set to true or false (default false). Setting this field to true will result in the confidence scores being influenced by vard xml tags read by the tool in the interactive version. This can be useful if you are reading a text which has not already influenced the confidence scores and variants list, although the training mode would normally be used for this function.
add_new_xml_reps_to_dict which again can be set to true or false (default true). A true value means that when xml tags are read (e.g. in the training mode) any variant replacements in xml tags (where auto = false) are added to the dictionary.
word_regex is a regular expression describing what should constitute a word when reading text in VARD. A good knowledge of (Java) regular expressions should be available if editing this entry. The expression lists valid characters for a word. The default expression and what each part means is as follows:
([\p{L}\'\-\^~=]|(&[#]?[a-zA-Z0-9]+;))+
- [...] defines the list of alternative characters.
- \p{L} is any 'letter' in java, including diacritics.
- \' apostrophe
- \- hyphen
- \^ caret
- ~ tilde
- = equals sign
- (&[#]?[a-zA-Z0-9]+;) detects coded unicode entries which are later changed into java unicode characters. This should not be removed in most cases.
- (...) grouping to separate coded unicode entries, these should remain.
- + one or more characters.
Possible additions to the first set of square brackets ([...]) to indicate additional possible characters include 0-9 for digits (e.g. to deal with SMS spelling variation) and ` to represent a grave accent.
encoding dictates the encoding which should be used when reading texts in any mode of VARD. If this is set to detect (the default) then VARD will attempt to detect the encoding, however, this sometimes is not possible and an incorrect encoding will be used meaning characters are displayed and processed erroneously. If the encoding is consistent and known for your corpus, an encoding should be enforced here. Different versions of Java will have different encodings available, however, according to the Java API the following are available as standard:
- US-ASCII
- ISO-8859-1
- UTF-8
- UTF-16BE
- UTF-16LE
- UTF-16

Change Logs and Error log
Three log files are also stored in the "logs" folder (stored in the current setup folder); one for rule changes (rules_change_log.txt), one for the dictionary (words_change_log.txt) and one for the list of variant -> replacement mappings (variants_change_log.txt). Each time a change is saved to the rules list, dictionary or known variants list a log entry is made detailing the change.
Changes are only logged when made using the interactive or training modes of the tool; manual edits to saved data files will not be recorded and the batch version of the tool will never make changes to these files - it will only re-initiate the files if they are not present. Log entries are always appended to the end of the file so the change log files may include changes made in many sessions - deleting a log file will result in a new file being created next time VARD 2 is run.
The log files can be very useful for the future development of letter replacement rules, the default dictionary and the pre-defined variant -> replacement mappings. It would be greatly appreciated if you could send your log files for analysis, especially after training the tool for your data.
If an error occurs whilst using VARD a log will be created in logs/error_log.txt which may help explain what has gone wrong. It would be appreciated if errors could be reported in the bugs section if not already listed there.

|