Configuration
This page describes how to customize your configuration of WikiBrain.
Env, configurator and components
Each WikiBrain runtime environment is associated with an instance of the Env class. An Env creates, configures, and keeps track of the components of WikiBrain.
Let’s say you want to create a Disambiguator component (a disambiguator returns Wikipedia articles that correspond to a particular phrase). They are described in more detail in the SR section of this manual.
To create a new WikiBrain environment and get a particular disambiguator you would do the following:
public static void main(String args[]) {
// Prepare the environment
Env env = EnvBuilder.envFromArgs(args);
// Get the configurator that creates components and a phraze analyzer from it
Configurator configurator = env.getConfigurator();
// Get a specific disambiguator called "topResult"
Disambiguator dab1 = configurator.get(Disambiguator.class, "topResult");
// Get the default disambiguator (named "similarity" in this case)
Disambiguator dab2 = configurator.get(Disambiguator.class);
}
Let’s walk through this program to explain each piece.
First, we create an Env
, a WikiBrain environment that provides access to the components we need:
Env env = EnvBuilder.envFromArgs(args);
The EnvBuilder
provides utility methods to set the languages you want to support, the maximum number of threads available to your program, etc.
There are more advanced ways of configuring WikiBrain - both programatically and through configuration files - described in the WikiBrain command line args section of this page.
You can also create an Env
by hand, but the builder provides many convenience methods for you.
The Env provides access to a
Configurator
-
essentially a Factory for creating WikiBrain components. We get the Disambiguator next:
Configurator configurator = env.getConfigurator();
Disambiguator dab1 = configurator.get(Disambiguator.class, "topResult");
Finally, you typically want the “default” version of a particular component. In that case, you can omit its name:
Disambiguator dab2 = configurator.get(Disambiguator.class);
In this case, you’ll receive the similarity
disambiguator, which is an instance of SimilarityDisambiguator.
Overview of configuration file structure
WikiBrain decides how to configure components by looking at its configuration files. The default configuration file is stored in reference.conf. You should NOT edit the reference.conf file. Instead, you can specify configuration files that override these default settings (more on this later).
The configuration system is based on Typesafe’s config framework and uses a JSON-like format called HOCON.
Consider the snippet below from the default reference.conf, which defines four disambiguators named topResult
, topResultConsensus
, etc.
It also tells WikiBrain to use the similarity
disambiguator by default.
sr : {
disambig : {
default : similarity
topResult : {
type : topResult
phraseAnalyzer : default
}
topResultConsensus : {
type : topResultConsensus
phraseAnalyzers : ["lucene","stanford","anchortext"]
}
milnewitten : {
type : milnewitten
metric : milnewitten
phraseAnalyzer : default
}
similarity : {
type : similarity
metric : inlinknotrain
phraseAnalyzer : default
// how to score candidate senses. Possibilities are:
// popularity: just popularity
// similarity: just similarity
// product: similarity * popularity
// sum: similarity + popularity
criteria : sum
}
}
Customizing your configuration
You’ll commonly want to override the default reference.conf configuration file. To do so, create a text file, and include configurations for any elements you’d like. Any configuration in your file will override reference.conf.
For example, to change the default disambiguator to topResult
, do the following:
sr.disambig.default : topResult
To run a program with an override configuration file, you can take advantage of the -c
option that is processed by EnvBuilder.envFromArgs()
:
$ java my.class.Name -c /path/to/myConf.conf
Alternately, we could specify a configuration override in the Java program:
Env env = new EnvBuilder()
.setConfigFile("/path/to/myConf.conf")
.build();
Or we could change the property directly:
Env env = new EnvBuilder()
.setProperty("sr.disambig.default", "topResult")
.build();
Standard command-line options
WikiBrain specifies a set of standard command-line options.
If your program parses arguments using EnvBuilder.envFromArgs
, it will recognize them.
Wikibrain recognizes the following arguments:
option | value | default | notes |
---|---|---|---|
-c | path/to/conf.txt | ||
-h | max-threads | # logical cores | |
-l | language-codes | installed languages | comma separated |
–base-dir | path/to/dir | current directory | |
–tmp-dir | path/to/dir | baseDir/.tmp | WikiBrain requires many GB of tmp space |
If you would like to add custom command line options to your program, take a look at DumpLoader.java, which shows how to incorporate custom command line processing with WikiBrain.
Using external databases
By default, wikibrain uses an embedded h2 database. While this is convenient, it does not scale well and does not currently support WikiBrain’s spatial module. For language editions with more than 1M articles, Postgres is recommended.
You can configure the project to use postgresql by adjusting the configuration as stated above. The relevant section of the default reference.conf is:
dao : {
dataSource : {
default : h2
h2 : {
driver : org.h2.Driver
url: "jdbc:h2:"${baseDir}"/db/h2;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0;MAX_OPERATION_MEMORY=100000000"
username : sa
password : ""
// Connection pooling
// This sets the total number of jdbc connections to a minimum of 16.
// partitions defaults to max(8, num-logical-cores)
partitions : default
connectionsPerPartition : 2
}
psql : {
driver : org.postgresql.Driver
url: "jdbc:postgresql://localhost/wikibrain"
username : toby
password : ""
// Connection pooling
// This sets the total number of jdbc connections to a minimum of 16.
// partitions defaults to max(8, num-logical-cores)
partitions : default
connectionsPerPartition : 2
}
}
You could override these by creating an external override configuration file (i.e. override.conf
) with:
dao.dataSource.default : psql
dao.dataSource.psql {
username : foo
password : bar
url : "jdbc:postgresql://localhost/my_database_name"
}
You could then load the altered configuration by passing the -c
option to your program.
Alternately, you could pass the configuration settings directly to the builder:
Env env = new EnvBuilder()
.setProperty("dao.dataSource.default", "psql")
.setProperty("dao.dataSource.psql.username", "foo")
.setProperty("dao.dataSource.psql.password", "bar")
.setProperty("dao.dataSource.psql.url", "jdbc:postgresql://localhost/my_database_name")
.build();