Tutorial: Testing the correctness of an algorithm

Online non-stationary supervised learning

Check that online learning algorithms are able to learn to predict with an error below some threshold.
Code snippet: testing the IDBD algorithm

    NoisyInputSum problem = new NoisyInputSum();
    double error = problem.evaluateLearner(new IDBD(NoisyInputSum.NbInputs, 0.001));
    Assert.assertEquals(2.0, error, 0.1);

See the package rlpark.plugin.rltoys.junit.algorithms.predictions.supervised for more tests.

Online temporal difference learning

Check that online TD algorithms are able to converge on-policy to the TD fix point on a random walk problem.
Code snippet: testing the TD(λ) algorithm

    RandomWalk randomWalkProblem = new RandomWalk(new Random(0));
    FiniteStateGraphOnPolicy.testTD(randomWalkProblem, new OnPolicyTDFactory() {
      @Override
      public OnPolicyTD create(int nbFeatures) {
        return new TDLambda(0.1, 0.9, 0.01, nbFeatures);
      }
    });

See rlpark.plugin.rltoys.junit.algorithms.predictions.td.TDTest for more tests.

Check that online TD algorithms are able to converge off-policy to the TD fix point on a random walk problem.
Code snippet: testing off-policy TD and Gradient-TD algorithms

    OffPolicyTDFactory tdFactory = new OffPolicyTDFactory() {
      @Override
      public OffPolicyTD newTD(double lambda, double gamma, int vectorSize) {
        return new GTDLambda(lambda, gamma, alpha_v, alpha_w, vectorSize, new AMaxTraces());
      }
    };
    TestingResult<OffPolicyTD> result = RandomWalkOffPolicy.testOffPolicyGTD(lambda, gamma, targetLeftProbability,
                                                                             behaviourLeftProbability, tdFactory);
    Assert.assertTrue(result.message, result.passed);

See rlpark.plugin.rltoys.junit.algorithms.predictions.td.GTDLambdaTest for more tests.

Control on a reinforcement learning problem

Check that online control algorithms are able to learn on-policy a policy with some minimal performance.
Code snippet: test an actor-critic algorithm on the pendulum problem

    Assert.assertTrue(PendulumOnPolicyLearning.evaluate(new ActorCriticFactory() {
      @Override
      public ControlLearner create(int vectorSize, double vectorNorm, PolicyDistribution policyDistribution) {
        TD critic = new TD(1.0, 0.5 / vectorNorm, vectorSize);
        Actor actor = new Actor(policyDistribution, 0.05 / vectorNorm, vectorSize);
        return new AverageRewardActorCritic(0.01, critic, actor);
      }
    }) > .75);

See rlpark.plugin.rltoys.junit.algorithms.control.actorcritic.ActorCriticOnPolicyOnPendulumTest for more tests.

Check that online control algorithms are able to learn off-policy a policy with some minimal performance. A fixed exploration policy is run in a problem to generate data that is used to improve a target policy. Then, the performance of the target policy is evaluated on a copy of the problem.
Code snippet: test the off-pac algorithm on the mountain car problem

    final MountainCarEvaluationAgentFactory factory = new MountainCarEvaluationAgentFactory() {
      @Override
      public OffPolicyAgentEvaluable createOffPolicyAgent(Random random, MountainCar problem, Policy behaviour, double gamma) {
        Projector criticProjector = MountainCarOffPolicyLearning.createProjector(random, problem);
        OffPolicyTD critic = createCritic(criticProjector, gamma);
        StateToStateAction toStateAction = MountainCarOffPolicyLearning.createToStateAction(random, problem);
        PolicyDistribution target = new BoltzmannDistribution(random, problem.actions(), toStateAction);
        double alpha_u = 1.0 / criticProjector.vectorNorm();
        ActorOffPolicy actor = new ActorLambdaOffPolicy(0, gamma, target, alpha_u, toStateAction.vectorSize(), new ATraces());
        return new OffPolicyAgentDirect(behaviour, new OffPAC(behaviour, critic, actor));
      }

      private OffPolicyTD createCritic(Projector criticProjector, double gamma) {
        double alpha_v = .05 / criticProjector.vectorNorm();
        double alpha_w = .0001 / criticProjector.vectorNorm();
        GTDLambda gtd = new GTDLambda(0, gamma, alpha_v, alpha_w, criticProjector.vectorSize(), new ATraces());
        return new CriticAdapterFA(criticProjector, gtd);
      }
    };
    Assert.assertTrue(MountainCarOffPolicyLearning.evaluate(factory) < 115);

See rlpark.plugin.rltoys.junit.experiments.offpolicy.OffPolicyTests for more tests.

All tests in RLPark

See rlpark.alltests.RLParkAllTests

Documentation