Running Maze2D Demos

This is a simple demonstration of a maze task.

Overview

The robot is an omniwheel mobile robot. The robot can move in any direction on a 2-dimensional plane ([-1,1]x[-1,1]).

map-ce2-rlem080907-Nt-ex10.png

This task is performed in simulation. The state of the robot is its global position which is expressed as

x = (x1, x2) ,

and its control input is the state transition in a time step dt=0.01 which is expressed as

u = (Dx1, Dx2) .

In this environment, there is some wind that changes the behavior of the robot in the direction of the arrows as shown in the above figure. There are also walls which the robot can not pass through.

The objective of the navigation task is to acquire a path from the start to the goal. According to this objective, the reward function is designed as follows: 1 for goal, a small step cost, and a penalty for going out of the plane. Each episode begins with the start state, and ends if the robot has reached the goal, gone outside, or t>12[s].

Running Command

Please read Common Usage in advance. The running command is:

 $ ./DEMO_PRG -path PATH_LIST -agent AGENT_FILE -outdir OUT_DIR

The demo-specific elements are:

DEMO_DIR
maze2d (benchmarks/maze2d).
DEMO_PRG
maze2d.out
PATH_LIST
../cmn,m
AGENT_FILE
Available agent scripts are listed below.
OUT_DIR
Any directory is possible.

For example, execute the following:

 $ mkdir -p result/rl1
 $ ./maze2d.out -path ../cmn,m -agent ql_da1 -outdir result/rl1

You can see the output like (debug lines are omitted):

 random seed = 1306343562
 episode 0...
 episode 1...
 episode 2...
 episode 3...
 episode 4...
 ...
 episode 997...
 episode 998...
 episode 999...

In OUT_DIR (result/rl1), the result files are stored. For example, use gnuplot to plot the learning curve as:

 $ gnuplot
 gnuplot> plot 'result/rl1/log-eps-ret.dat' w l
rl1-eps-ret.png

Agent Script

The following files can be specified as AGENT_FILE.

ql_da1
Q(λ)-learning, linear action value function (NGnet).
fqi_da1
Fitted Q iteration (updated in every 10 episode), linear action value function (NGnet).
hrl_da1
Cohen's hierarchical RL.
ql_gwf1
Wire-fitting updated by Q(λ)-learning.
qlfqi_da1
Q(λ)-learning + Fitted Q iteration (updated in every 10 episode), linear action value function (NGnet).
qlfqi_gwf1
Q(λ)-learning + Fitted Q iteration (updated in every 10 episode) for Wire-fitting.
dyna_da1
Dyna (using McMahan-and-Gordon's prioritized sweeping), linear action value function (NGnet).
ql_dcob_q1
Q(λ)-learning, DCOB (action space), linear action value function (NGnet).
ql_wfdcob1
Q(λ)-learning, WF-DCOB.
chacts
In this case, available action set changes with state (situation).

Testing:

lspi_da1
LSPI (updated in every 5 episode), linear action value function (NGnet). Note: the LSPI module is under testing; it does not converge in this task.

Front page   New List of pages Search Recent changes   Help   RSS of recent changes