Monday, June 28, 2010

Introduction to Prediction Markets

Prediction Markets are an implementation of the broader concept of Collective Intelligence. In general, Collective Intelligence is an intelligence that emerges from the shared knowledge of individuals which can then be used to make decisions. With Prediction Markets (PM), this intelligence emerges through the use of market mechanisms (buying/selling securities) where the pay out depends upon the outcomes of future events. In short, the collective is attempting to predict the future.

Prediction Markets should be familiar to us because a stock market is really just a forum for making predictions about the value of some underlying security. Participants buy and sell shares in a company, for example, based on information they feel is relevant to the future value of that company. A security's price is an aggregated bit of information that is not only a prediction about the future, but is also new information from which more predictions can be made. That last part is important because prices are information that cause participants to act in a market.

A real-world example of using PMs to make decisions is Best Buy's TagTrade system. This system is used by Best Buy employees to provide information back to management on issues like customer sentiment. The linked article explains one particular incident:
TagTrade indicated that sales of a new service package for laptops would be disappointing when compared with the formal forecast. When early results confirmed the prediction, the company pulled the offering and relaunched it in the fall. While far from flawless, the prediction market has been more accurate than the experts a majority of the time and has provided management with information it would not have had otherwise
Another interesting example comes from Motorola and their attempts to deal with idea/innovation requests from their employees. Their ThinkTank system was set up to allow employees to submit ideas on products and innovations. Those in charge with weeding through these requests were initially overwhelmed. To improve the process, Motorola used PM software to allow employees to purchase shares in the submitted ideas. At the end of 30 days the market was closed and those ideas that had the highest share price got pursued, and employees holding stock in those ideas got a bonus.

(Some other companies using Prediction Markets are IBM, Google (PDF), Microsoft, and Yahoo! Some of these companies use internal prediction markets (employees only) while others provide external markets (general population). The Iowa Electronics Market (IEM), associated with the University of Iowa, uses PMs to predict election outcomes. IEM has been in existence for over 20 years, and has studies showing their predictions being more accurate than phone polls.)

The bonus paid out by Motorola points to an important aspect of PMs - incentives. With good incentives participants stay interested in the process and look for ways to make more accurate predictions. Driving people to discover new information about future events can lead to interesting behavior in a company.

Another key aspect of PMs is the idea of weighting. That is, the ability of traders to put some weight behind their predictions. Those who are more confident in their predictions can purchase/sell more shares in those outcomes. Contrast this with a simple survey where an expert's opinion gets the same weight as a layman's (one person one vote).

Railinc is now starting to venture into using Prediction Markets with Inkling's software and services. Some of the topics for which predictions could be made are bonus metrics, customer surveys, project metrics, and fun things like World Cup results. One thing that will be interesting to track over the coming months is the value of PMs in such a small company (Railinc has approximately 150 employees). Value from PMs tends to come from larger populations where errors can be canceled out and participation rates stay constant. The hope is that at some point these markets will be opened to various parties in the rail industry thereby increasing the population and alleviating this concern. If the markets were opened up to external parties then the topics could be broadened to include regulatory changes, industry trends, product suggestions, and ideas to improve existing applications. So, the potential is there if the execution is handled properly.

Prediction Markets provide an interesting way to efficiently gather dispersed information. Using this innovative tool, Railinc will attempt to tap into the Collective Intelligence of its employees and, hopefully, the rail industry.

More to come.

Thursday, June 17, 2010

ESRI and Python

Railinc is using ESRI to create map services. One of these services provides information about North American rail stations. The official record of these stations is in a DB2 database that gets updated whenever stations are added, deleted, or changed in some way. When we first created the ESRI service to access these stations, we copied the data from DB2 to an Oracle table, then built an ESRI ArcSDE Geodatabase using the Oracle data.

We had some issues with the ArcSDE Geodatabase architecture, and after some consultation we decided to switch to a File Geodatabase. This architecture avoids Oracle altogether and instead uses files on the file system. With this set up we've seen better performance and better stability of the ESRI services. (N.B: This is not necessarily a statement about ESRI services in general. Our particular infrastructure caused us to move from the Oracle solution.)

The question now is how do we keep the stations data up-to-date when using the File Geodatabase approach? Enter Python.

Rail Stations Data

Before getting to the Python script, let's take a look at the structure of the rail stations table.
  • RAIL_STATION_ID - unique id for the record
  • SCAC - A four character ID, issued by Railinc, that signifies the owner of the station
  • FSAC - A four digit number that, combined with the SCAC, provides a unique identifier for the station
  • SPLC - A nine digit number that is a universal identifier for the geographic location of the station
Most of this data is going to be informational only. What's most important for this process are the latitude and longitude columns which will be used to create geospatial objects.

Python and ESRI

The end result of this process is going to be the creation of an ESRI Shapefile - a file format created and regulated by ESRI as an open specification for data interoperability. Basically, shapefiles describe geometries - points, lines, polygons, and polylines.

While working on this problem I found three ways to create shapefiles programmatically:
  • The ESRI Java API
  • The ESRI Python scripting module
  • The Open Source GeoTools Toolkit
I chose Python over the others because of its simplicity and its history with ESRI. (I do have a working solution using the GeoTools Toolkit that I may share in a future blog post.) Now, to the script.

First, I'll create the Geoprocessor object using the ESRI arcgiscripting module specifying that I want output to be overwritten (actually, this tells subsequent function calls to overwrite any output).

import arcgisscripting, cx_Oracle, datetime

gp = arcgisscripting.create(9.3)
gp.Overwriteoutput = 1
gp.workspace = "/usr/local/someworkspace"
gp.toolbox = "management"

Next, I'll create an empty feature class specifying the location (workspace), file, and type of geometry. The geometry can be POINT, MULTIPOINT, POLYGON, and POLYLINE. In this case, I'll use a POINT to represent a station. At this time I will also define the projection for the geometry.
gp.CreateFeatureclass( "/usr/local/someworkspace", "stations.shp", "POINT" )
coordsys = "Coordinate Systems/Geographic Coordinate Systems/North America/North American Datum 1983.prj"
gp.defineprojection( "stations.shp", coordsys )

Now I need to define the structure of the feature class. When I created the feature class above I defined it with the POINT geometry. So the structure is already partially defined with a Shape field. What's left is to create fields to hold the station specific structure.

gp.AddField_management( "stations.shp", "STATION_ID", "LONG", "", "", "10", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "SCAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "FSAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LATITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LONGITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LAST_UPD", "DATE" )

At this point I have a shapefile with a feature class based upon the station schema. Before adding data I must create a cursor to access the file. The Geoprocessor provides methods to create three types of cursors - insert, update, and search. Since I am creating a shapefile I will need an insert cursor.

cur = gp.InsertCursor( "/usr/local/someworkspace/stations.shp" )
pnt = gp.CreateObject("Point")

I've also created a Point object here that I will use repeatedly for each record's Shape field in the feature class.


Now that the output structure is ready, I need some input. To query the Oracle table I will use the cx_Oracle module. This is one of the reasons why I liked the Python solution - accessing Oracle was trivial. Simply create a connection, create a cursor to loop over, and execute the query.

dbConn = cx_Oracle.connect( username, pw, url )
dbCur = dbConn.cursor()
dbCur.execute( "SELECT * FROM RAIL_STATIONS" )

Now I can start building the shapefile. The process will loop over the database cursor and create a new feature class row, populating the row with the rail station data.

for dbRow in dbCur:

    pnt.x = dbRow[10]
    pnt.y = dbRow[9] = dbRow[0]

    fcRow = cur.NewRow()
    fcRow.shape = pnt
    fcRow.STATION_ID = dbRow[0]
    fcRow.SCAC = dbRow[1]
    fcRow.FSAC = dbRow[2]
    fcRow.SPLC = dbRow[3]
    fcRow.LATITUDE = dbRow[9]
    fcRow.LONGITUDE = dbRow[10]
    fcRow.LAST_UPD = dbRow[11].strftime( "%x %X" )


del cur, dbCur, dbConn

First, the Point object created above is used to populate the feature class's Shape field. However, before doing that the InsertCursor is used to create a new row in the feature class (this acts as a factory and only creates a new row object - it does not insert the object into the feature class). Once I have the new row from the database I can populate all of the fields in the feature class row. Finally, I insert the row into the cursor (actually, the final part is the clean up).

One problem that took me a while to figure out (since I am new to ESRI and Python) was handling dates. My first pass at populating the LAST_UPD field was to use fcRow.LAST_UPD = dbRow[11]. Consistent, right? When I did this I got the following error:

Traceback (most recent call last):
  File "", line 72, in 
    feat.LAST_UPD = row[11]
ValueError: Row: Invalid input value for setting

After searching around I figured out that what was coming back from Oracle was a datetime.datetime type that was not being accepted by the feature class date type. I found that I could convert the datetime.datetime to a string and ESRI would do the date conversion properly ("%x %X" just takes whatever the date and time formats are and outputs them as strings).


That's it. Now I have a shapefile that I can use with my ESRI File Geodatabase architecture. The next step is to swap out shapefiles when the stations data changes (which it does on a regular basis). Can this be done without recreating the ESRI service? Stay tuned.


Friday, June 11, 2010

Geospatial Analytics using Teradata: Part I

In October, I (along with a co-worker) will be giving a presentation at the Teradata PARTNERS conference. The topic will be on how Railinc uses Teradata for geospatial analytics. Since I did not propose the paper, write the abstract, or even work on geospatial analytics, I will be learning a lot during this process. So, to help with that education I will be sharing some thoughts in a series of blog posts.

To kick the series off, let me share the abstract that was originally proposed:
Linking location to information provides a new data dimension, a new precision, unlocking a huge potential in analytics. Geospatial data enables entirely new industry metrics, new insights, and better decision making. Railinc, as a trusted provider of IT services to the Rail Freight industry, is responsible for accurate and timely dissemination of more than 10 million rail events per day. This session provides an overview of how Railinc Business Analytics’ group has implemented Active Data Warehouse and Teradata GeoSpatial technologies to bring an unprecedented amount of new Rail Network insight. The real-time calculation of Geospatial metrics from rail events, has enabled Railinc to better assess; 1) Rail equipment utilization 2) repair patterns 3) geographic usage patterns, and other factors. All of which, afford insights that impact maintenance program decisions, component deployments, service designs and industry policy decisions.
Below is a first pass at an outline for the talk. It is preliminary and will most likely change over the coming months.
  1. Describe Railinc's Teradata installation
  2. Describe Railinc's source systems
    1. Rail car movement events
    2. Rail car Inventory
    3. Rail car health
    4. Commodity
  3. Describe our ETL process
  4. Explain the FRA geospatial rail track data
    1. Track ownership complexity
  5. Tie 1-4 together
    1. Current state of car portal
    2. Car Utilization analytics
    3. Traffic pattern analytics
  6. Lessons learned
    1. Study of different routing algorithms
    2. Data quality issues
Item 5 is the problem - how can we tie our source systems together with geospatial data in a compelling way? One idea is a portal that provides information about the current state of a rail car. How would geospatial data fit into this portal? Location is the most obvious answer, but is there something more interesting? What about an odometer reading for a rail car? Outside of the portal there are ideas around car utilization and traffic patterns. I like the last two but I need to learn more about them.

These are some issues/questions I need to answer over the coming months. Along the way I plan on sharing information about implementation details, possible business cases, and any problems I come across.