Monday, August 30, 2010

NFJS 2010 in Raleigh

I attended the No Fluff Just Stuff tour in Raleigh this past weekend with a bunch of others from Railinc. After the event on Sunday I tweeted that I wasn’t all that impressed with this year’s sessions. Matthew McCullough responded asking for some details on my concerns. Instead of being terse in a tweet, I thought I’d be fair with a more lengthy response.

First off, I really wasn’t planning on going this year. I took a look at the sessions and didn’t see enough relevant content that interested me. In 2007 and 2008 I went with a group from Railinc and had a pretty good time while learning about some new things that were going on in the industry. (We didn’t go in 2009 for economic reasons.) This year, however, I felt that there wasn’t enough new content on the schedule that interested me. (I have seen/read/heard enough about REST, Grails, JRuby, Scala, etc.).

What changed my mind about going was the interest expressed by some other developers at Railinc. Since I coordinated the 2007 and 2008 trips, I thought I’d get this one coordinated, and since there was a good amount of interest, I figured I’d give it a shot as well. So, to be fair, I wasn’t going in expecting much anyway.

Here were the key issues for me:
  1. Some of the sessions did not go in the direction that I expected. To be fair, though, I was warned ahead of time to review the slides before making a decision on a session. The problem here is that some presenters relied more on demos and less on slides, so in some cases it was hard to judge by just the slide deck.
  2. Like I said above, I wasn’t planning on going in the first place because of the dearth of sessions that seemed interesting to me. I ended up going to some sessions because it was the least non-relevant session at that time. There were actually two sessions that I bailed on in the middle because I wasn’t getting any value from them.
  3. Finally, and this is completely subjective, some of the speakers just didn't do it for me. While you could tell that most (if not all) of the speakers were passionate about what they were talking about, some were just annoying about it. For instance, some of the attendees I spoke to felt that the git snobbery was a bit overkill. Some of it was just speaker style - some click with me some don't.
Some things I heard from the other Railinc attendees were
  • Too much duplication across speakers
  • Not enough detail along tracks
  • Some of the session were too introductory - could have gotten same information from a bit of googling.
Granted, some of my concerns are subjective and specific to my own oddities. But I do remember that I had enjoyed the '07 and '08 events much more.

I did, however, enjoy Matthew's first session on Hadoop. I knew very little about the technology going in and Matthew helped crystallize some things for me. I also got some good information from Neal Ford's talks on Agile engineering practices and testing the entire stack.

I really like the No Fluff Just Stuff concept in general. I think it is an important event in the technology industry. The speakers are knowledgeable and passionate which is great to see. My mind is still open about going next year, but it will be a harder sell.

Wednesday, August 25, 2010

Not so Stimulating

I sent the following to the Raleigh News & Observer:
E. Wayne Stewart says that “enormous fiscal stimulus ... to finance World War II led the U.S. out of the Depression.” While it is true that aggregate economic indicators (e.g., unemployment and GDP) improved during the war, it was not a time of economic prosperity.

During World War II the U.S. produced a lot of war material, not consumer goods. It was a time when citizens went without many goods and raw materials due to war-time rationing. It was also a time when wages and prices were set by government planning boards. In short, it was a time of economic privation for the general public. It wasn't until after the war, when spending was drastically reduced, that the economy returned to a sense of normalcy.

The lesson we should learn is that, yes, it is possible for government to spend enough money to improve aggregate economic indicators. That same spending, however, can distort the fundamentals of the economic structure in ways that are not wealth-producing as determined by consumer preferences.
This argument, that government spending during WWII got us out of the Depression, is used by many to justify economic stimulus. The argument I use above comes from Robert Higgs and his analysis of the economy during the Depression and WWII.

For me, though, the biggest problem with the "just spend" argument is that it ignores the nuances and subtly of a market-based, consumer-driven economy. It is like saying that to get a 1000 word essay to a 2000 word essay all you need to do is add 1000 words. There is no thought into the idea that those extra words need to fit into the overall essay in a coherent manner. A productive economy needs spending to occur in the proper places at the proper times, and it is the market process that does this most efficiently (not completely efficiently, but better than the alternatives).

Prediction Markets at a Small Company

Railinc has recently started a prediction market venture using Inkling software. We have been using it internally to predict various events including monthly revenue projections and rail industry traffic volume. In July, we also had markets to predict World Cup results. While this experience has been fun and interesting, I can't claim it has been a success.

The biggest problem we've had is with participation. There is a core but small group of people who participate regularly, while most of the company hasn't even asked for an account to access the software. When I first suggested this venture I was skeptical that it would work at such a small company (just under 200 staff) primarily because of this problem. From the research I saw, other companies using prediction markets only had a small percentage of employees participate as well. However, those companies were much larger than Railinc, so the total number participating was much greater.

Another problem that is related to participation is the number of questions being asked. Since we officially started this venture I've proposed all but one of the questions/markets. While I know a lot about the company, I don't know everything that is needed to make important business decisions. Which brings up another problem - in such a small company do you really need such a unique mechanism to gather actionable information from such a limited collective?

Even considering these problems we venture forward and look for ways to make prediction markets relevant at Railinc. One way to do this is through a contest. Starting on September 1 we will have a contest to determine the best predictor. At the Railinc holiday party in December we will give an award to the person with the largest portfolio as calculated by Inkling. (The award will be similar to door prizes we've given out at past holiday parties.) I've spent some time recently with the CIO of Railinc to discuss some possible questions we can ask during this contest. We came up with several categories of questions including financial, headcount, project statistics, and sales. While I am still somewhat skeptical, we will see how it plays out.

We are also looking to work with industry economists to see if Railinc could possibly host an industry prediction market. This area could be a bit more interesting, in part, because of the potential size of the population. If we can get just a small percentage of the rail industry participating in prediction markets we could tap into a sizable collective.

Over the coming months we'll learn a lot about the viability of prediction markets at Railinc. Even if the venture fails internally, my hope is to make some progress with the rail industry.

Thursday, August 12, 2010

Geospatial Analytics using Teradata: Part II - Railinc Source Systems

[This is Part II in a series of posts on Railinc's venture into geospatial analytics. See Part I.]

Before getting into details of the various analytics that Railinc is working on, I need to explain the source data behind these analytics. Ralinc is basically a large data store of information that is received from various parties in the North American rail industry. We process, distribute, and store large volumes of data on a daily basis. Roughly 3 million messages are received from the industry which can translate into 9 million records to process daily. The data is categorized in four ways:
  • Asset - rail cars and all attributes for those rail cars
  • Asset health - damage and repair information for assets
  • Movement - location and logistic information for assets
  • Industry reference - supporting data for assets including stations, commodities, routes, etc.
Assets (rail cars) are at the center of almost all of Railinc's applications. We keep the inventory of the nearly 2 million rail cars in North America. For the most part, data we receive either has an asset component or in some way supports asset-based applications. The analytics that we are currently creating from this data falls into three main categories: 1) logistics, 2) management/utilization, and 3) health.

Logistics is an easy one because movement information encompasses the bulk of the data we receive on a daily basis. If there is a question about the last reported location of a rail car, we can answer it. The key there is "last reported location." Currently we receive notifications from the industry whenever a predefined event occurs. These events tend to occur at particular locations (e.g., stations). In between those locations is a black hole for us. At least for now, that is. More and more rail cars are being equipped with GPS devices that can pin point a car's exact location at any point in time. We are now working with the industry to start receiving such data to fill in that black hole.

Management/utilization requires more information than just location, however. If a car is currently moving and is loaded with a commodity then it is making money for its owner; if it is sitting empty somewhere then it is not. Using information provided by Railinc, car owners and the industry as a whole can get a better view into how fleets of cars are being used.

Finally, asset health analytics provide another dimension into the view of the North American fleet. Railinc, through its sister organization TTCI, has access to events recorded by track-side detectors. These devices can detect, among others, wheel problems at speed. TTCI performs some initial analysis on these events before forwarding them on to Railinc which then creates alert messages that are sent to subscribers. Railinc will also collect data on repairs that are performed on rail cars. With a history of such events we can perform degradation analytics to help the industry better understand the life-cycle of assets and asset components.

Railinc is unique in the rail industry in that it can be viewed as a data store of various information. We are just now starting to tap into this data to get a unique view into the industry. Future posts will examine some of these efforts.

Tuesday, July 06, 2010

The Science and Art of Prediction Markets

What constitutes a good question for a prediction market? Obviously, for the question to be valuable the answer should provide information that was not available when the question was originally asked. Otherwise, why ask the question. Value, however, is only one aspect of a good question. For prediction markets to function in a useful manner the questions that are asked must also be constructed properly. There is both a science and an art to this process.

The Science

There are three criteria to keep in mind when constructing a question for a prediction market:
  • The correct answer must be concrete
  • Answers must be determined on specific dates
  • Information about possible answers can be acquired before the settled date
Concreteness is important because it settles the question being asked - the result is not open to interpretation. An example of a question with a vague answer would be "What policy should the U.S. government enact to encourage economic growth? A) Subsidizing green energy, B) free trade, C) fiscal austerity, D) health care reform." One problem here is that the time frame to accurately answer this question could be extensive. Also, the complexities of economic growth make it difficult to tease out the individual variables that would be necessary to concretely answer the question. If two or more answers are correct (whatever that may mean) then the market may end up reflecting the value judgments of the participants, not objective knowledge. This type of question is more suited for a poll rather than a prediction.

Not only should answers be concrete, there should be some point in time when each answer can be determined to either have occurred or not have occurred. A question that never gets resolved can hamper the prediction process by reducing the incentive to invest in that market. (Can a non-expiring question be valuable? Could the ongoing process of information discovery be useful? Questions to ponder.)

This doesn't mean, however, that every answer must be determined on the same date. Wrong answers can be closed as the process unfolds. Once the correct answer is determined, however, the market should be closed. For example, take the question "Which candidate will win the 2012 Republican Party nomination for U.S. President?" If this question is asked in January of 2012 there could be several possible answers (one for each candidate). As the year progresses to the Republican Party convention, several candidates will drop out of the election. The prediction market would then close out those answers (candidates) but stay open for the remaining answers. Weeding out wrong answers over time is part of the discovery process.

The final criterion - the ability to acquire information before the settled date - is what separates prediction markets from strict gambling. If all participants are in the dark about a question until that question is settled, then there is little value in asking the question. Prediction markets are powerful because they allow participants to impart some knowledge into the process over a period of time. The resulting market prices can then provide information that can be acted upon throughout the process. If participants cannot acquire useful information to incorporate into the market, then market activity is nothing more than playing roulette where all answers are equally possible until the correct answer is determined.

A good example to illustrate the above criteria is a customer satisfaction survey. Railinc uses a bi-annual (twice a year) survey to gauge customer sentiment on a list of products. For each product, customers are asked a series of questions the answers to which range from 1 (disagree) to 5 (agree). The answers are then averaged with a final score for each product ranging from 1-5 (the goal is to get as close to 5 as possible).

The following market could be set up for Railinc employees:
What will the Fourth Quarter 2010 customer satisfaction score be for product X?
  • Less than or equal to 4.0
  • Between 4.1 and 4.4 (inclusive)
  • Greater than or equal to 4.5
The value of this market is that Railinc management and product owners may get some insight into what employees are hearing from customers. Customer Service personnel could have one view based upon their interactions with customers, while developers may have a different view. Over time, management and product owners could take actions based upon market movements.

As far as concreteness is concerned, the final answer for this question will be determined when the survey is completed (e.g., January 2011), and it will be a specific number that falls into one of the ranges given by the answers.

This market also satisfies the last criteria regarding the ability to acquire information before the market is settled. This is important because this is where the value of the market is realized. As Railinc employees (i.e., market participants) gain knowledge over time they can incorporate that knowledge into the market via the buying and selling of shares in the provided answers.

The Art

In the example given above regarding the customer satisfaction survey, the answers provided were not arbitrary - they were selected to maximize the value of the market. This is where the art of prediction markets is applied.

If the possible answers for a customer survey are 1-5 why not provide five separate answers (1-1.9, 2-2.9, 3-3.9, 4-4.9, 5)? Why not have two possible answers (below 2.5 and above 2.5)? The selection of possible answers is partially determined by what is already known about the result. In the case of the survey, past results may have shown that this particular product has average a 4.1. It is highly unlikely that the survey results will drop to the 1-1.9 range. Providing such an answer would not be valuable because market participants would almost immediately short that position. This is still information, but it is information that is already known. What is desired is insight to what is not known. The answers provided in the above example will give some insight into whether the product is continuing to improve or whether it is digressing.

So, the selection of possible answers to market questions must take into account what is already known as well as what is unknown. What do you know about what you don't know?

Conclusion

Good questions make good prediction markets. Constructed properly, these questions can be a valuable tool in the decision making process of an organization.

Wednesday, June 30, 2010

Monday, June 28, 2010

Introduction to Prediction Markets

Prediction Markets are an implementation of the broader concept of Collective Intelligence. In general, Collective Intelligence is an intelligence that emerges from the shared knowledge of individuals which can then be used to make decisions. With Prediction Markets (PM), this intelligence emerges through the use of market mechanisms (buying/selling securities) where the pay out depends upon the outcomes of future events. In short, the collective is attempting to predict the future.

Prediction Markets should be familiar to us because a stock market is really just a forum for making predictions about the value of some underlying security. Participants buy and sell shares in a company, for example, based on information they feel is relevant to the future value of that company. A security's price is an aggregated bit of information that is not only a prediction about the future, but is also new information from which more predictions can be made. That last part is important because prices are information that cause participants to act in a market.

A real-world example of using PMs to make decisions is Best Buy's TagTrade system. This system is used by Best Buy employees to provide information back to management on issues like customer sentiment. The linked article explains one particular incident:
TagTrade indicated that sales of a new service package for laptops would be disappointing when compared with the formal forecast. When early results confirmed the prediction, the company pulled the offering and relaunched it in the fall. While far from flawless, the prediction market has been more accurate than the experts a majority of the time and has provided management with information it would not have had otherwise
Another interesting example comes from Motorola and their attempts to deal with idea/innovation requests from their employees. Their ThinkTank system was set up to allow employees to submit ideas on products and innovations. Those in charge with weeding through these requests were initially overwhelmed. To improve the process, Motorola used PM software to allow employees to purchase shares in the submitted ideas. At the end of 30 days the market was closed and those ideas that had the highest share price got pursued, and employees holding stock in those ideas got a bonus.

(Some other companies using Prediction Markets are IBM, Google (PDF), Microsoft, and Yahoo! Some of these companies use internal prediction markets (employees only) while others provide external markets (general population). The Iowa Electronics Market (IEM), associated with the University of Iowa, uses PMs to predict election outcomes. IEM has been in existence for over 20 years, and has studies showing their predictions being more accurate than phone polls.)

The bonus paid out by Motorola points to an important aspect of PMs - incentives. With good incentives participants stay interested in the process and look for ways to make more accurate predictions. Driving people to discover new information about future events can lead to interesting behavior in a company.

Another key aspect of PMs is the idea of weighting. That is, the ability of traders to put some weight behind their predictions. Those who are more confident in their predictions can purchase/sell more shares in those outcomes. Contrast this with a simple survey where an expert's opinion gets the same weight as a layman's (one person one vote).

Railinc is now starting to venture into using Prediction Markets with Inkling's software and services. Some of the topics for which predictions could be made are bonus metrics, customer surveys, project metrics, and fun things like World Cup results. One thing that will be interesting to track over the coming months is the value of PMs in such a small company (Railinc has approximately 150 employees). Value from PMs tends to come from larger populations where errors can be canceled out and participation rates stay constant. The hope is that at some point these markets will be opened to various parties in the rail industry thereby increasing the population and alleviating this concern. If the markets were opened up to external parties then the topics could be broadened to include regulatory changes, industry trends, product suggestions, and ideas to improve existing applications. So, the potential is there if the execution is handled properly.

Prediction Markets provide an interesting way to efficiently gather dispersed information. Using this innovative tool, Railinc will attempt to tap into the Collective Intelligence of its employees and, hopefully, the rail industry.

More to come.

Thursday, June 17, 2010

ESRI and Python

Railinc is using ESRI to create map services. One of these services provides information about North American rail stations. The official record of these stations is in a DB2 database that gets updated whenever stations are added, deleted, or changed in some way. When we first created the ESRI service to access these stations, we copied the data from DB2 to an Oracle table, then built an ESRI ArcSDE Geodatabase using the Oracle data.

We had some issues with the ArcSDE Geodatabase architecture, and after some consultation we decided to switch to a File Geodatabase. This architecture avoids Oracle altogether and instead uses files on the file system. With this set up we've seen better performance and better stability of the ESRI services. (N.B: This is not necessarily a statement about ESRI services in general. Our particular infrastructure caused us to move from the Oracle solution.)

The question now is how do we keep the stations data up-to-date when using the File Geodatabase approach? Enter Python.

Rail Stations Data

Before getting to the Python script, let's take a look at the structure of the rail stations table.
  • RAIL_STATION_ID - unique id for the record
  • SCAC - A four character ID, issued by Railinc, that signifies the owner of the station
  • FSAC - A four digit number that, combined with the SCAC, provides a unique identifier for the station
  • SPLC - A nine digit number that is a universal identifier for the geographic location of the station
  • STATION_NAME
  • COUNTY
  • STATE_PROVINCE
  • COUNTRY
  • STATION_POSTAL_CODE
  • LATITUDE
  • LONGITUDE
  • LAST_UPDATED
Most of this data is going to be informational only. What's most important for this process are the latitude and longitude columns which will be used to create geospatial objects.

Python and ESRI

The end result of this process is going to be the creation of an ESRI Shapefile - a file format created and regulated by ESRI as an open specification for data interoperability. Basically, shapefiles describe geometries - points, lines, polygons, and polylines.

While working on this problem I found three ways to create shapefiles programmatically:
  • The ESRI Java API
  • The ESRI Python scripting module
  • The Open Source GeoTools Toolkit
I chose Python over the others because of its simplicity and its history with ESRI. (I do have a working solution using the GeoTools Toolkit that I may share in a future blog post.) Now, to the script.

First, I'll create the Geoprocessor object using the ESRI arcgiscripting module specifying that I want output to be overwritten (actually, this tells subsequent function calls to overwrite any output).


import arcgisscripting, cx_Oracle, datetime

gp = arcgisscripting.create(9.3)
gp.Overwriteoutput = 1
gp.workspace = "/usr/local/someworkspace"
gp.toolbox = "management"

Next, I'll create an empty feature class specifying the location (workspace), file, and type of geometry. The geometry can be POINT, MULTIPOINT, POLYGON, and POLYLINE. In this case, I'll use a POINT to represent a station. At this time I will also define the projection for the geometry.
gp.CreateFeatureclass( "/usr/local/someworkspace", "stations.shp", "POINT" )
coordsys = "Coordinate Systems/Geographic Coordinate Systems/North America/North American Datum 1983.prj"
gp.defineprojection( "stations.shp", coordsys )

Now I need to define the structure of the feature class. When I created the feature class above I defined it with the POINT geometry. So the structure is already partially defined with a Shape field. What's left is to create fields to hold the station specific structure.

gp.AddField_management( "stations.shp", "STATION_ID", "LONG", "", "", "10", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "SCAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "FSAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
...
gp.AddField_management( "stations.shp", "LATITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LONGITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LAST_UPD", "DATE" )

At this point I have a shapefile with a feature class based upon the station schema. Before adding data I must create a cursor to access the file. The Geoprocessor provides methods to create three types of cursors - insert, update, and search. Since I am creating a shapefile I will need an insert cursor.

cur = gp.InsertCursor( "/usr/local/someworkspace/stations.shp" )
pnt = gp.CreateObject("Point")

I've also created a Point object here that I will use repeatedly for each record's Shape field in the feature class.

Oracle

Now that the output structure is ready, I need some input. To query the Oracle table I will use the cx_Oracle module. This is one of the reasons why I liked the Python solution - accessing Oracle was trivial. Simply create a connection, create a cursor to loop over, and execute the query.

dbConn = cx_Oracle.connect( username, pw, url )
dbCur = dbConn.cursor()
dbCur.execute( "SELECT * FROM RAIL_STATIONS" )

Now I can start building the shapefile. The process will loop over the database cursor and create a new feature class row, populating the row with the rail station data.

for dbRow in dbCur:

    pnt.x = dbRow[10]
    pnt.y = dbRow[9]

    pnt.id = dbRow[0]

    fcRow = cur.NewRow()
    fcRow.shape = pnt
    
    fcRow.STATION_ID = dbRow[0]
    fcRow.SCAC = dbRow[1]
    fcRow.FSAC = dbRow[2]
    fcRow.SPLC = dbRow[3]
    ...
    fcRow.LATITUDE = dbRow[9]
    fcRow.LONGITUDE = dbRow[10]
    fcRow.LAST_UPD = dbRow[11].strftime( "%x %X" )

    cur.InsertRow(fcRow)

dbCur.close()
dbConn.close()
del cur, dbCur, dbConn

First, the Point object created above is used to populate the feature class's Shape field. However, before doing that the InsertCursor is used to create a new row in the feature class (this acts as a factory and only creates a new row object - it does not insert the object into the feature class). Once I have the new row from the database I can populate all of the fields in the feature class row. Finally, I insert the row into the cursor (actually, the final part is the clean up).


One problem that took me a while to figure out (since I am new to ESRI and Python) was handling dates. My first pass at populating the LAST_UPD field was to use fcRow.LAST_UPD = dbRow[11]. Consistent, right? When I did this I got the following error:

Traceback (most recent call last):
  File "createStationShp.py", line 72, in 
    feat.LAST_UPD = row[11]
ValueError: Row: Invalid input value for setting

After searching around I figured out that what was coming back from Oracle was a datetime.datetime type that was not being accepted by the feature class date type. I found that I could convert the datetime.datetime to a string and ESRI would do the date conversion properly ("%x %X" just takes whatever the date and time formats are and outputs them as strings).

Conclusion

That's it. Now I have a shapefile that I can use with my ESRI File Geodatabase architecture. The next step is to swap out shapefiles when the stations data changes (which it does on a regular basis). Can this be done without recreating the ESRI service? Stay tuned.

References

Friday, June 11, 2010

Geospatial Analytics using Teradata: Part I

In October, I (along with a co-worker) will be giving a presentation at the Teradata PARTNERS conference. The topic will be on how Railinc uses Teradata for geospatial analytics. Since I did not propose the paper, write the abstract, or even work on geospatial analytics, I will be learning a lot during this process. So, to help with that education I will be sharing some thoughts in a series of blog posts.

To kick the series off, let me share the abstract that was originally proposed:
Linking location to information provides a new data dimension, a new precision, unlocking a huge potential in analytics. Geospatial data enables entirely new industry metrics, new insights, and better decision making. Railinc, as a trusted provider of IT services to the Rail Freight industry, is responsible for accurate and timely dissemination of more than 10 million rail events per day. This session provides an overview of how Railinc Business Analytics’ group has implemented Active Data Warehouse and Teradata GeoSpatial technologies to bring an unprecedented amount of new Rail Network insight. The real-time calculation of Geospatial metrics from rail events, has enabled Railinc to better assess; 1) Rail equipment utilization 2) repair patterns 3) geographic usage patterns, and other factors. All of which, afford insights that impact maintenance program decisions, component deployments, service designs and industry policy decisions.
Below is a first pass at an outline for the talk. It is preliminary and will most likely change over the coming months.
  1. Describe Railinc's Teradata installation
  2. Describe Railinc's source systems
    1. Rail car movement events
    2. Rail car Inventory
    3. Rail car health
    4. Commodity
  3. Describe our ETL process
  4. Explain the FRA geospatial rail track data
    1. Track ownership complexity
  5. Tie 1-4 together
    1. Current state of car portal
    2. Car Utilization analytics
    3. Traffic pattern analytics
  6. Lessons learned
    1. Study of different routing algorithms
    2. Data quality issues
Item 5 is the problem - how can we tie our source systems together with geospatial data in a compelling way? One idea is a portal that provides information about the current state of a rail car. How would geospatial data fit into this portal? Location is the most obvious answer, but is there something more interesting? What about an odometer reading for a rail car? Outside of the portal there are ideas around car utilization and traffic patterns. I like the last two but I need to learn more about them.

These are some issues/questions I need to answer over the coming months. Along the way I plan on sharing information about implementation details, possible business cases, and any problems I come across.