Thursday, June 17, 2010

ESRI and Python

Railinc is using ESRI to create map services. One of these services provides information about North American rail stations. The official record of these stations is in a DB2 database that gets updated whenever stations are added, deleted, or changed in some way. When we first created the ESRI service to access these stations, we copied the data from DB2 to an Oracle table, then built an ESRI ArcSDE Geodatabase using the Oracle data.

We had some issues with the ArcSDE Geodatabase architecture, and after some consultation we decided to switch to a File Geodatabase. This architecture avoids Oracle altogether and instead uses files on the file system. With this set up we've seen better performance and better stability of the ESRI services. (N.B: This is not necessarily a statement about ESRI services in general. Our particular infrastructure caused us to move from the Oracle solution.)

The question now is how do we keep the stations data up-to-date when using the File Geodatabase approach? Enter Python.

Rail Stations Data

Before getting to the Python script, let's take a look at the structure of the rail stations table.
  • RAIL_STATION_ID - unique id for the record
  • SCAC - A four character ID, issued by Railinc, that signifies the owner of the station
  • FSAC - A four digit number that, combined with the SCAC, provides a unique identifier for the station
  • SPLC - A nine digit number that is a universal identifier for the geographic location of the station
Most of this data is going to be informational only. What's most important for this process are the latitude and longitude columns which will be used to create geospatial objects.

Python and ESRI

The end result of this process is going to be the creation of an ESRI Shapefile - a file format created and regulated by ESRI as an open specification for data interoperability. Basically, shapefiles describe geometries - points, lines, polygons, and polylines.

While working on this problem I found three ways to create shapefiles programmatically:
  • The ESRI Java API
  • The ESRI Python scripting module
  • The Open Source GeoTools Toolkit
I chose Python over the others because of its simplicity and its history with ESRI. (I do have a working solution using the GeoTools Toolkit that I may share in a future blog post.) Now, to the script.

First, I'll create the Geoprocessor object using the ESRI arcgiscripting module specifying that I want output to be overwritten (actually, this tells subsequent function calls to overwrite any output).

import arcgisscripting, cx_Oracle, datetime

gp = arcgisscripting.create(9.3)
gp.Overwriteoutput = 1
gp.workspace = "/usr/local/someworkspace"
gp.toolbox = "management"

Next, I'll create an empty feature class specifying the location (workspace), file, and type of geometry. The geometry can be POINT, MULTIPOINT, POLYGON, and POLYLINE. In this case, I'll use a POINT to represent a station. At this time I will also define the projection for the geometry.
gp.CreateFeatureclass( "/usr/local/someworkspace", "stations.shp", "POINT" )
coordsys = "Coordinate Systems/Geographic Coordinate Systems/North America/North American Datum 1983.prj"
gp.defineprojection( "stations.shp", coordsys )

Now I need to define the structure of the feature class. When I created the feature class above I defined it with the POINT geometry. So the structure is already partially defined with a Shape field. What's left is to create fields to hold the station specific structure.

gp.AddField_management( "stations.shp", "STATION_ID", "LONG", "", "", "10", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "SCAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "FSAC", "TEXT", "", "", "4", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LATITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LONGITUDE", "DOUBLE", "19", "10", "12", "", "", "REQUIRED", "" )
gp.AddField_management( "stations.shp", "LAST_UPD", "DATE" )

At this point I have a shapefile with a feature class based upon the station schema. Before adding data I must create a cursor to access the file. The Geoprocessor provides methods to create three types of cursors - insert, update, and search. Since I am creating a shapefile I will need an insert cursor.

cur = gp.InsertCursor( "/usr/local/someworkspace/stations.shp" )
pnt = gp.CreateObject("Point")

I've also created a Point object here that I will use repeatedly for each record's Shape field in the feature class.


Now that the output structure is ready, I need some input. To query the Oracle table I will use the cx_Oracle module. This is one of the reasons why I liked the Python solution - accessing Oracle was trivial. Simply create a connection, create a cursor to loop over, and execute the query.

dbConn = cx_Oracle.connect( username, pw, url )
dbCur = dbConn.cursor()
dbCur.execute( "SELECT * FROM RAIL_STATIONS" )

Now I can start building the shapefile. The process will loop over the database cursor and create a new feature class row, populating the row with the rail station data.

for dbRow in dbCur:

    pnt.x = dbRow[10]
    pnt.y = dbRow[9] = dbRow[0]

    fcRow = cur.NewRow()
    fcRow.shape = pnt
    fcRow.STATION_ID = dbRow[0]
    fcRow.SCAC = dbRow[1]
    fcRow.FSAC = dbRow[2]
    fcRow.SPLC = dbRow[3]
    fcRow.LATITUDE = dbRow[9]
    fcRow.LONGITUDE = dbRow[10]
    fcRow.LAST_UPD = dbRow[11].strftime( "%x %X" )


del cur, dbCur, dbConn

First, the Point object created above is used to populate the feature class's Shape field. However, before doing that the InsertCursor is used to create a new row in the feature class (this acts as a factory and only creates a new row object - it does not insert the object into the feature class). Once I have the new row from the database I can populate all of the fields in the feature class row. Finally, I insert the row into the cursor (actually, the final part is the clean up).

One problem that took me a while to figure out (since I am new to ESRI and Python) was handling dates. My first pass at populating the LAST_UPD field was to use fcRow.LAST_UPD = dbRow[11]. Consistent, right? When I did this I got the following error:

Traceback (most recent call last):
  File "", line 72, in 
    feat.LAST_UPD = row[11]
ValueError: Row: Invalid input value for setting

After searching around I figured out that what was coming back from Oracle was a datetime.datetime type that was not being accepted by the feature class date type. I found that I could convert the datetime.datetime to a string and ESRI would do the date conversion properly ("%x %X" just takes whatever the date and time formats are and outputs them as strings).


That's it. Now I have a shapefile that I can use with my ESRI File Geodatabase architecture. The next step is to swap out shapefiles when the stations data changes (which it does on a regular basis). Can this be done without recreating the ESRI service? Stay tuned.


No comments: