New York City taxi cab data


NYのタクシーの乗降客情報が下記の記事でテストデータとなっていた。

Big Data ST_Geometry Queries up to 20X Faster in Hive
http://blogs.esri.com/esri/arcgis/2014/07/10/big-data-st_geometry-queries-up-to-20x-faster-in-hive/

20倍速くなったらしいが、使い物になるのはしばらく先だろうと予想しているので暇あったらいじってみる程度か。

New York City taxi cab data
http://chriswhong.com/open-data/foil_nyc_taxi/

とりあえず落としてぶち込んだメモ。

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<RECORD>
		<FIELD ID="0" xsi:type="CharFixed" LENGTH="1"/>
		<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="2" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="3" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="4" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="5" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="6" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="7" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="8" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="9" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="10" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="11" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="12" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="13" xsi:type="CharTerm" TERMINATOR=","/>
		<FIELD ID="14" xsi:type="CharTerm" TERMINATOR="\r\n"/>
	</RECORD>
	<ROW>
		<COLUMN SOURCE="1" NAME="medallion" xsi:type="SQLNVARCHAR"/>
		<COLUMN SOURCE="2" NAME="hack_license" xsi:type="SQLNVARCHAR"/>
		<COLUMN SOURCE="3" NAME="vendor_id" xsi:type="SQLNVARCHAR"/>
		<COLUMN SOURCE="4" NAME="rate_code" xsi:type="SQLINT"/>
		<COLUMN SOURCE="5" NAME="store_and_fwd_flag" xsi:type="SQLNVARCHAR"/>
		<COLUMN SOURCE="6" NAME="pickup_datetime" xsi:type="SQLDATETIME"/>
		<COLUMN SOURCE="7" NAME="dropoff_datetime" xsi:type="SQLDATETIME"/>
		<COLUMN SOURCE="8" NAME="passenger_count" xsi:type="SQLINT"/>
		<COLUMN SOURCE="9" NAME="trip_time_in_secs" xsi:type="SQLINT"/>
		<COLUMN SOURCE="10" NAME="trip_distance" xsi:type="SQLFLT8"/>
		<COLUMN SOURCE="11" NAME="pickup_longitude" xsi:type="SQLFLT8"/>
		<COLUMN SOURCE="12" NAME="pickup_latitude" xsi:type="SQLFLT8"/>
		<COLUMN SOURCE="13" NAME="dropoff_longitude" xsi:type="SQLFLT8"/>
		<COLUMN SOURCE="14" NAME="dropoff_latitude" xsi:type="SQLFLT8"/>

	</ROW>
</BCPFORMAT>
DECLARE @path AS VARCHAR(MAX);
DECLARE @prePath AS VARCHAR(MAX);
DECLARE @postPath AS VARCHAR(MAX);
DECLARE @colDef AS VARCHAR(MAX);

SET @path = 'C:\data\trip_data\';
SET @prePath = 'trip_data_';
SET @postPath = '.csv';
SET @colDef = 'C:\data\trip_data\TP.xml';

DECLARE @csvPath AS VARCHAR(MAX);
DECLARE @counter int;
DECLARE @ex_query AS VARCHAR(MAX);

SET @counter = 1;

WHILE (@counter < 12)
BEGIN
  SET @csvPath = @path + 
				 @prePath + convert(varchar,@counter) + @postPath
  SET @ex_query =
  'SELECT
    *
    ,GEOGRAPHY::Point(pickup_latitude,pickup_longitude ,4326) AS pickup  -- 4326=wg1984
	,GEOGRAPHY::Point(dropoff_latitude,dropoff_longitude ,4326) AS dropoff -- 4326=wg1984
  FROM OPENROWSET(
    BULK ''' + @csvPath  + ''',
    FIRSTROW = 2,
    FORMATFILE='''+ @colDef +'''
  ) AS CSV;'

  exec(@ex_query);

  SET @counter = @counter + 1;
END
カテゴリー: 開発 タグ: パーマリンク