Connect to Text Files with the Microsoft Jet ODBC Text Driver
by Steve Kass
SQL Server provides several
solutions for importing and exporting text files. Data transformation services (DTS) supports text as a data source
or destination; the bcp and osql utilities can import and export text files;
and T-SQL provides BULK INSERT for importing text. Another solution is less widely known: the Jet text file driver
can connect SQL Server directly to text files via a linked server. This ODBC driver handles some tasks better
than any of the other solutions, and it provides pure T-SQL solutions to many
problems.
This is the first of (at
least) two articles on the Jet text driver.
It shows how to connect a directory of text files as a linked server and
includes a practical example. If you’re
near a development server as you read this, open the Query Analyzer and follow
along.
Connecting a directory of
text files as a linked server is easy.
Here’s how:
That’s all that’s required
to add the linked server txtsvr as a connection to the text files in E:\txtsvr,
but there are no tables in txtsvr yet.
Also missing is a required file, schema.ini, which must contain a description
of each table’s format. You can learn more about schema.ini from the
Microsoft MSDN Library, but for now, just use Notepad or your favorite text
editor to create the file e:\txtsvr\schema.ini as shown in Listing 1. In addition, create an empty text file in
e:\txtsvr named table1.txt.
The file
e:\txtsvr\table1.txt is now an empty table in the linked server txtsvr. We can see that if we list information about
all the tables in txtserv with the system stored procedure sp_tables_ex:
EXEC sp_tables_ex
txtsvr
You’ll see that the name of
the one table in txtsvr is table1#txt, according to sp_tables_ex. The table name for a file in txtsvr will
always contain ‘#’ in place of the filename’s ‘.’ character.
We can’t issue a CREATE
TABLE statement against this linked server, but we have created a table easily
enough. We can, however, run INSERT and
SELECT queries, if we use the four-part naming convention
<server>…<table> (that’s three dots):
insert into
txtsvr...table1#txt(CustomerNumber,CustomerName,EntryDate)
select orderid, shipname, orderdate
from Northwind..orders
Be sure to specify the
column names explicitly in any INSERT query.
The ordinal positions of the columns don’t always match the order given
in schema.ini, and so you may be in for a surprise if you omit the column
names. This INSERT query exports data
from Northwind..orders to e:\txtsvr\table1.txt. It also doesn’t make much sense, but it’s just an example. We’ll get to a practical example very
soon. Here are the first few lines of
e:\txtsvr\table1.txt after we run the query.
Tab characters separate the three columns of data in each row.
10248 "Vins et alcools Chevalier" 7/4/1996 0:00:00
10249 "Toms Spezialitäten" 7/5/1996 0:00:00
10250 "Hanari Carnes" 7/8/1996 0:00:00
Now that this file contains
data, a SELECT query is worth trying.
Remember, this is both a text file in a Windows directory and a table in
the linked server txtsvr.
select CustomerName,
convert(varchar(10),EntryDate,101) as [Date]
from txtsvr...table1#txt
where datepart(weekday,EntryDate) = 4
Output (results abridged):
CustomerName Date
----------------------------------------
----------
Hanari Carnes 07/10/1996
Ernst Handel 07/17/1996
Folk och fä HB 07/24/1996
Now we certainly don’t need
the Jet text file driver or a linked server just to retrieve this
information. This was simply an example
to introduce the technique.
If you’re running the
examples as you read, you probably noticed that these queries were slow. Each one takes several seconds at best. If you run them a second time, however,
you’ll see a significant improvement.
And if you’re concerned with the time it might take if table1.txt
contained thousands, or hundreds of thousands, of rows, rest assured that this
approach will do as well as DTS, bcp, or BULK INSERT in its speed, and in other
respects, it has some real advantages.
If you’ve ever wanted to
generate a Windows directory listing, you might have found the extended stored
procedure xp_cmdshell useful. The
one-liner
EXEC
master..xp_cmdshell 'dir "e:\Program Files\Ocelot"'
shows me what’s in my Ocelot
program directory (see Listing 2).
It takes some work to turn
this into a usable table containing the name, last modification date, and size
of each non-directory file. Among the
difficulties are the commas in the size (pretend you don’t know about DIR/-C),
the non-standard AM/PM indication, and the various extra lines of information.
One of the most welcome
features of the Jet text file driver is how well it accommodates text files
contain a mix of formatted and non-formatted lines. We’ll add a table specification to schema.ini that describes the
“good” lines as fixed-length data, and then view this mixed-bag file as a
table. The lines that don’t contain a
date, size and filename in neat columns, or that are blank or too long or too
short, won’t cause any trouble, as they do if we try to use bcp or BULK INSERT
to import this information into a three-column table.
While there may be a way to
specify the one-character AM/PM indicator as a datetime format, I didn’t find
it. As a result, I put four columns
into the schema for a DIR output file, and I added an entry to schema.ini
(Listing 3).
You’ll note that I’m
importing the file size column as currency.
That serves two purposes. First,
it means I don’t have to do anything special to accommodate the commas, since
the driver welcomes them in monetary data.
Second, it eliminates the possibility of overflowing a 4-byte integer if
a file is much bigger than 2 GB. I
chose 20 for the filename size so the results would fit on this page. What’s more important to know is that any
filenames longer than 20 characters will be quietly truncated to 20 characters
when data is selected through the linked server. Conveniently, directory.txt contains the entire filename, so if
the truncation is a problem, you don’t have to generated the directory listing
again. Simply replace the 20 with a
larger number in schema.ini.
My specific goal will be to
get the following results from txtsvr...directory#txt, if
E:\txtsvr\directory.txt contains a raw directory listing.
Name
FileSize Date
-------------------- --------------------
-------------------
CHANGE.SQL
874 2001-12-28
04:00:00
CHAR.SQL
1298 2001-12-28
04:00:00
CLUSTER.OCELOT
884 2001-12-28
04:00:00
(results abridged)
This goal is just two simple
steps away. First, put the raw
directory listing into directory.txt.
I’ll do this the hard way (shown in Listing 4), as you might want to in
a stored procedure, with the directory as a parameter. Note the use of quotename(), which handles
spaces in the directory name and guards against malicious “SQL injection.”
Second, query the table
directory#txt for the data you want, understanding that if the first 17
characters of a line do not contain a date in the format mm/dd/yyyy hh:nn, directory#txt will show NULL in the
Date column, and if the 21 character-wide size column doesn’t contain a valid
currency value, Size will be NULL. This
particular query (Listing 5) retrieves only the files, not the directories, by
requiring the Size column to be non-null.
That’s it. But just in case you decide you need
something else, the entire raw directory listing is still in
directory.txt. If you later want to see
a list of the directories in E:\Program Files\Ocelot, use Listing 5 again,
but change the WHERE condition to where [Size] is null and FileDate is not null.
In the next article, you’ll
learn how powerful the Jet driver is for handling ragged file, unusual formats,
and large quantities of data.
Listing 1.
[table1.txt]
Format=TabDelimited
CharacterSet=OEM
ColNameHeader=True
Col1=CustomerNumber
Long
Col2=CustomerName
Text Width 40
Col3=EntryDate
DateTime
Listing 2.
output
-----------------------------------------------------------
Volume in drive E is DISK3PART01
Volume Serial Number is 841B-E016
NULL
Directory of e:\Program Files\Ocelot
NULL
12/31/2001 09:09p
<DIR> .
12/31/2001 09:09p
<DIR> ..
12/28/2001 04:00a 874 CHANGE.SQL
12/28/2001 04:00a 1,298 CHAR.SQL
12/28/2001 04:00a 884 CLUSTER.OCELOT
12/28/2001 04:00a 1,493 COMPOUND.SQL
(results abridged)
Listing 3.
[directory.txt]
Format=FixedLength
CharacterSet=OEM
ColNameHeader=False
DateTimeFormat=mm/dd/yyyy hh:nn
Col1=FileDate
DateTime Width 17
Col2=AMPM Text Width
1
Col3=Size Currency
Width 21
Col4=Name Text Width
20
Listing 4.
declare @directory
varchar(80)
set @directory =
'E:\Program Files\Ocelot'
declare @cmd
varchar(300)
set @cmd =
'dir ' + cast(quotename(@directory,'"')
as varchar(320))
+ ' > E:\txtsvr\Directory.txt'
exec
master..xp_cmdshell @cmd
Listing 5.
select
[Name],
cast([Size] as bigint) as FileSize,
convert(varchar(19),
dateadd(hour,
case when ampm='a' then 0
else 12 end,
FileDate),
120) as [Date]
from txtsvr...Directory#txt
where [Size] is not null
and FileDate is not null
order by fName