MATLAB - Importing Data with TDFREAD (and TEXTSCAN)
26 Feb 2009 Quan Quach 21 comments 2,545 views
Recently, we got an inquiry through the contact form about how to import a particular data set. Since importing text-based data is an important task that many people have questions on, today’s post focuses on how to import a particular data set. Specifically, this post focuses on using tdfread to import tab delimited data. The tdfread command is a built-in MATLAB function that reads in tab delimited data. It’s amazingly easy to use! In addition, we’ll also revisit the textscan command and we use it to import data.
Contents
- The Input Data Set
- Which Data Set Should I Work With?
- Using TDFREAD
- Using TEXTSCAN
- The Best Way to Import Data?
The Input Data Set
Scott () wrote:
I have a problem that I need some help solving. Pasted onto this message is
following data. It contains column headers for the Region, State, Sales, Head
Count. What needs to be done is to read this data in (either by .txt or .csv)
and organize it then sum the data and display in table form.The data as .txt (this data is tab delimited):
Region State Sales Head Count North North Dakota 80078 81 North Montana 90608 391 North Michigan 4598 27 North Wisconsin 11622 36 South Florida 9788 73 South Georgia 86456 385 South Alabama 94766 91 South Mississippi 13004 61 East North Carolina 612 25 East Virginia 85508 233 West California 84419 262 West Washington 92682 97 West Oregon 53185 51The data in .csv format:
i,Region,State,Sales,Head Count 1,North,North Dakota,80078,81 2,North,Montana,90608,391 3,North,Michigan,4598,27 4,North,Wisconsin,11622,36 5,South,Florida,9788,73 6,South,Georgia,86456,385 7,South,Alabama,94766,91 8,South,Mississippi,13004,61 9,East,North Carolina,612,25 10,East,Virginia,85508,233 11,West,California,84419,262 12,West,Washington,92682,97 13,West,Oregon,53185,51
You can download the sample files here: Text Sample Data | CSV Sample Data
Which Data Set Should I Work With?
Half the battle was deciding which data set was easier to work with. To make this post more interesting, Rob and I decided on a friendly wager to see who could import the data into MATLAB first. The victorious party (to be determined from reader comments) wins free lunch.
My first inclination was to work with the tab delimited data. Since I’ve worked with a bunch of tab delimited data before, I was more familiar with this. I was crossing my fingers that I would prevail against Mr. Slazas.
On the other hand, our resident textscan master, Rob Slazas, chose to work with the .csv file. Since he uses textscan in his sleep, Rob immediately recognized that the .csv file was set up perfectly for textscan.
We imported both files using different techniques (and different file formats). You can be the judge on what method works better.
Using TDFREAD
A very useful function that I discovered when working with tab delimited data is tdfread.
The MATLAB help states the following when I query for help on the command tdfread:
TDFREAD Read in text and numeric data from tab-delimited file.
Using the following command, I was able to import the data quite easily.
%notice that I didn't have to use fopen! quanData = tdfread('SampleData.txt')
The following output is stored into a structure.
quanData =
Region: [13x5 char]
State: [13x14 char]
Sales: [13x1 double]
Head_Count: [13x1 double]
At this point, the data is successfully imported into MATLAB and is ready to be processed in whatever manner.
Using TEXTSCAN
The textscan command is a personal favorite of Rob Slazas. Rob had previously dedicated a post to this wonderful command. Independently of my me, Rob chose to use the .csv file and the textscan command to import the data.
He came up with the following code:
%open up the data file fid = fopen('SampleData.csv'); %use textscan to import the data robData = textscan(fid,'%f%s%s%f%f','delimiter',',','headerlines',1); %close the data file fclose(fid);
At this point, the imported data is stored in a cell array as shown below. It should be straight forward to manipulate the data accordingly in this form.
robData =
[13x1 double] {13x1 cell} {13x1 cell} [13x1 double] [13x1 double]
The Best Way to Import Data?
Is there a best way to import data? No, not really. I would suggest that you first try the built in commands first such as: csvread, dlmread, load, importdata, tdfread, etc as they could potentially save you some time. Depending on your data file, these functions may or may not work. If you find that these built in commands cannot successfully import your data, then the textscan command is probably your best bet. It’s very flexible and provides tons of formatting options.
So who wins the battle of importing data? Quan or Rob? The author of this post thinks that Quan takes this contest hands down. Then again, the author of this post is Quan himself.
21 Responses to “MATLAB - Importing Data with TDFREAD (and TEXTSCAN)”
Leave a Reply
Include MATLAB code in your comment by doing the following:
<pre lang="MATLAB">
%insert code here
</pre>


OK Quan!
Even though I’m not that fond of handling structures (can be hard to index), I concede that the tdfread method is easier. I usually prefer to work with csv files, or other types that do not use a “whitespace” type character as the delimiter, simply because there is a lot of variation between programs as to how it comes out. For example, some programs use the /tab character, while others pad with spaces. This can trip you up on the import. For Scott’s application though, it seems to work great.
Nice job,
Rob
P.S. Did I give up too easily? Anyone else out there disagree?
I think you put up a valiant effort. In the end, the best man won. I was thinking lunch at Morton’s steak house would be delicious.
In all seriousness, textscan is much more versatile and tends to work well. I’m surprised it was so difficult to get it to do what I wanted in this case.
Hey guys, nice victory this time Quan, but I don’t think the war is over. I can see tdfread failing when multiple spaces look like the tab as Rob mentioned. I do like how the cells are labeled by the title of the column too.
I was just hoping to go over my understanding of how Rob used textscan.
‘%f%s%s%f%f’- specifies format of the data in matlab:
[double][string][string][double][double] (what if you have too many/few parameters compared to the segments of data?)
‘delimiter’,',’- specifies “comma” as delimiter
‘headerlines’,1- specifies number of lines to skip before reading data (how does this handle .txt files that use word wrap?)
-Do you lose the header lines with textscan unless you read them into another cell?
Question for both formats: any idea how null values would be handled if an entry was left blank?
tdfread
- easy to use, no thinking involved, categorize by field names
- harder to unpack, indexing, harder to add entries (especially string due to architecture of struct ie. Oregon is actually ‘Oregon (lots of whitespace)’ to match with the string North Carolina)
textscan
- need some tailoring, some thinking, categorize by cell arrays
- easy to unpack with linear indexing (ie. Data{:}), easy to add strings due to architecture of cell, concatenations are easier.
all said and done… it really depends on Scott wants to do with the data. It sounds like he does not need to add many entries. importdata actually does a decent job here for both types of files (this is only because these are formatted data)
- automatically giving you the textdata in a table like format
- the data entries in double type for immediate computations!
- but need to load the data back into the table
Hey Guys,
Just thought I’d follow up on a couple of things that Zane and Dan raised:
1. Zane asked “what if you have too many/few parameters compared to the segments of data?” [with textscan]
The good news is that it doesn’t crap out with more/less format characters than there are fields. It does, however, behave differently with too many than it does with too few. With too few the output is “out of sync” since it tries to read the next field in as the first of the next line and continue. For instance, if there are 4 fields of data in the file and textscan is executed with 3 format strings, then the output will have 3 columns where the first row is correct, but the first element in the second row is really the 4th element in the first row of the file. Using too many formats is better, since it pads the output with columns of NaN’s. At least with too many format strings the left-most columns are in the right order.
2. Zane asked how the ‘headerlines’ argument handles text files with word wrap:
Well, word wrap seems to be a function of how the text file is displayed rather than the file itself (at least from what I can tell of notepad). Textscan seems to obey the EOL character.
3. Zane again “Do you lose the header lines with textscan unless you read them into another cell?”
Yes. I typically do an “fgetl” to grab the headerline before I run textscan. Then you can index the vectors to the header names. Even though it takes an extra step, I like being able to index them.
4. “how null values would be handled if an entry was left blank?”
If you leave a field blank or have a “NaN” instead of a value, textscan handles it fine. If you just have a line with one less value (and one less delimiter), then see above about too few/ too many format parameters.
5. I tested importdata with csv, txt with tabs, and txt with padded spaces. It worked right with the tabbed txt file, but not the other two. Definitely the easiest option if it likes your data’s format.
Hope this helps,
Rob
Great stuff. Thanks Rob.
Hello,
There’s really good informations on textscan and tdfread. However, I’m still having serious problems getting these data into matlab (see txt below). My first tought is to get rid of the 17 headerlines then read the data. But the coma is recognized as a delimiter, wich is not, it’s a tab delimiter… Moreover, the systems add a text line at the end.
It looks like I need to use more complexe structure to get these data. Does someone can help me? I’m loosing my hair!!!
Thank you,
Alex Paquet
Data:
SpectraSuite Data File
++++++++++++++++++++++++++++++++++++
Date: Tue Feb 10 09:18:03 EST 2009
User: alpaq40
Dark Spectrum Present: No
Reference Spectrum Present: No
Number of Sampled Component Spectra: 1
Spectrometers: USB4F01441
Integration Time (usec): 300000 (USB4F01441)
Spectra Averaged: 1 (USB4F01441)
Boxcar Smoothing: 0 (USB4F01441)
Correct for Electrical Dark: No (USB4F01441)
Strobe/Lamp Enabled: No (USB4F01441)
Correct for Detector Non-linearity: No (USB4F01441)
Correct for Stray Light: No (USB4F01441)
Number of Pixels in Processed Spectrum: 3648
>>>>>Begin Processed Spectral Data<<<<>>>>End Processed Spectral Data<<<<<
The last reply didn’t come out well : Heres a reduced sample of my data:
Headerlines:….
+Blablablabla
+Blablablabla
+Blablablabla
>>>>Begin Processed Spectral Data<<<<>>>End Processed Spectral Data<<<<<
Many thanks to one who solves this problem
@Alex,
I would like to try reading in your data file from the comment above. Can you please submit the original file to us through the contact form? I’ll reply with what I figure out here in the comments.
Best, Rob S.
@Alex,
I took a crack at your spectral data file. Here’s what I came up with. Hope you find it useful.
Rob
import xml dataset?
Hello
I am trying to use the textscan command to work a csv file we generate on a test stand.
C_text = textscan(fid, ‘%s’, 115, ‘delimiter’, ‘,’);
The test stand results file has 115 strings, but I need to ignore/erase the first two. I find it unpractical to write the 113 %s and the two %*s.
Any ideas?
Then I am thinking on getting the results (below each of the 113 strings) , strings and numbers, to start working on them for statistical analysis. I will for sure ask you alot more things,
I appreciate your help.
thanks
@Roberto
Sadly, tdfread is clearly inadequate for your application. 8^) Have I mentioned recently how much I really like the TEXTSCAN function?
About your first question, skipping the first two lines, try using the ‘headerlines’ option of TEXTSCAN. If the lines you want to skip are at the very top of the file, then this will work. It does not work for lines you want to skip that are mixed in with the data farther down.
About your second question, parsing out the strings and numbers, try using a more specific format. For example, if all the lines are uniform in layout with a string and then a value, you could use a format like ‘%s%n’ to seperate them.
So, all together, your function call might look like this:
Hope this helps! If you want, send me the csv file so I can see exactly what you’re working with.
Rob
I copy at the contact form the csv file. I do not know how to attach it. If I did it incorrectly please let me know.
The first two line are at the very top, so it will work. I will give a try with the headerlines.
The format is the same to all the files and it has 4 strings, 2 numbers, 1 string, 106 numbers:
‘%s”%s”%s”%s”%f”%f”%s”%f”%f”%f”%f’…’%106f
Please let me know if you receive the csv file properly so I can explain exactly what I want to do.
thanks for your help.
It will look like this:
Me again, I think I can gather the info now. I will need to change the number of rows depending on the size of the file.
From the text already gathered, I am on a dilema, should I create a separate variable from each column to easly start plotting them or should I work out each of the variables at “data” ? Please take a look to the code and let me know your thoughts. Is it possible to have at the x axis both date/time and Serial Number (obviously, matched) ? how can I change the scale of the y axis ?
I really appreciate your help, thanks
It’s funny, I’m the guy that originally asked Scott the question back in February. My actual question was how do you read it in and then make it into a simple summary table (i.e. like a pivot table which would give you total sales by region).
My solution ended up being pretty simple. I used R, which is quite happy with handling text data, and has a several functions for aggregating data by category to make summary tables.
I’m still very surprised that there is not a simple way to do this in Matlab. You guys should really take a look at R… It has the same computational functionality, and is a lot simpler for basic data handling. There are a lot of ways to make R fast, and in the end (no matter what statistical software you use) you end up having to call C code for top notch speed.
Hi Guys,
Your work includes great clues about my work but i still have problems and hope you’re the heros that will save me:)
I’m working with kdd’99 dataset which seems to be huge for applications. They include thousands of lines. I’m trying to save them as .txt documents to upload matlab with tdfread but they are over the size limitations of notepad.
So how can i upload this files to matlab? You can find the datasets here for idea:
http://mlr.cs.umass.edu/ml/databases/kddcup99/kddcup99.html
Thanks for your help.
Burcu
Hey..I am trying to import one excel fiel which i could not through dataset and xlsread…so i tried by changing that to txt…problem hare is xlsfile contains blanks in between and while reading from tdfread its not reading it properly..please help me
Hi
I am trying to use textscan to read a csv file.
Considering the initial data file posted as an example, suppose if i want to extract the information from row 5 to row 10 ( user mentions the search criteria ) , will it be possible to do it using textscan ? .
Kindly help
Thanks