MATLAB - Getting What You Want With textscan
06 Oct 2008 Rob Slazas 29 comments 2,680 views
Rob Slazas is not just a statistics guru; he also uses MATLAB to do everyday tasks. Today he shares some of his tricks using the Textscan command!
I was browsing the real estate listings the other day, and came across a roadblock. All the data I could ever want was readily available, but scattered in awkward formats and littered with other things that weren’t important to me. “How do I pull out the data I want while ignoring everything else”, I asked myself? In this post I’ll share an example of how the textscan function came through for me - maybe you’ll find it useful too.
Contents
Too Much Clutter!
I started out by asking my real estate agent to send me a list of all the rental properties in an area, including their list prices and final rent price. My interest was to see if rentals were going for more than, less than, or about the same as the list price. I got a www link to 13 pages of this:

As you can see, the webpage report contains a jumble of information and graphics on it. Mixed in with everything else is the data I want: the list prices and the final prices (red circles on the last one above). There must be a way to scrape this data off the page and analyze it in MATLAB. Hmmm…
textscan in Action
After a few minutes of trying to work with the source html for the webpage, I decided to just copy / paste the browser window directly into a text file and go from there. At least this way I would be free of all those html tags and the images. Here is what I got in the text file: (download the text file here)

Getting closer. Now it’s in a more familiar format. To sift out the data I want, I need to get this into a MATLAB variable of some kind. This is where textscan comes into play. To grab each word as a separate cell in a cell array, I used textscan’s default delimiter, which is simply a “space”. Remember that a delimiter is just the character that triggers a separation between chunks of data. Here is what my cell array looks like now:
fid = fopen('rentals.txt'); allstrings = textscan(fid,'%s'); fclose(fid); allstrings = allstrings{:}; %note: this last line is needed to undo %the strange cell-within-a-cell output that textscan uses.

Great! Now I can employ some classical string searching and parsing techniques to locate and extract the price data.
Searching and Parsing
Looking at the cell array, I can see that it follows a pattern with respect to the list and final sales prices. The rules seem to be:
- For each property, list prices come before sales prices
- Each cell with a price in it is preceded by a ‘Price:’ cell
- Each cell with a price in it starts with the ‘$’ character
- Every property has a list price
- Some properties are still active, and don’t have sales prices
So, if I search for the ‘Price:’ cells, then those locations +1 will mark where the list and sales data are supposed to be. I ran the following code to get a list of all those locations in the cell array:
pricecells = strmatch('Price',allstrings); %gets the list of cell indices that start with 'Price' datacells = pricecells + 1; %adds one to each element in the array
And since each property has 2 ‘Price:’s (list and sales), there are half as many properties as there are elements in datacells. Knowing this, I can prepare my numerical array to collect the pairs of list and sales prices for each property. We need a properties x 2 array of NaN’s to start with:
prices = nan(numel(datacells)/2,2); %note: this creates a 2 column matrix of NaNs that will be populated later %preallocating variables makes things go faster!
And finally, I will go to each of these cell locations and test for a dollar value. If the cell starts with a ‘$’ character, then one is present and should be saved in my prices variable. I’ll do this alternating between list prices (column 1 of prices) and sales prices (column 2 of prices). For the active rental properties that don’t have a sales price yet, I’ll leave the sales data as NaN.
for i = 1:(numel(datacells)/2) if allstrings{datacells(i*2-1)}(1)=='$' prices(i,1) = str2double(allstrings{datacells(i*2-1)}(2:end)); end if allstrings{datacells(i*2)}(1)=='$' prices(i,2) = str2double(allstrings{datacells(i*2)}(2:end)); end end %Notice the curly brackets in the if statements? These are needed %when working with the contents of a cell.

Now THAT is exactly what I wanted in the first place! There are 103 properties, each with list prices and some with sales prices. Those without sales prices have NaN’s in the second column.
Answering the Question
With these pairs of prices, I can answer my original question of how final sales prices compare to the list prices in this area. Are people paying more, less, or the same as what the landlords are asking? So what I’m really interested in is the difference in price pairs here. I used the diff function, and then graphed the results this way:
pricediff = diff(prices,1,2); h1 = figure('Position',[100 100 600 400],'Color','w'); h2 = subplot(1,6,1:3); boxplot(prices,'Notch','on','labels',{'List Price';'Sale Price'}); ylabel('Rent ($)'); grid on; h3 = subplot(1,6,5:6); boxplot(pricediff,'labels','$ Change per Property'); ylabel('Rent ($)');

Wrapping up
So after all the digging, searching, parsing, and graphing, I can see that people are paying about $50 less than the list price, with some people paying a lot less and a few paying more. Great info to have when going out to look at rentals in this area.
Hope this example helps you become more familiar with textscan and the searching and parsing techniques used above. As always, questions and comments are welcome.
29 Responses to “MATLAB - Getting What You Want With textscan”
Leave a Reply
Include MATLAB code in your comment by doing the following:
<pre lang="MATLAB">
%insert code here
</pre>


I feel that with MATLAB, anything is possible. Thanks for the post Rob. Textscan is my new best friend.
way to highlight some nifty uses with MATLAB!
@Quan, @Dan - you’re too kind.
Anyone else have a MATLAB solution to a “real life” problem? Share the wealth and let us know!
Rob S.
That’s great - now can you take it a step further? Wouldn’t it be slick to have your analysis/plots update automatically? Can you have MATLAB pull the text data directly from the web? I know it’s possible in Visual Basic, so it MUST be possible in MATLAB…it might be tricky to parse out the text from the HTML markup, but I’m sure regular expressions would be helpful there.
Yes, dp raises a good point. Is it possible? Only time will tell.
Regular expressions are like a different language to me though.
@dp, @Quan,
Great question! I think urlread will do the trick. I’m no wizard with regexp either, but with some teasing we can usually make it behave. Try this example code below that grabs the price of my favorite stock as it changes - is this what you had in mind? (Hit any key to stop the loop and close the figure)
Nice, that’s exactly what I had in mind. On a similar path as the original post, I threw something together that pulls apartment rental info from craigslist - just input the city! Sorry the comments are so verbose, I like to “think out loud” while coding…
Dp,
Love the massive comments. I only wish I were disciplined enough to write code laced with meaningful comments. Very cool examples from both Rob and DP! I tried out Rob’s stock function and it worked just like advertised. Now I always have stock prices at my fingertips. And with the way the market has been moving lately, it’ll come in quite handy when I want to know much much money I’ve lost in the last 5 minutes.
Quan
Wow, that got butchered…looks like the left and right carat (greater than/less than in some cases) didn’t survive the “pre” MATLAB wrapper. How does one get around this?
hi dp, can you send your comment through the contact form. i’ll be sure to fix it!
Hey,
I have been working with Matlab for years. I have never really used textscan before. It really is quick. (I normally work with excel spreadsheets). This time around I have 5 000 000 lines in a flat file. I see that textscan should be able to read chunks of the file. What parameter do I need to set to set the rows I want to import at a time?
(this way i dont have to split the file before importing)
Many thanks
Okay, so i feel a little silly
somehow when i entered:
TxnBatch = textscan( fid, ‘%s %s %s %s %f64 %f64 %f64 %f64 %f64 %f64′, N,’delimiter’, ‘,’);
with N as number i got errors.
but by refering to a variable N i have no problems
…
@ Reabs,
Like you, I love it when I answer my own questions! That’s weird about the N as variable versus entering a number directly. If the whole file follows the same format, you can just omit the N, then textscan will repeat until the end of file.
Another thought, if your file does not follow the same format all the way through, you can write a simple search loop that finds the last line you want, then use that number in your textscan line. Maybe something like this:
Let’s say that you have a file with a mix of data types. The first N lines are data (numbers), followed by some lines of text. You only want the data.
Hope this helps,
Rob
hey rob
thanks for the demo, i also like using textscan. Not very experienced with it, so always good to see other uses of it.
i’ve got one issue with it recently - in a textfile with about 22,000 lines, it only reads in the first 7,500 lines. do you know if this is a general thing of textscan and would i need to use another import tool (maybe fread) instead?
thanks in advance
marije
@ Marije,
Sounds like strange behavior from textscan. I couldn’t find anything in the documentation about a limitation like that. To test it out I ran the following code. It generates two types of 22,000 line data files and then reads them in - one with 4 elements per line and one with 1000 elements per line. Both cases read in all 22,000 lines for me. Perhaps you can also run it to see if textscan is the real problem (for instance, if line # 7501 in your file has an unexpected character, textscan might be terminating there).
* Caution, this code took several minutes to run on my laptop, so start it and then go get some coffee 8^)
If you like, you can also email your data file and the script.m file you are using, and I’ll take a look. Use rslazas(at)gmail(dot)com.
HTH,
Rob
it was indeed a sudden hash # at line 7500-ish.
*hitting head against the wall*
so sorry, i am not sure why i didn’t check for that.
Now i’m trying to use your trick above to make it skip that line, but there is one command fget1 that is unfamiliar. Is it written by you? what does it do?
thanks so much for your help.
@ Marije,
Oh, that is not a custom function “fget1″, that is fgetl (the last character is an L, not a one). It’s in the documentation:
http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/fgetl.html&http://
Rob
@ Rob
Hey, Thanks for the response. I am definately goig to use that bit of code to search through my flat file. this way i can pull out just the lines i need, when i need!
(that would have been my next question)
As a note for anyone else getting going with textscan:
matlab will not import files saved as unicode, resave (in something like textpad) to ansi and all is good again.
[...] textscan command is a personal favorite of Rob Slazas. Rob had previously dedicated a post to this wonderful command. Independently of my me, Rob chose to use the .csv file and the textscan [...]
guys,i have a problem about ‘textread’,hope somebody could help
what i have is a m-file,named test.m
it’s just a test file,acturally the m-file is huge…..
and i want to find some of the parameters and rewrite it with new value.
so i try to read them with matlab first,use
unfortunately,it didn’t work
hope somebody can help
thanks!!
bob
i mean, have to ignore the ‘equal sign’ and the ‘;’
don’t know how to do that
@bob,
Sorry for the delayed response, been away…
My first attempt at bringing in data is always the “importdata” function. Having tried it on your situation without success, let’s move on to a less automatic method.
I am sorry to report that the “tdfread” function won’t work here either (one of Quan’s favorites since you don’t have to use fopen / fclose). How about trying TEXTSCAN instead of textread? Have I mentioned recently that I really like the textscan function? 8^) Sometihng like this:
This will open your test.m file for read access. Then look at each line for a string of characters (%s) and a number (%n) seperated by either an equals sign or semi-colon (the delimiter).
The output will be the “alldata” variable as a 1×2 cell array. The first cell will hold a cell array of all the left hand variable strings in the test.m file. The second cell will hold an array of doubles from the right hand value assignments. You can extract this array of values to a more easily manipulated variable by:
Hope this helps,
Rob
@Rob
thanks for the idea
but now i think, with ‘textread’ or ‘textscan’ is not enough to solve this problem,because i also have some matrix in my m-file,like this
so it’s a little bit difficult with ‘textscan’ to get all the informations we need
and then, i have tried to find out another solution
i made a testfunction,like this
and it works,but not in a GUI,-_-
as we know, in a GUI , we have to use ‘assignin’ or ‘evalin’ to send or get information from workspace.and now after we run the test.m in a function of this GUI, with ‘variable.name = who’, we can still get the all the names of variables, but i can’t get the values, don’t know how toe execute them,because it looks like,they are not in the workspace.
i used to have problem with data saving and sharing in GUI
ppffffffff
hope you have the idea
Bob
@Bob,
There’s a thing we say around here sometimes, when an answer isn’t good enough because we were only told half the problem. “You’re killing me!” (with a good natured smile) Yes, of course the straight read I laid out with textscan won’t work if you have matrices. You can still do this, just with some extra conditional “smartness” added in. I’m sure you can figure that out from here now that you can read the file.
For the workspace variable problem with the GUI, I will have to defer to my colleague Quan. He is the master of our GUI universe, and I expect will be able to help you much more expediently than I.
HEY QUAN! GUI PROBLEM OVER HERE!
HTH,
Rob
@bob
it looks like you might want to do the following:
%get the variable names from the workspace variable.name = evalin('base','who');@ Rob and Quan
Morning, thanks a lot for your quick answers
yup,you are right,Rob. i would keep trying with ‘textread’ or ‘textscan’with some loops and conditional added in. when it works, it must be also very useful!!
now,Quan. thanks for the hints, now i know where is the problem, i uesd ‘evalin’ yesterday, but we have nothing in the main Workspace. i changed the code
yes, just with ‘eval’ but not ‘evalin’
oh,thank you so much!! it’s so helpful to discuss with you guys here!
thanks thanks thanks!!
Bob
I don’t know if this thread is still working but I will ask anyway…
I have a huge .txt file of almost 500mb of information and I can’t import it because it is too big. Basically the data in the file is:
Date…Time…Latitude…Longitude…number…number…kA…number and then it starts repeating itself. This is an example of what is seen:
09/01/03 00:00:01 26.996 -98.091 -81.3 -9.7 kA 1
09/01/03 00:00:02 25.484 -99.301 31.6 9.6 kA 1
09/01/03 00:00:03 38.795 -80.904 -166.1 -24.2 kA 3
09/01/03 00:00:05 26.888 -98.099 26.9 8.8 kA 1
09/01/03 00:00:05 39.170 -88.154 -173.1 -25.4 kA 1
09/01/03 00:00:05 34.979 -85.770 -105.8 -13.9 kA 1
What I want to know if there is a way to read the file and arrange the data in columns using only the first 4 things given in each sequence: Date, Time, Latitude, Longitude.
and delete the rest. And after that if there is a way to filter a specific range of information.
I know it is a lot but I would really aprecciate the help.
Hi GV,
Sorry for the delayed response. I think the big task here is to get the data into variables the way you want it. After that you can filter, sort, etc.
Try this use of TEXTSCAN to get only the first 4 values of each line. Then you just have to redistribute them out of the cell array (format of TEXTSCAN output).
Hope this helps!
Rob
Hi Rob & all,
I’m trying to read a file that includes strings and numerical data with textscan command. After importing this data, i’ll use it for neural network training.
My dataset is kdd’99 dataset ( could be seen here: http://mlr.cs.umass.edu/ml/databases/kddcup99/kddcup99.html will use 10% ones) and looks like:
0,tcp,smtp,SF,829,327,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,8,113,0.88,0.25,0.12,0.02,0.00,0.00,0.00,0.00, normal
Data dimensions are very large, so want to make a trial with first 19 rows of the dataset and used this commands:
fid = fopen(’ked.txt’);
hadi=textscan(fid, ‘%u8, %s, %s, %s, %u16, %u16, %u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%f,%f,%f,%f,%f,%f,%f, %u16, %u16,%f,%f,%f,%f,%f,%f,%f,%f’,%s, 19);
fclose(fid);
Then let’s say i need to set the first 41 columns of first 15 lines as training data, so i use this command:
P= hadi(1:15, 1:41);
and i get this error:
??? Index exceeds matrix dimensions.
hadi is 1×42 matrix, and i cant fix it to be 19×42. As i can understand this a normal behavior for textscan but i need a 19×42 matrix here to gon on my work. My matlab version is R2006a.
PS: i’ve also tried textscantool but i’ve also faced with different errors, couldn’t even make it work.
All the suggestions are welcome!