Rob Slazas is not just a statistics guru; he also uses MATLAB to do everyday tasks. Today he shares some of his tricks using the Textscan command!

Matlab Logo

I was browsing the real estate listings the other day, and came across a roadblock. All the data I could ever want was readily available, but scattered in awkward formats and littered with other things that weren’t important to me. “How do I pull out the data I want while ignoring everything else”, I asked myself? In this post I’ll share an example of how the textscan function came through for me - maybe you’ll find it useful too.

Contents

Too Much Clutter!

I started out by asking my real estate agent to send me a list of all the rental properties in an area, including their list prices and final rent price. My interest was to see if rentals were going for more than, less than, or about the same as the list price. I got a www link to 13 pages of this:

Web Report

As you can see, the webpage report contains a jumble of information and graphics on it. Mixed in with everything else is the data I want: the list prices and the final prices (red circles on the last one above). There must be a way to scrape this data off the page and analyze it in MATLAB. Hmmm…

textscan in Action

After a few minutes of trying to work with the source html for the webpage, I decided to just copy / paste the browser window directly into a text file and go from there. At least this way I would be free of all those html tags and the images. Here is what I got in the text file: (download the text file here)

Web Report

Getting closer. Now it’s in a more familiar format. To sift out the data I want, I need to get this into a MATLAB variable of some kind. This is where textscan comes into play. To grab each word as a separate cell in a cell array, I used textscan’s default delimiter, which is simply a “space”. Remember that a delimiter is just the character that triggers a separation between chunks of data. Here is what my cell array looks like now:

fid = fopen('rentals.txt');
allstrings = textscan(fid,'%s');
fclose(fid);
allstrings = allstrings{:};
%note: this last line is needed to undo
%the strange cell-within-a-cell output that textscan uses.

Cell Array

Great! Now I can employ some classical string searching and parsing techniques to locate and extract the price data.

Searching and Parsing

Looking at the cell array, I can see that it follows a pattern with respect to the list and final sales prices. The rules seem to be:

  • For each property, list prices come before sales prices
  • Each cell with a price in it is preceded by a ‘Price:’ cell
  • Each cell with a price in it starts with the ‘$’ character
  • Every property has a list price
  • Some properties are still active, and don’t have sales prices

So, if I search for the ‘Price:’ cells, then those locations +1 will mark where the list and sales data are supposed to be. I ran the following code to get a list of all those locations in the cell array:

pricecells = strmatch('Price',allstrings);
%gets the list of cell indices that start with 'Price'
datacells = pricecells + 1;
%adds one to each element in the array

And since each property has 2 ‘Price:’s (list and sales), there are half as many properties as there are elements in datacells. Knowing this, I can prepare my numerical array to collect the pairs of list and sales prices for each property. We need a properties x 2 array of NaN’s to start with:

prices = nan(numel(datacells)/2,2);
%note: this creates a 2 column matrix of NaNs that will be populated later
%preallocating variables makes things go faster!

And finally, I will go to each of these cell locations and test for a dollar value. If the cell starts with a ‘$’ character, then one is present and should be saved in my prices variable. I’ll do this alternating between list prices (column 1 of prices) and sales prices (column 2 of prices). For the active rental properties that don’t have a sales price yet, I’ll leave the sales data as NaN.

for i = 1:(numel(datacells)/2)
    if allstrings{datacells(i*2-1)}(1)=='$'
        prices(i,1) = str2double(allstrings{datacells(i*2-1)}(2:end));
    end
    if allstrings{datacells(i*2)}(1)=='$'
        prices(i,2) = str2double(allstrings{datacells(i*2)}(2:end));
    end
end 

%Notice the curly brackets in the if statements? These are needed
%when working with the contents of a cell.

Prices

Now THAT is exactly what I wanted in the first place! There are 103 properties, each with list prices and some with sales prices. Those without sales prices have NaN’s in the second column.

Answering the Question

With these pairs of prices, I can answer my original question of how final sales prices compare to the list prices in this area. Are people paying more, less, or the same as what the landlords are asking? So what I’m really interested in is the difference in price pairs here. I used the diff function, and then graphed the results this way:

pricediff = diff(prices,1,2);
h1 = figure('Position',[100 100 600 400],'Color','w');
h2 = subplot(1,6,1:3);
boxplot(prices,'Notch','on','labels',{'List Price';'Sale Price'});
ylabel('Rent ($)'); grid on;
h3 = subplot(1,6,5:6);
boxplot(pricediff,'labels','$ Change per Property');
ylabel('Rent ($)');

Get What you Want

Wrapping up

So after all the digging, searching, parsing, and graphing, I can see that people are paying about $50 less than the list price, with some people paying a lot less and a few paying more. Great info to have when going out to look at rentals in this area.

Hope this example helps you become more familiar with textscan and the searching and parsing techniques used above. As always, questions and comments are welcome.