Code Answer: 02/04/11

Friday, February 4, 2011

Java Web Services API, however I can't run a JVM on my server

I'm trying to use some data from a PlanPlusOnline account. They only provide a java web services API. The server for the site where the data will be used does not allow me to install Tomcat (edit: or a JVM for that matter). I'm not going to lie, I am a Java software engineer, and I do some web work on the side. I'm not familiar with web services or servlets, but I was willing to give it a shot. I'd much rather they have JSON access to the data, but as far as I know they don't. Any ideas?

EDIT: to clarify. The web service provided by planplusonline is Java based. I am trying to access the data from this web service without using Java. I believe this is possible now, but I need to do more research. Anyone who can help point me in the right direction is appreciated.

From stackoverflow drye

Are you trying to implement a client to a web service hosted somewhere else? If so, Java's not necessary. You can do web service clients in .NET, PHP, Ruby, or pretty much any modern web technology out there. All you need is a WSDL document to provide metadata about how to invoke the services.

From jodonnell
To follow up with jodonnell's comment, a Web service connection can be made in just about any server-side language. It is just that the API example they provided was in Java probably because PlanPlusOnline is written in Java. If you have a URL for the service, and an access key, then all you really need to do is figure out how to traverse the XML returned. If you can't do Java, then I suggest PHP because it could be already installed, and have the proper modules loaded. This link might be helpful:

http://www.onlamp.com/pub/a/php/2007/07/26/php-web-services.html

From hal10001
Thanks for making what should have been obvious to me, more clear. I believe I have some reading and research ahead of me, the link seems to be a great start. PHP is the language of choice so any further advice on this topic will be appreciated, but mostly I need to do some research.

Thanks again to both jodonnell and hal10001!

From drye

If I am understanding your question correctly, you only need to connect to an existing web service and not create your own web service. If that is a case, and maybe I am missing something, I do not believe you will need Tomcat at all. If you are using Netbeans you can create a new Desktop or Web application, and then right click the project name. Select New and then other, and select Web Client. Enter the information for where to find the WSDL (usually a URL) and the other required information.

Once you added the WebClient create a new class that actually makes your calls to the webservice. If the web service name was PlanPlusOnline then you could have something like:

public final class PlanPlusOnlineClient
{
    //instance to this class so that we do not have to reinstantiate it every time
    private static PlanPlusOnlineClient _instance = new PlanPlusOnlineClient();

    //generated class by netbeans with information about the web service
    private PlanPlusOnlineService service         = null;

    //another generated class by netbeans but this is a property of the service
    //that contains information about the individual methods available.
    private PlanPlusOnline port                   = null;

    private PlanPlusOnlineClient()
    {
        try
        {
            service = new PlanPlusOnlineService();
            port = service.getPlanPlusOnlinePort();
        }
        catch (MalformedURLException ex)
        {
            MessageLog.error(this, ex.getClass().getName(), ex);
        }
    }

    public static  PlanPlusOnlineClient getInstance()
    {
        return _instance;
    }

    public static String getSomethingInteresting(String param)
    {
         //this will call one of the actual methods the web 
         //service provides.
         return port.getSomethingIntersting(param);
    }    

}

I hope this helps you along your way with this. You should also check out http://www.netbeans.org/kb/60/websvc/ for some more information about Netbeans and web services. I am sure it is similar in other IDEs.

From jschoen

What are the disadvantages of Typed DataSets

I come from a world that favors building your own rather than rely on libraries and frameworks built by others. After escaping this world I have found the joy, and ease, of using such tools as Typed DataSets within Visual Studio. So besides the loss of flexibility what else do you lose? Are there performance factors (disregarding the procs vs dynamic sql debate)? Limitations?

From stackoverflow Craig Tyler

Performance is improved with typed datasets over untyped datasets (though I've never found performance issues with trivial things like that worth worrying about).

I'd say the biggest pain is just keeping them in sync with your database--I can't speak for VS 2008 but prior versions do not provide good support for this. I literally drag the procs onto the designer everytime the resultset's schema changes. Not fun.

But, you do get compile time type checking which is great and things like Customer.Name instead of Dataset.Tables(0).Rows(0)("Name").

So, if your schema is relatively static, they may be worth it, but otherwise, I wouldn't bother.

You could also look into a real ORM.

From Michael Haren
There is nothing wrong with typed datasets. They are not not perfect, however it's a next step toward solution of object-relational impedance mismatch problem. The only problem I faced is weak support for schema changes. Partial classes can help but not in every case.

From aku
I only gave Typed Datasets a very short try. I stopped when I found my code breaking about 2/3 of the way down a 1,000+ line file of generated code.

The other thing I didn't like was I thought I'd get to write code like Customer.Name, but by default I seemed to get code like CustomerDataSet.Customers[0].Name, where Customers[0] was of type CustomersRow. Still nicer to read than untyped datasets, but not really the semantics I was looking for.

Personally I headed off down the route of ActiveRecord/NHibernate, and haven't looked back since.

From David
The main criticism I would extend is that they don't scale well -- performance suffers because of the overhead, when you get to higher number of transactions, compared to lightweight business entities or DTOs or LINQ to SQL. There's also the headache of schema changes. For an "industrial strength" architecture, they're probably not the way to go, they will cause issues in the long run.

I would still definitely use them for quick and dirty PoCs or simple utilities -- they're really convenient to work with given the tooling in Visual Studio, and get the job done.

From Guy Starbuck
Typed datasets are by far an upgrade from the world of classic ADO disconnected recordsets. I have found that they are still nice to use in simple situations where you need to perform some sort task that's row oriented -- i.e. you still want to work in the context of a database paradigm of rows, columns, constraints and the like. If used wisely in that context, then you're OK.

There are a few areas where their benefits diminish:
- I think the synchronization issues brought up here already are definitely a problem, especially if you've gone and customized them or used them as a base class.
- Depending on the number of data tables in the dataset, they can become quite fat. I mean this in the sense that multi-table datasets typically present a relational view of data. What comes along with that, besides the in-memory footprint, are definition of keys and potentially other constraints. Again, if that's what you need great, but if you need to traverse data quickly, one time, then an efficient loop with a data reader might be a better candidate.
- Because of their complex definition and potential size, using them in remoting situations is ill advised as well.
- Finally, when you start realizing you need to work with your data in objects that are relevant to your problem domain, their use becomes more of a hindrance than a benefit. You constantly find yourself moving fields in and out of rows tables in the set and concerning yourself with the state of the tables and rows. You begin to realize that they made OO languages to make it easier to represent real-world problem domain objects and that working with tables, rows and columns doesn't really fit into that way of thinking.
Generally in my experience, I am finding that complex systems (e.g. many large enterprise systems) are better off moving away from the use of datasets and more towards a solid domain specific object model -- how you get your data in and out of those objects (using ORM's for example) is another topic of conversation altogether. However, in small projects where there's a form slapped in front of data that needs to basic maintenance and some other simple operations, great productivity can be acheived with the dataset paradigm -- especially when coupled with Visual Studio/.Net's powerful databinding features.

Jim Burger : Personally I think your last bullet point is the clincher: 'They made OO languages to make it easier to represent [a] real-world problem domain'. In fairness, your first point has since been mitigated slightly with the introduction of partial classes, allowing for table and row customizations that don't get blown away every time you refresh the database. Still, I would only use them in the simplest of rich client applications where performance is considered a 'hardware issue'.

From Peter Meyer
Datasets are nice for quickly slapping something together with visual studio, if all the issues mentioned previously are ignored. One problem I did not see mentioned is the visual scalability of datasets within the design surface of Visual Studio. As the system grows, the size of the datasets inevitably becomes unwieldy. The visual aspects of the designer simply don't scale. It is a real pain to scroll around trying to find a particular table or relation when the dataset has more than 20 or so tables.

From Todd Stout
I'm not a big fan of typed dataset. There is no way that I can improve the performance using typed dataset. Its purely a wrapper over the existing database objects. I cannot consider the access like employee.empName. Casting is still done in the wrapper. Another overhead is huge chunk of code. LOC is increased. So many active objects in memory. No automatic update of schema. In any way typed dataset is not useful for developers except the comfort that it gives. As a developers we don't have any right to demand for comfort :) Take the pain...take the pain out of user :)

From praveen kumar thalluri

Best way to hide DB connection code in PHP5 for apps that only require one connection?

Below I present three options for simplifying my database access when only a single connection is involved (this is often the case for the web apps I work on).

The general idea is to make the DB connection transparent, such that it connects the first time my script executes a query, and then it remains connected until the script terminates.

I'd like to know which one you think is the best and why. I don't know the names of any design patterns that these might fit so sorry for not using them. And if there's any better way of doing this with PHP5, please share.

To give a brief introduction: there is a DB_Connection class containing a query method. This is a third-party class which is out of my control and whose interface I've simplified for the purpose of this example. In each option I've also provided an example model for an imaginary DB "items" table to give some context.

Option 3 is the one that provides me with the interface I like most, but I don't think it's practical unfortunately.

I've described the pros and cons (that I can see) of each in the comment blocks below.

At the moment I lean towards Option 1 since the burden is put on my DB wrapper class instead of on the models.

All comments appreciated!

Note: For some reason the Stack Overflow preview is showing an encoded HTML entity instead of underscores. If the post comes through like that, please take this into account.

<?php

/**
 * This is the 3rd-party DB interface I'm trying to wrap.
 * I've simplified the interface to one method for this example.
 *
 * This class is used in each option below.
 */
class DB_Connection {
    public function &query($sql) { }
}

/**
 * OPTION 1
 *
 * Cons: Have to wrap every public DB_Connection method.
 * Pros: The model code is simple.
 */
class DB {
    private static $connection;
    private static function &getConnection() {
        if (!self::$connection) {
            self::$connection = new DB_Connection();
        }
        return self::$connection;
    }
    public static function &query($sql) {
        $dbh = self::getConnection();
        return $dbh->query($sql);
    }
}

class Item {
    public static function &getList() {
        return DB::query("SELECT * FROM items");
    }
}

/**
 * OPTION 2
 *
 * Pros: Don't have to wrap every DB_Connection function like in Option 1
 * Cons: Every function in the model is responsible for checking the connection
 */

class DB {
    protected static $connection = null;
    public function connect() {
        self::$connection = new DB_Connection();
    }
}

class Item extends DB {
    public static function &getList() {
        if (!self::$connection) $this->connect();
        return self::$connection->query("SELECT * FROM items");
    }
}

/**
 * OPTION 3
 *
 * Use magic methods
 *
 * Pros: Simple model code AND don't have to reimplement the DB_Connection interface
 * Cons: __callStatic requires PHP 5.3.0 and its args can't be passed-by-reference.
 */
class DB {
    private static $connection = null;

    public static function &getConnection() {
        if (!self::$connection) {
            self::$connection = new DB_Connection();
        }
        return self::$connection;
    }

    public static function __callStatic($name, $args) {
        if (in_array($name, get_class_methods('DB_Connection'))) {
            return call_user_func_array(
                array(self::getConnection(), $name), $args);
        }
    }
}

From stackoverflow $pageUsers[$entry.posts[0].author].name

Based on your examples above, I'd say option 1 is the best - simplicity always wins, and you can handle a failed connection differently depending on the method (you might want to fail differently for a stored procedure call than a simple SELECT, for instance).

From Steve M
Semantically speaking I think option 1 makes the most sense, if you're treating DB as a resource then the DB_Connectioin is an object that it uses but not necessarily the object itself.

However, several things I caution you against. First, don't make your DB class have all static methods as it will strongly impact your ability to test your code. Consider instead a very simple inversion of control container like this:
```
   class DB {
    private $connection;
    public function &query($sql) {
        return $connection->query($sql);
    }
    public __construct(&$db_connection) {
        $this->connection = $db_connection;
    }
}

class Item {
    public function &getList() {
        return  ResourceManager::getDB()->query("SELECT * FROM items");
    }
}

class ResourceManager {
    private $db_connection;
    private function &getDbConnection() {
        if (!$this->connection) {
            $this->connection = new DB_Connection();
        }
        return $this->connection;
    }
    private $db;
    public static function getDB() {
        if(!$this->db) $this->db = new DB(getDbConnection());
    return $this->db;
}
```
There are significant benefits. If you don't want DB to be used as a singleton you just make one modification to ResourceManager. If you decide it should not be a singleton - you make the modification in one place. If you want to return a different instance of DB based on some context - again, the change is in only one place.

Now if you want to test Item in isolation of DB simply create a setDb($db) method in ResourceManager and use it to set a fake/mock database (simplemock will serve you well in that respect).

Second - and this is another modification that this design eases - you might not want to keep your database connection open the entire time, it can end up using far more resources than need be.

Finally, as you mention that DB_Connection has other methods not shown, it seems like the it might be being used for more than simply maintaining a connection. Since you say you have no control over it, might I recommend extracting an interface from it of the methods that you DO care about and making a MyDBConnection extends DB_Connection class that implements your interface. In my experience something like that will ultimately ease some pain as well.

From George Mauer

Open Source Actionscript 3 or Javascript date utility classes?

I was wondering if anyone could point to an Open Source date utility class that is fairly robust. I find myself rolling my own when I want to do a lot of things I take for granted in C# and Java. For instance I did find a decent example of a DateDiff() function that I tore apart and another DatePart() function. Another examples would be parsing different date/time formats. I'm trying to avoid reinventing something if it's already built.

Another possibility may be a nice set of Javascript files that I can convert to ActionScript 3. So far I've found DateJS but I want to get a good idea of what is out there.

From stackoverflow Ian S

as3corelib has the DateUtil class and it should be pretty reliable since it's written by some Adobe employees. I haven't encountered any problems with it.

From hasseg
There is also DP_DateExtensions, though I believe DateJS is more robust.

From pkaeding
Thanks for the answers, these libraries will help tremendously.

From Ian S

Retaining HTTP POST data when a request is interrupted by a login page

Say a user is browsing a website, and then performs some action which changes the database (let's say they add a comment). When the request to actually add the comment comes in, however, we find we need to force them to login before they can continue.

Assume the login page asks for a username and password, and redirects the user back to the URL they were going to when the login was required. That redirect works find for a URL with only GET parameters, but if the request originally contained some HTTP POST data, that is now lost.

Can anyone recommend a way to handle this scenario when HTTP POST data is involved?

Obviously, if necessary, the login page could dynamically generate a form with all the POST parameters to pass them along (though that seems messy), but even then, I don't know of any way for the login page to redirect the user on to their intended page while keeping the POST data in the request.

Edit : One extra constraint I should have made clear - Imagine we don't know if a login will be required until the user submits their comment. For example, their cookie might have expired between when they loaded the form and actually submitted the comment.

From stackoverflow Matt Sheppard

This is one good place where Ajax techniques might be helpful. When the user clicks the submit button, show the login dialog on client side and validate with the server before you actually submit the page.

Another way I can think of is showing or hiding the login controls in a DIV tag dynamically in the main page itself.

From Gulzar
Collect the data on the page they submitted it, and store it in your backend (database?) while they go off through the login sequence, hide a transaction id or similar on the page with the login form. When they're done, return them to the page they asked for by looking it up using the transaction id on the backend, and dump all the data they posted into the form for previewing again, or just run whatever code that page would run.

Note that many systems, eg blogs, get around this by having login fields in the same form as the one for posting comments, if the user needs to be logged in to comment and isn't yet.

From castaway
Just store all the necessary data from the POST in the session until after the login process is completed. Or have some sort of temp table in the db to store in and then retrieve it. Obviously this is pseudo-code but:
```
if ( !loggedIn ) {
    StorePostInSession();
    ShowLoginForm();
}

if ( postIsStored ) {
    RetrievePostFromSession();
}
```
Or something along those lines.

From Swish

2 choices:

Write out the messy form from the login page, and JavaScript form.submit() it to the page.
Have the login page itself POST to the requesting page (with the previous values), and have that page's controller perform the login verification. Roll this into whatever logic you already have for detecting the not logged in user (frameworks vary on how they do this). In pseudo-MVC:


        CommentController {
           void AddComment() {
             if (!Request.User.IsAuthenticated && !AuthenticateUser()) {
                return;
             }
             // add comment to database
           }

           bool AuthenticateUser() {
             if (Request.Form["username"] == "") {
                // show login page
                foreach (Key key in Request.Form) {
                   // copy form values
                   ViewData.Form.Add("hidden", key, Request.Form[key]);
                }
                ViewData.Form.Action = Request.Url;

                ShowLoginView();
                return false;
              } else {
                 // validate login
                 return TryLogin(Request.Form["username"], Request.Form["password"]);
              } 
           }
        }

From Mark Brackett

I know it says language-agnostic, but why not take advantage of the conventions provided by the server-side language you are using? If it were Java, the data could persist by setting a Request attribute. You would use a controller to process the form, detect the login, and then forward through. If the attributes are set, then just prepopulate the form with that data?

Edit: You could also use a Session as pointed out, but I'm pretty sure if you use a forward in Java back to the login page, that the Request attribute will persist.

From hal10001
You might want to investigate why Django removed this feature before implementing it yourself. It doesn't seem like a Django specific problem, but rather yet another cross site forgery attack.

From Jacob Rigby

Simultaneous calls from CDR

I need to come up with an analysis of simultaneus events, when having only starttime and duration of each event.

Details

I've a standard CDR call detail record, that contains among others:

calldate (timedate of each call start
duration (int, seconds of call duration)
channel (a string)

What I need to come up with is some sort of analysys of simultaneus calls on each second, for a given timedate period. For example, a graph of simultaneous calls we had yesterday.

(The problem is the same if we have visitors logs with duration on a website and wish to obtain simultaneous clients for a group of web-pages)

What would your algoritm be?

I can iterate over records in the given period, and fill an array, where each bucket of the array corresponds to 1 second in the overall period. This works and seems to be fast, but if the timeperiod is big (say..1 year), I would need lots of memory (3600x24x365x4 bytes ~ 120MB aprox).

This is for a web-based, interactive app, so my memory footprint should be small enough.

Edit

By simultaneous, I mean all calls on a given second. Second would be my minimum unit. I cannot use something bigger (hour for example) becuse all calls during an hour do not need to be held at the same time.

From stackoverflow Pablo Alsina

I would implement this on the database. Using a GROUP BY clause with DATEPART, you could get a list of simultaneous calls for whatever time period you wanted, by second, minute, hour, whatever.

On the web side, you would only have to display the histogram that is returned by the query.

From Eric Z Beard
@eric-z-beard: I would really like to be able to implement this on the database. I like your proposal, and while it seems to lead to something, I dont quite fully understand it. Could you elaborate? Please recall that each call will span over several seconds, and each second need to count. If using DATEPART (or something like it on MySQL), what second should be used for the GROUP BY. See note on simultaneus.

Elaborating over this, I found a way to solve it using a temporary table. Assuming temp holds all seconds from tStart to tEnd, I could do
```
SELECT temp.second, count(call.id)
FROM call, temp
WHERE temp.second between (call.start and call.start + call.duration)
GROUP BY temp.second
```
Then, as suggested, the web app should use this as a histogram.

From Pablo Alsina
You can use a static Numbers table for lots of SQL tricks like this. The Numbers table simply contains integers from 0 to n for n like 10000.

Then your temp table never needs to be created, and instead is a subquery like:
```
SELECT StartTime + Numbers.Number AS Second
FROM Numbers
```
From Cade Roux

You can create table 'simultaneous_calls' with 3 fields:

yyyymmdd  Char(8),
day_second Number,  -- second of the day,
count          Number   -- count of simultaneous calls

Your web service can take 'count' value from this table and make some statistics.

Simultaneous_calls table will be filled by some batch program which will be started every day after end of the day.

Assuming that you use Oracle, the batch may start a PL/SQL procedure which does the following:

Appends table with 24 * 3600 = 86400 records for each second of the day, with default 'count' value = 0.
Defines the 'day_cdrs' cursor for the query:

Select to_char(calldate, 'yyyymmdd')              yyyymmdd,
         (calldate - trunc(calldate)) * 24 * 3600   starting_second,
         duration                                              duration
From cdrs
Where cdrs.calldate >= Trunc(Sysdate -1)
    And cdrs.calldate

Iterates the cursor to increment 'count' field for the seconds of the call:


For cdr in day_cdrs
Loop 
   Update simultaneos_calls
   Set      count = count + 1
   Where yyyymmdd = cdr.yyyymmdd
       And day_second Between cdr.starting_second And cdr.starting_second + cdr.duration;
End Loop;

From Aleksey Otrubennikov

How do you check whether a python method is bound or not?

Given a reference to a method, is there a way to check whether the method is bound to an object or not? Can you also access the instance that it's bound to?

From stackoverflow Readonly

im_self attribute

From Mark Cidade
```
def isbound(method):
    return method.im_self is not None

def instance(bounded_method):
    return bounded_method.im_self
```
User-defined methods:

When a user-defined method object is created by retrieving a user-defined function object from a class, its im_self attribute is None and the method object is said to be unbound. When one is created by retrieving a user-defined function object from a class via one of its instances, its im_self attribute is the instance, and the method object is said to be bound. In either case, the new method's im_class attribute is the class from which the retrieval takes place, and its im_func attribute is the original function object.

In Python 2.6 and 3.0:

Instance method objects have new attributes for the object and function comprising the method; the new synonym for im_self is __self__, and im_func is also available as __func__. The old names are still supported in Python 2.6, but are gone in 3.0.

From J.F. Sebastian

Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)

I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

Is this normal behaviour when using a SQL database?

The schema (the table holds responses to a survey):

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

I wrote some tests in Java and Python for context and they crush SQL (except for pure python):

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)

Tunings i've tried without success include (blindly following some web advice):

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.

Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.

From stackoverflow Jacob Rigby

What does your java/python code do?

Does it connect to Postgres and SELECT * then sum it all up? Or do you parse the CSV file, or what :-)?

From Danny Wilson
No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.

The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)

I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.

The Postgres query doesn't change timing on subsequent runs.

I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)

From Jacob Rigby
I don't think that your results are all that surprising -- if anything it is that Postgres is so fast.

Does the Postgres query run faster a second time once it has had a chance to cache the data? To be a little fairer your test for Java and Python should cover the cost of acquiring the data in the first place (ideally loading it off disk).

If this performance level is a problem for your application in practice but you need a RDBMS for other reasons then you could look at memcached. You would then have faster cached access to raw data and could do the calculations in code.

From Rob Walker
I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:
1. parse the SQL
2. work up a query plan, i. e. decide on which indices to use (if any), optimize etc.
3. if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or
4. if no index is used, scan the whole table to determine which rows are needed
5. load the data from disk into a temporary location (hopefully, but not necessarily, memory)
6. perform the count() and avg() calculations
So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.

To obtain more information about where Postgres spends its time, I would suggest the following tests:
- Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)
- If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.
To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.

There's several ways to do that:
- Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached
- Reduce the size of your stored data
- Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.
- If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.
- There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.
Update:

I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.

From Hanno Fietz
One other thing that an RDBMS generally does for you is to provide concurrency by protecting you from simultaneous access by another process. This is done by placing locks, and there's some overhead from that.

If you're dealing with entirely static data that never changes, and especially if you're in a basically "single user" scenario, then using a relational database doesn't necessarily gain you much benefit.

From Curt Hagenlocher
Those are very detailed answers, but they mostly beg the question, how do I get these benefits without leaving Postgres given that the data easily fits into memory, requires concurrent reads but no writes and is queried with the same query over and over again.

Is it possible to precompile the query and optimization plan? I would have thought the stored procedure would do this, but it doesn't really help.

To avoid disk access it's necessary to cache the whole table in memory, can I force Postgres to do that? I think it's already doing this though, since the query executes in just 200 ms after repeated runs.

Can I tell Postgres that the table is read only, so it can optimize any locking code?

I think it's possible to estimate the query construction costs with an empty table (timings range from 20-60 ms)

I still can't see why the Java/Python tests are invalid. Postgres just isn't doing that much more work (though I still haven't addressed the concurrency aspect, just the caching and query construction)

UPDATE: I don't think it's fair to compare the SELECTS as suggested by pulling 350,000 through the driver and serialization steps into Python to run the aggregation, nor even to omit the aggregation as the overhead in formatting and displaying is hard to separate from the timing. If both engines are operating on in memory data, it should be an apples to apples comparison, I'm not sure how to guarantee that's already happening though.

I can't figure out how to add comments, maybe i don't have enough reputation?

Hanno Fietz : I'd say estimating the basic overhead with an empty table is good. Since you're doing a full table scan, index use is probably irrelevant, which leaves the bulk (160 to 180 ms) of your 200 ms to loading data from disk and aggregation. Have you compared the two SELECTs as I suggested?

Bob Aman : You need 50 reputation to leave a comment.

From Jacob Rigby
You need to increase postgres' caches to the point where the whole working set fits into memory before you can expect to see perfomance comparable to doing it in-memory with a program.

From David Schmitt
Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)

If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.

(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).

Materialized Views

Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.

I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.

The real question is, is 200ms fast enough?

-------------- More --------------------

I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.

First I created a MV which refreshes every minute.
```
create materialized view mv_so_x 
build immediate 
refresh complete 
START WITH SYSDATE NEXT SYSDATE + 1/24/60
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;
```
While its refreshing, there is no rows returned
```
SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00
```
Once it refreshes, its MUCH faster than doing the raw query
```
SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>
```
If we insert into the base table, the result is not immediately viewable view the MV.
```
SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>
```
But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.
```
SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00
SQL>
```
This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.

Jacob Rigby : I like it, I think this would work for my situation, I have to check into Postgres support or MV, it does look a little less solid than Oracle's. This is probably 75% of the answer, but I still wonder if it's possible to speedup the live aggregate query.

Dems : I'd be interested to see how the 200ms was established. I agree that the network round trip carries an overhead. Also, putting code in a StoredProc can remove the time to parse, compile, etc the SQL query. In my experience aggregate functions are NOT slow in SQL...

From Matthew Watson
Thanks for the Oracle timings, that's the kind of stuff I'm looking for (disappointing though :-)

Materialized views are probably worth considering as I think I can precompute the most interesting forms of this query for most users.

I don't think query round trip time should be very high as i'm running the the queries on the same machine that runs Postgres, so it can't add much latency?

I've also done some checking into the cache sizes, and it seems Postgres relies on the OS to handle caching, they specifically mention BSD as the ideal OS for this, so I thinking Mac OS ought to be pretty smart about bringing the table into memory. Unless someone has more specific params in mind I think more specific caching is out of my control.

In the end I can probably put up with 200 ms response times, but knowing that 7 ms is a possible target makes me feel unsatisfied, as even 20-50 ms times would enable more users to have more up to date queries and get rid of a lots of caching and precomputed hacks.

I just checked the timings using MySQL 5 and they are slightly worse than Postgres. So barring some major caching breakthroughs, I guess this is what I can expect going the relational db route.

I wish I could up vote some of your answers, but I don't have enough points yet.

From Jacob Rigby
I'm a MS-SQL guy myself, and we'd use DBCC PINTABLE to keep a table cached, and SET STATISTICS IO to see that it's reading from cache, and not disk.

I can't find anything on Postgres to mimic PINTABLE, but pg_buffercache seems to give details on what is in the cache - you may want to check that, and see if your table is actually being cached.

A quick back of the envelope calculation makes me suspect that you're paging from disk. Assuming Postgres uses 4-byte integers, you have (6 * 4) bytes per row, so your table is a minimum of (24 * 350,000) bytes ~ 8.4MB. Assuming 40 MB/s sustained throughput on your HDD, you're looking at right around 200ms to read the data (which, as pointed out, should be where almost all of the time is being spent).

Unless I screwed up my math somewhere, I don't see how it's possible that you are able to read 8MB into your Java app and process it in the times you're showing - unless that file is already cached by either the drive or your OS.

From Mark Brackett
I retested with MySQL specifying ENGINE = MEMORY and it doesn't change a thing (still 200 ms). Sqlite3 using an in-memory db gives similar timings as well (250 ms).

The math here looks correct (at least the size, as that's how big the sqlite db is :-)

I'm just not buying the disk-causes-slowness argument as there is every indication the tables are in memory (the postgres guys all warn against trying too hard to pin tables to memory as they swear the OS will do it better than the programmer)

To clarify the timings, the Java code is not reading from disk, making it a totally unfair comparison if Postgres is reading from the disk and calculating a complicated query, but that's really besides the point, the DB should be smart enough to bring a small table into memory and precompile a stored procedure IMHO.

UPDATE (in response to the first comment below):

I'm not sure how I'd test the query without using an aggregation function in a way that would be fair, since if i select all of the rows it'll spend tons of time serializing and formatting everything. I'm not saying that the slowness is due to the aggregation function, it could still be just overhead from concurrency, integrity, and friends. I just don't know how to isolate the aggregation as the sole independent variable.

Hanno Fietz : I must say I'm surprised, but it seems to me you are correct in assuming the table data is not read from disk. Still, are you certain that it's the aggregation part, rather than the retrieval part? I. e., do you get a speed gain if you do a simple SELECT * and aggregate the rows in the program?

From Jacob Rigby
Are you using TCP to access the Postgres? In that case Nagle is messing with your timing.

Jacob Rigby : Yes I am using TCP, but it's on the localhost. Also simple queries can execute very fast, between 15-60 ms.

gabr : If its localhost then Nagle doesn't kick in. Sorry for not being of any help :(

From gabr

How to find all database references

In trying to figure out this problem (which is still unsolved and I still have no clue what is going on), I am wondering if maybe an external reference to the table in question is causing the problem. For example, a trigger or view or some other such thing.

Is there an easy way to find all references to a given database table? Including all views, triggers, constraints, or anything at all, preferably from the command line, and also preferably without a 3rd party tool (we are using db2).

From stackoverflow Mike Stone

Wow, I wouldn't have thought it, but there seems to be.. Good ole DB2.

I find the publib db2 docs view very very handy by the way:

http://publib.boulder.ibm.com/infocenter/db2luw/v8//index.jsp

I just found the "SYSCAT.TABDEP" catalog view in it, which seems to contain more or less what you asked for. I suspect for anything not covered there you'll have to trawl through the rest of the syscat tables which are vast. (Unfortunately I can't seem to link you to the exact page on SYSCAT.TABDEP itself, the search facility should lead you to it fairly easily though).

Most databases these days have a set of tables which contain data about the layout of your actual schema tables, quite handy for this sort of thing.

From castaway
You can write a query search the information schema views (definition column) to find the table in all views, triggers, procedure, etc. Not sure about FK & indexes though.

From Booji Boy

Getting closest element by id

I have two elements:

<input a>
<input b onclick="...">

When b is clicked, I want to access a and manipulate some of its data. A does not have a globally unique name, so document.getElementsByName is out. Looking into the event object, I thought event.target.parentNode would have some function like getElementsByName, but this does not seem to be the case with <td>s. Is there any simple way to do this?

From stackoverflow Patrick

If 'a' and 'b' are next to each other and have the same parent, you can use the prevSibling property of 'b' to find 'a'.

From 17 of 26
1. You should be able to find the element that was clicked from the event object. Depending on your browser you might want e.target or e.srcElement. The code below is from this w3schools example:
```
function whichElement(e) {
  var targ;
  if (!e) var e = window.event;
  if (e.target) {
    targ=e.target;
  } else if (e.srcElement) {
    targ = e.srcElement;
  }


  if (targ.nodeType==3) { // defeat Safari bug 
    targ = targ.parentNode;
  }


  var tname;
  tname = targ.tagName;
  alert("You clicked on a " + tname + " element.");
}
```
2. You may then use the nextSibling and prevSibling DOM traversal functions. Some more information here. And yet again a w3schools reference for XML DOM Nodes.
From Joseph Pecoraro
Leave your plain vanilla JavaScript behind. Get JQuery--it will save you a ton of time.

http://docs.jquery.com/Selectors

From rp
Prototype also has nice functions to move around in the DOM. In your example something like the following would do the trick:
```
b.up().down('a')
```
And if there are is more than one a element at that level you have the power of CSS selectors at your hand to specify exactly which element you want

From ujh

Is there an ASP.NET pagination control (Not MVC)?

I've got a search results page that basically consists of a repeater with content in it. What I need is a way to paginate the results. Getting paginated results isn't the problem, what I'm after is a web control that will display a list of the available paged data, preferably by providing the number of results and a page size

From stackoverflow Glenn Slaven

Repeaters don't do this by default.

However, GridViews do.

Personally, I hate GridViews, so I wrote a Paging/Sorting Repeater control.

Basic Steps:
- Subclass the Repeater Control
- Add a private PagedDataSource to it
- Add a public PageSize property
- Override Control.DataBind
  - Store the Control.DataSource in the PagedDataSource.
  - Bind the Control.DataSource to PagedDataSource
- Override Control.Render
  - Call Base.Render()
  - Render your paging links.
For a walkthrough, you could try this link:

http://aspnet.4guysfromrolla.com/articles/081804-1.aspx

Neil N : is rolling your own paginated repeater worth not using a gridview?

From FlySwat

Looking for experiences working with Qt

I had an idea for a plug-in for my favorite program runner Launchy, however after downloading the source code I found that it makes heavy use of the Qt library. The bulk of my experience is in C# and PHP and until that moment I had never really even heard of this library.

Which leads me to ask, does anyone have any impressions of it that they can share with me? Do you find working with it to be hard or easy? What professional/hobbyist prospects would experience with Qt make available for me? Is it fun? Do you think its worth my time overall? Are there any common gotchas that should I decide to learn it I should keep an eye out for? Just trying to solicit some opinions.

From stackoverflow George Mauer

I use Qt for commercial development and find it a very good library to work with. The main problem that you are likely to have is that it is C++. If you don't have much C++ experience most of the issues are likely to be C++ related rather than Qt related.

The part of using Qt which needs the most work to learn is the signal-slot mechanism. However, this is well documented.

You tagged the question with vc++ but the free version of Qt on windows only officially supports the Mingw compiler. The integration with Visual Studio is part of the commercial edition. (Qt open source edition) The Launchy plugin page does describe using Qt with Visual Studio, but then you need to build it from source yourself.

The KDE environment is based on Qt, so learning Qt will be useful if you want to work on any other Qt applications. Professionally, Qt is generally used for cross platform applications.

From David Dibben
I just want to add that learning Qt is definitely worth it, especially when programming with C++. Beside the GUI classes, it brings a lot of framework classes for handling containers, databases, threads.

One thing that is not often mentionned is the way Qt helps you translating your app in multiple languages. For us, it is definitely a big plus.

As David mentionned, the documentation is really good. Classes and methods naming is very consistent, which helps learning and using the whole framwork.

And once you understand the signal-slot, you won't stop using it !

We've been using it commercially for five years now, and I would definitely recommend it. Even though the support is not as fast as it used to be, it is still excellent.

From Jérôme
take a look at Launchy# - C# binding for launchy http://sourceforge.net/projects/launchysharp/

ip : Hi, do you know if I need to run install and build QT as well as Launchy# ? I'm looking for very basic functionality along the sample go-y# lines...

From Krzysztof Koźmic

Cannot delete from the database...?

So, I have 2 database instances, one is for development in general, another was copied from development for unit tests.

Something changed in the development database that I can't figure out, and I don't know how to see what is different.

When I try to delete from a particular table, with for example:

delete from myschema.mytable where id = 555

I get the following normal response from the unit test DB indicating no row was deleted:

SQL0100W No row was found for FETCH, UPDATE or DELETE; or the result of a query is an empty table. SQLSTATE=02000

However, the development database fails to delete at all with the following error:

DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. During SQL processing it returned: SQL0440N No authorized routine named "=" of type "FUNCTION" having compatible arguments was found. SQLSTATE=42884

My best guess is there is some trigger or view that was added or changed that is causing the problem, but I have no idea how to go about finding the problem... has anyone had this problem or know how to figure out what the root of the problem is?

(note that this is a DB2 database)

From stackoverflow Mike Stone

You might have an open transaction on the dev db...that gets me sometimes on SQL Server

From w4ik
Is the type of id compatible with 555? Or has it been changed to a non-integer type?

Alternatively, does the 555 argument somehow go missing (e.g. if you are using JDBC and the prepared statement did not get its arguments set before executing the query)?

From Leigh Caldwell
@Leigh Caldwell

I am doing this directly in the db2 shell, exactly as I have above, so the arguments can't be going missing.

When I do a describe on the table in the 2 instances, they are exactly the same.

From Mike Stone
Can you add more to your question? That error sounds like the sql statement parser is very confused about your statement. Can you do a select on that table for the row where id = 555 ?

You could try running a RUNSTATS and REORG TABLE on that table, those are supposed to sort out wonky tables.

From castaway
@castaway

A select with the same "where" condition works just fine, just not delete. Neither runstats nor reorg table have any affect on the problem.

From Mike Stone
Hmm, applying the great oracle to this question, I came up with:

http://bytes.com/forum/thread830774.html

It seems to suggest that another table has a foreign key pointing at the problematic one, when that FK on the other table is dropped, the delete should work again. (Presumably you can re-create the foreign key as well)

Does that help any?

From castaway
@castaway

We actually just solved the problem, and indeed it is just what you said (a coworker found that exact same page too).

The solution was to drop foreign key constraints and re-add them.

Another post on the subject:

http://www.ibm.com/developerworks/forums/thread.jspa?threadID=208277&tstart=-1

Which indicates that the problem is a referential constraint corruption, and is actually, or supposedly anyways, fixed in a later version of db2 V9 (which we are not yet using).

Thanks for the help!

From Mike Stone
Please check 1. your arguments of triggers, procedure, functions and etc. 2. datatype of arguments.

From Fuangwith S.

Friday, February 4, 2011

Blog Archive