Thank you! Your feedback has been delivered
Thank you! Your feedback has been sent

SQL - Devising a query to get StackOverflow

I need an SQL query that will get me the reply times to all questions in Stack Overflow sorted in to 8 groups:

  • Less then an hour

  • Between an hour and 2 hours

  • 2 hours to 6 hours

  • 6 hours to a day

  • A day to 3 days

  • 3 days to a week

  • Week to a month

  • Later then a month

You can access the Stack Overflow query browser here: http://data.stackexchange.com/stackoverflow/query/new

It allows you to see the DB structure and query it freely. It's an MS SQL Server.

A reply is considered either an answer or a comment.

Looking forward to your answers guys!

User Gravatar

BadGoat

Posted Jan 22 2014 5:12 UTC

$100


  • Assigned To beyal
  • Solved
  • sql
    mssql
  • 1194 Views

17 Replies


do you want each question per category or the number of questions in a category?

User Gravatar

guyisra

Posted Jan 22 2014 6:09 UTC

The number of questions in a group.

For example:

  • Less then an hour: 1,200,000 questions.
  • 1 - 2 hours: 200,000 questions.
  • And so on...
User Gravatar

BadGoat

Posted Jan 22 2014 6:14 UTC

http://data.stackexchange.com/stackoverflow/query/161930

WITH Answers AS (
    select questions.id, DATEDIFF(ss,questions.CreationDate,answers.CreationDate) as tta
    from Posts as questions
    join Posts answers on answers.id = questions.AcceptedAnswerId
    where questions.AcceptedAnswerId is not null
  ) 
  , less_minute AS (
      select id 
      from Answers
      where tta <= 60)
  , less_hour  AS (
      select id 
      from Answers
      where tta <= 3600)

  , hour_two AS (
      select id 
      from Answers
      where tta <= 3600*2 and tta > 3600)

  , two_six AS (
      select id 
      from Answers
      where tta <= 3600*6 and tta > 3600 * 2)

  ,six_day AS (
      select id 
      from Answers
      where tta <= 3600*24 and tta > 3600 * 6)

  ,day_three AS (
      select id 
      from Answers
      where tta <= 3600*3*24 and tta > 3600 * 24)

  ,week_month AS (
      select id 
      from Answers
      where tta <= 3600*24*30 and tta > 3600 *24*7)

  ,more_than_month AS (
      select id 
      from Answers
      where tta > 3600*24*30)

select (select count(*) from less_minute) as [less than minute],

(select count(*) from less_hour) as [less than hour],

(select count(*) from hour_two) as [between hour and two],

(select count(*) from two_six) as [two and six hours],

(select count(*) from six_day) as [six and day],

(select count(*) from day_three) as [day three],

 (select count(*) from week_month) as [week month],

(select count(*) from more_than_month) as [more than month]

made it look better

User Gravatar

guyisra

Posted Jan 22 2014 6:45 UTC

First of all thanks for the quick response!

Second of all it seems that the total number of question I get from your query is 3,732,517 when the total amount of questions in Stack Overflow about 6,500,000.

From the queries I ran there are only about 2000 question with no replies at all at Stack Overflow so the numbers don't correlate.

The query I used is:

SELECT count(*) FROM Posts
WHERE PostTypeId = 1
AND AnswerCount = 0
AND CommentCount = 0

Also it seems that you screen by only accepted answers. I was talking about all question in the site.

User Gravatar

BadGoat

Posted Jan 22 2014 7:03 UTC

maybe there was misunderstanding

I checked for the question with accepted answers (yes you are right I forgot the filter, I add it in http://data.stackexchange.com/stackoverflow/query/161958)

do you want time to respond or time to accept an answer?

User Gravatar

guyisra

Posted Jan 22 2014 7:13 UTC

ok I see

what about this http://data.stackexchange.com/stackoverflow/query/161976 ?

User Gravatar

guyisra

Posted Jan 22 2014 7:24 UTC

It's closer but still only questions with accepted answers..

Should be all questions regardless of if they are solved or not.

User Gravatar

BadGoat

Posted Jan 22 2014 7:39 UTC

I updated my Select:

http://data.stackexchange.com/stackoverflow/query/edit/162010

WITH qus_replay_times_desc AS ( SELECT qus_id, replay_time = CASE WHEN DATE_DIFF_IN_SEC < 3600 THEN '1 - Less then an hour' WHEN DATE_DIFF_IN_SEC >= 3600 and DATE_DIFF_IN_SEC < 7200 THEN '2 - Between an hour and 2 hours' WHEN DATE_DIFF_IN_SEC >= 7200 and DATE_DIFF_IN_SEC < 21600 THEN '3 - 2 hours to 6 hours' WHEN DATE_DIFF_IN_SEC >= 21600 and DATE_DIFF_IN_SEC < 86400 THEN '4 - 6 hours to a day' WHEN DATE_DIFF_IN_SEC >= 86400 and DATE_DIFF_IN_SEC < 259200 THEN '5 - A day to 3 days' WHEN DATE_DIFF_IN_SEC >= 259200 and DATE_DIFF_IN_SEC < 604800 THEN '6 - 3 days to a week' WHEN DATE_DIFF_IN_SEC >= 604800 and DATE_DIFF_IN_SEC < 2592000 THEN '7 - Week to a month' ELSE '8 - Later then a month' END FROM (SELECT qus_id, MIN(DATE_DIFF_IN_SEC) DATE_DIFF_IN_SEC FROM (SELECT questions.id qus_id,
DATEDIFF(s, questions.creationdate, answers.creationdate) DATE_DIFF_IN_SEC FROM posts questions, posts answers WHERE questions.id=answers.parentid) qus_ans_response_time GROUP BY qus_id) qus_min_response_time ) SELECT SUBSTRING(replay_time, 5, len(replay_time)), count(qus_id) FROM qus_replay_times_desc GROUP BY replay_time ORDER BY replay_time asc

My Select considers all questions while taking the first response of each question.

PAY ATTANTION that there are 14249248 questions with no answer (AcceptedAnswerId column is null) and 3817732 questions with answer (AcceptedAnswerId is not null) according to the StackOverflow Database, which means only around 20% of the questions asked in StackOverflow got a good answer which the question author manually accepted. It looks very low if you ask me, so relaying on AcceptedAnswerId column can miss a lot of questions.

User Gravatar

beyal

Posted Jan 22 2014 7:59 UTC

Hey beyal and thanks for the reply.

I guess I wasn't clear enough about what I needed.

I don't need the average time to solve a question on Stack Overflow.

I need the time it takes to get a first reply - any reply like an answer or a comment - to a question. Even if the question was not eventually solved.

Sorry for the confusion, hope this clarifies it.

User Gravatar

BadGoat

Posted Jan 22 2014 8:31 UTC

Hey,

Great, so this is exactly what my query does. It calculates the time from the creation of a question until the time of the first response or comment for all questions in the DB.

User Gravatar

beyal

Posted Jan 22 2014 8:37 UTC

I updated my query to include the comments: http://data.stackexchange.com/stackoverflow/query/edit/162023

WITH qus_replay_times_desc AS ( SELECT qus_id, replay_time = CASE WHEN DATE_DIFF_IN_SEC < 3600 THEN '1 - Less then an hour' WHEN DATE_DIFF_IN_SEC >= 3600 and DATE_DIFF_IN_SEC < 7200 THEN '2 - Between an hour and 2 hours' WHEN DATE_DIFF_IN_SEC >= 7200 and DATE_DIFF_IN_SEC < 21600 THEN '3 - 2 hours to 6 hours' WHEN DATE_DIFF_IN_SEC >= 21600 and DATE_DIFF_IN_SEC < 86400 THEN '4 - 6 hours to a day' WHEN DATE_DIFF_IN_SEC >= 86400 and DATE_DIFF_IN_SEC < 259200 THEN '5 - A day to 3 days' WHEN DATE_DIFF_IN_SEC >= 259200 and DATE_DIFF_IN_SEC < 604800 THEN '6 - 3 days to a week' WHEN DATE_DIFF_IN_SEC >= 604800 and DATE_DIFF_IN_SEC < 2592000 THEN '7 - Week to a month' ELSE '8 - Later then a month' END FROM (SELECT qus_id, MIN(DATE_DIFF_IN_SEC) DATE_DIFF_IN_SEC FROM (SELECT questions.id qus_id,
DATEDIFF(s, questions.creationdate, answers.creationdate) DATE_DIFF_IN_SEC FROM posts questions, posts answers WHERE questions.id=answers.parentid UNION ALL SELECT questions.id qus_id,
DATEDIFF(s, questions.creationdate, comments.creationdate) DATE_DIFF_IN_SEC FROM posts questions, comments WHERE questions.id=comments.postid) qus_ans_response_time GROUP BY qus_id) qus_min_response_time ) SELECT SUBSTRING(replay_time, 5, len(replay_time)), count(qus_id) FROM qus_replay_times_desc GROUP BY replay_time ORDER BY replay_time asc

User Gravatar

beyal

Posted Jan 22 2014 8:45 UTC

Solution

This didn't solve your task? Get your own custom solution.

That's great but the numbers don't add.

There are 6.5M questions on Stack Overflow that are not closed (that have a ClosedDate that is null) - if I calculate the total number of questions from your query I get a little over 5M.

User Gravatar

BadGoat

Posted Jan 22 2014 8:47 UTC

Why do you refer to the questions which are not closed? Don't you want to have the info on all questions?

User Gravatar

beyal

Posted Jan 22 2014 8:50 UTC

Even with the closed questions the numbers are too low.

User Gravatar

BadGoat

Posted Jan 22 2014 8:53 UTC

This select gets you all questions that has answers (questions that has no answers would not have posts with parentid that relates to them and therefore wouldn't be included in the results):

SELECT questions.id qus_id,
DATEDIFF(s, questions.creationdate, answers.creationdate) DATE_DIFF_IN_SEC FROM posts questions, posts answers WHERE questions.id=answers.parentid

This select gets you all questions that has comments (questions that has no comments would not have comments with postid that relates to them and therefore wouldn't be included in the results):

SELECT questions.id qus_id,
DATEDIFF(s, questions.creationdate, comments.creationdate) DATE_DIFF_IN_SEC FROM posts questions, comments WHERE questions.id=comments.postid

I combine the two select by union all to get all questions with all their responses and comments and the next step is that i take the earliest (=MIN(DATE_DIFF_IN_SEC)) response comment for each question.

It can't be more elaborated then this.

Notice that answercount can be null and can be zero so searching "where answercount is not null" is not enough and you need to search "where answercount is not null and answercount > 0"

User Gravatar

beyal

Posted Jan 22 2014 9:04 UTC

Hii,

Have you tried using my query? Did it fulfilled your request?

User Gravatar

beyal

Posted Jan 23 2014 1:54 UTC

Hey sorry for the late response..

Yes! it worked like a charm! :)

Thanks so much!

User Gravatar

BadGoat

Posted Jan 23 2014 2:11 UTC

Add a reply

By posting a reply on CodersClan you agree to our Terms & Conditions