Ian Mulvany

September 5, 2022

Using GTP-3 to do some data cleaning


My son is just entering year six in primary school. That means that over the next few months we have to visit secondary schools in our area to learn which ones we want to apply to. The schools all have different visiting dates, so I wanted to pull them together to see them in one location. 

Hackney provides the information on this pdf. The information is presented in a variety of ways, and I didn’t want to transcribe it all by hand, so I wondered if I could use GPT-3 to help me with the task. 

I copied out the text from the schools that I thought we would be most interested in, and I pasted it into the OpenAI playground. I then played around with a number of prompts to see if it could do the data extraction job for me. 

I had to have a few goes to get the data out, but broadly it worked well. 

The following prompt

extract the visit dates, visit times, school name, and website, from the below text, and output as a csv file, with one row for every visit date 

Applied to the following text

THE URSWICK SCHOOL Paragon Road, London E9 6NR 020 8985 2430 www.theurswickschool.co.uk Tuesday 4 October, 5.30–7.30pm (Headteacher speaks at 6.15pm and 7pm. No booking required) Tours of the school at 9.30am every Tuesday and Thursday between 20 September and 20 October. Contact the school by phone or email admin@theurswickschool.co.uk to book your tour. 
THE PETCHEY ACADEMY Shacklewell Lane, London E8 2EY 020 7275 1500 www.petcheyacademy.org.uk Monday 26 September, 5.30-7.30pm (Headteacher talks at 6-6.30pm and 7-7.30pm; tours between 5.30-6pm and 6.30-7pm) Tours (by appointment only) every Thursday between 15 September and 20 October, 8.30-9.30am. Details on how to book on the academy’s’ website 
MOSSBOURNE COMMUNITY ACADEMY 100 Downs Park Road, London E5 8JY 020 8525 5200 www.mca.mossbourne.org Thursday 29 September, 4.30-7.30pm (Last entry at 7.15pm) Please note, this is the only date and time offered for visits by prospective applicants and their parents/carers. The Principal's speech will be shown throughout the evening and available on our website from 29 September. 
MOSSBOURNE VICTORIA PARK ACADEMY Victoria Park Road, London E9 7HD 020 8510 4550 www.mvpa.mossbourne.org Thursday 29 September, 4.30–8pm (Last entry 7.15pm. Last Principal’s speech at 7.30pm. Evening ends at 8pm) Please note, this is the only date and time offered for visits by prospective applicants and their parents/carers. HAGGERSTON SCHOOL Weymouth Terrace, London E2 8LS 020 7739 7324 www.haggerston.hackney.sch.uk Wednesday 28 September, 9-10.30am Monday 3 October, 5-7.30pm Wednesday 5 October, 9-10.30am Wednesday 12 October, 9-10.30am Wednesday 19 October, 9-10.30am

Produced the following output

The Urswick School, Tuesday 4 October, 5.30-7.30pm, www.theurswickschool.co.uk 

The Petchey Academy, Monday 26 September, 5.30-7.30pm, www.petcheyacademy.org.uk
Mossbourne Community Academy, Thursday 29 September, 4.30-7.30pm, www.mca.mossbourne.org
Mossbourne Victoria Park Academy, Thursday 29 September, 4.30-8pm, www.mvpa.mossbourne.org
Haggerston School, Wednesday 28 September, 9-10.30am, www.haggerston.hackney.sch.uk

With a small number of iterations I was able to build up this spreadsheet: 

Interestingly I wanted to improve the date formatting, and add in days that were missing. I tried the following prompt 

Take the following dates, and format them consistently, including the name of the day if it is missing

And applied it to this list of dates:

Thursday 13 October Thursday 6 October Thursday 6 October Wednesday 21 September Wednesday 12 October Thursday 13 October 13th September, 20th September, 27th September, 20 September 27th September, 6th October 7th October Tuesday 4 October Monday 26 September Thursday 29 September Thursday 29 September Wednesday 28 September 28th September, 3rd October, 5th October, 12 October, 19th October

The result was as follows:

Thursday, October 13 Thursday, October 6 Thursday, October 6 Wednesday, September 21 Wednesday, October 12 Thursday, October 13 Thursday, September 13 Thursday, September 20 Thursday, September 27 Thursday, September 20 Thursday, September 27 Thursday, October 6 Friday, October 7 Tuesday, October 4 Monday, September 26 Thursday, September 29 Thursday, September 29 Wednesday, September 28 Thursday, September 28 Saturday, October 3 Monday, October 5 Monday, October 12 Thursday, October 19

Which is interesting because the days that it adds in are the wrong days for the given dates in 2022.