[TRU Research] Web App Data Schema

Stephen DeSanto rachidian at gmail.com
Sun Aug 25 15:37:11 PDT 2019


Hi everyone. Made a few updates to the master list of employers
<https://docs.google.com/spreadsheets/d/1HmOcG7hJLD1G0unCMPcsDnXr4RIA_PMKEE5ne-hhQR8/edit?usp=sharing>
for the upcoming website:

   - Added companies and transit benefits raw descriptions from TRU survey
   data
   - Added "Likely CTR Targets":
   https://seattletransitpasses-research.pbworks.com/w/page/133438365/Likely%20Target%20Assessment
   - Added "Potential CTR Targets":
   https://seattletransitpasses-research.pbworks.com/w/page/133437828/Potential%20CTR%20Targets
   - "Likely" / high-profile targets (hotels, banks) are highlighted in
   ORANGE
   - Added "Potential Poster Children":
   https://seattletransitpasses-research.pbworks.com/w/page/133439169/Potential%20Poster%20Children
   - Poster children are highlighted in GREEN

Highlight colors are just to make it easier to find rows that a) someone
said should be included in the list, and b) probably needs benefits data
incorporated

Things to be done:

   - Normalize locations data
   - Assign "leaf scores" to all companies that don't have one
   - Assign "polluter" etc badges to companies we want to name&shame
   - Add all of the hotels?
   https://seattletransitpasses-research.pbworks.com/w/page/133666440/Hotels
   - Add Choice participants?
   - Add column for Commute Seattle participants?
   https://seattletransitpasses-research.pbworks.com/w/page/133438167/Commute%20Seattle%20List%20of%20Passport%20Participants
   - Citations, descriptions of benefits, etc. for companies that need it

There's still a lot we're not 100% sure about for employer benefits, but we
can do the best we have with what we've got, and make changes as we get new
information.

My thinking was, we can use the "master employer list" to get as much
information about the companies we're interested in. When we're satisfied
that a row is finished and ready for publication, check the checkbox in the
"__publish" column. Then, when we export this data to the website, we can
only get the rows where "__publish" is checked. This hopefully will ensure
that someone manually reviewed and verified all the data for an employer
before it gets published, and that unfinished rows won't be accidentally
exported.

Is this helpful? Am I just spinning my wheels in the mud?

On Thu, Aug 22, 2019 at 7:06 PM Stephen DeSanto <rachidian at gmail.com> wrote:

> FYI, added most of you as editors on the spreadsheet I'm working on, in
> case anyone has time for tedious data tasks (or a quick way to do tedious
> data tasks). I'm currently adding in data from the TRU survey, from
> respondents whose employers offer transit benefits. Eventually, we'll need
> these tagged with industry and fix the neighborhoods data? And add in any
> other company data we have from the other research spreadsheets on the
> wiki? And eventually some subset of this data ends up on the website?
>
> On Tue, Aug 20, 2019 at 3:03 PM Tom Chartrand <tmchartrand at gmail.com>
> wrote:
>
>> Oh you're right, sorry for the confusion everyone! was just fairly hidden
>> in the view i looked at. Column S!
>>
>> On Tue, Aug 20, 2019 at 3:00 PM Katie Wilson <katie at transitriders.org>
>> wrote:
>>
>>> I think the spreadsheet with PII removed still does include the Employer
>>> column, no?
>>>
>>> Sorry I’m being slow to respond to all this good stuff, I am still
>>> digging myself out from being away last week and I’m at an all-day thing
>>> today… but I should have time to pay more attention before the end of the
>>> week!
>>>
>>> On Aug 19, 2019, at 6:26 PM, Stephen DeSanto <rachidian at gmail.com>
>>> wrote:
>>>
>>> I have time to go through the survey data and find the reported transit
>>> benefits per employer, though I'll need the data set that contains that
>>> data. :)
>>>
>>> Otherwise, I am going to be trying to match CTR neighborhoods to the
>>> employers already in our spreadsheet, as well as adding any employers
>>> mentioned in other sources/sheets on our wiki.
>>>
>>> On Mon, Aug 19, 2019 at 6:02 PM Tom Chartrand <tmchartrand at gmail.com>
>>> wrote:
>>>
>>>> This is looking great, Stephen!
>>>> I had put myself down to organize the survey data with respect to
>>>> employers for this, but I just realized that info was removed as PII (of
>>>> course)! So either Mike will need to take that on (I think Mike did the
>>>> original PII removal) or we'll need to figure out an appropriate way of
>>>> sharing that.
>>>> I'm feeling pretty swamped myself lately, so if you (Stephen) were down
>>>> to help him with the task that could be great. I can certainly still take
>>>> on some of it if needed though, once we get this sorted out.
>>>> Katie, maybe you could help coordinate this and make sure Mike sees
>>>> this sooner rather than later?
>>>>
>>>> Also, do let me know if you have any more specific spots in the report
>>>> where some backup from the PSRC dataset could be useful!
>>>>
>>>> On Sun, Aug 18, 2019 at 3:59 PM Stephen DeSanto <rachidian at gmail.com>
>>>> wrote:
>>>>
>>>>> I've added the list of industry categories to the Google Sheet, so
>>>>> that should help validate the data we add there, though it's going to
>>>>> likely be a manual task to fill in industries for all the employers.
>>>>>
>>>>> I've also added a "citation" column, which can be the public
>>>>> representation of where we got the data to make our claim. We can fuss with
>>>>> the wording later.
>>>>>
>>>>> I should have time this week to go through our survey data and other
>>>>> wiki tables to add or modify employers in the Google Sheet. Agree that
>>>>> it'll be good to have solid information on our primary targets and
>>>>> champions.
>>>>>
>>>>> On Tue, Aug 13, 2019 at 10:48 PM Harry Maher <harryb.maher at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Just a quick update with regard to qualitative data analysis: I made
>>>>>> a "Commute Survey Qualitative Data Analysis" folder on pbworks and put a
>>>>>> doc with some quotes in it for the report. I tried to pull out the main
>>>>>> relevant themes that I noticed discussed in the two qualitative questions
>>>>>> currently in the file with a couple of quote options for each
>>>>>> theme/category of response to the question.
>>>>>>
>>>>>> -Harry
>>>>>>
>>>>>> On Sun, Aug 11, 2019 at 12:54 PM Tom Chartrand <tmchartrand at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Regarding where to have this discussion - I'm just gonna continue
>>>>>>> the email chain cause I haven't followed where to put the discussion on the
>>>>>>> wiki, but someone feel free to steer it over there if we want to!
>>>>>>>
>>>>>>> A brief update regarding establishing a larger list of employers to
>>>>>>> include in the dataset: basic contact information for all seattle
>>>>>>> businesses, sorted by the North American Industry Classification System, is
>>>>>>> available at
>>>>>>> https://web6.seattle.gov/fas/slimbizsearch/ResultsPage.aspx?NAICList=Top100,
>>>>>>> but it's a huge list of course, with no info on number of employees or
>>>>>>> revenue to filter out the smaller ones. Still, I did send off an email
>>>>>>> about getting a copy of the database just for purposes of cross-referencing
>>>>>>> names and such.
>>>>>>> On 8/10/19 6:42 PM, Katie Wilson wrote:
>>>>>>>
>>>>>>> For “neighborhood” I think it makes sense to use the “CTR Network
>>>>>>> Areas” as defined here
>>>>>>> <https://www.seattle.gov/transportation/projects-and-programs/programs/transportation-options-program/commute-trip-reduction-program/draft-2019-2023-networks-and-targets>
>>>>>>> .
>>>>>>>
>>>>>>> For “industry” I think it makes sense to use the “Employment
>>>>>>> Sector” categories listed on Page 12 of this CTR strategic plan.
>>>>>>> <https://www.seattle.gov/Documents/Departments/SDOT/TransportationOptionsProgram/CTR_Draft_Strategic_Plan_Jan2019.pdf>
>>>>>>>
>>>>>>> On the ratings, I think it does make sense to lump "piggy bank" and
>>>>>>> "brown tortoise" in the same rating (0), and then add a tortoise badge for
>>>>>>> employers that aren’t even doing the pre-tax thing.
>>>>>>>
>>>>>>> Another simplification option to consider would be to lump together
>>>>>>> 3 and 4 leaves. But let’s leave them separate for now and depending on how
>>>>>>> things shake out we can easily combine them later.
>>>>>>>
>>>>>>> We don’t have any major sources of data on what benefits employers
>>>>>>> provide other than:
>>>>>>> — Metro public disclosure request spreadsheet
>>>>>>> <https://seattletransitpasses-research.pbworks.com/w/page/133438080/First%20Public%20Records%20Request>
>>>>>>> — Our commute survey
>>>>>>> — Info gleaned online from company websites, asking around,
>>>>>>> glassdoor etc (what I’ve found I’ve added to the relevant tables in
>>>>>>> the wiki
>>>>>>> <https://seattletransitpasses-research.pbworks.com/w/page/132177123/Employers>,
>>>>>>> on CTR employers and “potential poster children” and “likely target
>>>>>>> assessment” and “hotels”)
>>>>>>>
>>>>>>> Maybe it makes sense to have another string indicating sufficient
>>>>>>> certainty — when we have two sources, or one very reliable source, we enter
>>>>>>> an X or whatever, and that gives us the green light to display that data.
>>>>>>> Also it may not make sense to put a lot of work into categorizing employers
>>>>>>> into Network Area and Employment Sector until we have reliable data on what
>>>>>>> benefits they’re offering.
>>>>>>>
>>>>>>> Speaking of Seattle Coffee Works, I spoke with their HR person a few
>>>>>>> months ago and actually employees have to pay $20/month (pre-tax $) if they
>>>>>>> want an ORCA card. Still a great deal but not 100% subsidy as reported in
>>>>>>> the Metro data— which, I then learned, is self-reported by the company.
>>>>>>> Metro only knows that all those companies are signed up for the Passport
>>>>>>> program. I noted the real situation on this page
>>>>>>> <https://seattletransitpasses-research.pbworks.com/w/page/133439169/Potential%20Poster%20Children>.
>>>>>>> Anyway, the point is we should probably crosscheck the Metro data as much
>>>>>>> as we can with our survey or other sources of information.
>>>>>>>
>>>>>>> (Also speaking of Seattle Coffee Works they have locations in Capitol
>>>>>>> Hill & Cascade too
>>>>>>> <https://www.seattlecoffeeworks.com/our-cafes.aspx>. From talking
>>>>>>> with the HR person I’m pretty sure all are include in their passport
>>>>>>> program, and the employees swap around a lot from location to location.
>>>>>>> They probably use the Ballard location as home base for transit pass
>>>>>>> purposes since that’s the least expensive zone.)
>>>>>>>
>>>>>>> One project would be to come up with a list of employers that have
>>>>>>> name recognition (or that we are interested in for some other reason) and
>>>>>>> put a little work into attaining sufficient certainty. If we posted the
>>>>>>> list to a page and put a call out on social media and email I bet we’d get
>>>>>>> some answers.
>>>>>>>
>>>>>>> On Aug 8, 2019, at 5:26 PM, Stephen DeSanto <rachidian at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I've taken a first pass at the data schema for showing employer
>>>>>>> transit benefits in our upcoming web app. In this draft, each employer
>>>>>>> record is represented as follows:
>>>>>>>
>>>>>>> {
>>>>>>>     "employer": string,
>>>>>>>     "industry": [string],
>>>>>>>     "neighborhood": [string],
>>>>>>>     "alias": [string],
>>>>>>>     "rating": int,
>>>>>>>     "description": string
>>>>>>>     "badges": [string]
>>>>>>> }
>>>>>>>
>>>>>>> *Employer* is a plain text string.
>>>>>>> *Industry* is a list of strings (or a single string, if we want to
>>>>>>> limit one employer = one industry).
>>>>>>> *Neighborhood* is treated similarly to industry
>>>>>>> *Alias* is a list of other names for the same company. For example,
>>>>>>> *Rating* is a numerical scale that represents the "worker's monthly
>>>>>>> cost of an unlimited transit pass". The scale provided during the meeting
>>>>>>> went from "4 leaves" to "brown tortoise"; aligning to the leaves, that
>>>>>>> gives us a scale of [-1, 0, 1, 2, 3, 4]. We could adjust this up to 0-5, or
>>>>>>> lump "piggy bank" and "brown tortoise" in the same rating.
>>>>>>> *Description* is a string that describes the employer's transit
>>>>>>> benefits, i.e. why they got the rating they did.
>>>>>>> *Badges* is a list of strings that represent any additional
>>>>>>> categories we want to assign to a company (e.g. "industry leader",
>>>>>>> "polluter").
>>>>>>>
>>>>>>> We can make changes to this schema if it makes it easier to work
>>>>>>> with our underlying data visualization platform (e.g. Tableau?
>>>>>>> DataTables?), but hopefully this is a suitable starting place.
>>>>>>>
>>>>>>> As an example, take a hypothetical record for Seattle Coffee Works.
>>>>>>>
>>>>>>> {
>>>>>>>     "employer": "Seattle Coffee Works",
>>>>>>>     "industry": ["restaurant"],
>>>>>>>     "neighborhood": ["cbd", "ballard"],
>>>>>>>     "alias": ["Ballard Coffee Works"],
>>>>>>>     "rating": 4,
>>>>>>>     "description": "Provides 100% ORCA Passport subsidy."
>>>>>>>     "badges": ["leader"]
>>>>>>> }
>>>>>>>
>>>>>>> *Where Our Data Lives (For Now)*
>>>>>>>
>>>>>>> I've also taken a rough chop at getting started with the data. Here,
>>>>>>> I've just taken the raw list of ORCA Business Passport employers and
>>>>>>> assigned a score based on their subsidy percentage, as an example:
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/spreadsheets/d/1HmOcG7hJLD1G0unCMPcsDnXr4RIA_PMKEE5ne-hhQR8/edit?usp=sharing
>>>>>>>
>>>>>>> The spreadsheet contains columns for each item of the employer
>>>>>>> record, as well as some additional columns to record the raw data we have
>>>>>>> on file for that employer, so we can use that data to automatically or
>>>>>>> manually determine an employer's rating.
>>>>>>>
>>>>>>> If we have data from other sources not listed (e.g. survey data,
>>>>>>> City of Seattle data), the "source_" columns can be renamed or added to
>>>>>>> represent that source's data. For example, if I want to add data from the
>>>>>>> TRU survey, I might rename "__source_b" to "__TRU Survey", then include
>>>>>>> results from that survey in that column for each company. (The columns
>>>>>>> beginning with two underscores are ones I don't expect to be publicly
>>>>>>> available.)
>>>>>>>
>>>>>>> PBworks feels really inadequate for editing large data sets, and I
>>>>>>> don't know where else to put it, so it's living in Google Sheets for now.
>>>>>>> Set to read-only with the link, for now, but please request editing
>>>>>>> permissions so you can add stuff to the sheet.
>>>>>>>
>>>>>>> Currently, my expectation is that the spreadsheet will be
>>>>>>> hand-edited in Google Sheets, and then when we're ready to put live data in
>>>>>>> the web app, we can export the sheet to a flat file, which we can then
>>>>>>> import into a format appropriate for the website (big ol' JSON file,
>>>>>>> database, whatever). Manual process, but probably fine for a project of
>>>>>>> this scale; I'm open to alternatives.
>>>>>>>
>>>>>>> *Things To Do Next*
>>>>>>>
>>>>>>> Aside from the ORCA Passport data and the data we collected through
>>>>>>> TRU survey / legwork (on PBworks), do we have any other data sources that
>>>>>>> would provide context for a score?
>>>>>>>
>>>>>>> For the data sources we have, we'll have to start filling out the
>>>>>>> rest of the spreadsheet, I guess?
>>>>>>>
>>>>>>> Also, we will need to determine:
>>>>>>> a) master list of "industries" we want to support, and
>>>>>>> b) "industry" field(s) for each employer
>>>>>>> c) "neighborhood" field(s) for each employer we don't have one for
>>>>>>> (or being more precise than what I have now)
>>>>>>> d) which companies get tagged with which badges
>>>>>>>
>>>>>>> Hope that helps.
>>>>>>>
>>>>>>> In solidarity,
>>>>>>>
>>>>>>> Stephen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.transitriders.org/pipermail/research/attachments/20190825/a5375d91/attachment.html>


More information about the Research mailing list