What's new in this version:
Screaming Frog 18.1
- Bug fixes
Screaming Frog 18.0
- It’s taken a little while, but like most SEOs, we’ve finally come to terms that we’ll have to actually switch to GA4. You’re now able to (begrudgingly) connect to GA4 and pull in analytics data in a crawl via their new API.
- Connect via ‘Config > API Access > GA4’, select from 65 available metrics, and adjust the date and dimensions.
- Similar to the existing UA integration, data will quickly appear under the ‘Analytics’ and Internal tabs when you start crawling in real-time
- You can apply ‘filter’ dimensions like in the GA UI, including first user, or session channel grouping with dimension values, such as ‘organic search’ to refine to a specific channel
- If there are any other dimensions or filters you’d like to see supported, then do let us know
- PDFs are not the sexiest thing in the world, but due to the number of corporates and educational institutions that have requested this over the years, we felt compelled to provide support parsing them. The SEO Spider will now crawl PDFs, discover links within them and show the document title as the page title.
- This means users can check to see whether links within PDFs are functioning as expected and issues like broken links will be reported in the usual way in the Response Codes tab. The outlinks tab will be populated, and include details such as response codes, anchor text and even what page of the PDF a link is on.
- You can also choose to ‘Extract PDF Properties’ and ‘Store PDF’ under ‘Config > Spider > Extraction’ and the PDF subject, author, created and modified dates, page count and word count will be stored.
- PDFs can be bulk saved and exported via ‘Bulk Export > Web > All PDF Documents’.
- If you’re interested in how search engines crawl and index PDFs, check out a couple of tweets where we shared some insights from internal experiments for both Google and Bing.
- There’s a new Validation tab, which performs some basic best practice validations that can impact crawlers when crawling and indexing. This isn’t W3C HTML validation which is a little too strict, the aim of this tab is to identify issues that can impact search bots from being able to parse and understand a page reliably.
- Most SEOs know about invalid HTML elements in the head causing it to close early, but there are other interesting fix-ups and quirks that both browsers like Chrome (and subsequently) Google do if it sees a non-head element prior to the head in the HTML (it creates its own blank head), or if there are multiple, or missing HTML elements etc.
- The new filters include –
- Invalid HTML Elements In – Pages with invalid HTML elements within the . When an invalid element is used in the , Google assumes the end of the element and ignores any elements that appear after the invalid element. This means critical elements that appear after the invalid element will not be seen. The element as per the HTML standard is reserved for title, meta, link, script, style, base, noscript and template elements only.
- Not First In Element – Pages with an HTML element that proceed the element in the HTML. The should be the first element in the element. Browsers and Googlebot will automatically generate a element if it’s not first in the HTML. While ideally elements would be in the, if a valid element is first in the it will be considered as part of the generated. However, if nonelements such as
,, etc are used before the intended element and its metadata, then Google assumes the end of the element. This means the intendedelement and its metadata may only be seen in theand ignored.
- Missing Tag – Pages missing a element within the HTML. The element is a container for metadata about the page, that’s placed between theandtag. Metadata is used to define the page title, character set, styles, scripts, viewport and other data that are critical to the page. Browsers and Googlebot will automatically generate a element if it’s omitted in the markup, however it may not contain meaningful metadata for the page and this should not be relied upon.
- Multiple Tags – Pages with multiple elements in the HTML. There should only be one element in the HTML which contains all critical metadata for the document. Browsers and Googlebot will combine metadata from subsequentelements if they are both before the, however, this should not be relied upon and is open to potential mix-ups. Anytags after thestarts will be ignored.
- MissingTag – Pages missing aelement within the HTML. Theelement contains all the content of a page, including links, headings, paragraphs, images and more. There should be oneelement in the HTML of the page. Browsers and Googlebot will automatically generate aelement if it’s omitted in the markup, however, this should not be relied upon.
- MultipleTags – Pages with multipleelements in the HTML. There should only be oneelement in the HTML which contains all content for the document. Browsers and Googlebot will try to combine content from subsequentelements, however, this should not be relied upon and is open to potential mix-ups.
- We plan on extending our validation checks and filters over time.
- Every time we release an update there will always be one or two users that remind us that they have to painstakingly visit our website, and click a button to download and install the new version.
- WHY do we have to put them through this torture?
- The simple answer is that historically we’ve thought it wasn’t a big deal and it’s a bit of a boring enhancement to prioritise over so many other super cool features we could build. With that said, we do listen to our users, so we went ahead and prioritised the boring-but-useful feature.
- You will now be alerted in-app when there’s a new version available, which will have already silently downloaded in the background. You can then install in a few clicks.
- We’re planning on switching our installer, so the number of clicks required to install and auto-restart will be implemented soon, too. We can barely contain our excitement
Authentication for Scheduling / CLI:
- Previously, the only way to authenticate via scheduling or the CLI was to supply an ‘Authorization’ HTTP header with a username and password via the HTTP header config, which worked for standards based authentication – rather than web forms
- We’ve now made this much simpler, and not just for basic or digest authentication, but web form authentication as well. In ‘Config > Authentication’, you can now provide the username and password for any standards based authentication, which will be remembered so you only need to provide it once.
- You can also login as usual via ‘Forms Based’ authentication and the cookies will be stored
- When you have provided the relevant details or logged in, you can visit the new ‘Profiles’ tab, and export a new .seospiderauthconfig file
- This file which has saved authentication for both standards and forms based authentication can then be supplied in scheduling, or the CLI
- This means for scheduled or automated crawls the SEO Spider can login to not just standards based authentication, but web forms where feasible as well
New Filters & Issues:
- There’s a variety of new filters and issues available across existing tabs that help better filter data, or communicate issues discovered
- Many of these were already available either via another filter, or from an existing report like ‘Redirect Chains’. However, they now have their own dedicated filter and issue in the UI, to help raise awareness. These include –
- ‘Response Codes > Redirect Chains’ – Internal URLs that redirect to another URL, which also then redirects. This can occur multiple times in a row, each redirect is referred to as a ‘hop’. Full redirect chains can be viewed and exported via ‘Reports > Redirects > Redirect Chains’.
- ‘Response Codes > Redirect Loop’ – Internal URLs that redirect to another URL, which also then redirects. This can occur multiple times in a row, each redirect is referred to as a ‘hop’. This filter will only populate if a URL redirects to a previous URL within the redirect chain. Redirect chains with a loop can be viewed and exported via ‘Reports > Redirects > Redirect Chains’ with the ‘Loop’ column filtered to ‘True’.
- ‘Images > Background Images’ – CSS background and dynamically loaded images discovered across the website, which should be used for non-critical and decorative purposes. Background images are not typically indexed by Google and browsers do not provide alt attributes or text on background images to assistive technology.
- ‘Canonicals > Multiple Conflicting’ – Pages with multiple canonicals set for a URL that have different URLs specified (via either multiple link elements, HTTP header, or both combined). This can lead to unpredictability, as there should only be a single canonical URL set by a single implementation (link element, or HTTP header) for a page.
- ‘Canonicals > Canonical Is Relative’ – Pages that have a relative rather than absolute rel=”canonical” link tag. While the tag, like many HTML tags, accepts both relative and absolute URLs, it’s easy to make subtle mistakes with relative paths that could cause indexing-related issues.
- ‘Canonicals > Unlinked’ – URLs that are only discoverable via rel=”canonical” and are not linked-to via hyperlinks on the website. This might be a sign of a problem with internal linking, or the URLs contained in the canonical.
- ‘Links > Non-Indexable Page Inlinks Only’ – Indexable pages that are only linked-to from pages that are non-indexable, which includes noindex, canonicalised or robots.txt disallowed pages. Pages with noindex and links from them will initially be crawled, but noindex pages will be removed from the index and be crawled less over time. Links from these pages may also be crawled less and it has been debated by Googlers whether links will continue to be counted at all. Links from canonicalised pages can be crawled initially, but PageRank may not flow as expected if indexing and link signals are passed to another page as indicated in the canonical. This may impact discovery and ranking. Robots.txt pages can’t be crawled, so links from these pages will not be seen.
Flesch Readability Scores:
- Flesch readability scores are now calculated and included within the ‘Content‘ tab with new filters for ‘Readability Difficult’ and Readability Very Difficult’.
- Please note, the readability scores are suited for English language, and we may provide support to additional languages or alternative readability scores for other languages in the future.
- Readability scores can be disabled under ‘Config > Spider > Extraction’
Auto Complete URL Bar:
- The URL bar will now show suggested URLs to enter as you type based upon previous URL bar history, which a user can quickly select to help save precious seconds.
- Response Code Colours for Visualisations:
- You’re now able to select to ‘Use Response Code Node Colours’ in crawl visualisations.
- This means nodes for no responses, 2XX, 3XX, 4XX and 5XX buckets will be coloured individually, to help users spot issues related to responses more effectively.
XML Sitemap Source In Scheduling:
- You can now choose an XML Sitemap URL as the source in scheduling and via the CL in list mode like the regular UI
Screaming Frog 17.2
- Bug fixes
Screaming Frog 17.1
- Bug fixes
Screaming Frog 17.0
- Issues Tab, Links Tab, New Limits, ‘Multiple Properties’ Config For URL Inspection API, Apple Silicon Version & RPM for Fedora and Detachable Tabs
Screaming Frog 16.7
This release is mainly bug fixes and small improvements:
- URL inspection can now be resumed from a saved crawl
- The automated Screaming Frog Data Studio Crawl Report now has a URL Inspection page
- Added ‘Days Since Last Crawl’ column for the URL Inspection integration
- Added URL Inspection data to the lower ‘URL Details’ tab
- Translations are now available for the URL Inspection integration
- Fixed a bug moving tabs and filters related to URL Inspection in scheduling
- Renamed two ‘Search Console’ filters – ‘No Search Console Data’ to ‘No Search Analytics Data’ and ‘Non-Indexable with Search Console Data’ to ‘Non-Indexable with Search Analytics Data’ to be more specific regarding the API used
- Fix crash loading scheduled tasks
- Fix crash removing URLs
Screaming Frog 16.6
- Change log not available for this version
Screaming Frog 16.5
- Update to Apache log4j 2.17.0 to fix CVE-2021-45046 and CVE-2021-45105
- Show more detailed crawl analysis progress in the bottom status bar when active
- Improve Google Sheets exporting when Google responds with 403s and 502s
- Be more tolerant of leading/trailing spaces for all tab and filter names when using the CLI
- Add auto naming for GSC accounts, to avoid tasks clashing
- Fix crash running link score on crawls with URLs that have a status of “Rendering Failed”
Screaming Frog 16.4
- Bug fixes
Screaming Frog 16.3
- The Google Search Console integration now has new filters for search type (Discover, Google News, Web etc) and supports regex as per the recent Search Analytics API update
- Fix issue with Shopify and CloudFront sites loading in Forms Based authentication browser
- Fix issue with cookies not being displayed in some cases
- Give unique names to Google Rich Features and Google Rich Features Summary report file names
- Fix crash running on macOS Monetery
- Fix right click focus in visualisations
- Fix crash in Spelling and Grammar UI
- Fix crash when exporting invalid custom extraction tabs on the CLI
- Fix crash when flattening shadow DOM
- Fix crash generating a crawl diff
- Fix crash when the Chromium can’t be initialised
Screaming Frog 16.2
- some Spanish translations based on feedback
- SERP Snippet preview to be more in sync with current SERPs
- preventing the Custom Crawl Overview report for Data Studio working in languages other than English
- resuming crawls with saved Internal URL configuration
- caused by highlighting a selection then clicking another cell in both list and tree views
- duplicating a scheduled crawl